Member-only story

Transformers are Expensive, How to Shrink Them?

2 min readApr 20, 2023

There are a lot of explanations elsewhere, here I’d like to share some example questions in an interview setting.

Transformer models are expensive, what are ways to shrink them?

Here are some tips for readers’ reference:

There are several techniques that can be used to shrink transformers and make them more efficient and cheaper to deploy:

Pruning: This involves removing unimportant connections within the transformer network. This can significantly reduce the number of parameters in the model without compromising performance.
Quantization: This involves reducing the precision of the weights in the model. By using fewer bits to represent each weight, the model can be made smaller and faster to compute.
Knowledge Distillation: This involves training a smaller, more efficient model to mimic the output of a larger, more accurate model. By doing so, the smaller model can achieve similar levels of accuracy with fewer parameters.
Architecture optimization: This involves re-designing the architecture of the transformer to reduce the number of computations required for inference. This can be achieved by simplifying the attention mechanism or using more efficient operations.

Written by Angelina Yang