Member-only story

What Are The Main Advantages of BERT Over LSTM Models?

2 min readAug 15, 2022

There are a lot of deep explanations of BERT and LSTM models elsewhere so here I’d like to share tips on what you can say during an interview setting.

What are the main advantages of BERT over LSTM models?

Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Here are some example answers for readers’ reference:

Question 1:

The main advantages of BERT over LSTM models are as follows (Watch the explanation by Dr. Jacob Devlin from Google AI Language:)

1.With the self-attention mechanism, BERT has no locality bias, which means long-distance context has “equal opportunity” to short-distance context.

Advantage #1

2. Single multiplication per layer improves efficiency on TPU, which means the effective batch size is the number of words and not sequences.

Advantage #2

Happy practicing!

Thanks for reading my newsletter. You can follow me on Linkedin! The original post is here.

Note: There are different angles to answer an interview question. The author of this newsletter does not try to find a reference that answers a question exhaustively. Rather, the author would like to share some quick insights and help the readers to think, practice and do further research as necessary.

Source of video: Stanford CS224N: NLP with Deep Learning (Winter 2020) BERT and Other Pre-trained Language Models by Dr. Jacob Devlin from Google AI Language

Source of images/answers: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin et al. Wikipedia: word embeddings

Good reads: Paper: Semi-supervised Sequence Learning by Andrew M. Dai, Quoc V. Le (2015) Paper: ELMo: Deep Contextual Word Embeddings by Matthew E. Peters et al. (2017) Paper: Improving Language Understanding by Generative Pre-Training by OpenAI (2018)

What Are The Main Advantages of BERT Over LSTM Models?

Written by Angelina Yang

No responses yet