Member-only story

What Can Go Wrong When Fine-tuning BERT?

2 min readFeb 27, 2023

There are a lot of explanations elsewhere, here I’d like to share some example questions in an interview setting.

When fine-tuning BERT (Bidirectional Encoder Representations from Transformers) for your use case, what can go wrong? Or what should you pay attention to?

Source: Illustration of the pre-training / fine-tuning approach. 3 different downstream NLP tasks, MNLI, NER, and SQuAD, are all solved with the same pre-trained language model, by fine-tuning on the specific task. Image credit: Devlin et al 2019.

Here are some tips for readers’ reference:

Fine-tuning a pre-trained language model such as BERT for a specific use case can be a complex process that requires careful consideration and testing to ensure optimal results. Here are some potential issues that can arise during fine-tuning:

Not using WordPiece tokenizer (You need to use the same tokenizer as the one that was used to train the original model.)
While fine-tuning, some runs produce degenerate results (This could be task-specific and hyper-parameter tuning is really important.)
Training new models instead of using pre-trained ones (Too expensive!)

Other common issues could be overfitting, bad data quality, cross language training etc.

What Can Go Wrong When Fine-tuning BERT?

Written by Angelina Yang

No responses yet