What Can Go Wrong When Fine-tuning BERT?
There are a lot of explanations elsewhere, here I’d like to share some example questions in an interview setting.
When fine-tuning BERT (Bidirectional Encoder Representations from Transformers) for your use case, what can go wrong? Or what should you pay attention to?
Here are some tips for readers’ reference:
Fine-tuning a pre-trained language model such as BERT for a specific use case can be a complex process that requires careful consideration and testing to ensure optimal results. Here are some potential issues that can arise during fine-tuning:
- Not using WordPiece tokenizer (You need to use the same tokenizer as the one that was used to train the original model.)
- While fine-tuning, some runs produce degenerate results (This could be task-specific and hyper-parameter tuning is really important.)
- Training new models instead of using pre-trained ones (Too expensive!)
Other common issues could be overfitting, bad data quality, cross language training etc.