In this post, we will see how to expand the vocabulary of a transformers model by adding your own words or tokens.
Why do you need to expand the vocabulary?
All the language models that are trained for a specific task in NLP domain have a vocabulary. The vocabulary is the unique words of the text corpus that the model has been trained with. Therefore, depending on the domain and corpus, model includes a set of unique words. The pre-trained language models are no exception.
In many cases, we want to use a pre-trained model and fine-tune that for our particular task rather than training a model from the scratch, but our dataset may have words that don’t exist in the vocabulary of the model. For example, we may want to adopt a pre-trained model that as been trained on general English text documents in the medical domain. In this situation, we need to add the new words (tokens) from our domain specific corpus to the vocabulary.
You might ask: “Why bother adding new words when transformer models can handle out of vocabulary tokens?“ The answer is: Yes. While the current subword tokenizers used with transformer models are able to handle basically arbitrary tokens, this is not optimal. These tokenizers essentially handle unknown tokens by splitting them up in smaller sub-tokens, which allows the text to be processed, but it may be hard for the model to capture the special meaning of the token. Additionally, splitting words up in many sub-tokens leads to longer sequences of tokens that needs to be processed, hence reducing the efficiency of the model. Therefore, adding new, domain-specific tokens explicitly to the tokenizer and the model, allows for faster fine-tuning as well as better capturing the information in the data.
How to add new tokens?
Adding new words to a transformers model vocabulary is very straight forward.
- Load the pre-trained model as well as its tokenizer.
- Get the vocabulary of the model.
- Create a list of the new words.
- Check to see if the vocabulary already has any of the new words.
- Add the new words that don’t exist in the vocabulary to the tokenizer.
- Resize the input token embeddings of the model based on the new tokenizer
Fun part: Coding
The code snippet below shows the steps. You can find the code here.
This post is brought to you by Dr. Mehdi Allahyari.
Want to learn more about data science bits and pieces like this, or career development advice? Subscribe to our newsletter!