How to Add New Tokens to a Transformer Model Vocabulary?
In this post, we will see how to expand the vocabulary of a transformers model by adding your own words or tokens.
Why do you need to expand the vocabulary?
All the language models that are trained for a specific task in NLP domain have a vocabulary. The vocabulary is the unique words of the text corpus that the model has been trained with. Therefore, depending on the domain and corpus, model includes a set of unique words. The pre-trained language models are no exception.
In many cases, we want to use a pre-trained model and fine-tune that for our particular task rather than training a model from the scratch, but our dataset may have words that don’t exist in the vocabulary of the model. For example, we may want to adopt a pre-trained model that as been trained on general English text documents in the medical domain. In this situation, we need to add the new words (tokens) from our domain specific corpus to the vocabulary.
You might ask: “Why bother adding new words when transformer models can handle out of vocabulary tokens?“ The answer is: Yes. While the current subword tokenizers used with transformer models are able to handle basically arbitrary tokens, this is not optimal. These tokenizers essentially handle unknown tokens by splitting them up in smaller sub-tokens, which allows the text to be processed, but it may be hard for the model to capture the special meaning of the token. Additionally, splitting…