Member-only story

What is WordPiece?

Angelina Yang
3 min readJun 10, 2023

There are a lot of explanations elsewhere, here I’d like to share some example questions in an interview setting.

WordPiece is the tokenization algorithm Google developed to pretrain BERT. How does the WordPiece tokenization work? And why do we use it?

Source: BERT

Here are some tips for readers’ reference:

Wordpiece tokenizer is a type of subword tokenizer that splits words into subword units called “wordpieces.” It is commonly used in natural language processing (NLP) tasks, particularly in models like BERT (Bidirectional Encoder Representations from Transformers).

If you have BERT on your resume, be prepared to be asked questions like this.

It is very similar to another type of tokenization called BPE (Byte-Pair Encoding tokenization).

WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet. Since it identifies subwords by adding a prefix (like ## for BERT), each word is initially split by adding that prefix to all the characters inside the word. So, for instance, "word" gets split like this:

w ##o ##r ##d

The algorithm then considers the frequency of word combinations and iteratively merges the

--

--

No responses yet