Member-only story

What is WordPiece?

3 min readJun 10, 2023

There are a lot of explanations elsewhere, here I’d like to share some example questions in an interview setting.

WordPiece is the tokenization algorithm Google developed to pretrain BERT. How does the WordPiece tokenization work? And why do we use it?

Here are some tips for readers’ reference:

Wordpiece tokenizer is a type of subword tokenizer that splits words into subword units called “wordpieces.” It is commonly used in natural language processing (NLP) tasks, particularly in models like BERT (Bidirectional Encoder Representations from Transformers).

If you have BERT on your resume, be prepared to be asked questions like this.

It is very similar to another type of tokenization called BPE (Byte-Pair Encoding tokenization).

WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet. Since it identifies subwords by adding a prefix (like ## for BERT), each word is initially split by adding that prefix to all the characters inside the word. So, for instance, "word" gets split like this:

w ##o ##r ##d

The algorithm then considers the frequency of word combinations and iteratively merges the…

What is WordPiece?

Written by Angelina Yang

No responses yet