What does the tokenizer do for a language model?

Angelina Yang
2 min readSep 22

This post is about prepping for interviews for Data Science roles. The original post can be found here.

Welcome to today’s data science interview challenge! Today’s challenge is inspired by a Huggingface Transformer Lecture (2022 version) at Stanford! Relax!

A warm up question 🤓:

See if you can tell me (without writing down) what the code looks like that creates a torch.tensor with the following contents:

Now tell me what the code look like to compute the average of each row (.mean()) and each column. What's the shape of the results?

I usually don’t do live coding questions but this one is straightforward and you should be able to speak while thinking. Have fun!

Now back to the basics:

Question: What does the tokenizer do for a language model?

Source: Paper

Here are some tips for readers’ reference:

Warm up Question :

Is the following what you are envisioning?

Question :

Pretrained models are implemented along with tokenizers that are used to preprocess their inputs. The tokenizers take raw strings or list of strings and output what are effectively dictionaries that contain the the model inputs.

Check the lecturer’s explanation below! (To jump to the answer scroll to roughly 3 minutes of the lecture.)

Happy practicing!

Thanks for reading my newsletter. You can follow me on Linkedin or Twitter @Angelina_Magr!

Note: There are different angles to answer an interview question. The author of this newsletter does not try to find a reference that answers a question exhaustively. Rather, the author would like to share some quick insights and help the readers to think, practice and do further research as necessary.

Source of content/images: please check the original post here.

Angelina Yang