How to Solve the Sparsity Problem in N-gram Language Model?
I know the OpenAI language model ChatGPT is the most recent hype. But let’s go back to the basics when prepping for our interviews!
Why?
If your interviewer is asking you about ChatGPT, he/she is testing your curiosity, proactiveness, and maybe your imagination. 😂
There are a lot of deep explanations elsewhere, here I’d like to share some example questions in an interview setting. Last week we asked about the n-gram model, let’s add a bonus question:
How to solve the sparsity problem in the n-gram language model?
Considering the occurrences of some words can be rare, which causes probabilities to be assigned zero when there are no occurrences of such n-gram in the training dataset. (i.e.,If we have never seen an event in the training dataset, then our model assigns zero probability to that event.)
Here are some tips for readers’ reference:
One way to deal with the sparsity problem is smoothing. We can add a small delta to the count of every word in the vocabulary. In this way, every word that can come next has at least some small probability. This helps with solving a nominator sparsity problem.
In case of a denominator probability being zero, we won’t be able to calculate the probability distribution for any word because we never even saw this context before. A solution to this is if you cannot find occurrences of these n grams in the training corpus at all, then you should back off to just conditioning on fewer words such as n-1 grams. The sparsity problem gets worse when you want to factor in more context. In practice, we usually cannot have n much bigger than 5.
Watch how Abby See from Stanford explains it (remember to watch a bit longer to see the full explanation!):
Happy practicing!
Thanks for reading my newsletter. You can follow me on Linkedin or Twitter @Angelina_Magr !
Note: There are different angles to answer an interview question. The author of this newsletter does not try to find a reference that answers a question exhaustively. Rather, the author would like to share some quick insights and help the readers to think, practice and do further research as necessary.
Source of quotes/videos: Stanford CS224N Lecture 6 (Winter 2019) — Lecture 6 — Language Models and RNNs by Dr. Abby See
Source of images/Good reads: Blog . NLP Breakfast 2: The Rise of Language Models; Blog. N-gram Model