N-gram Language Models

5 min readSep 22, 2020

Step towards statistical language model

INTRODUCTION

Language modelling is the task of assigning probabilities to sentences in a language. It also assigns probability to sequence of words and estimate likelihood of new word given sequence of N-1 words. Statistical Language models are probabilistic models that is used to predict next word given the previous words.

Why would we want to predict upcoming words, or assign probabilities to sentences? Probabilities are essential in tasks like speech recognition, spelling correction or grammatical error correction and machine translation.In this article we will dive deeper into what is n-gram model and design a simple ngram model from scratch.

DOMAIN BACKGROUND

The goal of Probabilistic Language model is to Compute the probability of a sentence or sequence of words: P(W) =P(w1 , w2 , …, wn ) or probability of an upcoming word: P(w4 | w1 ,w2 ,w3 ). The intuition of ngram model is that instead of computing probability of word given entire sequence, approximate history by just last few words. Hence for simple ngram model Predicting the probability of a word w given some history h, or P(w|h).

So, for sentence

Mary likes her coffee with milk and ……? and we Want to know the probability that the next word is sugar : P(sugar| Mary likes her coffee with milk and)

Here,
w = sugar
h = Mary likes her coffee with milk and

This joint probability can be computed using can be computed using chain rule of probability. Joint prob is represented as: P(w1 , w2 , …, wn ) Probability of entire sequence P(w1 , w2 , …, wn ), use chain rule of probability :

P(X1…Xn) = P(X1)P(X2|X1)P(X3|X 2 X1 )..P(Xn|X n−1 Xn-2 Xn-3…..X1).

Thus for our previous example,

Estimating Probabilities Values:

P(sugar| Mary likes her coffee with milk and) = Count(Mary likes her coffee with milk and sugar)/ Count(Mary likes her coffee with milk and).

MARKOV MODELS

These are class of models that assume that we can predict the probability of some future unit without looking too far into the past. Based on this intuition we can categorize our ngram models into bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the N-gram (which looks N −1 words into the past).

BIGRAM MODEL

It is based on Markov assumption — the probability of a word depends only on the previous word. Thus our chain probability reduces to,

P(X1…Xn) = P(X1)P(X2|X1)P(X3|X 2 )…P(Xk|X k−1)…P(Xn|X n−1)

Thus for our example, P(sugar| Mary likes her coffee with milk and) reduces to : P(sugar|and)

P(wn|w n−1..1 ) ≈ P(wn|wn−1)

P(wn|wn−1) = C(wn−1wn) / C(wn−1).

TRIGRAM MODEL

It looks into two words into past to predict next word.Thus our chain probability reduces to,

Thus for our example, P(sugar| Mary likes her coffee with milk and) reduces to : P(sugar|and milk)

p(the dog barks STOP) =P(the | *,<s>) P(dog | <s>, the) P(barks | the, dog)P(STOP | dog, barks).

The same can be generalized for other higher ngram models.

EVALUATING LANGUAGE MODELS

Performance of a language models can be evaluated using intrinsic evaluation or extrinsic evaluation.Extrinsic evaluation evaluates a language model by embedding it into real time application and analysing the performance. However intrinsic evaluation is metric to measure performance independent of application.Intrinsic metric used for language model performance analysis is perplexity.The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalised by the number of words. For a test set W = w1w2 …wN,:

There is another way to think about perplexity: as the weighted average branching factor of a language. The branching factor of a language is the number of possible next words that can follow any word.

So here perplexity of unigram model is 962, means there are 962 possible next word and model has to select the best possible next word of the sequence from these 962 possible words.Thus lower the perplexity better is the model.

Below are the challenges faced with modelling approach,

Generalization and Sensitivity:

The ngram model is sensitive to training corpus.The reason behind this is that sometimes probabilities encode specific facts about training corpus and another reason being performance on training corpus is proportional to value of n.

Unknown Words:

A reasonable assumptions in in some domains, such as speech recognition or machine translation, where we have a pronunciation dictionary or a phrase table that are fixed in advance, and so the language model can only use the words in that dictionary or phrase table.But in some applications we have to deal with words that we haven’t seen before ie. unknown words or sometimes also referred to as OOV(out of vocabulary words).

Context Specific:

There might be the case where words to be addressed are present in our vocabulary but appear in test corpus with an unseen context(for example they appear after a word they never appeared after in training)? To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen.

Sources:

[1] CHAPTER DRAFT — | Stanford Lagunita. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf

Thanks for reading. Stay tuned for next story — addressing these challenges and designing simple ngram model from scratch.

If you have any feedback, please feel to reach out by commenting on this post.

Links to the youtube videos

Viterbi decoding algorithm : https://youtu.be/yGtC10JJNq8

Machine Learning : https://youtu.be/Q3mZui3H6MM

N-gram Language Models

BIGRAM MODEL

Sources:

Written by EliteAI

No responses yet