Part2: Building Statistical Language Model from Scratch
website: https://eliteaihub.com/
In Part1 we explored the basics of Language models and identified challenges faced with modelling approach.In this Part we will address the challenges identified and build Ngram model from scratch without NLTK Library!
The Challenges that we identified were : Generalization and Sensitivity,Unknown Words and Context Specific.
- Generalization and Sensitivity:
As we saw that the performance on training corpus is proportional to value of n. We would be generating random sentences from different n-gram models. It’s simplest to visualize how this works for the unigram case. Imagine all the words of the English language covering the probability space between 0 and 1, each word covering an interval proportional to its frequency. We choose a random value between 0 and 1 and print the word whose interval includes this chosen value. We continue choosing random numbers and generating words until we randomly generate the sentence-final token . We can use the same technique to generate bigrams by first generating a random bigram that starts with (according to its bigram probability). Let’s say the second word of that bigram is w. We next chose a random bigram starting with w (again, drawn according to its bigram probability), and so on. To give an intuition for the increasing power of higher-order n-grams hence reducing the sensitivity of model towards train set.
- Unknown words:
There are two common ways to train the probabilities of the unknown word model: The first one is to turn the problem back into a closed vocabulary one by choosing a fixed vocabulary in advance:
- Choose a vocabulary (word list) that is fixed in advance.
- Convert in the training set any word that is not in this set (any OOV word) to the unknown word token in a text normalization step.
- Estimate the probabilities for from its counts just like any other regular word in the training set.
The second alternative, in situations where we don’t have a prior vocabulary in advance, is to create such a vocabulary implicitly, replacing words in the training data by based on their frequency. For example we can replace by all words that occur fewer than n times in the training set, where n is some small number, or equivalently select a vocabulary size V in advance (say 50,000) and choose the top V words by frequency and replace the rest by UNK. In either case we then proceed to train the language model as before, treating like a regular word.
- Context Specific:
To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. smoothing This modification is called smoothing or discounting.There are variety of ways to do smoothing: add-1 smoothing, add-k smoothing, stupid backoff, and Kneser-Ney smoothing.
Laplace Smoothing:The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into probabilities. All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called Laplace smoothing.
Add-k smoothing:One alternative to add-one smoothing is to move a bit less of the probability mass from the seen to the unseen events. Instead of adding 1 to each count, we add a fractional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing.
Kneser-Ney smoothing: Kneser-Ney smoothing makes use of the probability of a word being a novel continuation. The interpolated Kneser-Ney smoothing algorithm mixes a discounted probability with a lower-order continuation probability.
IDENTIFYING CORPUS AND PREPROCESSING:
The corpus that we would be using is Europarl corpus from Kaggle. It consists of the proceedings of the European Parliament from 1996.
First, we need to preprocess our corpus data. Apply Case folding to convert all text to lower case this would ensure that the ngrams with different case would be treated as same Eg: Bigram (I,am) should be same as (i,am). Next is to tokenize the text, you can use nltk.tokenize or define your own tokenizer. Remove punctuations as they don’t add any significance to the model. After Preprocessing you should have something like this below:
BUILD VOCABULARY:
We need to build the vocabulary from the preprocessed data,Vocabulary is the set of unique words of text.The reason behind building vocabulary is that it helps to get OOV(Out of Vocabulary) Words, and there by determine the OOV rate. It also helps in Smoothing techniques like Add-1 Smoothing where we need to know the vocabulary count |V|.Vocabulary could be created using simple set operation Eg: set(self.tokens_final) add <s> , </s> , <uk> to vocabulary as we would be using these symbols.
BUILD MODEL:
Before defining how to build model note that we would using additional symbols like: <s> to denote start of sentence, </s> to denote end of sentence and <uk> to denote words that are not in vocabulary ie. OOV words.we would be using Trie data structure to store our ngrams the reason being:
- Efficiently extract the count of word.
- Finding ngrams from indices and vice-versa.
- store prefix information.
- Can be used to extract info about smaller grams( n-1 and lesser).
Let’s start with creating our model, For each sentence in our preprocessed text data repeat the below steps note: you can define your own sentence delimiters or use sentence boundary detection algorithms to identify end of sentence. We would use (‘.’ , ‘?’ , ‘!’ , …) as sentence delimiters.
Traversing Trice you can extract your ngrams:
Bigram Model:
The table shows the count of bigrams with previous seen word as <s> ie. It shows all the words that occur as the start of the sentence in our training corpus.
P(<s>,i) = C(<s>,i) / C(<s>)
ie. count of bigrams (<s>,i) / count of unigram <s>
so, = 21 / C(<s>)
Trigram Model:
P(<s>,i, have) = C(<s>,i, have) / C(<s>, i).
ie. count of trigrams(<s>,i, have) / count of bigram <s>,i
so, = 3/ 21
P(<s>,i, should) = C(<s>,i, should) / C(<s>, i).
= 3/21
GENERATING TEXT:
Let’s generate text using language model that we build.We will fix <s>, and let model generate the next most likely word seeing previous history of words.
EVALUATING MODEL:
As seen in previous part the performance of the language model is evaluated using intrinsic method — perplexity. To prevent our language model from assigning zero probabilities to unseen events, we adopt Smoothing or Discounting. Here we would be using Kneser-Ney smoothing. We would be evaluating the performance of our language model on test data that is completely different that our training corpus.
Thus Language models offer a way assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.n-gram language models are evaluated extrinsically in some task, or intrinsically using perplexity.The perplexity of a test set according to a language model is the geometric mean of the inverse test set probability computed by the model.Smoothing algorithms provide a more sophisticated way to estimate the probability of n-grams. Commonly used smoothing algorithms for n-grams rely on lower-order n-gram counts through backoff or interpolation. Kneser-Ney smoothing makes use of the probability of a word being a novel continuation. The interpolated Kneser-Ney smoothing algorithm mixes a discounted probability with a lower-order continuation probability
Sources:
[1] CHAPTER DRAFT — | Stanford Lagunita. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf
Thanks for reading. Stay tuned for next story — POS Tagging using Hidden Markov Models (HMM) & Viterbi algorithm
If you have any feedback, please feel to reach out by commenting on this post.
Links to the youtube videos
Viterbi decoding algorithm : https://youtu.be/yGtC10JJNq8
Machine Learning : https://youtu.be/Q3mZui3H6MM