Build Viterbi Decoding Model from Scratch

4 min readMar 9, 2021

The Viterbi algorithm provides an efficient way of finding the most likely state sequence in the maximum a posteriori probability sense of a process assumed to be a finite-state discrete-time Markov process.

I highly recommend that you know the basics of Viterbi decoding algorithm before reading further. Check https://youtu.be/yGtC10JJNq8 for understanding basics of viterbi decoding algorithm.

Let’s get Started, You would be needing the following libraries installed before getting started. Import the libraries once the installation is done.

We would be using Treebank corpus and Universal Tagset. You can refer http://cse.iitkgp.ac.in/~sudeshna/courses/NLP19/Universal%20POS%20tags.html to know more about Universal POS tags.

Step1 : Dataset Creation

Now, let’s look into the tagged sentences of Penn Treebank corpus the below is the code to get tagged sentences using nltk.corpus.treebank.tagged_sents().

And You’ll get output something like this as shown below,

[[(‘Mr.’, ‘NOUN’), (‘Vinken’, ‘NOUN’), (‘is’, ‘VERB’), (‘chairman’, ‘NOUN’), (‘of’, ‘ADP’), (‘Elsevier’, ‘NOUN’), (‘N.V.’, ‘NOUN’), ……]]

Step2 : Calculating Emission and Transmission Probabilities

2.1 Calculating Emission probabilities:

𝑏𝑖(𝑜) = Count(𝑖→𝑜)/ Count(𝑖)

where Count(i) is the number of times tag 𝑖. occurs in the training set and Count(i→o) is the number of times where the observed word o maps to the tag 𝑖. it is possible to use Laplace Smoothing. In general Laplace Smoothing can be written as:

𝑏𝑖(𝑜) = Count(𝑖→𝑜) + 1 / Count(𝑖) + n

2.2 Calculating Transition probabilities:

(P)ij = pij = P(X1 = j | X0 = i) = P(Xn+1 = j | Xn = i) for any n.

pij is the probability of making a transition FROM state i TO state j in a SINGLE step. Let’s use Laplace Smoothing. In general Laplace Smoothing can be written as:

p𝑖(Xn) = Count(P(Xn+1 = j | Xn = i)) + 1 / Count(P(Xn = i)) + n

Step3: Building Viterbi model

3.1 Initialization:

states [] = array to store best hidden state sequence , probabilities [] = appended max probabilities , start state = “.” , End state = “ .”

Initialization step

3.2 Recursion

The multiplication consists of the previous Viterbi variable of state i, times the transition probability from state i to j, times the emission probability from state j to observation O.

3.3 Termination

This equation represents the probability of the entire state sequence up to point T + 1 having been produced given the observation and the HMM’s parameters. Just enumerate through all the word sequences in dataset.

Let’s Test our model

Case1: The sentence is from the corpus, sentence1 = “Vinken is chairman of Elsevier”
Case2: The Sentence is not from the corpus, sentence2 = “Processing text data so that able to infer some information which is useful”

Output:

Sentence 1: “Vinken is chairman of Elsevier”

Sentence 2: “Processing text data so that able to infer some information which is useful”

Summing up :

Calculate Tag transition Probabilities
Calculate Word Emission Probabilities
Initialize state matrix that stores the states that gave max probabilities
Initialize Probabilities array that stores probabilities from all prev layer states to current layer state
Get the possible tags for every word in the sentence.
Use Emission and Transition probabilities to best possible state/tag sequence using the Viterbi algorithm.