Share with your network!

In previous post, we introduced concept of Markov “memoryless” process and state transition chains for certain class of Predictive Modeling. In this post, we continue mathematical treatment and learning of Markov model.

As you will have noted from last post, Markov processes are represented by series of state transitions in a directed graph. Classical Markov process is of order one i.e. next state transition depends only on current state and not on how current state has been reached, but Markov processes can be of higher order too. Popular children’s game Snakes and Ladder is one example of order one Markov process. We will also make another assumption that events happen in discrete time
, that is, at each discrete time interval one and only one state transition happens. This is again not uncommon in practice. Requirement of time being discrete unit is not mandatory requirement as long as key properties of assumptions are understood. For instance, a person browsing webpages (example we also used in last post) can be considered transitioning from one webpage to another in Markov fashion, even if time to change is not uniform or consistent across persons.

Training Data

Training observations for Markov modeling consists of number of sequence of states. For following process flow diagram in Fig. 1 may generate sequences as shown in Fig. 2 where 40 random sequences have been generated. Note that true process rarely be known in real world, and what will be observed are simply manifest sequences. Here true process flow is shown only for illustration.

Figure 1: True but Unobserved Markov Process

true but unobserved Markov Process

Figure 2: 40 Random Sequences


Given this data, how will we go about learning the Markov process?

First, we need to find number of states. This is easy – there are 6 unique state symbols, 1-6. Of course, in practice data always is not universe of data but a sample. There may be case where some rare states remain unobserved in the training data. This will introduce error in the model since we are not learning reality, but we have to accept this as we cannot do anything except try to get better training data.

Learning Markov Model of Order One

Second, we need to assume order of Markov process. Let’s first try to learn Markov process of order = 1. How? Well, strangely, and very conveniently for us, Maximum Likelihood Estimator of Markov Process is simply count based observed transition probabilities. Statistical details are in this paper for interested reader.

Thus, estimated probability of transition from 1 to 2 is number of times transition from 1 to 2 was observed, divided by number of times transition from 1 to anywhere was observed.


If we sample 70%, or about 31 of these observations for training, and do the counting exercise, we will obtain transition matrix as show in Fig. 3 which is compared against “true” matrix which was used to generate the data. While number of training sequences was fairly small, we are able to do pretty decent job of estimation. In practice, of course, there is no ‘truth’ known. There are no standard metric for performance of fit, since true transition probabilities are not known. Often, data is separated in multiple equal-size chunks and separate estimations are done. If they are close to each other – say, as measured by Root-Mean-Squared-Error – then we have more confidence in estimated probabilities.

Figure 3: Order 1 Markov Model

Order 1 Markov Model

Learning Markov Model of Order Two

But how do we know, if order of Markov process is really 1? Let’s assume it’s 2. Now we compute the estimated transition probabilities in same manner, as


Doing so, produces estimated probabilities, as shown in Fig. 4. For brevity, only non-zero values are presented. Note that since ‘true’ process is only of Order 1, true probabilities are independent of index i.

Figure 4: Order 2 Markov Model

Order 2 Markov Model

How do we know which of Order 1 or Order 2 is true estimation? Here comes cross-validation handy, as is often common in Machine Learning models. We can compute probabilities of observing sequences from cross-validation data, and whichever model provides maximum value is likely correct model. With a caveat which we will come to later.

Let’s pick a sequence 125456 from rest of 30% data. Probability of observing this sequence is

P(12456)= P(2|1)*P(5|2)*P(4|5)*P(5|4)*P(6|5)

P(12456)= 1*0.4*0.4*1*0.6= 0.096

according to Order 1 Markov process. Same for Order 2 is

P(12456)= P(5|12)*P(4|25)*P(5|54)*P(6|45)

P(12456)= 0.4*0.6*1.0*0.7= 0.168

So, for this sequence, Order 2 seems more likely. We will have to do same for all sequences in cross-validation data, and multiply all numbers to obtain joint probability of obtaining observed data. This is actually the very calculation of likelihood which is used in estimation above.

Now about that caveat: Given that two models perform nearly similar, we should always prefer a simpler model, that is, model of smaller order. This can be achieved by applying a penalty to likelihood estimation which is function of order of model. If penalty simply is order, then we obtain what’s known as Akaike Information Criterion (AIC).

In next post, we will talk about training Markov models in cases where observed states are not true states of system.

Other Articles by the same author

Curse Dimensionality

Semi-Supervised Clustering

Other Related Links that you may like

Overview of Text Mining

Role of Business Analyst