Markov chain is a graph model which describes transfering from one state to another with certain probability. In daily life you can experience it, when using smart keyboard on your portable device. You may notice, that keyboard is slowly learning what you are typing and after some time, can predict next word. This characteristics can be used to create simple chat bot or application which mimics someones writing. For example, if your job requires you to fill yearly performance review, this might be helpful ;) In following paragraphs I’m going to show you, how to create your own Markov chain from provided text file. At first we need to formulate constraints about our text, which should be:
- Written with latin letters only (so numbers are not included)
- Might contain dot character
With above limitations we construct following algorithm
- Split text using whitespaces
- Remove any non-latin characters from every splitted word
- If word contains dot character, then treat them as separate word
- Make every word lowercase
- Take first word from list and mark it as previous
- Take next word from list and add it as previous’ connection
- If previous doesn’t contain word as connection make counter equal 1 for this word
- Else if previous contains word as connection, increment counter
- Mark current word as previous
- Go to point 2 if there there are still words to process
Above algorithm will make a chain of words with structure as shown below:
states / (5) united -- (2) nations \ (1) kingdom
As you can see, every connection has a weight, which affects probability of randomly choosing it as a next word. For initial word united we have 5/8 chances for choosing states, 1/4 for nations and 1/8 for kingdom as a next word. Let’s write some code.
At first we should split text into tokens.
As you can see, we introduced space character before every dot, which let as treat dot as a separate word. Then we lowercased every word and removed any non-latin letter or dot characters. Then we filtered out every empty word and at the end we added every word into some repository.
As you can see, NodeRepository is a class which contains a map of nodes, which are simple containers for words. Every new connection is being added to last used node and current (already added) node is set as a last one. We can also retrieve any node, check if it already exist in repository or get random connection from the node of our interest.
The node consists of word pattern, the word variable, connections count for current word and connections map which is being used to get random connection node. Here comes a little magic, because we need to draw a random connection using weights. Without them we’d just just random.nextInt(connectionsTotal) but our problem is a little more complicated. But it’s not rocket science. So, let’s assume we’ve got 9 total connections. 3 for X, 2 for Y and 4 for Z. We can spread them out on the axis so, Z will take 4 units, X 3 and Y will take 2 like below:
Z4 X3 Y2 |====|===|==| 0123 456 78 r
Now we can draw a random number from 0 to 8 and check in which secion it will fit. There are two approchaches for that (actually, there are more complicated algorithms for that, but we will cover only two simplest) we can create an array of possible choices, which will save us time, but no memory. This will give us time complexity of O(1), but memory complexity of O(n^2), where n is connections count. We can also just iterate every element and check if our random number fits the range. This will give us O(n) time complexity and also O(n) memory complexity. The algorithm is following:
- Draw random number r, less or equal n
- Set i := 0
- Get next element from connections collection
- Set i := i + element.weight
- If r < i then return element
- Go to point 3.
In our case r = 5, so when we’re processing word Z we’re using inequality 5 < 4. This statement is false, so we’re processing another word, X for which i = Z + Y = 4 + 3 = 7, so we’re having inequality 5 < 7 which is true and now we know, that we’ve drawn word X.
After we get random word, we can use NodeRepository#getNode(word) method to retrieve node for it and then repeat from the beginning as long as we reach an end point, which can be dot character (if we want to create just single sentence) or final word count in output string. Such code may look like below example from my TextProcessor class:
And that is roughly as much as we need to do. We can get a little trouble with testing code which peeks random connection from word, but we can solve this by mocking random number generator:
In above code in RandomMock class we overrided Random#nextInt method, which now returns numbers from collection provided by us. Here numbers are being shuffled, but we know, that this will always be collection of integrals from 0 to 15 inclusive. So even when words are being returned randomly, we still know that distribution is flat.