linear time string matching using KMP matching algorithm

Given an input string S and a word w. Find the occurrences of the word w in the given string S in linear time. In other words, explain the linear time string matching using KMP matching algorithm.

For example, if S = “ABABAABA” and w = “ABA” then there are 3 occurrences of w in S at {0, 2, 5}.

A trivial solution to solve this problem would be to scan S (using a loop i over S) from left to right and do a character-to-character matching of w (using an inner loop j over w). If a character match fails at current position i then the search will start from next position i+1 and redo the character-to-character match of w in S. This is O(n^2) algorithm in worst case when w=S. This is certainly inefficient for longer patterns. But there is a very beautiful and efficient solution based on the analysis of the trivial solution that optimizes the amount of character to character matching in the event of a mismatch using a finite automaton.

int i=0, j=0;
while (i<S.length){
  while (j<w.length){
     if (S[i+j] == w[j])
          j++;
     else j=0;
  }
  
  i++;
}

For example, lets consider finding the pattern, w=”nano” in the input string, S=”banananobano”. Observe that when brute-force search got a mismatch at [i=2,j=3] (i.e. S[i+j]!=w[j]), then the it restarts the search from next position [i=3,j=0]. But notice that in previous step [i=2,j=2] we’ve already found a partial match ‘nan’ up to j=2 (i.e. S[i+j]=’n’=P[j]). But trivial search is not using this valuable information. Think about it. What the brute-force search would do in next step? It is eventually going to find out that there is a mismatch at [i=3,j=0] and will restart the search from [i=4,j=0]. But if we had used the precious information from the previous partial match and restarted the search from [i=4,j=0] during the mismatch at [i=2,j=3] then we could have skipped the redundant search at [i=3,j=0]. By skipping redundant searches in this way we would have done much better computationally than the trivial algorithm. Yeah! KMP matching algorithm is exactly doing this!

The KMP matching algorithm is based on finite automaton and works by implicitly building the transition table for an automaton that matches the pattern. This uses a linear-time preprocessing to build the transition table of the matching automaton for the pattern. The automaton that’s constructed works by having one state for each amount of the string that has been matched so far. By default in brute-force solutions, the transitions are such that matching the next character advances to the next state in the machine and during a mismatch it transitions back to the beginning. However in KMP, with patterns having overlapping structures, appropriate transitions are added that take the automaton back to the next best partial match during a mismatch.

The automaton is implemented in such a way that at each step an internal loop index j is set either to j+1 if a match occurs or to the best partial match, Pj if a mismatch occurs. The index of a partial match at position j, Pj is just a function of j and doesn’t depend on other information such as the characters in w. So we can draw an automaton with the values of j as the finite states and the matching character at position j as the transitions of the automaton. Note that the following example figure is taken from this reference page.

KMP-DFA

How to construct the transition table for a pattern? We know that there is a state for each character of the pattern plus an initial state. But what about the transitions? Whenever we have a match at state j we simply transition to next state j+1. But when we have a mismatch at state j then we transition to next best partial match, Pj. Question is how to find the next partial match efficiently? Observe that, when we have a partial match up to “nan” then the next partial match ‘n’ is a proper suffix (suffix not equal to original) of the partial match “nan”, which is also a prefix of the original pattern “nano”. In that sense –

The best next partial match at position j, P(j) is the longest proper suffix of all the prefixes of the previous partial match at j-1, P(j-1), which is also a prefix of the pattern w.

So, P(j) is basically amount of backtrack needed during a mismatch at current state j. Lets give an example to clear up all the above statements. Let start with an example pattern w=”ababad”. Lets compute the longest proper suffix of the prefixes which is also prefix of the pattern, LSP(j) for j=0 to m, for the pattern w. Also lets compute the index of next best partial match at index j, P[j] for j=0 to m, by computing the length of LSP[j].

------------------------------------------------
| j |    prefix(j)       |  LSP (j) |   P[j]   |
------------------------------------------------
| 0 | "" (empty string)  | ""       |    0     |
------------------------------------------------
| 1 | a                  | ""       |    0     |
------------------------------------------------
| 2 | ab                 | ""       |    0     |
------------------------------------------------
| 3 | aba                | a        |    1     |
------------------------------------------------
| 4 | abab               | ab       |    2     |
------------------------------------------------
| 5 | ababa              | aba      |    3     |
------------------------------------------------
| 6 | ababad             | ""       |    0     |
------------------------------------------------

Note that, in the above table, we can easily find that when we have a partial match “ababa” at j=5 then we also have next longest partial match “aba” at j=3, which in turn contains next longest partial match “a” at j=1. In that sense we are actually creating a finite automata where P[j] points to the next state in case of a mismatch at current state j. That is, P[j] doesn’t only contain partial match under index j but also about every partial match of it. P[j] is the first best partial match, P[P[j]] – is the second best, P[P[P[j]]] – the third, and so on. That is, we can calculate P[j] from the values P[k] for all k<j

The best next partial match at j will be the longest partial match at k<j whose “expanding character” is equal to the last character of partial match at j.

So, in order to compute P[j], for j=0 to m: we need to check every partial match at k<=(j-1) i.e. LSP(k), k<=j-1 in descending order of length and determine whether the last character of LSP(j) is the “expanding character” of the match i.e. it expands the current match. If no such expanding character is found then LSP(j) is empty and hence P[j]=0. Otherwise LSP(j) is the longest partial match that is expanded as far as possible. Below is the preprocessing function to compute P[j] for j=0 to m. This is only O(n) time operation.

private int[] computeLSP(final char[] pattern, final int m) {
    final int[] P = new int[m + 1];

    P[0] = -1;
    P[1] = 0; // always true
    // k is the index of the longest next partial match under
    // index j - 1
    int k = 0;

    for (int j = 2; j <= m; j++) {
        k = P[j - 1];
        do {
            // check if last character at j-1 is the expanding character of the next partial match at k
            if (pattern[k] == pattern[j - 1]) {
                // if expanding then we can expand the partial match
                P[j] = k + 1;
                break;
            }
            // if we cannot "expand" even the empty string
            if (k == 0) {
                P[j] = 0;
                break;
            }
            // else go to the next best "candidate" partial match
            k = P[k];
        } while (true);
    }

    return P;
}

 

With the computed LSP[j] and P[j] for j=0 to m, we can actually find the occurrences of the pattern w in the string S by traversing the automaton as we discussed earlier.

P[j] indicates amount of backtrack needed in the event of current matching at j that ends in a mismatch. That is if we have a match starting at S[i] that fails at character S[i + j] (i.e. S[i + j]!=w[j]), then the next best match will start at index i + j – p[j] in S. This has two implications: first, P[0] = -1, which indicates that if w[0] is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at index i + j – P[j], as in the example above, we need not actually check any of the P[j] characters after that, so that we continue searching from w[P[j]].

Below is a simple implementation of the KMP search to find all the occurrences of a pattern in a text in O(n+m) time.

List<Integer> KMPSearch(final String text, final String pattern) {
    final List<Integer> resultSet = new ArrayList<Integer>();
    if (null == text || text.isEmpty() || null == pattern || pattern.isEmpty() || text.length() < pattern.length()) {
        return resultSet;
    }

    final int n = text.length();
    final int m = pattern.length();

    // Compute the next partial matches for each position of the pattern
    final int[] P = computeLSP(pattern.toCharArray(), m);
    int i = 0; // text index
    int j = 0;// finite automata state

    while (j + i < n) {
        // if we have a match starting at text[i] that fails when comparing text[i + j] to pattern[j], then the next
        // possible match will start at index i + j - P[j] in text (that is, P[j] is the amount of "backtracking" we
        // need to do after a mismatch).
        if (text.charAt(i + j) != pattern.charAt(j)) {
            i = i + j - P[j];
            j = -1 < P[j] ? P[j] : 0;
        }
        // otherwise we can "expand" the current match. Check if expanded partial match is a full match or not
        else if (j++ == m - 1) {
            resultSet.add(i);
            i++;
            j = 0;
        }
    }
    return resultSet;
}

2 thoughts on “linear time string matching using KMP matching algorithm

  1. I had always struggled to digest the idea behind the KMP algorithm. Thanks to this post for making it extremely easy to understand the algorithm. Nice examples. keep it up. Nice post!

Comments are closed.