8.3. Knuth–Morris–Pratt Algorithm

  • File: KMP.ml

This is the first algorithm for string matching in \(O(n)\), where \(n\) is the size of the text where the search takes place). It has been independently invented by Donald Knuth and Vaughan Pratt, and James H. Morris, who published it together in a joint paper.

It is known as one of the most non-trivial basic algorithms, and is commonly just presented as-is, with explanations of its pragmatics and the complexity. In this chapter, we will take a look at a systematic derivation of the algorithm from a naive string search, eventually touching upon a very interesting (although somewhat non-obvious) idea — interrupted partial matches can be used to optimise the search in the rest of the text by “fast-forwarding” through several positions, without re-matching completely starting from the next character.

A very nice video explaining KMP is available on YouTube.

This chapter shows how to systematically derive KMP from the naive search. We will not cover this in the lecture, but you are encouraged to go through the steps below. The material of this chapter is based on this blog article, which, in its turn is based on this research paper by Prof. Olivier Danvy and his co-authors.

8.3.1. Revisiting the naive algorithm

Let us start by re-implementing the naive research algorithm with a single loop that handles both indices k and j, so the former ranges over the text, and the latter goes over the pattern:

let naive_search_one_loop text pattern =
  let n = length text in
  let m = length pattern in
  if n < m then None
  else
    let k = ref 0 in
    let j = ref 0 in
    let stop = ref false in
    let res = ref None in
    while !k <= n && not !stop do
      if !j = m
      then (
        res := Some (!k - !j);
        stop := true)
      else if !k = n then (stop := true)
      else if text.[!k] = pattern.[!j]
      then (
        k := !k + 1;
        j := !j + 1)
      else  (
        k := !k - !j + 1;
        j := 0)
    done;
    !res

If a mismatch occurs at a certain position of k and j (text.[!k] = pattern.[!j]), then k is set up for j positions back, plus one (to move forward), while j is restarted from 0. That is, the variant of the loop is still k.

8.3.2. Returning the Interrupt Index

Let us refactor the code of naive_search_one_loop into a recursive procedure search. While doing so, we also make a dedicated data type search_result that either returns an index where the pattern begins (Found i) or a position j at a pattern, at which the text string has ended (Interrupted j):

type search_result =
  | Found of int
  | Interrupted of int

The result is than processed by a generic function global_search that converts it to a value of the option type:

let global_search search text pattern =
  let n = length text in
  let m = length pattern in
  let res = search pattern m text n 0 0 in
  match res with
  | Found x -> Some x
  | _ -> None

let search_rec =
  let rec search pattern m text n j k =
    if j = m then
      Found (k - j)
    else if k = n then
      Interrupted j
    else if pattern.[j] = text.[k] then
      search pattern m text n (j + 1) (k + 1)
    else
      search pattern m text n 0 (k - j + 1)
  in
  global_search search

The signature of the inner search seems quite verbose, but it is important for the derivation, which is coming: the first three parameters are the pattern, its size m and the text; n stands for the size of the text, but it also limits the search range on the right (and will be a subject to manipulation in the future). Finally, j and k are the current (and also initial for the first run) values of the running indices within pattern and text, correspondingly.

So far, we don’t make any interesting use of a interrupt index j in the case when the inner search returns Interrupted j

8.3.3. Relating Matched Text and the pattern

Let us notice the first interesting detail: at any moment, the positions of j and k relate the prefix of the pattern and a substring of text that fully match. This can be reinforced by the following invariants, instrumenting the search body:

let search_inv =
  let rec search pattern m text n j k =
    assert (0 <= j && j <= m);
    assert (j <= k && k <= n);
    assert (sub pattern 0 j = sub text (k - j) j);

    if j = m then
      Found (k - j)
    else if k = n then
      Interrupted j
    else if pattern.[j] = text.[k] then
      search pattern m text n (j + 1) (k + 1)
    else
      search pattern m text n 0 (k - j + 1)
  in
  global_search search

Therefore, at the last call search pattern m text n 0 (k - j + 1) we might be dropping essential information – the fact that the interval [k j, k) of the text matches the interval [0, j) of the pattern.

8.3.4. Fast-Forwarding Search using Interrupt Index

To exploit the information about already-matched prefix of the pattern, let us split the search, after the interruption, in the shifted range [k j + 1, n) into the search in the intervals [k j + 1, k) and [k, n).

This is possible due to the following fact. For any l, such that for k <= l <= n, the call search pattern m text n j k is equivalent to:

let result = search pattern m text l j k in
 match result with
 | Found _ ->
     result
 | Interrupted j' ->
     search pattern m text n j' l

That is, we can search up to l, and, if interrupted, start from searching l from an fast-forwarded position j'. That is due to the fact that we have managed to reach l and got Interrupted j', so there is no need to re-check the first j' - 1 pattern characters as they match the suffix [k, l) of text.

By using this observation, we can split the last call in the previous version of search into the case j = 0 (which is simple to handle by just incrementing k), and the case of j <> 0, in which case we will calculate the interruption index for computing the search starting at k + 1 - j:

let search_with_shift =
  let rec search pattern m text n j k =
  if j = m then
    Found (k - j)
  else if k = n then
    Interrupted j
  else if pattern.[j] = text.[k] then
    search pattern m text n (j + 1) (k + 1)
  else if j = 0 then
    search pattern m text n 0 (k + 1)
  else
    let result = search pattern m text k 0 (k - j + 1) in
    match result with
    | Found _ ->
        result
    | Interrupted j' -> search pattern m text n j' k
  in
  global_search search

Let us notice that the search search pattern m text k 0 (k - j + 1) is deemed to fail, as it searches in the range k - (k - j + 1) = j - 1 < m. However, when it fails, it will give us j', such that it can be used as an initial position in a pattern when starting at k.

Notice that there is some nicely hidden dependency there: the call to search pattern m text k 0 (k - j + 1) might run multiple smaller searches recursively, eventually hitting the right end of the range (i.e., k). As Interrupted j' is only returned when it happens, we can be sure that this is correct answer to the question “which position” should I start from processing the pattern, when I start processing the text from k. It might very well be the case that j' = 0.

8.3.5. Extracting the Interrupt Index

As the goal of calling search pattern m text k 0 (k - j + 1) in the code above is only to extract the fast-forwarding information, and it always fails, we can now make use of this information and eliminate some administrative “boilerplate” code:

let assertInterrupted = function
  | Found _       -> assert false
  | Interrupted j -> j


let search_assert =
  let rec search pattern m text n j k =
  if j = m then
    Found (k - j)
  else if k = n then
    Interrupted j
  else if pattern.[j] = text.[k] then
    search pattern m text n (j + 1) (k + 1)
  else if j = 0 then
    search pattern m text n 0 (k + 1)
  else
    let j' = assertInterrupted @@ search pattern m text k 0 (k - j + 1) in
    search pattern m text n j' k
  in
  global_search search

8.3.6. Exploiting the Prefix Equality

From the explanations above, recall that the sub-strings sub pattern 0 j and sub text (k - j) j are equal. Therefore, the sub-call search pattern m text k 0 (k - j + 1) searches for the pattern (or, rather, the interrupt index) within (a prefix of a suffix of) the pattern itself. Therefore, we can remove text from there, thus making this call work exclusively on a pattern:

let search_via_pattern =
  let rec search pattern m text n j k =
  if j = m then
    Found (k - j)
  else if k = n then
    Interrupted j
  else if pattern.[j] = text.[k] then
    search pattern m text n (j + 1) (k + 1)
  else if j = 0 then
    search pattern m text n 0 (k + 1)
  else
    (* So we're looking in our own prefix *)
    let j' = assertInterrupted @@ search pattern m pattern j 0 1 in
    assert (j' < j);
    search pattern m text n j' k

  in
  global_search search

8.3.7. Tabulating the interrupt indices

Since the information about interruptions and fast-forwarding can be calculating only using the pattern, without text involved, we might want to pre-compiled it and tabulate before running the search, obtaining a table : int array with this inforations. In other words the value j' = table.(j) answers a question: how many positions j' of the pattern can I skip when starting to look in a text, that begins with my pattern’s substring pattern[1 .. j] (i.e., precisely the value search pattern m pattern j 0 1).

If we had a table like this, we could forumlate search as the following tail-recursive procedure:

let rec loop table pattern m text n j k =
  if j = m then
    Found (k - j)
  else if k = n then
    Interrupted j
  else if pattern.[j] = text.[k] then
    loop table pattern m text n (j + 1) (k + 1)
  else if j = 0 then
    loop table pattern m text n 0 (k + 1)
  else
    loop table pattern m text n table.(j) k

To populate such a table, however, we will need the search procedure itself. However, the size of the pattern m is typically much smaller than the size of the text, so creating this table pays off. Int the following implementation the inner procedure loop_search defines the standard search (as before) and uses to populate the table, which is the used for the main matching procedure:

let search_with_inefficient_init =

  let loop_search pattern _ text n j k =
    let rec search pattern m text n j k =
      if j = m then
        Found (k - j)
      else if k = n then
        Interrupted j
      else if pattern.[j] = text.[k] then
        search pattern m text n (j + 1) (k + 1)
      else if j = 0 then
        search pattern m text n 0 (k + 1)
      else
        (* So we're looking in our own prefix *)
        let j' = assertInterrupted @@ search pattern m pattern j 0 1 in
        assert (j' < j);
        search pattern m text n j' k
    in

    let m = length pattern in
    let table = Array.make m 0 in
    for j = 1 to m - 1 do
      table.(j) <- assertInterrupted @@ search pattern m pattern j 0 1
    done;

    let rec loop table pattern m text n j k =
      if j = m then
        Found (k - j)
      else if k = n then
        Interrupted j
      else if pattern.[j] = text.[k] then
        loop table pattern m text n (j + 1) (k + 1)
      else if j = 0 then
        loop table pattern m text n 0 (k + 1)
      else
        loop table pattern m text n table.(j) k
    in

    loop table pattern m text n j k
  in

  global_search loop_search

8.3.8. Boot-strapping the table

We can rewrite the code in a more efficient manner by using the same loop function to populate the table. To do so, let us notice the two following intricate observation.

The value table.(j) can be computed in terms of the tabulated values at j - 1 and smaller. The base case is j = 1 corresponds to an empty interval, so table.(j) = 0, and we can start populating the table from j = 2. With this in mind, we can rewrite the search as follows:

let search_kmp =

  let loop_search pattern _ text n j k =
    let rec loop table pattern m text n j k =
      if j = m then
        Found (k - j)
      else if k = n then
        Interrupted j
      else if pattern.[j] = text.[k] then
        loop table pattern m text n (j + 1) (k + 1)
      else if j = 0 then
        loop table pattern m text n 0 (k + 1)
      else
        loop table pattern m text n table.(j) k
    in
    let m = length pattern in
    let table = Array.make m 0 in

    (*  In the case of j = 1, j' is 0 *)
    for j = 2 to m - 1 do
      table.(j) <- assertInterrupted @@
        loop table pattern m pattern j table.(j - 1) (j - 1)
    done;
    loop table pattern m text n j k
  in

  global_search loop_search

Notice that the mutual dependency between loop and table is resolved, as table is mutable, hence it can be altered by loop a-posteriori (the trick known and Landin’s knot – A technique named after Pater Landin for implementing recursive functions using mutable state).

This concludes our derivation of the Knuth-Morris-Pratt (KMP) algorithm, whose main idea is to pre-compute the table of fast-forwarding shifts for a given pattern, which is then used to avoid redundant work for re-matching already observed parts and the corresponding back-tracking.

The fact that the lookup in the table takes constant and the main iteration through text always progresses without backtracking, yields the linear complexity result \(O(n)\) for the final algorithm.

8.3.9. Comparing performance, again

  • File StringSearchComparison.ml

Let us compare the three studies string matching algorithms on regular and repetitive strings:

let compare_string_search n m =
  let (s, ps, pn) = generate_string_and_patterns n m in
  evaluate_search naive_search "Naive" s ps pn;
  evaluate_search rabin_karp_search "Rabin-Karp" s ps pn;
  evaluate_search search_kmp "Knuth-Morris-Pratt"  s ps pn

let compare_string_search_repetitive n =
  let (s, ps, pn) = repetitive_string n in
  evaluate_search naive_search  "Naive"  s ps pn;
  evaluate_search rabin_karp_search "Rabin-Karp"  s ps pn;
  evaluate_search search_kmp "Knuth-Morris-Pratt"  s ps pn

Here’s the result for repetitive strings, showing that RK and KMP are very close:

utop # compare_string_search_repetitive 50000;;

[Naive] Pattern in: Execution elapsed time: 1.310680 sec
[Naive] Pattern not in: Execution elapsed time: 1.312447 sec

[Rabin-Karp] Pattern in: Execution elapsed time: 0.060640 sec
[Rabin-Karp] Pattern not in: Execution elapsed time: 0.059571 sec

[Knuth-Morris-Pratt] Pattern in: Execution elapsed time: 0.078809 sec
[Knuth-Morris-Pratt] Pattern not in: Execution elapsed time: 0.077379 sec

And here’s the result for arbitrary strings, showing the superiority of KMP on randomised inputs:

utop #  compare_string_search 20000 50;;

[Naive] Pattern in: Execution elapsed time: 1.027522 sec
[Naive] Pattern not in: Execution elapsed time: 2.001959 sec

[Rabin-Karp] Pattern in: Execution elapsed time: 1.106642 sec
[Rabin-Karp] Pattern not in: Execution elapsed time: 2.166105 sec

[Knuth-Morris-Pratt] Pattern in: Execution elapsed time: 0.762785 sec
[Knuth-Morris-Pratt] Pattern not in: Execution elapsed time: 1.421093 sec