String matching using suffix tree

Given that you have one string of length N and M small strings of length L . How do you efficiently find the occurrence of each small string in the larger one ?

Suffix Tree Definition

A suffix tree ST for an m-character string S is a rooted directed tree with exactly m
leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character.

The key feature of the suffix tree is that for any leaf i, the concatenation of the edge labels on the path from the root to the leaf i exactly spells out the suffix of S that
starts at position i.

For example, Suffixes of ‘ XYZXYZ’

XYZXYZ$
 YZXYZ$
  ZXYZ$
   XYZ$
    YZ$
     Z$
      $

The corresponding suffix tree would be –

st

Here is a good explanation on conceptual building of a Suffix Tree

Given a pattern P of length m, find all occurrences of P in text T – O(n+m) algorithm.
Build a suffix tree ST for text T in O(m) time. Then, match the characters of P along the unique path in ST until either P is exhausted or no more matches are possible.

Ukkonen’s Algorithm: O(n) suffix tree building Read this post: Ukkonen’s suffix tree algorithm.

Number of distinct palindromic substring using suffix tree

Observe that

The longest contained palindrome within a string is exactly the longest common substring of this string and its reverse.

Thus, the longest contained palindrome within a string is exactly the longest common prefix of all pairs of suffixes between a string and its reverse. This is what we’re doing here.

Let’s take the word banana.
Sf = banana$
Sr = ananab#

Below is the generalized suffix tree of Sf and Sr, where the number at the end of each path is the index of the corresponding suffix.
Screen Shot 2014-06-05 at 12.34.43 AM

This can be done in linear time using suffix trees:

  1. For constant sized alphabet we can build suffix trees using Ukkonen’s Algorithm in O(n).
  2. For given string S, build a generalized suffix tree of S#S` where S’ is reverse of string S and # is delimiting character.
  3. Now in this suffix tree, for every suffix i in S, look for lowest common ancestor of (2n-i+1) suffix is S`.
  4. count for all such suffixes in the tree to get total count of all palindromes.
  5. You can even go further and get longest palindrome. Look for LCA which is deepest in the tree i.e. one whose distance from root is max.

This algorithm takes linear time (Used by Genebank to find biological palindromes in DNA sequences), although for very large strings you can do quadratic time preprocessing of this tree to answer LCA queries in constant time.

References
Suffix Tree