Sunday, December 14, 2008

Fun with Tries

A number of years ago, Michael Zucchi introduced me to the Aho-Corasick algorithm which he implemented in Evolution for the "Find in Message" feature. Later, I ended up extracting that logic out and refactoring it a bit into the ETrie class which I ended up using to replace the regex logic (which is common among most GNOME apps that do this) to scan for urls in the message body (in order to insert hyperlinks, etc). This turned out to be an order of magnitude faster.

For those who aren't in the know, the Aho-Corasick algorithm is for searching for multiple strings in a given input. In other words, it searches for one of multiple needles in a single haystack.

The pseudocode for the search itself would look something like this:

q = root
FOR i = 1 TO n
  WHILE q != fail AND g(q, text[i]) == fail
    q = h(q)
  ENDWHILE
  IF q == fail
    q = root
  ELSE
    q = g(q, text[i])
  ENDIF
  IF isElement(q, final)
    RETURN TRUE
  ENDIF
ENDFOR
RETURN FALSE

Of course, ETrie has a slight modification to the above algorithm, which is that instead of returning TRUE or FALSE, it returns a pointer to the beginning of the match or NULL if it find no matches.

The problem with this particular implementation, though, is that it would simply return a pointer to the offset into the string of the first match found. But what if you had multiple pattern strings that started with identical characters?

Like the ETrie implementation, the traditional Aho-Corasick algorithm (if I am remembering correctly) stops searching as soon as it finds any match. The difference, of course, is that given access to the current to the state / tree structure, a programmer could choose to continue matching to see if there is a more greedy match. With ETrie, however, in the interest of simplicity, it just returned the first / shortest match it got to.

In case I'm explaining this badly, say our search patterns are "the", "there", and "therefor". In the case of ETrie, it would always return that it matched "the" even when the "the" that it matched was the starting substring of the larger string, "therefor".

In something I was working on more recently (in c++), I wanted a greedier match (e.g. I wanted it to match "therefor" instead of "the").

In order to achieve this, I modified the implementation a slight bit:

const char *
Trie::Search (const char *buf, size_t buflen, int *matched_id)
{
    const char *inptr, *inend, *prev, *pat;
    size_t matched = 0;
    TrieMatch *m = 0;
    TrieState *q;
    char c;
    
    inend = buf + buflen;
    inptr = buf;
    
    q = &root;
    pat = prev = inptr;
    while (inptr < inend) {
        c = *inptr++;
        
        if (icase && (c >= 'A' && c <= 'Z'))
            c += 0x20;
        
        while (q != 0 && (m = g (q, c)) == 0 && matched == 0)
            q = q->fail;
        
        if (q == &root) {
            if (matched)
                return pat;
            
            pat = prev;
        }
        
        if (q == 0) {
            if (matched)
                return pat;
            
            q = &root;
            pat = inptr;
        } else if (m != 0) {
            q = m->state;
            
            if (q->final > matched) {
                if (matched_id)
                    *matched_id = q->id;
                
                matched = q->final;
            }
        }
        
        prev = inptr;
    }
    
    return matched ? pat : 0;
}

I have published the full source code for trie.cpp and trie.h on my website for your perusal.

Post a Comment

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.