A number of years ago, Michael Zucchi introduced me to the Aho-Corasick algorithm which he implemented in Evolution for the "Find in Message" feature. Later, I ended up extracting that logic out and refactoring it a bit into the ETrie class which I ended up using to replace the regex logic (which is common among most GNOME apps that do this) to scan for urls in the message body (in order to insert hyperlinks, etc). This turned out to be an order of magnitude faster.
For those who aren't in the know, the Aho-Corasick algorithm is for searching for multiple strings in a given input. In other words, it searches for one of multiple needles in a single haystack.
The pseudocode for the search itself would look something like this:
q = root FOR i = 1 TO n WHILE q != fail AND g(q, text[i]) == fail q = h(q) ENDWHILE IF q == fail q = root ELSE q = g(q, text[i]) ENDIF IF isElement(q, final) RETURN TRUE ENDIF ENDFOR RETURN FALSE
Of course, ETrie has a slight modification to the above algorithm, which is that instead of returning TRUE or FALSE, it returns a pointer to the beginning of the match or NULL if it find no matches.
The problem with this particular implementation, though, is that it would simply return a pointer to the offset into the string of the first match found. But what if you had multiple pattern strings that started with identical characters?
Like the ETrie implementation, the traditional Aho-Corasick algorithm (if I am remembering correctly) stops searching as soon as it finds any match. The difference, of course, is that given access to the current to the state / tree structure, a programmer could choose to continue matching to see if there is a more greedy match. With ETrie, however, in the interest of simplicity, it just returned the first / shortest match it got to.
In case I'm explaining this badly, say our search patterns are "the", "there", and "therefor". In the case of ETrie, it would always return that it matched "the" even when the "the" that it matched was the starting substring of the larger string, "therefor".
In something I was working on more recently (in c++), I wanted a greedier match (e.g. I wanted it to match "therefor" instead of "the").
In order to achieve this, I modified the implementation a slight bit:
const char * Trie::Search (const char *buf, size_t buflen, int *matched_id) { const char *inptr, *inend, *prev, *pat; size_t matched = 0; TrieMatch *m = 0; TrieState *q; char c; inend = buf + buflen; inptr = buf; q = &root; pat = prev = inptr; while (inptr < inend) { c = *inptr++; if (icase && (c >= 'A' && c <= 'Z')) c += 0x20; while (q != 0 && (m = g (q, c)) == 0 && matched == 0) q = q->fail; if (q == &root) { if (matched) return pat; pat = prev; } if (q == 0) { if (matched) return pat; q = &root; pat = inptr; } else if (m != 0) { q = m->state; if (q->final > matched) { if (matched_id) *matched_id = q->id; matched = q->final; } } prev = inptr; } return matched ? pat : 0; }
I have published the full source code for trie.cpp and trie.h on my website for your perusal.
5 comments:
Ah, hmm, thats pretty much what I was wanting to use for the search feature in muine. Sweet.
Glad this post has been of service ;-)
Interesting, I would have hoped that a regex engine would parse (foo|bar|etc) into using that optimization itself. Seems a shame to have to give up just using a single regex API to get the optimization.
Just curios, did you consider Boyer-Moore algorithm also ? It looks pretty interesting - http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm Seems like string search is an interesting area of research.
Sankar: Boyer-Moore is pretty cool (implemented that back in college), but it's really only useful for matching against 1 substring at a time (e.g. great for an API like strstr())
Post a Comment