I've always liked the elegance of this algorithm, fun little example of how you can make things incredibly fast by thinking about your problem.
And a related anecdote: I was interviewing at Apple for a systems development related role (graphics drivers, I think?) and one of the senior-level folks asked me to write strstr on the whiteboard. I started with a naive, working implementation, then he asked me how I'd optimize it. I said 'knuth morris pratt' and gave a basic overview of the algorithm and explained how it's faster.
He insisted the algorithm couldn't possibly work. I spent a few more minutes trying to explain it, but I couldn't convince him. The dark magic of efficient string searches evades us all sometimes, I suppose. I always like coming away from an interview feeling like I learned something, so I hope he googled the algorithm later. :-)
There's some interesting theoretical work that was done by Srinivas in the 90s[1], that takes a geometric view of pattern matching, based on sheaves, and uses it to derive a generalized version of KMP that can be applied in other domains. I'm not sure what happened to this research program, and forget most of the details, but I heard a talk by Srinivas and recall thinking it was a very practical and real application of category theory.
That's the risk with interview questions, that the candidate suggests a better solution that the interviewer anticipated. Leaving the interviewer with the task of ascertaining whether the solution is correct.
That's true. On other hand, it's just for the interviewees own good - you don't want to work with people like that ;) except from megacorps with hundreds and thousands of devs
This story is borderline baffling, though. If you flat out named the algorithm and it contains a famous name like Knuth, that should be good enough to go. That you may not have all of the points of it memorized is irrelevant.
Now, if you name a very obscurely named algorithm, that is one thing. But seriously, Knuth!? Is anyone involved with optimizations and serious algorithm design not aware of that name?
IMO the Boyer-Moore string searching algorithm is way more elegant. Average case performance of O(n/m)?!? That means that the longer the string you're searching for the faster you can find it! It obviously makes perfect logical sense once you think about it, but when I first heard about the algorithm it seemed magical.
Because it is of interest to hackers and rightly so. Some submittals are good because they are cutting edge, others because they touch on the fundamental - the magical - aspects of computing.
In terms is it being news, it's a bit of a feature story rather than on the scene reporting of a city council meeting or a cub's coverage of the police blotter or the Chamber of Commerce recent press release wrapped up as news.
If nothing else it's better than a blog post about the algorithm, and for me it's always helpful to be reminded what an amazing resource Wikipedia has become for computer science topics.
I personally like this kind of submission. If it's something I was familiar with, then it's a good review/reminder; if it's something I was unfamiliar with, it's a good lesson.
In addition to all the other possible reasons, Knuth-Morris-Pratt was brought up just a few days ago, in the discussion on another article about a letter Knuth wrote about software patents.
I have implemented KMP in Haskell. This version doesn't use any index! It is built purely functionally by realizing KMP's failure table is just a finite state automaton(well, almost...) However it is much longer than the C++ version...
Actually, KMP is a little harder to program purely functionally than the MP algorithm.
An extremely elegant MP algorithm is implemented here:
http://twanvl.nl/blog/haskell/Knuth-Morris-Pratt-in-Haskell
(Note it says the algorithm is KMP, but it is actually the MP algorithm).
KMP is conceptually very cool, as well as other clever string searching algorithms (BM, RK, AC), though one of the questions for me always was if even its most efficient implementation wouldn't be always slower than executing a brute-force combination of REP CMPSL/CMPSB instructions (x86) for vast majority of searched strings?
The time complexity of KMP is O(n), while the complexity of your idea is O(n^2). Considering the fact that KMP is not even doing a lot of things in its inner loop, (few memory accesses), a rep cmpsb approach is really no match for it, even in the trivial cases.
I know about the time complexity, but my argument points to the CPU architecture and cache/branch prediction efficiency.
KMP has a lot of branching which is more expensive than a simple cache (line) hit, cache REP CMPSx doesn't have any branching, aborts immediately after a mismatch with only a single loop over the starting position, search is also pretty much linear. This is not a Turing machine where it is being executed ;-)
So in theory KMP is faster, in practice for not very large strings I am really not sure... There are plenty of optimal algorithms where the fixed cost is too high for majority of useful cases comparing to less optimal algorithms with very low fixed costs. Does KMP have as much "mechanical sympathy" to overcome specific machine code instructions for most frequent cases?
This depends completely on your needle and your haystack. If it's the string "ABC" in "QABC" then certainly the naive algorithm will be the fastest.
"Mechanical sympathy" unfortunately is easily misinterpreted to mean to only listen to the machine. Remember though that Martin Thompson is alluding to Jackie Stewart and Formula One racing. All of the cars competed on the same track. But you wouldn't enter a Formula One car for the Baja 1000.
In fact, I don't think you should use term unless you have a specific goal in mind. KMP is only best for certain classes of searches. Boyer-Moore is used for others. And there are plenty of other algorithms with there own pros and cons. See http://www-igm.univ-mlv.fr/~lecroq/string/index.html for descriptions.
You're asking a question that the experts have also asked, so it's definitely a good question. The algorithm you're talking about is also called the naive algorithm. It's much faster in practice than you might expect given its simplicity and poor worst-case performance. That's because pathological behavior doesn't actually happen that often, and most of the time it is quite efficient.
I'm not sure whether it or KMP would be faster. But it doesn't matter much in the real world, because even better algorithms, like Boyer-Moore and its derivatives are faster still, and they are used instead.
Yes, but if something is O(m*n) and m <= n (which is the case in string search), then it's also O(n^2). On the other hand, since m can be of the order of, say, n/2, it's not too confusing to say that naive string search is O(n^2), since it's actually Theta(n^2).
I really just meant that "in practice" the growth of the string length (m) varies independently and common cases behave more like O(n) than O(n^2). Also that naive search is probably heavily optimized with SSE assembly instructions and cache tuned. I have implemented several of these algorithms in C and it is difficult to impossible to beat C's strstr consistently with a simple implementation, even on cases that start to make the naive algorithm really inefficient.
You'd most likely need a hybrid approach to make a good general purpose string search function that is effective across many domains.
If you don't have the entire string in memory then Boyer-Moore is usually fastest because you can avoid doing a lot of I/O (since you skip comparing many subsequences, you don't have to read those subsequences in the first place). This is why grep is very fast given a static string pattern to search for, even on huge files, but grepping for a regex is dog slow.
No, you're missing the point -- using a regex does not allow you to read in less than all of the input (in the best case, even). If you grep a 1GB file with a regex you must read 1GB from disk, no exceptions, and that's many orders of magnitude slower than anything on the CPU (or in memory). With Boyer-Moore you can do less than 100% of the file size in I/O, especially if the given pattern is long relative to the full text.
We have no context to the environment that those regexs are running. For all we know they could have implicit limits and be getting compiled down to FPGAs. I can guarantee you that those regexs aren't running unbounded.
KMP starts with the first character of the pattern (or substring/needle) and then jumps forward by the length of the mismatch. A related algorithm is Boyer-Moore (BM)(http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algor...), which operates the other way round: it begins with the last character of the pattern and then compares backwards until the full pattern matches. The advantage of BM is that it allows for bigger jumps.
It seems that KMP works well for small alphabets (e.g., DNA), whereas BM shines for larger alphabets (e.g., plain English).
One difference is that BM is O(m*n) while KMP is O(n + m). Depending on how the input string looks like, this can matter - especially in small alphabets where the likelihood of pattern repeating themselves are bigger.
BM is too provably O(n + m) (and without dependence on alphabet size unlike KMP) if you apply two heurestics that I don't really remember. For some reason it's not mentioned on wikipedia.
I always wondered why most common KMP and RE implementations don't take into account the case of using streams instead of strings. That's why I ended up writing this article (with code) "Searching for Substrings in Streams: a Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe" [1] and adding information about a currently unsupported RE lib that take into account streams.
And a related anecdote: I was interviewing at Apple for a systems development related role (graphics drivers, I think?) and one of the senior-level folks asked me to write strstr on the whiteboard. I started with a naive, working implementation, then he asked me how I'd optimize it. I said 'knuth morris pratt' and gave a basic overview of the algorithm and explained how it's faster.
He insisted the algorithm couldn't possibly work. I spent a few more minutes trying to explain it, but I couldn't convince him. The dark magic of efficient string searches evades us all sometimes, I suppose. I always like coming away from an interview feeling like I learned something, so I hope he googled the algorithm later. :-)