Hacker News new | past | comments | ask | show | jobs | submit login
Knuth–Morris–Pratt algorithm (wikipedia.org)
117 points by rbcoffee on July 5, 2014 | hide | past | favorite | 47 comments



I've always liked the elegance of this algorithm, fun little example of how you can make things incredibly fast by thinking about your problem.

And a related anecdote: I was interviewing at Apple for a systems development related role (graphics drivers, I think?) and one of the senior-level folks asked me to write strstr on the whiteboard. I started with a naive, working implementation, then he asked me how I'd optimize it. I said 'knuth morris pratt' and gave a basic overview of the algorithm and explained how it's faster.

He insisted the algorithm couldn't possibly work. I spent a few more minutes trying to explain it, but I couldn't convince him. The dark magic of efficient string searches evades us all sometimes, I suppose. I always like coming away from an interview feeling like I learned something, so I hope he googled the algorithm later. :-)


There's some interesting theoretical work that was done by Srinivas in the 90s[1], that takes a geometric view of pattern matching, based on sheaves, and uses it to derive a generalized version of KMP that can be applied in other domains. I'm not sure what happened to this research program, and forget most of the details, but I heard a talk by Srinivas and recall thinking it was a very practical and real application of category theory.

[1] http://www.sciencedirect.com/science/article/pii/03043975939...


That's the risk with interview questions, that the candidate suggests a better solution that the interviewer anticipated. Leaving the interviewer with the task of ascertaining whether the solution is correct.


I love to be surprised like that in interviews, let me tell you. Makes my decision much easier.


> Makes my decision much easier.

You didn't say which way, I've worked with and for people who wouldn't hire someone smarter than them, the human psyche is a dark place.


That's true. On other hand, it's just for the interviewees own good - you don't want to work with people like that ;) except from megacorps with hundreds and thousands of devs


This story is borderline baffling, though. If you flat out named the algorithm and it contains a famous name like Knuth, that should be good enough to go. That you may not have all of the points of it memorized is irrelevant.

Now, if you name a very obscurely named algorithm, that is one thing. But seriously, Knuth!? Is anyone involved with optimizations and serious algorithm design not aware of that name?


IMO the Boyer-Moore string searching algorithm is way more elegant. Average case performance of O(n/m)?!? That means that the longer the string you're searching for the faster you can find it! It obviously makes perfect logical sense once you think about it, but when I first heard about the algorithm it seemed magical.


Wikipedia says Boyer Moore is O(n+m) which is the same as this algorithm.

https://en.wikipedia.org/wiki/Boyer-Moore_string_search_algo...


That's worst case complexity. It's the average case where Boyer Moore wins over KMP.


Why is KMP on the front page of HN? Not complaining, just confused. Is this related to some other news?


Because it is of interest to hackers and rightly so. Some submittals are good because they are cutting edge, others because they touch on the fundamental - the magical - aspects of computing.

In terms is it being news, it's a bit of a feature story rather than on the scene reporting of a city council meeting or a cub's coverage of the police blotter or the Chamber of Commerce recent press release wrapped up as news.

If nothing else it's better than a blog post about the algorithm, and for me it's always helpful to be reminded what an amazing resource Wikipedia has become for computer science topics.


It's a weekend and other material surfaces.


Yup welcome to the slow hours in the weekend on HN. Personally, I like the change.


Because it takes just a few votes (4-7 in first ~45min, from my experience) to bring something up to the front page. Plus a cool name, I guess.


I personally like this kind of submission. If it's something I was familiar with, then it's a good review/reminder; if it's something I was unfamiliar with, it's a good lesson.


In addition to all the other possible reasons, Knuth-Morris-Pratt was brought up just a few days ago, in the discussion on another article about a letter Knuth wrote about software patents.


reminds me that I implemented it 10 years ago... and forgot it in the meantime. It's nice to see a good old friend.


Because people like you and me are bored enough to reply here..


I have implemented KMP in Haskell. This version doesn't use any index! It is built purely functionally by realizing KMP's failure table is just a finite state automaton(well, almost...) However it is much longer than the C++ version...

Code: https://github.com/Mgccl/haskell-algorithm/blob/master/KMP.h... Description: http://www.chaoxuprime.com/posts/2014-04-11-the-kmp-algorith...

Actually, KMP is a little harder to program purely functionally than the MP algorithm. An extremely elegant MP algorithm is implemented here: http://twanvl.nl/blog/haskell/Knuth-Morris-Pratt-in-Haskell (Note it says the algorithm is KMP, but it is actually the MP algorithm).

The Aho–Corasick string matching algorithm is a generalization of the MP algorithm. Which I also coded in Haskell inspired by the MP code above. https://github.com/Mgccl/haskell-algorithm/blob/master/AhoCo...


Looks like there's also a KMP implementation here specialized to ByteStrings: http://hackage.haskell.org/package/stringsearch


KMP is conceptually very cool, as well as other clever string searching algorithms (BM, RK, AC), though one of the questions for me always was if even its most efficient implementation wouldn't be always slower than executing a brute-force combination of REP CMPSL/CMPSB instructions (x86) for vast majority of searched strings?


The time complexity of KMP is O(n), while the complexity of your idea is O(n^2). Considering the fact that KMP is not even doing a lot of things in its inner loop, (few memory accesses), a rep cmpsb approach is really no match for it, even in the trivial cases.

So no, it's much faster.


I know about the time complexity, but my argument points to the CPU architecture and cache/branch prediction efficiency.

KMP has a lot of branching which is more expensive than a simple cache (line) hit, cache REP CMPSx doesn't have any branching, aborts immediately after a mismatch with only a single loop over the starting position, search is also pretty much linear. This is not a Turing machine where it is being executed ;-)

So in theory KMP is faster, in practice for not very large strings I am really not sure... There are plenty of optimal algorithms where the fixed cost is too high for majority of useful cases comparing to less optimal algorithms with very low fixed costs. Does KMP have as much "mechanical sympathy" to overcome specific machine code instructions for most frequent cases?


This depends completely on your needle and your haystack. If it's the string "ABC" in "QABC" then certainly the naive algorithm will be the fastest.

"Mechanical sympathy" unfortunately is easily misinterpreted to mean to only listen to the machine. Remember though that Martin Thompson is alluding to Jackie Stewart and Formula One racing. All of the cars competed on the same track. But you wouldn't enter a Formula One car for the Baja 1000.

In fact, I don't think you should use term unless you have a specific goal in mind. KMP is only best for certain classes of searches. Boyer-Moore is used for others. And there are plenty of other algorithms with there own pros and cons. See http://www-igm.univ-mlv.fr/~lecroq/string/index.html for descriptions.

Python, for example, uses (or used?) the algorithm described at http://effbot.org/zone/stringlib.htm , for the reasons listed therein.


As with most optimisations the only way to be sure is to try it. Big-O complexity says very little about actual runtime on real-world data.


You're asking a question that the experts have also asked, so it's definitely a good question. The algorithm you're talking about is also called the naive algorithm. It's much faster in practice than you might expect given its simplicity and poor worst-case performance. That's because pathological behavior doesn't actually happen that often, and most of the time it is quite efficient.

I'm not sure whether it or KMP would be faster. But it doesn't matter much in the real world, because even better algorithms, like Boyer-Moore and its derivatives are faster still, and they are used instead.


O(m*n) is not the same as O(n^2).


Yes, but if something is O(m*n) and m <= n (which is the case in string search), then it's also O(n^2). On the other hand, since m can be of the order of, say, n/2, it's not too confusing to say that naive string search is O(n^2), since it's actually Theta(n^2).


Yes, you are right of course.

I really just meant that "in practice" the growth of the string length (m) varies independently and common cases behave more like O(n) than O(n^2). Also that naive search is probably heavily optimized with SSE assembly instructions and cache tuned. I have implemented several of these algorithms in C and it is difficult to impossible to beat C's strstr consistently with a simple implementation, even on cases that start to make the naive algorithm really inefficient.

You'd most likely need a hybrid approach to make a good general purpose string search function that is effective across many domains.


It's the same when m=n.


Of course in string search it never is. In m=n case KMP would simply compare the first character, and if it doesn't match declare nothing was found.

Yes there'd be much more different instructions involved but I think KMP would start beating naive pretty quickly in the m=n case.


If you don't have the entire string in memory then Boyer-Moore is usually fastest because you can avoid doing a lot of I/O (since you skip comparing many subsequences, you don't have to read those subsequences in the first place). This is why grep is very fast given a static string pattern to search for, even on huge files, but grepping for a regex is dog slow.


Until they implement a JIT and compile the regex.


No, you're missing the point -- using a regex does not allow you to read in less than all of the input (in the best case, even). If you grep a 1GB file with a regex you must read 1GB from disk, no exceptions, and that's many orders of magnitude slower than anything on the CPU (or in memory). With Boyer-Moore you can do less than 100% of the file size in I/O, especially if the given pattern is long relative to the full text.


We have no context to the environment that those regexs are running. For all we know they could have implicit limits and be getting compiled down to FPGAs. I can guarantee you that those regexs aren't running unbounded.

No point was missed.


KMP starts with the first character of the pattern (or substring/needle) and then jumps forward by the length of the mismatch. A related algorithm is Boyer-Moore (BM)(http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algor...), which operates the other way round: it begins with the last character of the pattern and then compares backwards until the full pattern matches. The advantage of BM is that it allows for bigger jumps.

It seems that KMP works well for small alphabets (e.g., DNA), whereas BM shines for larger alphabets (e.g., plain English).


One difference is that BM is O(m*n) while KMP is O(n + m). Depending on how the input string looks like, this can matter - especially in small alphabets where the likelihood of pattern repeating themselves are bigger.


BM is too provably O(n + m) (and without dependence on alphabet size unlike KMP) if you apply two heurestics that I don't really remember. For some reason it's not mentioned on wikipedia.





I always wondered why most common KMP and RE implementations don't take into account the case of using streams instead of strings. That's why I ended up writing this article (with code) "Searching for Substrings in Streams: a Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe" [1] and adding information about a currently unsupported RE lib that take into account streams.

Hope this helps.

[1] http://blog.databigbang.com/searching-for-substrings-in-stre...


A few months ago I stumbled on James Morris' GitHub profile [1] by accident. There's not much activity, but he has dabbled with Ruby on Rails.

1: https://github.com/jhm15217/


The FM-Index changed the way I think about searching.

http://alexbowe.com/fm-index/


I learned about this algorithm recently, because of an exercise in the Rosalind bioinformatics problems list: http://rosalind.info/problems/kmp/

There are tons of other interesting algorithms put in practice in different bioinformatics problems.


Here's an easy to follow video explaining the algorithm: https://www.youtube.com/watch?v=rfisBOOLN9M




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: