I was working with NLP, and its various toolkits (python-nltk I'm looking at you). The thing about NLP is, there just isn't enough libraries (for humans) to simply plug NLP into use. Even nltk, the premier python library for NLP, seems to be an NLP core-library for building NLP solutions, rather than for building NLP-powered apps. It also seems to extremely unpythonic.
Is there a missing link there? I don't know. So I took a few days and built an extremely simple NER (named entity recognition) engine and made it extremely easy for any programmer to begin using.
While it's possible that some parts of NLTK may be unpythonic, I disagree that it's somehow inaccessible or hard to use. The NLTK project is fairly well-documented, and I've never had any difficulty using it. Also, all the NLTK code I've ever examined looks Pythonic to me, unlike the Java-style Python that I sometimes see coming out of academia; they follow PEP8 and have a developer style guide. There may be pockets of unpythonic code, but I've not come across any.
NLTK is also free and open-source, with a liberal license (Apache), which I appreciate greatly.
Also, I don't understand what you mean by nltk being a library for building NLP solutions, rather than NLP-powered apps. Can you expand on that?
Re: NER, I found this gist (not my own) for a basic example of entity extraction with nltk: https://gist.github.com/322906/90dea659c04570757cccf0ce1e6d2... This looks pretty straight-forward to me. What NLP toolkit are you using for the NER service your Chrome extension calls, if not NLTK?
It isn't just extraction, but the entire process, from training a dictionary, to training it, etc. Have you tried training your own dictionary to extract custom entities? It is "non-trivial".
yeah, for the vast majority of uses, most people really want to do just a fairly small set of things fairly well.
NER comes to mind, lots and lots and lots of toolkits for building up to NER, but very few that let submit English text and get back a list of people, places and things without having to virtually build my own NER system from scratch anyways.
Give me NER, Entity relationships (ER) and a couple kinds of sentiment analysis scoring (SA) (which can be jump started with decent NER) and I've pretty much exhausted 95% of what I'd ever want to do.
I really really really don't need yet another library to do sentence tokenization, term tokenization, tf counting and stemming. If I was building a free text indexer or bayesian filter or some such it might be useful, but I'm probably not, there are far better solutions to those ___domain than I'm likely to come up with, but there aren't for NER, ER and SA.
I've used the Stanford NLP library extensively for NER. I made heavy use of it in my senior thesis project.
It's pretty straightforward to use their library to read a document and output an XML file containing NER data (and lots of other fun stuff).
For instance, from the sentence:
> World War II, or the Second World War (often abbreviated as WWII or WW2), was a global military conflict lasting from 1939 to 1945, which involved most of the world's nations, including all of the great powers, eventually forming two opposing military alliances, the Allies and Axis.
Stanford NLP NER will output the following entities:
"World War II" - MISC
"Second World War" - MISC
"1939 to 1945" - DATE - NORMALIZED 1939/1945
"Axis" - MISC
You can view the output of Stanford's CoreNLP library (NER + dependency grammar + coreference resolution + some other stuff) for the Wikipedia article on World War II in my github repo:
edit: I should add that the real fun (for me) came from combining NER with dependency grammars and coreference resolution. It makes it very easy to turn Stanford NLP's output into a knowledge graph combining a large number of documents.
Here is the whole Stanford CoreNLP suite with visualisation output: http://nlp.stanford.edu:8080/corenlp/ It helps me greatly when it comes to interpreting the dependency structure.
If you need an example sentence: "Stanford University is located in California. It is a great university."
I also know that Microsoft Research has a demo online of their NLP tools: http://msrsplatdemo.cloudapp.net/ (Silverlight required) I don't think you can download the tools though, but they do offer to provide you with an API token to call their service from their cloud.
Potential conflict of interest: I wrote parts of the CoreNLP visualiser.
I had experimented with NLTK, CoreNLP, OpenNLP etc and when it came it NEP extraction, I felt NLTK does the better job (none of them were anywhere close to perfect/dependable), but NLTK had a lot more dictionaries to choose from and overall better.
We use a highly customized/overhauled NLTK for our apps Iris(siri for Android) and Friday for Android.
NLTK may very well be better. I only compared OpenNLP and Stanford because I was implementing my thesis in Scala and I wanted a library running on the JVM.
I can't recommend any libraries "for humans" for this, there are APIs out there for it.
The main problem with many NLP libs (and data mining applications in general) has a lot to do with how much memory good models can take up when doing it in order for it to be accurate at all. Here are a few APIs and libs that might be useful though:
(Disclaimer: publisher of this one here)
https://www.mashape.com/agibsonccc/semantic-analytic
There are other text processing APIs on there as well.
As for libraries, I primarily come from the JVM camp for NLP, but I would recommend the following libraries:
My favorite is cleartk (http://code.google.com/p/cleartk/ )
mainly due to the fact it's a consistent interface, but UIMA itself can be a difficult toolchain to pick up, and I could understand most of these being overkill for many simple applications people may have in mind.
OpenNLP is great. I've used it for a lot of subtasks, but nothing that produces end results as described earlier. It's an amazing library for building NLP systems, but doesn't produce anything directly (named entity recognition, etc) Typically it's coupled with other libraries.
The big problem I think with NLP in general is typically to do anything, you need a pipeline. (Sentence segmentation, tokenization, part of speech tagging) usually at a bare minimum. Then from there you can do named entity recognition or other tasks that produce actual usable results.
Well, if you want to get a really good accuracy, then you need a model specific to your ___domain - which most likely you'll need to train up yourself.
The NLP systems leading in competitions such as CONLL conference tend to be publicly available, so you can get a "general purpose" system there; but the current way usually is to train specific model for a specific purpose - since if you don't have a predetermined purpose, you can't really tell which of items should be tagged as places (instead of things); you can't tell which things should be tagged as 'things' and in what way they should be classified deeper - the list of classes tends to be application-specific.
NERILY appears to be closed source and only usable on a subscription pricing model. It appears to be similar to zemanta.com - unless I am mistaken and this is also available as a library?
It is indeed closed source and available as a NLP as a Service model, what we do have instead is a quick-to-deploy ReST API, with JSON output, and even a chrome extension to train your own dictionary.
One of the authors here: we wrote this during the Pragmatic Programmer's writing month in 2010 and some more in 2011. Then I got caught up writing my PhD thesis, and now a new job (as an NLP engineer, but in Java ;)).
So, the book is basically frozen. We hope to have more time in the future to continue the writing...
Nice endeavor, but finished up as the most endeavors - unfinished. :)
That was the first book in NLP (and the only for now) that I read. I've been interested both in NLP and Haskell. In that respect it fitted, thanks!
A few points to criticize. For the frequency list one should use multisets, not dictionaries. There are a few multiset packages at Hackage. Suffix arrays are badly explained. Monads - very badly. With tagging there was an impression that it could be explained simpler.
Many things are announced but not touched. The book is not a book in fact, it's more like an article. Perhaps reconsider it in that way? But oke, hopefully you will find time to continue it as a book.
Perhaps meanwhile you can recommend some other book to continue reading on NLP?
I'm sorry for that! All other sections were written nicely or OK, and I appreciate for what I picked up from the book. I just wanted to point out some places needed to be reworked in case you continue.
Myself being in industry, I know how hard, near to impossible it is to find time for anything extra than work and family. And a decent book requires approximately the same amount of effort as finishing PhD. Perhaps that was my frustration coming out of the projects I had to abandon. :(
Take a look at Coursera, the NLP course by Jurafsky/Manning (authors of recommendable books) was ok; and right now there's another course starting by Collins, another state-of-art researcher in NLP.
I'd like to thank you for putting out what you've done. I got a lot out of it and I'm sure I'd get more out of it if I understood Haskell better. I look forward to reading the whole thing if you get a chance to finish it!
Basing the examples on standard String class seems dangerous.
As soon as you get a corpus of any reasonable size (and you'll have to use large corpora for any meaningful, non-toy results), the various Haskell String-like classes and laziness-control options are mandatory, but tricky/ugly when starting to use them.
I'm not that familiar with Haskell and the past week's HN frontpage articles on Monads was just confusing...but what is it about Haskell that makes it more useful for NLP than, say, Python?
I've only played around with Haskell and NLP (using this guide, actually), but functional languages are a very nice fit for natural language processing, which often involves pipelining text (in the form of arrays or lists of characters) from function to function (tokenization->tagging->chunking->extraction). This fits the functional paradigm very well. I really like using NLTK (Python) but if I were more comfortable in Haskell and if Haskell had better NLP libraries, I'd probably switch to Haskell because it's a natural fit for NLP. But I have to agree with your assessment of monads...I've been learning Haskell on and off for over a year and I'm still shaky on monads. I'm still hoping I will eventually experience that same moment of epiphany with monads that I did with recursions when I first started programming.
What worked for me was going through worked examples with IO, List, Maybe and State. You don't want to just do List, Maybe and Either or you'll associate it with holding onto particular data. You want to use highly disparate things so you can get to the fully abstract understanding.
You might try to understand monoids first, because you already have familiarity with many applications of monoids. The realization "oh, this is just two functions" at the heart of monoids is also what's at the heart of monads, but the applications are different.
Desugaring helps a lot too. I learned by avoiding do-notation, but you can learn do-notation at the same time if you try desugaring as you go, so you can make explicit what's going on under the covers.
It's like math, you have to keep playing with it until you grok it. I find it's good to try out different expressions in ghci, use :t a lot to see what types are coming back, to build intuition.
If you put a few hours into it for a few days in a row, you can probably get to this epiphany in one weekend. The trick is building up enough examples that your brain can generalize it. Nobody's going to learn it by staring at the abstract form and thinking hard--if we did work that way, there would be a lot more use of comonads. That's why it's important to re-type examples. You're not going to be able to write the examples yourself until you understand them, but working through them gives you something to build on, and builds healthy expectations (I'm going to need return here, because the naked value isn't in the monad, etc.)
The epiphany is worth it--but don't count on a monad tutorial to help you much, they're mainly a side-effect of other people having the epiphany.
Thanks so much for your advice -- much appreciated. I've pulled back out my copy of Learn You a Haskell and have just re-read the Haskell wikibook chapter on monads.
> Not much. It's a more expressive and cleaner language, but on the other hand python has NLTK + scipy community.
Haskell's mechanisms for defining parsers, lexers, and other pattern match tools is so good it probably passes over the line from "pretty" to "objectively better".
A lot of people who need to lex and parse data and then act on it turn to Haskell. It has some really remarkable and efficient libraries. And even for "common" target languages it's reasonable to write extremely fast parsers. With tuning, projects like Aeson are among some of the fastest JSON parsers and writers out there (only a few projects exceed its speed and resource efficiency in ANY runtime).
I am guessing you might be conflating parsing natural language with parsing something that has a rigid and well defined grammar (like a programming language). NLP is a whole different beast.
The very same patterns that define "packrat-like" parsers (which share a strong relationship to the monadic and "arrow-adic" parsers) can be extended to define things like DFAs and semantic pattern matching. And languages with support for rich, somewhat lazy pattern matching like Haskell and Prolog wipe the floor with eager languages without (e.g., C), which is ideal for semantic analysis.
While not an "authority" in the subject, I've spent a lot of time working with some very skilled folks in the field of NLP, Linguistics. Most tools they used (in our case licensed from X/PARC) had C underpinnings for performance, but ultimately consumed specifications that were very much like Prolog or Haskell in character. Talking to some of the linguists who wrote those tools suggested that had GHC existed (or Allegro or a fast prolog been cheaper) then they would have been much easier to write in those languages.
I'm afraid I can't say much more beyond what I have without talking out of my rear. But you can read about X/P's XLE project here: http://www2.parc.com/isl/groups/nltt/xle/
I'm always fascinated with NLP. My undergrad works around it. And currently doing a research for my MS degree. It's about a variant of automatic summarization, wherein I extract the most important sentences in an article. I'll open an API for it soon. :) If you're interested, just contact me in my email (check my profile for it).
Is there a missing link there? I don't know. So I took a few days and built an extremely simple NER (named entity recognition) engine and made it extremely easy for any programmer to begin using.
See http://blog.nerily.com/howto-train-your-own-modelset-for-you... . We'll see how NLP becomes more easily accessible with better tools to come over time.