I was working with NLP, and its various toolkits (python-nltk I'm looking at you). The thing about NLP is, there just isn't enough libraries (for humans) to simply plug NLP into use. Even nltk, the premier python library for NLP, seems to be an NLP core-library for building NLP solutions, rather than for building NLP-powered apps. It also seems to extremely unpythonic.
Is there a missing link there? I don't know. So I took a few days and built an extremely simple NER (named entity recognition) engine and made it extremely easy for any programmer to begin using.
While it's possible that some parts of NLTK may be unpythonic, I disagree that it's somehow inaccessible or hard to use. The NLTK project is fairly well-documented, and I've never had any difficulty using it. Also, all the NLTK code I've ever examined looks Pythonic to me, unlike the Java-style Python that I sometimes see coming out of academia; they follow PEP8 and have a developer style guide. There may be pockets of unpythonic code, but I've not come across any.
NLTK is also free and open-source, with a liberal license (Apache), which I appreciate greatly.
Also, I don't understand what you mean by nltk being a library for building NLP solutions, rather than NLP-powered apps. Can you expand on that?
Re: NER, I found this gist (not my own) for a basic example of entity extraction with nltk: https://gist.github.com/322906/90dea659c04570757cccf0ce1e6d2... This looks pretty straight-forward to me. What NLP toolkit are you using for the NER service your Chrome extension calls, if not NLTK?
It isn't just extraction, but the entire process, from training a dictionary, to training it, etc. Have you tried training your own dictionary to extract custom entities? It is "non-trivial".
yeah, for the vast majority of uses, most people really want to do just a fairly small set of things fairly well.
NER comes to mind, lots and lots and lots of toolkits for building up to NER, but very few that let submit English text and get back a list of people, places and things without having to virtually build my own NER system from scratch anyways.
Give me NER, Entity relationships (ER) and a couple kinds of sentiment analysis scoring (SA) (which can be jump started with decent NER) and I've pretty much exhausted 95% of what I'd ever want to do.
I really really really don't need yet another library to do sentence tokenization, term tokenization, tf counting and stemming. If I was building a free text indexer or bayesian filter or some such it might be useful, but I'm probably not, there are far better solutions to those ___domain than I'm likely to come up with, but there aren't for NER, ER and SA.
I've used the Stanford NLP library extensively for NER. I made heavy use of it in my senior thesis project.
It's pretty straightforward to use their library to read a document and output an XML file containing NER data (and lots of other fun stuff).
For instance, from the sentence:
> World War II, or the Second World War (often abbreviated as WWII or WW2), was a global military conflict lasting from 1939 to 1945, which involved most of the world's nations, including all of the great powers, eventually forming two opposing military alliances, the Allies and Axis.
Stanford NLP NER will output the following entities:
"World War II" - MISC
"Second World War" - MISC
"1939 to 1945" - DATE - NORMALIZED 1939/1945
"Axis" - MISC
You can view the output of Stanford's CoreNLP library (NER + dependency grammar + coreference resolution + some other stuff) for the Wikipedia article on World War II in my github repo:
edit: I should add that the real fun (for me) came from combining NER with dependency grammars and coreference resolution. It makes it very easy to turn Stanford NLP's output into a knowledge graph combining a large number of documents.
Here is the whole Stanford CoreNLP suite with visualisation output: http://nlp.stanford.edu:8080/corenlp/ It helps me greatly when it comes to interpreting the dependency structure.
If you need an example sentence: "Stanford University is located in California. It is a great university."
I also know that Microsoft Research has a demo online of their NLP tools: http://msrsplatdemo.cloudapp.net/ (Silverlight required) I don't think you can download the tools though, but they do offer to provide you with an API token to call their service from their cloud.
Potential conflict of interest: I wrote parts of the CoreNLP visualiser.
I had experimented with NLTK, CoreNLP, OpenNLP etc and when it came it NEP extraction, I felt NLTK does the better job (none of them were anywhere close to perfect/dependable), but NLTK had a lot more dictionaries to choose from and overall better.
We use a highly customized/overhauled NLTK for our apps Iris(siri for Android) and Friday for Android.
NLTK may very well be better. I only compared OpenNLP and Stanford because I was implementing my thesis in Scala and I wanted a library running on the JVM.
I can't recommend any libraries "for humans" for this, there are APIs out there for it.
The main problem with many NLP libs (and data mining applications in general) has a lot to do with how much memory good models can take up when doing it in order for it to be accurate at all. Here are a few APIs and libs that might be useful though:
(Disclaimer: publisher of this one here)
https://www.mashape.com/agibsonccc/semantic-analytic
There are other text processing APIs on there as well.
As for libraries, I primarily come from the JVM camp for NLP, but I would recommend the following libraries:
My favorite is cleartk (http://code.google.com/p/cleartk/ )
mainly due to the fact it's a consistent interface, but UIMA itself can be a difficult toolchain to pick up, and I could understand most of these being overkill for many simple applications people may have in mind.
OpenNLP is great. I've used it for a lot of subtasks, but nothing that produces end results as described earlier. It's an amazing library for building NLP systems, but doesn't produce anything directly (named entity recognition, etc) Typically it's coupled with other libraries.
The big problem I think with NLP in general is typically to do anything, you need a pipeline. (Sentence segmentation, tokenization, part of speech tagging) usually at a bare minimum. Then from there you can do named entity recognition or other tasks that produce actual usable results.
Well, if you want to get a really good accuracy, then you need a model specific to your ___domain - which most likely you'll need to train up yourself.
The NLP systems leading in competitions such as CONLL conference tend to be publicly available, so you can get a "general purpose" system there; but the current way usually is to train specific model for a specific purpose - since if you don't have a predetermined purpose, you can't really tell which of items should be tagged as places (instead of things); you can't tell which things should be tagged as 'things' and in what way they should be classified deeper - the list of classes tends to be application-specific.
NERILY appears to be closed source and only usable on a subscription pricing model. It appears to be similar to zemanta.com - unless I am mistaken and this is also available as a library?
It is indeed closed source and available as a NLP as a Service model, what we do have instead is a quick-to-deploy ReST API, with JSON output, and even a chrome extension to train your own dictionary.
Is there a missing link there? I don't know. So I took a few days and built an extremely simple NER (named entity recognition) engine and made it extremely easy for any programmer to begin using.
See http://blog.nerily.com/howto-train-your-own-modelset-for-you... . We'll see how NLP becomes more easily accessible with better tools to come over time.