I was working with NLP, and its various toolkits (python-nltk I'm looking at you...

eric_bullington · on Feb 11, 2013

While it's possible that some parts of NLTK may be unpythonic, I disagree that it's somehow inaccessible or hard to use. The NLTK project is fairly well-documented, and I've never had any difficulty using it. Also, all the NLTK code I've ever examined looks Pythonic to me, unlike the Java-style Python that I sometimes see coming out of academia; they follow PEP8 and have a developer style guide. There may be pockets of unpythonic code, but I've not come across any.

NLTK is also free and open-source, with a liberal license (Apache), which I appreciate greatly.

Also, I don't understand what you mean by nltk being a library for building NLP solutions, rather than NLP-powered apps. Can you expand on that?

Re: NER, I found this gist (not my own) for a basic example of entity extraction with nltk: https://gist.github.com/322906/90dea659c04570757cccf0ce1e6d2... This looks pretty straight-forward to me. What NLP toolkit are you using for the NER service your Chrome extension calls, if not NLTK?

nubela · on Feb 11, 2013

It isn't just extraction, but the entire process, from training a dictionary, to training it, etc. Have you tried training your own dictionary to extract custom entities? It is "non-trivial".

bane · on Feb 11, 2013

yeah, for the vast majority of uses, most people really want to do just a fairly small set of things fairly well.

NER comes to mind, lots and lots and lots of toolkits for building up to NER, but very few that let submit English text and get back a list of people, places and things without having to virtually build my own NER system from scratch anyways.

Give me NER, Entity relationships (ER) and a couple kinds of sentiment analysis scoring (SA) (which can be jump started with decent NER) and I've pretty much exhausted 95% of what I'd ever want to do.

I really really really don't need yet another library to do sentence tokenization, term tokenization, tf counting and stemming. If I was building a free text indexer or bayesian filter or some such it might be useful, but I'm probably not, there are far better solutions to those ___domain than I'm likely to come up with, but there aren't for NER, ER and SA.

saryant · on Feb 11, 2013

I've used the Stanford NLP library extensively for NER. I made heavy use of it in my senior thesis project.

It's pretty straightforward to use their library to read a document and output an XML file containing NER data (and lots of other fun stuff).

For instance, from the sentence:

> World War II, or the Second World War (often abbreviated as WWII or WW2), was a global military conflict lasting from 1939 to 1945, which involved most of the world's nations, including all of the great powers, eventually forming two opposing military alliances, the Allies and Axis.

Stanford NLP NER will output the following entities:

"World War II" - MISC "Second World War" - MISC "1939 to 1945" - DATE - NORMALIZED 1939/1945 "Axis" - MISC

You can view the output of Stanford's CoreNLP library (NER + dependency grammar + coreference resolution + some other stuff) for the Wikipedia article on World War II in my github repo:

https://raw.github.com/ryantanner/thesis/master/data/ww2samp...

edit: I should add that the real fun (for me) came from combining NER with dependency grammars and coreference resolution. It makes it very easy to turn Stanford NLP's output into a knowledge graph combining a large number of documents.

agibsonccc · on Feb 11, 2013

For those who want to play around with dynamic output: http://nlp.stanford.edu:8080/parser/

This is a bit more human friendly.

ninjin · on Feb 11, 2013

Here is the whole Stanford CoreNLP suite with visualisation output: http://nlp.stanford.edu:8080/corenlp/ It helps me greatly when it comes to interpreting the dependency structure.

If you need an example sentence: "Stanford University is located in California. It is a great university."

I also know that Microsoft Research has a demo online of their NLP tools: http://msrsplatdemo.cloudapp.net/ (Silverlight required) I don't think you can download the tools though, but they do offer to provide you with an API token to call their service from their cloud.

Potential conflict of interest: I wrote parts of the CoreNLP visualiser.

Edit: Added example sentence.

nubela · on Feb 11, 2013

http://www.nerily.com/#demo -> Or this with a JSON API output.

narayanb · on Feb 11, 2013

I had experimented with NLTK, CoreNLP, OpenNLP etc and when it came it NEP extraction, I felt NLTK does the better job (none of them were anywhere close to perfect/dependable), but NLTK had a lot more dictionaries to choose from and overall better. We use a highly customized/overhauled NLTK for our apps Iris(siri for Android) and Friday for Android.

saryant · on Feb 11, 2013

NLTK may very well be better. I only compared OpenNLP and Stanford because I was implementing my thesis in Scala and I wanted a library running on the JVM.

agibsonccc · on Feb 11, 2013

I can't recommend any libraries "for humans" for this, there are APIs out there for it. The main problem with many NLP libs (and data mining applications in general) has a lot to do with how much memory good models can take up when doing it in order for it to be accurate at all. Here are a few APIs and libs that might be useful though: (Disclaimer: publisher of this one here) https://www.mashape.com/agibsonccc/semantic-analytic

There are other text processing APIs on there as well. As for libraries, I primarily come from the JVM camp for NLP, but I would recommend the following libraries:

http://nlp.stanford.edu/software/index.shtml (Comprehensive) http://code.google.com/p/clearnlp/ (Fairly simple)

My favorite is cleartk (http://code.google.com/p/cleartk/ ) mainly due to the fact it's a consistent interface, but UIMA itself can be a difficult toolchain to pick up, and I could understand most of these being overkill for many simple applications people may have in mind.

bane · on Feb 11, 2013

I've heard some good things about OpenNLP as well http://opennlp.apache.org/

but haven't had the time to look at it with any detail.

agibsonccc · on Feb 11, 2013

OpenNLP is great. I've used it for a lot of subtasks, but nothing that produces end results as described earlier. It's an amazing library for building NLP systems, but doesn't produce anything directly (named entity recognition, etc) Typically it's coupled with other libraries.

The big problem I think with NLP in general is typically to do anything, you need a pipeline. (Sentence segmentation, tokenization, part of speech tagging) usually at a bare minimum. Then from there you can do named entity recognition or other tasks that produce actual usable results.

PeterisP · on Feb 12, 2013

Well, if you want to get a really good accuracy, then you need a model specific to your ___domain - which most likely you'll need to train up yourself.

The NLP systems leading in competitions such as CONLL conference tend to be publicly available, so you can get a "general purpose" system there; but the current way usually is to train specific model for a specific purpose - since if you don't have a predetermined purpose, you can't really tell which of items should be tagged as places (instead of things); you can't tell which things should be tagged as 'things' and in what way they should be classified deeper - the list of classes tends to be application-specific.

hnriot · on Feb 11, 2013

NERILY appears to be closed source and only usable on a subscription pricing model. It appears to be similar to zemanta.com - unless I am mistaken and this is also available as a library?

nubela · on Feb 11, 2013

It is indeed closed source and available as a NLP as a Service model, what we do have instead is a quick-to-deploy ReST API, with JSON output, and even a chrome extension to train your own dictionary.