Hacker News new | past | comments | ask | show | jobs | submit login

The issue with Genism is you have to know what you are trying to analyze before you analyze it. It doesn't do well if you use the wrong corpus or if like you mention start with a million word corpus.

If you were analyzing emails in a single organization all day you could probably sort out topics really well. Doing all of the web it breaks down because it gets less accurate the larger the variety of content.




"doing all of the web" will cause pretty much any approach to AI/machine learning/NLP to break down. i'm a big believer in it being the responsibility of the engineer employing these techniques to take stock of the problem at hand and find out what constraints you can take advantage of to achieve better performance/accuracy/prettiness of code. there's not really a silver bullet that you can just release on the internet with the task of bringing back incredibly useful information without "knowing what you're trying to analyze before you analyze it."


Web Developers are a neat bunch. It's amazing what kind of inference you can do with exploiting document structure to make different kind of inferences alongside more traditional approaches like word frequency analysis, LDA, or even deep learning/word distributional inference. NLP on the web especially question answering and search, can still be greatly expanded upon.


Wait an hour. We decided to push that bit of code live in Alpha. :-)


SCIENCE!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: