Hacker News new | past | comments | ask | show | jobs | submit login

While CPU cost is a concern, often memory and correspondingly IO is the bottleneck in vector space approaches. Practitioners can leverage highly-optimized libraries for performing the matrix decompositions, so random disk seeks become more of a concern than CPU time. It's in iterative SVD that Gensim really shines in my opinion.

Turning to your example, any model based on term frequencies, vector space treatments included, would have trouble identifying 'Drakaal' as the most important term. But, this can be mitigated to some extent by preprocessing. In particular, naive coreference resolution would simply assign 'Drakaal' to every occurrence of 'he'/'his' in the sentence (since there are no other candidates). In which case, the count of 'Drakaal' jumps from 1 to 5. Just taking the comments in this thread as the corpus, that's a pretty high frequency for a single document, which might indeed get it to stand out on that basis alone.

Now whether we could get even more nuanced and determine that it's not just about 'Drakaal' but also a certain disposition toward him really depends on the task. If it's important to uncover those sorts of patterns then I would incorporate some documents that are illustrative of the distinction. In this sense, vector space approaches can be both purely exploratory as well as guided toward the divisions you aim for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: