How do you deal with keeping your super rare words list sensible? For many forms...

msalahi · on Sept 26, 2013

As with most successful applications of machine learning, it's about finessing your approach based on the problem at hand. In our case, we have classes divided on the level of "Medicine," "Real Estate," etc. So, we could throw away lots of words that only occurred once or twice in the massive corpus we crawled to build the language model and still have a pretty robust representation of the subject you're trying to represent.

msalahi · on Sept 26, 2013

In fact, if your training corpus is sufficiently large, you'd be shocked how many words you can eliminate right away for a term frequency of one or two. I went from millions of words in the vocabulary to something like 60k just by ignoring words that happen once or twice in the corpus. Plus, you probably won't learn much about the relationships between words if they only occur a few times in the corpus.

jlees · on Sept 27, 2013

Yeah, but consider that some rare words are much stronger indicators of topic than more common ones. Even more so if you look at n-grams. If you use something like wordnet you can get a lot of meaning out of low-frequency words and throw away the meaningless higher-frequency ones that occur in too many categories to be useful.

adpreese · on Sept 27, 2013

Sure, there's value in rare words, but I don't think anything that occurs across the corpus fewer than 3 times is going to tell you anything useful. You need a certain amount just to have it be a real signal. What was the least frequent useful word in the data set, msalahi?

samuizo · on Sept 27, 2013

You can often strike a balance between rare words that appear in only a couple of documents and very frequent words that occur all over the place by employing both a term frequency and a document frequency weighting scheme; 'tf-idf' in the nomenclature [1].

The basic idea is that you keep track of counts both within documents and among documents. For English, word like 'the' will be frequent in each document it occurs in. It will also occur in every document. The high document frequency counteracts the high term frequency. On the other hand, 'motherboard' might be infrequent overall (but not extremely so), but its low document frequency boosts its importance.

The scheme is commonly employed and works quite well, sometimes obviating the need for careful vocabulary pruning. FWIW, scikit-learn implements it in their feature extraction library [2].

[1] http://en.wikipedia.org/wiki/Tf–idf‎ [2] http://scikit-learn.org/stable/modules/generated/sklearn.fea...