One place to improve this would be to use a better set of word-embeddings. FastText is, well, fast, but it's no longer close to SOTA.
You're most likely using simple average pooling, which is why many users are getting results that don't look right to them. Try a chunking approach, where you get a vector for each chunk of the document and horizontally concatenate those together (if your vectors are 50d, and do 5 chunks per doc, than you get a 250d fixed vector for each document regardless of length). This partially solves the issue of highly diluted vectors which is responsible for the poor results that some users are reporting. You can also do "attentive pooling" where you pool the way a transformer head would pool - though that's an O(N^2) operation so YMMV
If you have the GPU compute, try something like BERT, or GPT-2 which is fine-tuned on all of reddit. Better yet, try vertically concatenating all of the word-embeddings models you can together (just stack the embeddings from each model) if you have the compute
To respond to your comment (since HN isn't letting me post cus I'm 'posting too fast')
You can use cheaper and more effective approaches for getting the subword functionality you want.
Look up "Byte Pair Embeddings". That will also handle the OOV problem but for far less CPU/RAM overhead. BERT also does this for you with its unique form of tokenization.
A home CPU can fine-tune FastText in a day on 4 million documents if you're able to walk away from your computer for awhile. Shouldn't cost you anything except electricity. If you set the number of epochs higher, you'll get better performance but correspondingly longer times to train.
For BERT/GPT-2, you'll maybe want to fine-tune a small version of the model (say, the 117m parameter version of GPT-2) and then vertically concatenate that with the regular un-fine-tuned GPT-2 model. That should be very fast and hopefully not expensive (and also possible on your home GPU)
Cool man! Thanks for sharing :) I wasn't familiar with the chunking approach. I will read more!
Regarding BERT, it indeed may perform better if fine tuned correctly. For a baseline fastText is great because it is super fast and runs on a CPU. It costed me 24$ to run a 24h autotune on a 16 CPU core machine. Also, fastText is great out of the box as it also builds word vectors for subwords, which helps with typos and specific terms that may otherwise be out of vocabulary.
I am betting that fine tuning BERT will cost me at least x10 more. But I this project is a chance to try it out :) Looking forward to v2!
Luckily, with Valohai, I get access to GPU credits for open source projects!
You're most likely using simple average pooling, which is why many users are getting results that don't look right to them. Try a chunking approach, where you get a vector for each chunk of the document and horizontally concatenate those together (if your vectors are 50d, and do 5 chunks per doc, than you get a 250d fixed vector for each document regardless of length). This partially solves the issue of highly diluted vectors which is responsible for the poor results that some users are reporting. You can also do "attentive pooling" where you pool the way a transformer head would pool - though that's an O(N^2) operation so YMMV
If you have the GPU compute, try something like BERT, or GPT-2 which is fine-tuned on all of reddit. Better yet, try vertically concatenating all of the word-embeddings models you can together (just stack the embeddings from each model) if you have the compute
To respond to your comment (since HN isn't letting me post cus I'm 'posting too fast')
You can use cheaper and more effective approaches for getting the subword functionality you want.
Look up "Byte Pair Embeddings". That will also handle the OOV problem but for far less CPU/RAM overhead. BERT also does this for you with its unique form of tokenization.
A home CPU can fine-tune FastText in a day on 4 million documents if you're able to walk away from your computer for awhile. Shouldn't cost you anything except electricity. If you set the number of epochs higher, you'll get better performance but correspondingly longer times to train.
For BERT/GPT-2, you'll maybe want to fine-tune a small version of the model (say, the 117m parameter version of GPT-2) and then vertically concatenate that with the regular un-fine-tuned GPT-2 model. That should be very fast and hopefully not expensive (and also possible on your home GPU)