A very cool demo and I congratulate the author, but I am always a little sad for more data science type demos that try to answer the question (that is proving toxic) "given what I know about you, how can I find a community of people just like you?"
I would love to see a subreddit finder that answers questions like "what community would complement your interests?" or "what community needs to hear what you have to say?" or "what community would be made better by your presence?". Similarity is at best a proxy for it.
Thanks for your valuable feedback. This indeed answers a question such us "What community is used to hear what you have to say?".
It is not really based on your interests, it just takes your text and suggests subreddits where people have posted similar texts.
Otherwise, I agree on what you say. I would love to also see those kind of systems. Kind of what you get as a reaction when you talk with a mentor that surprises you ;)
Unfortunately, my skills are not there yet but I am working hard to eventually be able to build those "surprise/discovery" systems.
They aren't plug 'n play for advertising, though. "How can I find a community for you," is the charitable flipside to, "here's a community you might like to be a part of," where the community is "Coors Lite purchasers."
That's so short sighted though; I'd be so much more likely to engage with an advert that figured out a new thing that would interest me than the usual "you've been reading about sc2 for a week, so have more of the same" nonsense.
Yeah nobody knows how to do that. To the degree that anybody's figured any amount of it out, it's much more likely and lucrative to point that code at changing your vote than your brand of toilet paper (wink wink).
Not sure if you're familiar with reddit but question posts like that are incredibly common, especially in regards to technology questions and the like. It's still a forum when you get down to it so lots of people like myself post questions on niche topics because you can find small communities of experts on everything from dogs to obscure vintage computers.
Just saying, as a huge user of reddit - I'd expect the same as OP, those seem like reasonable searches to get those results.
I'm a long time redditor, and that's why I said what I did. I felt the took was more for finding your niche community than finding an answer to a question. Different interpretations I guess.
In my case the first sentence was the title and the second the body; I guess having the 2 together might’ve thrown it off. I didn’t know only one of the fields was actually required.
Good feedback. Indeed, it is not clear on the UI. The model concatenates the two, so not sure if changing the label "Message" to "Text" to mimic Reddit UI for text posts would be enough... Or rather just have a single "Text" textarea as for the model is the same.
The intercom chat widget makes the tab title switch back and forth between "Subreddit Finder" and "Valohai says". There does not appear to be a way to dismiss the chat widget, so it just keeps flipping back and forth, which is visually annoying.
I keep many tabs open, but I am going to close this one immediately because I don't want to have something flashing at me out of the corner of my eye all day.
One place to improve this would be to use a better set of word-embeddings. FastText is, well, fast, but it's no longer close to SOTA.
You're most likely using simple average pooling, which is why many users are getting results that don't look right to them. Try a chunking approach, where you get a vector for each chunk of the document and horizontally concatenate those together (if your vectors are 50d, and do 5 chunks per doc, than you get a 250d fixed vector for each document regardless of length). This partially solves the issue of highly diluted vectors which is responsible for the poor results that some users are reporting. You can also do "attentive pooling" where you pool the way a transformer head would pool - though that's an O(N^2) operation so YMMV
If you have the GPU compute, try something like BERT, or GPT-2 which is fine-tuned on all of reddit. Better yet, try vertically concatenating all of the word-embeddings models you can together (just stack the embeddings from each model) if you have the compute
To respond to your comment (since HN isn't letting me post cus I'm 'posting too fast')
You can use cheaper and more effective approaches for getting the subword functionality you want.
Look up "Byte Pair Embeddings". That will also handle the OOV problem but for far less CPU/RAM overhead. BERT also does this for you with its unique form of tokenization.
A home CPU can fine-tune FastText in a day on 4 million documents if you're able to walk away from your computer for awhile. Shouldn't cost you anything except electricity. If you set the number of epochs higher, you'll get better performance but correspondingly longer times to train.
For BERT/GPT-2, you'll maybe want to fine-tune a small version of the model (say, the 117m parameter version of GPT-2) and then vertically concatenate that with the regular un-fine-tuned GPT-2 model. That should be very fast and hopefully not expensive (and also possible on your home GPU)
Cool man! Thanks for sharing :) I wasn't familiar with the chunking approach. I will read more!
Regarding BERT, it indeed may perform better if fine tuned correctly. For a baseline fastText is great because it is super fast and runs on a CPU. It costed me 24$ to run a 24h autotune on a 16 CPU core machine. Also, fastText is great out of the box as it also builds word vectors for subwords, which helps with typos and specific terms that may otherwise be out of vocabulary.
I am betting that fine tuning BERT will cost me at least x10 more. But I this project is a chance to try it out :) Looking forward to v2!
Luckily, with Valohai, I get access to GPU credits for open source projects!
thanks! That helps a lot, although I am not familiar with that area of knowledge.
Indeed, I got some ML metrics on a test split that gives me an idea of its accuracy :) But it's just an estimation, so indeed I am looking out for feedback to know its real performance so I can debug bad cases and fix those with more data or a better model.
The test performance on subreddit r/hearthstone is 0.21 f1-score, which is not great. And looking at the confusion matrix for r/hearthstone it gets often confused with:
Cool. Last year I created something like this as a Chrome extension so that you could type in your post and it would show up on reddit where to post. You could then just select it by clicking a link. Project is here https://github.com/wesbarnett/insight
Nice work @exegete! The Chrome extension idea is great ;) Nice to also see some metrics comparison. I will review your work. Looks like you achieved 0.6 accuracy with 600 classes. I got 0.4 f1-score on 4000 classes, but I have a ton of posts and subreddits with images and no text ;) For this case, it is also nice to report Recall@k. The current model has Recall@5 of 0.6. Meaning that on the test dataset, 60% of the time the human choice is within the first 5 suggestions. Currently, it is not supposed to automatically post but help the user discover new subreddits :)
I will probably retrain it on more subreddits, and fine tune a few things.
Here were my two experiences, one I felt would be easy and the other hard:
Title: Build recommendations. Message:
"I'd like to upgrade some components. My current rig has an old i7 and an RTX 2060. Looking for something midrange that can handle modern games at high settings (but maybe not ultra)."
Title: Travel advice. Message:
"I'm returning to Ireland in July from the USA. My visa is up. I know I will have to self-quarantine for two weeks. I cannot move back to my family home due to elderly parents. Are there any recommendations for people in this sort of situation? I'm happy to pay for a hotel, but don't want to put a hotel worker at risk. We have an old house down in Wexford I could stay in, but would involve taking a train when I arrive, and the HSE guidance says not to take public transport. Any advice?"
Overall I think this was pretty good, even if it wasn't perfect. I thought it would struggle more with the second one (maybe getting confused and suggesting vacation planning subreddits). A little controversial that it kept suggesting "UK" reddits for a question about Ireland though :)
Well yes that is likely, but maybe not a good suggestion as that is a place where folks point out people who posted the wrong thing in the wrong sub or conversation ;)
Yeah - had the same experience looking for a whole buncha vintage computer/development topics. There would be 1-2 applicable results, but the majority were incorrect.
I find your example to be the biggest "no-go" for any practical application because the info/description of that /r/DevelEire is literally "A sub reddit for Irish Software Developers". That's something that a simple FULLTEXT solution would have easily found w/high confidence.
Right! I used the official Reddit API. I created an APP, got the credentials for the API. Then used the Python library PRAW to consume the API. https://praw.readthedocs.io/en/latest/
It took me 36 hours to collect the 4M posts. Reddit API returns results in batches of 100 results, and then sleeps for 2 seconds.
Yes, that is the human solution to the problem. I manually tested a bit on what people ask there :)
On the other hand, the machine is faster and lot of people don't get an answer there or can wait for it. The machine is not necessarily better, just a complement.
HN doesn't get enough credit for the tight rope they walk maintaining this community. I see some people post that HN should expand to other topics a la Reddit, but the team does a great job of maintaining focus.
It's not just HN's aesthetic that is minimal and no-nonsense, it's their moderation policies and the tone they set for the community. There is perfect alignment between their approach to content, community, and UX—no fluff, no nonsense, no manipulation, just the simplest, most valuable material possible.
If they expand the scope of acceptable content, it will be really hard not to tweak moderation policies, and eventually you end up with something like reddit, each subreddit might as well be its own (typically under-staffed) site.
EDIT: This is not critical of any post on this thread, just seeing the above comment about old-school reddit got me thinking.
> HN doesn't get enough credit for the tight rope they walk maintaining this community
You sure? Let me just check -theoretically- if you wanted to build a system to reinforce bubble thinking... how would it look different than this? There is hiding disagreeable posts, invisible moderation, and a magic karma system where 1 vote is not always 1 point.
Typically this isn’t a problem, but you aren’t paying attention if you think this isn’t by design and doesn’t exactly lead to a diverse spectrum of opinions here.
Okay, so how would you design it differently and achieve better results? Or point to examples that do it better? If not, the criticism is highly unwarranted.
Depends what you mean by “better”. I believe this is supposed to be a bubble. The issue specifically in HN case is most people don’t realize that.
If you mean, how would I present a variety of ideas but not let it get out of control with an extreme one way or another that puts common people off... easy.
Remove the score system. That little number in the corner is cancer.
Keep the vote system but only highlight when “many” people agree or disagree. Otherwise post are presented neutrally and the merit of the content must be evaluated. Even keep they grey out system but not at 4 people disagree, at 10 or so. It’s easy enough to find 4 people here that will want to hide the fact the WHO has dropped many balls during covid including faking that video interview dropout to not address that Taiwan is its own country and not an “area of China” - doesn’t mean it’s not true.
The thing that might not be clear here is that I do think this is all intentional and you used the right word “community”... but I think the danger is even long time users don’t know this, and think their ideas are “just right”, not that they are being cultivated into the same bubble they themselves are cultivating. Is everyone aware the “community” is not entirely natural?
The easy way to think about this is to steelman a topic you know a lot about. If you were to argue the other side of a topic, how would it be presented on this site? (Example, argue some debatable aspect against anthropogenic climate change) The answer to that is most likely hidden and downvoted into oblivion, so much so that it creates a chilling effect for anyone that would disagree in the future. That is wrong imo.
Edit: if you need proof there is a bubble with chilling effect, these posts are being hidden by anonymous disagreement :D
I've found accounts (possibly bots) that seem to go find the accompanying HN thread, and post a comment from it onto the reddit thread for karma. I sent reddit's anti-evil team a note about it since it's probably karma farming, but they never responded. Maybe it doesn't matter? It's not like we hold exclusive rights to stuff on here, so there's no legal issue, it just seems to be an efficient way of getting karma for that specific subreddit.
The reposts also result in a lot of not programming related content getting on that sub, which none of the mods seem to delete very often (stuff that should go to r/sysadmin or even r/technology).
I used to do something like this any time I detected twin threads (discussing the same URL) across HN and reddit. If anyone asked any unanswered question at one source, one of my bots would ask it at the other sources (plus Quora, usually) and wait for a response elsewhere, then paste that response back to the original OP with a link/citations. Including the citations got me autobanned a few times (which reddit admins graciously removed, repeatedly); if I weren't concerned with plagiarism, bot management would probably have been much smoother.
Would be pretty trivial to skip all the question/answer stuff and just share comments around sites. In a vacuum, I'd say it could be argued that mirroring comments around the Internet would result in good in various ways (sharing information, letting people choose what site they want to use, limiting censorship and/or site downtime, getting answers to people who might not know the best place to ask them, etc).
What you saw was probably karma farming, but could also have been someone trying to help in some abstract way. :)
"When should I kill my chicken" -> http://reddit.com/r/csgo, 19%
"Am I conscious" -> http://reddit.com/r/INTP, 25%
"How to not think" -> http://www.reddit.com/r/howtonotgiveafuck/, 49%
"Is the government evil" -> http://www.reddit.com/r/ENLIGHTENEDCENTRISM/, 19%
"Is the government good" -> http://www.reddit.com/r/CoronavirusUK, 10%
"Is the government useful" -> http://www.reddit.com/r/iran, 31%