Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Subreddit Finder - Trained on 4M Reddit Posts from 4K Subreddits (valohai.com)
152 points by Arimbr on April 9, 2020 | hide | past | favorite | 79 comments



"What is the penalty for living" -> http://reddit.com/r/Poland, 28%

"When should I kill my chicken" -> http://reddit.com/r/csgo, 19%

"Am I conscious" -> http://reddit.com/r/INTP, 25%

"How to not think" -> http://www.reddit.com/r/howtonotgiveafuck/, 49%

"Is the government evil" -> http://www.reddit.com/r/ENLIGHTENEDCENTRISM/, 19%

"Is the government good" -> http://www.reddit.com/r/CoronavirusUK, 10%

"Is the government useful" -> http://www.reddit.com/r/iran, 31%


Poland for the win :D


I tried "find hot local singles in your area" and the top result was /r/vinyls

actually very impressed


lol


A very cool demo and I congratulate the author, but I am always a little sad for more data science type demos that try to answer the question (that is proving toxic) "given what I know about you, how can I find a community of people just like you?"

I would love to see a subreddit finder that answers questions like "what community would complement your interests?" or "what community needs to hear what you have to say?" or "what community would be made better by your presence?". Similarity is at best a proxy for it.

Those are harder but, I think, more useful.


Thanks for your valuable feedback. This indeed answers a question such us "What community is used to hear what you have to say?".

It is not really based on your interests, it just takes your text and suggests subreddits where people have posted similar texts.

Otherwise, I agree on what you say. I would love to also see those kind of systems. Kind of what you get as a reaction when you talk with a mentor that surprises you ;)

Unfortunately, my skills are not there yet but I am working hard to eventually be able to build those "surprise/discovery" systems.


They aren't plug 'n play for advertising, though. "How can I find a community for you," is the charitable flipside to, "here's a community you might like to be a part of," where the community is "Coors Lite purchasers."


That's so short sighted though; I'd be so much more likely to engage with an advert that figured out a new thing that would interest me than the usual "you've been reading about sc2 for a week, so have more of the same" nonsense.


Yeah nobody knows how to do that. To the degree that anybody's figured any amount of it out, it's much more likely and lucrative to point that code at changing your vote than your brand of toilet paper (wink wink).


Good point, but I think the goal that you outline is harder and that this type of research is a stepping stone along that path.


I tried it with "best time tracking app for iOS?" and "I'm looking for a time tracking app. Any recommendations?"

I expected the iPhone or iOS subreddit to be suggested, but it suggested GearVR | 13.0%, ringdoorbell | 9.0%, canadacordcutters | 5.0%, TTVreborn | 5.0%, AusSkincare | 4.0%, sideloaded | 4.0%, FlutterDev | 2.0%, shopify | 2.0%, weightwatchers | 2.0%, crossfit | 2.0%.

Congrats on the attempt but it does still need some work.


I'm not really sure your query is what it was built for. That's more of a google search than a community idea.


Not sure if you're familiar with reddit but question posts like that are incredibly common, especially in regards to technology questions and the like. It's still a forum when you get down to it so lots of people like myself post questions on niche topics because you can find small communities of experts on everything from dogs to obscure vintage computers.

Just saying, as a huge user of reddit - I'd expect the same as OP, those seem like reasonable searches to get those results.


I'm a long time redditor, and that's why I said what I did. I felt the took was more for finding your niche community than finding an answer to a question. Different interpretations I guess.


Weird! I just tried "best time tracking app for iOS?" and got "iOSProgramming | 45.0 % swift | 18.0% ProtonMail | 14.0% ios | 4.0% apple | 3.0% iphone | 3.0% FlutterDev | 2.0% freebies | 1.0% jailbreak_ | 1.0% jailbreak | 1.0%"

For the second case: "I'm looking for a time tracking app. Any recommendations?" and got "GearVR | 43.0% TTVreborn | 14.0% WearOS | 4.0% OculusGo | 3.0% IPTV | 3.0% ApksApps | 2.0% animepiracy | 2.0% RabbitReddit | 2.0% NetflixViaVPN | 1.0% androidapps | 1.0%"

Which is a bit better! But still not perfect ;)


In my case the first sentence was the title and the second the body; I guess having the 2 together might’ve thrown it off. I didn’t know only one of the fields was actually required.


Good feedback. Indeed, it is not clear on the UI. The model concatenates the two, so not sure if changing the label "Message" to "Text" to mimic Reddit UI for text posts would be enough... Or rather just have a single "Text" textarea as for the model is the same.


The intercom chat widget makes the tab title switch back and forth between "Subreddit Finder" and "Valohai says". There does not appear to be a way to dismiss the chat widget, so it just keeps flipping back and forth, which is visually annoying.

I keep many tabs open, but I am going to close this one immediately because I don't want to have something flashing at me out of the corner of my eye all day.


One place to improve this would be to use a better set of word-embeddings. FastText is, well, fast, but it's no longer close to SOTA.

You're most likely using simple average pooling, which is why many users are getting results that don't look right to them. Try a chunking approach, where you get a vector for each chunk of the document and horizontally concatenate those together (if your vectors are 50d, and do 5 chunks per doc, than you get a 250d fixed vector for each document regardless of length). This partially solves the issue of highly diluted vectors which is responsible for the poor results that some users are reporting. You can also do "attentive pooling" where you pool the way a transformer head would pool - though that's an O(N^2) operation so YMMV

If you have the GPU compute, try something like BERT, or GPT-2 which is fine-tuned on all of reddit. Better yet, try vertically concatenating all of the word-embeddings models you can together (just stack the embeddings from each model) if you have the compute

To respond to your comment (since HN isn't letting me post cus I'm 'posting too fast')

You can use cheaper and more effective approaches for getting the subword functionality you want.

Look up "Byte Pair Embeddings". That will also handle the OOV problem but for far less CPU/RAM overhead. BERT also does this for you with its unique form of tokenization.

A home CPU can fine-tune FastText in a day on 4 million documents if you're able to walk away from your computer for awhile. Shouldn't cost you anything except electricity. If you set the number of epochs higher, you'll get better performance but correspondingly longer times to train.

For BERT/GPT-2, you'll maybe want to fine-tune a small version of the model (say, the 117m parameter version of GPT-2) and then vertically concatenate that with the regular un-fine-tuned GPT-2 model. That should be very fast and hopefully not expensive (and also possible on your home GPU)


Cool man! Thanks for sharing :) I wasn't familiar with the chunking approach. I will read more!

Regarding BERT, it indeed may perform better if fine tuned correctly. For a baseline fastText is great because it is super fast and runs on a CPU. It costed me 24$ to run a 24h autotune on a 16 CPU core machine. Also, fastText is great out of the box as it also builds word vectors for subwords, which helps with typos and specific terms that may otherwise be out of vocabulary.

I am betting that fine tuning BERT will cost me at least x10 more. But I this project is a chance to try it out :) Looking forward to v2!

Luckily, with Valohai, I get access to GPU credits for open source projects!


Would be nice to have the subreddits be links. So I could just click it to open a new tab of that subreddit.


Great idea! Will add that :)


Tried it with Hearthstone related content. Title: turn 2 lethal Content: I managed to cheat out 4 prophet valens on turn 2 followed up by mind blast.

Results: shadowverse, elderscrollslegends, teamfighttactics, teemotalk, fioramains, ekkomains, ezrealmains, bobstavern, kaisamains, xcom2

Should include: hearthstone It did pick up BobsTavern which is something. I thought you would want some feedback.


thanks! That helps a lot, although I am not familiar with that area of knowledge.

Indeed, I got some ML metrics on a test split that gives me an idea of its accuracy :) But it's just an estimation, so indeed I am looking out for feedback to know its real performance so I can debug bad cases and fix those with more data or a better model.

The test performance on subreddit r/hearthstone is 0.21 f1-score, which is not great. And looking at the confusion matrix for r/hearthstone it gets often confused with:

r/BobsTavern r/CompetitiveHS r/customhearthstone r/Blizzard

If you are curious, I uploaded the metrics (precision, recall, f1-score) and confusion matrix on the test dataset on a Google Spreadsheet.

https://docs.google.com/spreadsheets/d/1NBY1o85ZiNpcm4tcYhKk...

The sheet 'confusion_matrix_gt2' can be used to find similar subreddits.


Cool. Last year I created something like this as a Chrome extension so that you could type in your post and it would show up on reddit where to post. You could then just select it by clicking a link. Project is here https://github.com/wesbarnett/insight


Nice work @exegete! The Chrome extension idea is great ;) Nice to also see some metrics comparison. I will review your work. Looks like you achieved 0.6 accuracy with 600 classes. I got 0.4 f1-score on 4000 classes, but I have a ton of posts and subreddits with images and no text ;) For this case, it is also nice to report Recall@k. The current model has Recall@5 of 0.6. Meaning that on the test dataset, 60% of the time the human choice is within the first 5 suggestions. Currently, it is not supposed to automatically post but help the user discover new subreddits :)

I will probably retrain it on more subreddits, and fine tune a few things.


Searching for "marijuana" should point to trees (internal Reddit joke) and not marijuana primarily.


Hilariously enough: those that want to stop smokeing post to r/leaves.


And those who want to see trees go to r/marijuanaenthusiasts


This is pretty good. Typed in "$spy 1000" and it said r/wallstreetbets (100%). Accurate.


Tried stocks, stock options, investing - all kept giving Robinhoodpennystocks as the top option. Not sure if the model is not fully trained?

What are some examples where the model does recommend meaningful things?


Here were my two experiences, one I felt would be easy and the other hard:

Title: Build recommendations. Message: "I'd like to upgrade some components. My current rig has an old i7 and an RTX 2060. Looking for something midrange that can handle modern games at high settings (but maybe not ultra)."

Matches: Nvidia (19%), IndianGaming (8%), GamingLaptops (8%), pcgamingtechsupport (6%)

Title: Travel advice. Message: "I'm returning to Ireland in July from the USA. My visa is up. I know I will have to self-quarantine for two weeks. I cannot move back to my family home due to elderly parents. Are there any recommendations for people in this sort of situation? I'm happy to pay for a hotel, but don't want to put a hotel worker at risk. We have an old house down in Wexford I could stay in, but would involve taking a train when I arrive, and the HSE guidance says not to take public transport. Any advice?"

Recommendations: LegalAdviceUK (8%), IWantOut (7%), AskUK (5%)

Overall I think this was pretty good, even if it wasn't perfect. I thought it would struggle more with the second one (maybe getting confused and suggesting vacation planning subreddits). A little controversial that it kept suggesting "UK" reddits for a question about Ireland though :)


Thanks so much for your detailed feedback. There is definitely something going on with the UK thing... Need to dig deeper!


Thanks for making it, super cool! Btw, if it wasn't clear, my "controversial" comment isn't to be taken seriously


Suggested subreddits for this post:

lostredditors 45%

Well yes that is likely, but maybe not a good suggestion as that is a place where folks point out people who posted the wrong thing in the wrong sub or conversation ;)


Interesting! I didn't know about that subreddit. Which text did you try? :)


IIRC I had some text about cat pics ;)


;)


Or rather people who point to a sub in a comment to a post... in that same sub :)


“I got straight A’s this semester!!!”

aggies 19.0%


Clicking a sub-reddit name should open the sub-reddit in a new window/tab.


This is awesome!

I often find that I when I'm buying something new, I want to find subs related to that product category.

While this doesn't find me direct results, it should me communities that I should focus my research on.


Tried to find /r/DevelEire using the search terms "Irish Software Developers"

No luck but Google will bring it up as a first result if the query is "Irish Software Developers Reddit".


Unfortunately, I trained only on what Reddit told me are the most popular 4k subreddits :) It was not trained on r/DevelEire :/

So, I think i need to train it on more subreddits to make it more useful. Thanks for sharing!


Yeah - had the same experience looking for a whole buncha vintage computer/development topics. There would be 1-2 applicable results, but the majority were incorrect.

I find your example to be the biggest "no-go" for any practical application because the info/description of that /r/DevelEire is literally "A sub reddit for Irish Software Developers". That's something that a simple FULLTEXT solution would have easily found w/high confidence.


I wish reddit would allow me to download all my comments.

Apparently it's not possible since they're all archived, because reddit constantly regenerate its webpages.


You can download them from google:

https://bigquery.cloud.google.com/table/fh-bigquery:reddit_c...

Run an SQL over all comments of everyone

edit: so only comments till october. 6 months old, that is when they become archived. Guess google has only the archived comments.


You can also download them from their original mirror at pushshift[1]

PS: that BQ you linked is maintained by fhoffa[2]

1: https://files.pushshift.io/reddit/

2: https://news.ycombinator.com/user?id=fhoffa


Original mirror? But the BQ has 2019_10 while the pushshift has only up 2019-09


Apologies let me clarify: Pushshift has a live api that mirrors reddit which you can use to compile a more recent dataset[0].

The uploader of that BQ has cited Pushshift as their source[1].

PS: In the next couple of days the batched archive data for Q4 of 2019 as well as Q1 of 2020 will be available[2]

0: https://github.com/pushshift/api

1: https://www.reddit.com/r/bigquery/comments/fcyu4m/extended_o...

2: https://www.reddit.com/r/pushshift/comments/fuoe2d/september...


Is r/thedonald not included in the database? I'm trying the usual suspect titles, but get nothing.


Yes, unfortunately is not part of the 4k subreddits I trained on. I will retrain it with more subreddits :)

The list of subreddits and an estimation of the performance for each one is on this Google Spreadsheet

https://docs.google.com/spreadsheets/d/1NBY1o85ZiNpcm4tcYhKk...


You should probably add how exactly you retrieved the 4 million Reddit posts.


The official API is well documented.

https://www.reddit.com/dev/api/


True, but 4M posts is heavy (even when bypassing the API itself and using unauthenticated requests), so was curious.


Probably with the official Reddit API? There are several libraries for it.


Right! I used the official Reddit API. I created an APP, got the credentials for the API. Then used the Python library PRAW to consume the API. https://praw.readthedocs.io/en/latest/

It took me 36 hours to collect the 4M posts. Reddit API returns results in batches of 100 results, and then sleeps for 2 seconds.

You can find some more details on how it was built here https://blog.valohai.com/machine-learning-pipeline-classifyi...

I can publish on Github the repository that runs two commands to collect the data if your are interested.


Thanks for the answer!

I was curious since it seems like it was using the BigQuery dataset, but PRAW works too.


hey i tried

title: My siberian cat Message: My floof

I was hoping to find r/SiberianCats where i usually post but it wasn't in the list.

I googled "siberian cat subreddit" and r/SiberianCats was the first link.


Thanks man! Right, unfortunately it was not trained on r/SiberianCats. It was trained on what Reddit said are the most popular 4k subreddits.

To make it more useful, I need to collect more data from less popular subreddits :)


Maybe not a top 4k subreddit?


You are right!


oh ok. thank you. I overestimated Siberian cats popularity :)


Or you could just ask here:

https://old.reddit.com/r/findareddit/


Yes, that is the human solution to the problem. I manually tested a bit on what people ask there :)

On the other hand, the machine is faster and lot of people don't get an answer there or can wait for it. The machine is not necessarily better, just a complement.


Anyone remember /r/reddit.com? That was around the time reddit looked like Hacker News and people were embarrassed to admit they use it.


HN doesn't get enough credit for the tight rope they walk maintaining this community. I see some people post that HN should expand to other topics a la Reddit, but the team does a great job of maintaining focus.

It's not just HN's aesthetic that is minimal and no-nonsense, it's their moderation policies and the tone they set for the community. There is perfect alignment between their approach to content, community, and UX—no fluff, no nonsense, no manipulation, just the simplest, most valuable material possible.

If they expand the scope of acceptable content, it will be really hard not to tweak moderation policies, and eventually you end up with something like reddit, each subreddit might as well be its own (typically under-staffed) site.

EDIT: This is not critical of any post on this thread, just seeing the above comment about old-school reddit got me thinking.


> HN doesn't get enough credit for the tight rope they walk maintaining this community

You sure? Let me just check -theoretically- if you wanted to build a system to reinforce bubble thinking... how would it look different than this? There is hiding disagreeable posts, invisible moderation, and a magic karma system where 1 vote is not always 1 point.

Typically this isn’t a problem, but you aren’t paying attention if you think this isn’t by design and doesn’t exactly lead to a diverse spectrum of opinions here.


Okay, so how would you design it differently and achieve better results? Or point to examples that do it better? If not, the criticism is highly unwarranted.


Depends what you mean by “better”. I believe this is supposed to be a bubble. The issue specifically in HN case is most people don’t realize that.

If you mean, how would I present a variety of ideas but not let it get out of control with an extreme one way or another that puts common people off... easy.

Remove the score system. That little number in the corner is cancer.

Keep the vote system but only highlight when “many” people agree or disagree. Otherwise post are presented neutrally and the merit of the content must be evaluated. Even keep they grey out system but not at 4 people disagree, at 10 or so. It’s easy enough to find 4 people here that will want to hide the fact the WHO has dropped many balls during covid including faking that video interview dropout to not address that Taiwan is its own country and not an “area of China” - doesn’t mean it’s not true.

The thing that might not be clear here is that I do think this is all intentional and you used the right word “community”... but I think the danger is even long time users don’t know this, and think their ideas are “just right”, not that they are being cultivated into the same bubble they themselves are cultivating. Is everyone aware the “community” is not entirely natural?

The easy way to think about this is to steelman a topic you know a lot about. If you were to argue the other side of a topic, how would it be presented on this site? (Example, argue some debatable aspect against anthropogenic climate change) The answer to that is most likely hidden and downvoted into oblivion, so much so that it creates a chilling effect for anyone that would disagree in the future. That is wrong imo.

Edit: if you need proof there is a bubble with chilling effect, these posts are being hidden by anonymous disagreement :D


/r/programming is basically "I reposted this from https://news.ycombinator.com"


I've found accounts (possibly bots) that seem to go find the accompanying HN thread, and post a comment from it onto the reddit thread for karma. I sent reddit's anti-evil team a note about it since it's probably karma farming, but they never responded. Maybe it doesn't matter? It's not like we hold exclusive rights to stuff on here, so there's no legal issue, it just seems to be an efficient way of getting karma for that specific subreddit.

The reposts also result in a lot of not programming related content getting on that sub, which none of the mods seem to delete very often (stuff that should go to r/sysadmin or even r/technology).


I used to do something like this any time I detected twin threads (discussing the same URL) across HN and reddit. If anyone asked any unanswered question at one source, one of my bots would ask it at the other sources (plus Quora, usually) and wait for a response elsewhere, then paste that response back to the original OP with a link/citations. Including the citations got me autobanned a few times (which reddit admins graciously removed, repeatedly); if I weren't concerned with plagiarism, bot management would probably have been much smoother.

Would be pretty trivial to skip all the question/answer stuff and just share comments around sites. In a vacuum, I'd say it could be argued that mirroring comments around the Internet would result in good in various ways (sharing information, letting people choose what site they want to use, limiting censorship and/or site downtime, getting answers to people who might not know the best place to ask them, etc).

What you saw was probably karma farming, but could also have been someone trying to help in some abstract way. :)


It would be better if it was just that...


Thanks for reminding me, I just found gold here.

* After 5 years of surfing reddit, these are my favorite discoveries...

https://old.reddit.com/r/reddit.com/comments/guktv/after_5_y...

A collection of science and computing related links.


I for one was embarrassed to admit I used reddit long after /r/reddit.com was defunct.


I just ran some pretty dumb queries. I can assure you something is missing.


good attempt. but needs a lot of work.


Thanks man! Looking forward to working on V2 :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: