Hacker News new | past | comments | ask | show | jobs | submit login

Probably with the official Reddit API? There are several libraries for it.



Right! I used the official Reddit API. I created an APP, got the credentials for the API. Then used the Python library PRAW to consume the API. https://praw.readthedocs.io/en/latest/

It took me 36 hours to collect the 4M posts. Reddit API returns results in batches of 100 results, and then sleeps for 2 seconds.

You can find some more details on how it was built here https://blog.valohai.com/machine-learning-pipeline-classifyi...

I can publish on Github the repository that runs two commands to collect the data if your are interested.


Thanks for the answer!

I was curious since it seems like it was using the BigQuery dataset, but PRAW works too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: