Probably with the official Reddit API? There are several libraries for it.

Arimbr · on April 9, 2020

Right! I used the official Reddit API. I created an APP, got the credentials for the API. Then used the Python library PRAW to consume the API. https://praw.readthedocs.io/en/latest/

It took me 36 hours to collect the 4M posts. Reddit API returns results in batches of 100 results, and then sleeps for 2 seconds.

You can find some more details on how it was built here https://blog.valohai.com/machine-learning-pipeline-classifyi...

I can publish on Github the repository that runs two commands to collect the data if your are interested.

minimaxir · on April 9, 2020

Thanks for the answer!

I was curious since it seems like it was using the BigQuery dataset, but PRAW works too.