Scraping Data from Reddit

By Max Candocia

|

December 30, 2016

A couple years ago, I finished a project titled "Analyzing Political Discourse on Reddit", which utilized some outdated code that was inefficient and no longer works due to Reddit's API changes.

Now I've released a newer, more flexible, and much more useful script called Tree Grab for Reddit. I call it this because it can navigate through comment trees, storing both parent IDs of comments and an array of numeric indices, among many other features.

How to get started

The three things you will need are:

  1. A Reddit account with developer permissions
  2. A computer with an internet connection and a PostgreSQL database version ≥ 9.1
  3. Python 2.7.x installed with the most recent psycopg2, pytz, and praw packages.

Reddit Account

If you do not have an account, register for one first. Then you can go to https://www.reddit.com/wiki/api to register information about your project and https://www.reddit.com/prefs/apps to add your application to your account so you can use OAuth2. When you are done with that, you can store your information in praw_user.py. Remember to avoid sharing your password/client secret with others if you are putting code on GitHub or another public site.

PostgreSQL

See this wiki for information on how to install PostgreSQL if you haven't already done so. After you've installed it, all you need to do is create a database and a user to access the database, then update dbinfo.py to reflect that. Note that by default, tablespace should be "pg_default" unless you have the database set up on a different tablespace (for example, on a separate hard drive).

Python

The only non-obvious thing about setting up Python for this is that psycopg2, the package that interfaces Postgres with Python, must be installed after the proper Postgres libraries are installed to your computer. See this documentation for instructions.

Use Cases

Get data for a post/many posts with specific IDs

For example, from this thread with ID 5l5afx.

python scraper.py nameofschema --ids 5l5afx --grab-authors

data will be collected from the comment tree of the post, and user data for the top 100 comments and top 100 posts (both default settings) will be gathered from each commenter in the thread (default) and the author (because of the --grab-authors argument).

Gather data from a user, then scrape all of their posts

Using a website admin, kn0thing, as an example, we can go back 1,000 comments and 1,000 posts into his history and then begin scraping those as soon as the IDs are all collected:

python scraper.py kn0thing_db --users kn0thing --man-user-thread-limit 1000 --man-user-comment-limit 1000 --deepuser --log

Without the --deepuser argument, this would just collect user and comment data for one user. With the --deepuser argument, this will scrape each thread in his comment history as if it were supplied to the --ids argument. Using this does scramble the order of all IDs, however. The man before user-comment-limit and user-thread-limit indicates that the limit only applies to manually supplied IDs. User IDs encountered in threads will use the default or a supplied --user-comment-limit/--user-thread-limit argument instead.

Scrape data from a few political subreddits

python scraper.py politics --subreddits politics politics Conservative Libertarian progressive --age 0.5 10 --type hot --limit 500 --mincomments 1 -n 100

This command will take the top 500 "hot" posts from /r/politics, /r/politics (yes, twice), /r/Conservative, /r/Libertarian, and /r/progressive, and find take the ones at the top of the "hot" rankings (a combination of popular and new) that are between 0.5 days and 10 days old, limiting themselves to the top 500 posts. Posts are only accepted if they have at least one comment (possibly deleted) via the --mincomments argument, and it will cycle 20 times, going through n=100 submissions/threads. If you want to scrape posts in a random order, you can use the -uo (unordered) argument to randomize it:

python scraper.py politics --subreddits politics politics Conservative Libertarian progressive --age 0.5 10 --type hot --limit 500 --mincomments 1 -n 100 -uo

Tracking Post History of a Subreddit

This may be difficult to do comprehensively for a larger subreddit, but it can easily be done for a smaller subreddit.

python scraper.py ml --subreddits MachineLearning --history users threads comments --thread-delay 0.1 --user-delay 0.1 --log --post-refresh-time 0.1 --drop-old-posts

This will constantly check /r/MachineLearning for the top 100 new posts (default) , and whenever it encounters a comment, thread, or user that hasn't been updated in the past 0.1 days, then it will update it. It will also refresh the list of posts once every 0.1 days, removing old posts from that list (via --drop-old-posts. Note that when users, threads, and/or comments are used with --history, the usernames/IDs of those tables are not primary keys/indexed, so querying in the future will most likely have worse performance. The --log argument keeps track of your command and start/end time in a table called ml.log.

Scrape all threads encountered in comment histories but not via subreddits

If you want to get a good sample of posts from a large subreddit, it's nearly impossible to get thread IDs for posts going back more than a few days. However, those IDs are well-represented in the comment histories of users. The --rescrape-threads argument can be used to take advantage of this.

python scraper.py news --subreddits news -n 50 -type hot --age 0.5 5 --limit 1000 --user-comment-limit 200 --mincomments 5 --rescrape-with-comment-histories --avoid-full-threads

The scraper will go through 50 "hot" threads from /r/news with at least 5 comments, with the post being 0.5 to 5 days old. It will go back 200 comments in users' comment histories. Then, when it is finished, it will go through all of the user's comment and post histories, look for thread IDs of posts from /r/news that were not in the original 50, and then scrape each one using a random order.

Scraping through more comments in a thread

The --pattern argument is useful for increasing/decreasing the number of comments that the scraper can go through. For example, --pattern 500 100 20 5 5 will go navigate through the top 500 top-level comments, then for any of those, up to the top 100 replies, and for those, 20 replies, etc. The default is somewhat small, so you may want to increase it if you don't mind longer times processing individual threads.

Other Features

There are other features here I haven't listed, such as reading usernames/subreddits from files instead of writing them on the command line. You can read more about them on the GitHub page for the project or use the --help argument on the command line to see the documentation.

Future of Project

I am planning on doing some interesting things with this data, but I'm still open to feature requests/possible assistance if anyone is interested. My email is my full name at gmail dot com, all lowercase no spaces.


Tags: 

Recommended Articles

What do People Do for Their Moms on Mothers Day?

What do people do for their moms on Mother's Day? I sampled a small group of people and got a rough idea of different ways people show their love and appreciation.

Sketching and Guessing with Keras

Draw a picture and see if the computer can guess it. You can also submit images for the algorithm to develop more possible results.