December 30, 2016
Now I've released a newer, more flexible, and much more useful script called Tree Grab for Reddit. I call it this because it can navigate through comment trees, storing both parent IDs of comments and an array of numeric indices, among many other features.
The three things you will need are:
If you do not have an account, register for one first. Then you can go to https://www.reddit.com/wiki/api to register information about your project and https://www.reddit.com/prefs/apps to add your application to your account so you can use OAuth2. When you are done with that, you can store your information in praw_user.py. Remember to avoid sharing your password/client secret with others if you are putting code on GitHub or another public site.
See this wiki for information on how to install PostgreSQL if you haven't already done so. After you've installed it, all you need to do is create a database and a user to access the database, then update dbinfo.py to reflect that. Note that by default,
tablespace should be
"pg_default" unless you have the database set up on a different tablespace (for example, on a separate hard drive).
The only non-obvious thing about setting up Python for this is that psycopg2, the package that interfaces Postgres with Python, must be installed after the proper Postgres libraries are installed to your computer. See this documentation for instructions.
For example, from this thread with ID 5l5afx.
python scraper.py nameofschema --ids 5l5afx --grab-authors
data will be collected from the comment tree of the post, and user data for the top 100 comments and top 100 posts (both default settings) will be gathered from each commenter in the thread (default) and the author (because of the --grab-authors argument).
Using a website admin, kn0thing, as an example, we can go back 1,000 comments and 1,000 posts into his history and then begin scraping those as soon as the IDs are all collected:
python scraper.py kn0thing_db --users kn0thing --man-user-thread-limit 1000 --man-user-comment-limit 1000 --deepuser --log
--deepuser argument, this would just collect user and comment data for one user. With the
--deepuser argument, this will scrape each thread in his comment history as if it were supplied to the
--ids argument. Using this does scramble the order of all IDs, however. The
user-thread-limit indicates that the limit only applies to manually supplied IDs. User IDs encountered in threads will use the default or a supplied
--user-thread-limit argument instead.
python scraper.py politics --subreddits politics politics Conservative Libertarian progressive --age 0.5 10 --type hot --limit 500 --mincomments 1 -n 100
This command will take the top 500 "hot" posts from /r/politics, /r/politics (yes, twice), /r/Conservative, /r/Libertarian, and /r/progressive, and find take the ones at the top of the "hot" rankings (a combination of popular and new) that are between 0.5 days and 10 days old, limiting themselves to the top 500 posts. Posts are only accepted if they have at least one comment (possibly deleted) via the
--mincomments argument, and it will cycle 20 times, going through n=100 submissions/threads. If you want to scrape posts in a random order, you can use the
-uo (unordered) argument to randomize it:
python scraper.py politics --subreddits politics politics Conservative Libertarian progressive --age 0.5 10 --type hot --limit 500 --mincomments 1 -n 100 -uo
This may be difficult to do comprehensively for a larger subreddit, but it can easily be done for a smaller subreddit.
python scraper.py ml --subreddits MachineLearning --history users threads comments --thread-delay 0.1 --user-delay 0.1 --log --post-refresh-time 0.1 --drop-old-posts
This will constantly check /r/MachineLearning for the top 100 new posts (default) , and whenever it encounters a comment, thread, or user that hasn't been updated in the past 0.1 days, then it will update it. It will also refresh the list of posts once every 0.1 days, removing old posts from that list (via
--drop-old-posts. Note that when users, threads, and/or comments are used with
--history, the usernames/IDs of those tables are not primary keys/indexed, so querying in the future will most likely have worse performance. The
--log argument keeps track of your command and start/end time in a table called ml.log.
If you want to get a good sample of posts from a large subreddit, it's nearly impossible to get thread IDs for posts going back more than a few days. However, those IDs are well-represented in the comment histories of users. The
--rescrape-threads argument can be used to take advantage of this.
python scraper.py news --subreddits news -n 50 -type hot --age 0.5 5 --limit 1000 --user-comment-limit 200 --mincomments 5 --rescrape-with-comment-histories --avoid-full-threads
The scraper will go through 50 "hot" threads from /r/news with at least 5 comments, with the post being 0.5 to 5 days old. It will go back 200 comments in users' comment histories. Then, when it is finished, it will go through all of the user's comment and post histories, look for thread IDs of posts from /r/news that were not in the original 50, and then scrape each one using a random order.
--pattern argument is useful for increasing/decreasing the number of comments that the scraper can go through. For example,
--pattern 500 100 20 5 5 will go navigate through the top 500 top-level comments, then for any of those, up to the top 100 replies, and for those, 20 replies, etc. The default is somewhat small, so you may want to increase it if you don't mind longer times processing individual threads.
There are other features here I haven't listed, such as reading usernames/subreddits from files instead of writing them on the command line. You can read more about them on the GitHub page for the project or use the
--help argument on the command line to see the documentation.
I am planning on doing some interesting things with this data, but I'm still open to feature requests/possible assistance if anyone is interested. My email is my full name at gmail dot com, all lowercase no spaces.