January 05, 2017
A couple weeks ago I started collecting data with a Tree Grab for Reddit for reddit.com. In the meantime, I have gathered 3.5 million comments across 69 different political Subreddits. With this data, I have looked at different ways that the subreddits can be grouped together. Specifically, grouping subreddits by...
The above graph is called a "dendrogram". The closeness (Jaccard Index) of two Subreddits is determined by taking the number of users who commented in both Subreddits, and then dividing that by the number of users who posted in at least one of those Subreddits.
For example, imagine you have 2 parks (Park A and Park B) and 10 people, each labeled with a number 1 through 10. Imagine that people labeled 1 through 7 visited Park A and people labeled 6 through 9 visited Park B. A total of 2 people (person 6 and person 7) visited both parks, and a total of 9 people (persons 1 through 9) visited either park. Therefore, the similarity of the two parks would be 2/9, or 0.22.
On the tips of the graph there are Subreddits, which branch in with Subreddits that are the closest in terms of sharing users who comment in them. The shorter the line connecting the branches is, the more people there are who comment in both Subreddits, making the Subreddits closer. When comparing groups of Subreddits that have already been clustered together, an average closeness between members of different groups is used to determine how close they are. See Ward's method for the exact algorithm I used.
There is quite a bit of information that can be dissected. Much information is unsurprising, such as /r/Libertarian being close to /r/GaryJohnson and /r/RandPaul, two Subreddits for the Libertarian candidate and a libertarian-leaning candidate for last year's presidential election.
One interesting correlation I noticed was that /r/SandersForPresident, a strongly left-leaning Subreddit, was closer to /r/HillaryForPrison and /r/The_Donald, which are largely right-leaning Subreddits. Part of this can be explained by the similarity in sizes of the Subreddits, since a Subreddit with a large userbase will never share a significant portion of its userbase with a much smaller Subreddit. Regardless, this suggests that the left-right scale is not as important for this relationship.
Unlike the previous graph, this graph divides each subreddit into two halves: One half where users write positive-scoring comments, and another half where they post negative-scoring comments. If a community is relatively normal, then both halves will be grouped together. However, if the halves are separated, that indicates a very strong divide within commenters in a subreddit, where the majority overwhelmingly downvotes the minority.
The one explicit case of this happening is with /r/communism and /r/DebateCommunism. If a user makes a positive-scoring comment in one of them, they are more likely to make a positive-scoring comment in the other Subreddit than make a negative-scoring comment in the original Subreddit. Similarly, if they make a negative-scoring comment in one of the subreddits, they're more likely to make a negative-scoring comment in the other one than a positive-scoring comment in the original one.
A similar trend can be seen with /r/Anarchy101 with a group of other anarchy-related subreddits. This would suggest that anarchy and communism-related Subreddits have a strong tendency to downvote opinions contrary to their primary beliefs.
The above is a network graph showing how different Subreddits are connected by having the same moderators. Moderators can have a strong influence over a community, so Subreddits that share moderators are likely to be somewhat similar.
There are a few parts of the graph worth noting. One is that the socialism/anarchy/communism network at the very top, suggesting that they are run similarly. Another is how /r/The_Donald shares many moderators with /r/HillaryMeltdown and /r/HillaryForPrison. One other tidbit is that /r/politics and /r/news, two very large subreddits with many moderators, do not appear to have a huge influence over other Subreddits, moderator-wise.
This graph is similar to the first one, which used comments instead of submissions to connect Subreddits via users. The lengths of the branches are longer, which indicates that Subreddits are further apart from each other in terms of how much they share submitters versus commenters. Below is an animated GIF of a vertical dendrogram that allows you to compare the branch lengths. Don't worry if you can't read the graph below. It is just an unraveled version of the dendrogram above it and the first dendrogram on the page.
Different political groups are also more separated, such as libertarian-leaning and conservative-leaning ones, which are relatively close in the comment-based graph.
When comparing words and phrases used across Subreddits (using TF-IDF and cosine similarity), some of the clustering seen before remains, but the Subreddits become more strongly clustered by topic, rather than just the political leanings of their userbases. An obvious example of this in the lower left-hand side of the graph, where /r/progun, a pro-firearms Subreddit, is closest to /r/GunsAreCool, a sarcastically-named Subreddit mocking irresponsible firearm use with a noticeable left leaning.
A couple of fun pairs I notice here are /r/jillstein and /r/GaryJohnson, Subreddits focused around third-party presidential candidates, are right next to each other just below the center on the right side of the graph, as are /r/Republican and /r/democrats just below those two.
In the past, I analyzed political discourse on Reddit using a customized generative model, looking at how users of different ideologies interacted with each other across various Subreddits. I may return to this in the future, but as of right now, I am gathering more data and thinking of different ideas, such as using sentiment analysis.