October 12, 2017
Recently, as a commenter on Reddit suggested, I downloaded some data on Google BigQuery, which contains comprehensive post data from Reddit over the past year and a half. Using about 717,000 randomly sampled submissions from this data, the main insights I came across can be seen in the below graph:
As before, I would recommend posting early on the weekends in US Central Time in order to maximize your post score. When I only looked at default subreddits, this is what I found:
The weekend morning effect is even more pronounced, likely due to a more American audience around those times and the hours following them.
In addition to looking at the percent-change in score, I also looked at "over 50" and "over 500" models, which aim to predict the chances that a post will score at least 50 or at least 500 points, respectively. These are similar to the "percent-change" graphs above, but with more pronounced features. The default subreddit versions are a bit too choppy for me to display here due to low sample size.
Notice how the effects become stronger the higher the threshold is. The scale is about 3 times stronger for the "over 500" model than the "over 50" model. Effects are more tightly concentrated for the "over 500" model, suggesting a narrower time window when trying to get a very high-scoring post.
Looking at the effect of domains in submissions, it's obvious that image and short video-based submissions vastly outperform others, all-else-equal. News sites and some presumably spam sites seem to perform the worst.
The largest increase in scores based on submitting subreddit seem to be /r/BlackPeopleTwitter and /r/The_Donald. This effect is multiplied with the effects of time and submission domain, so if you posted a GIF on /r/BlackPeopleTwitter at 8 am US Central Time on Sunday, it would more likely than not score well.
When estimating each of these values, I used elastic net regularization in tandem with logistic and linear regression. In layman's terms, each subreddit, domain, and time of day had a specific score/weight that it adds to a prediction (that's regression), and the regularization shrinks the sizes of the weights down a bit to avoid overfitting, which is the statistics equivalent of playing connect-the-dots.
To (quite) roughly verify the consistency of the models over time, I looked at the general trends across the March, June, and combined March & June samples across the "percent-change", "over 50", and "over 500" models. I did this for models including all data and for models only including default Subreddits. They were similar enough to the point I am not too worried about making generalizations, although the default Subreddit estimates were a bit choppier.
Although the percent-change predictions were pretty safe, and probably the "over 50" models, there is still some overfitting when the number of "positive" examples is too low in comparison with the number of variables. For example, when I was looking at the subreddits that were best for getting a high post score, /r/CasualConversation showed up much further ahead than any other subreddit by an order of magnitude. Looking at the specific data, there were only 5 posts out of 325 that had a score of at least 500. This is about 50% higher than the average of ~1.1%, but nowhere near the 87,000% shown below. This is the result of the self posts scoring low, which is captured by domain, and a few of the posts scoring particularly high, including one image post by a moderator. This is an example of multicollinearity, which is when two or more input variables (in this case the domain of "self.CasualConversation" and a post being in /r/CasualConversation) are highly correlated, resulting in erroneous estimates for their respective effects on the independent (target/response) variable.
In the above case, the one image post by the moderator was enough to throw the estimate into disarray. Usually regularization would help cancel out those effects for larger sample sizes.
Each Subreddit behaves differently, so the effect of posting time is not 100% consistent between them. The same goes with the domain and Subreddit-based figures. The above graphs are meant as general guidelines only. Take a look at these examples using /r/The_Donald and /r/DataIsBeautiful as the only sampling source:
While both Subreddits tend to have higher-scoring posts in the mornings and weekends (like most other Subreddits), the estimates are a bit flimsier for /r/DataIsBeautiful for having about 9 times fewer posts, as well as a different audience.
I randomly sampled about 3% of all posts from March and June of 2017. For subreddit-specific insights (/r/The_Donald and /r/DataIsBeautiful), I used a complete sample over specific months. /r/The_Donald only consisted of June 2017, while /r/DataIsBeautiful consistend of January-June of 2017, as well as October 2016.
The code for the model is hosted on this GitHub repository: https://github.com/mcandocia/reddit_posting