July 29, 2017
Last Updated 10/24/2017
UPDATE: A more recent and thorough analysis can be found here.
When posting anything on social media, whether a news article, a picture of yourself, or a funny image (or a combination thereof), you usually want to reach the largest audience. When posting on Reddit, I have noticed that the success of a post is largely determined by the time of day and day of week that your submission is posted. There are a few other factors, such as whether the post is an image, an article, or a text-only submission.
I have used the Python scraper I built in order to collect data on articles of particular Subreddits I wish to analyze. Among the data collected, I have...
Using this information, I can formulate a model that describes what attributes affect the score. Specifically, I am looking for a percent change in the score with respect to values such as time of day, day of week, whether a post is an image post, etc. In my case, this can be approximated with this formula:
sign(score) * log(abs(score) + 1) = time_of_day_and_day_of_week + is_image_post + is_text_post + length_of_submission_title
I log-transform the score on the left side. Doing so ensures that the terms on the right side have a multiplicative effect on the score, as opposed to additive. The right side treats the time of day + day of week, the post being an image post, and its other attributes as independent factors that each scale the score by some value; i.e., I am controlling for other effects.
Below is a graph that estimates the effect of the time of day and day of week on six different subreddits I sampled collectively. I use Monday from 8 to 10 am as a reference, so the percentages are the percent increase in score you can expect if you post at the given time versus Monday from 8 to 10 am US Central Time .
Monday morning is a relatively good time to post in these subreddits, especially from 6-8 am. Sunday is even better during that time frame, with an expected score that is 74% higher than our reference, Monday from 8 to 10 am. Saturday, however, seems fairly strong most of the day.
Because the above image only applies to a relatively small amount of data, it helps to compare it to a different set of data. Below I sampled default subreddits, as well as thread commenter's comment histories, so this model generalizes to Reddit as a whole better.
This tells a similar story, except the tiles change a lot more smoothly. You could repeat the process, but the general takeaway is that the best time to post on Reddit is on Sunday, Monday, or Saturday from 6 to 8 am US Central Time. The next best times would be within 2 hours of that time range on those same days, or during that same time range on other days.
Technically, the transformation I made to the score adds 1 to the score before calculating the percent change, and negative scores are calculated as having points equal to 1/(1+abs(score))
, which is a fractional score always decreasing as the score becomes more negative.
Below I have the R code I used to generate the images. You can download the data for the file here: constrasts_threadmode.csv.
library(plyr)
library(dplyr)
library(htmlTable)
library(ggplot2)
library(scales)
setwd('/mydirectory/reddit_posting')
#makes filenames possible/better
subslash <- function(x){
x = (gsub(' ','-',x))
return(gsub('/','-',x))
}
create_threads_plot <- function(threads, tname='none', subtitle_size=18){
#group times to increase significance of data
threads$hour_ = cut(threads$hour, seq(0,24,2), include.lowest=TRUE, right=FALSE)
source_hour_ = levels(threads$hour_)
target_hour_ = c('12-2 am','2-4 am', '4-6 am', '6-8 am','8-10 am', '10 am - 12 pm','12- 2 pm', '2- 4 pm', '4-6 pm','6-8 pm',
'8-10 pm','10 pm - 12 am')
threads$hour_ = mapvalues(threads$hour_, from = source_hour_, to=target_hour_)
threads$titlelen = nchar(as.character(threads$title))/100
threads$logscore = sign(threads$score) * log(1 + abs(threads$score))
threads$is_self = with(threads, ifelse(is_self=='t','Self Post','Link Post'))
daysofweek = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday')
threads$dow = factor(daysofweek[threads$dow+1], levels=daysofweek)
weekday_hour_grid = expand.grid(target_hour_, daysofweek)
#make sure order is right
weekday_hour_levels = paste(weekday_hour_grid[,2], weekday_hour_grid[,1])
#for a better reference, ref=Monday 8-10 am
weekday_hour_levels_ = c(weekday_hour_levels[17], weekday_hour_levels[-17])
threads$weekday_hour = factor(paste(threads$dow,threads$hour_), levels =weekday_hour_levels_ )
#domain vars
threads$image_submission = factor( c('Image Submission',
'Non-Image Submission')[2 - threads$domain %in% c('imgur.com','i.imgur.com','i.reddit.com')])
threads$image_submission = relevel(threads$image_submission,ref='Non-Image Submission')
#remove moderator posts, which will most likely be very high
threads = threads %>% filter(is_distinguished=='f', is_stickied=='f')
n_data_points = nrow(threads)
#run linear model and extract coefficients
model = lm(logscore ~ weekday_hour + titlelen + is_self + image_submission + subreddit, data=threads)
model_summary = summary(model)
coefs = model_summary$coefficients
#round sig figs
for (i in 1:4)
coefs[,i] = signif(coefs[,i], 4)
#used to produce HTML output of the model summary for display on web
sink(subslash(paste0('reddit_thread_summary_table_', tname , '.html')))
print(htmlTable(coefs))
sink()
#now format matrix to show results
coefmat = as.data.frame(cbind(varname = rownames(coefs), coefs))[,1:2]
coefmat = coefmat %>% filter(grepl('weekday_hour.*',varname))
coefmat = rbind(data.frame(varname='weekday_hourMonday 8-10 am',Estimate=0), coefmat)
coefmat$dow = factor(gsub( '.*hour','', gsub(' .*','',coefmat$varname) ), levels=daysofweek)
coefmat$hour = factor(gsub('^[^0-9-]*? ','', coefmat$varname), levels=rev(target_hour_) )
coefmat$`Percent Change`= (exp(as.numeric(coefmat$Estimate)) - 1)
#save plot to png
png(subslash(paste0('expected_reddit_score_',tname, '.png')), height=720, width=920)
print(
ggplot(coefmat, aes(x=hour, y=dow, fill=`Percent Change`)) +
geom_tile() + xlab('') + ylab('') + #axes are self-explanatory with title
ggtitle('Percent Change in Expected Reddit Submission Score Based on Time Posted',
subtitle=paste('compared to Monday from 8 - 10 am & using',comma(n_data_points), tname,'submissions')) +
theme_bw() + theme(plot.title = element_text(hjust=0.5, size=24), plot.subtitle=element_text(hjust=0.5, size=subtitle_size),
axis.text.x=element_text(size=18,angle=0, vjust=0.8), axis.text.y = element_text(size=18)) +
scale_fill_gradient2(labels=scales::percent) + geom_text(aes(label=scales::percent(`Percent Change`)),size=6) +
coord_flip()
)
dev.off()
}
#load file and create a plot + html table for each
threads = read.csv('/mydirectory/contrasts_threadmode.csv')
create_threads_plot(threads, 'nintendo/boardgames/rap/classicalmusic/democrats/conservative', subtitle_size=12)
Tags: