October 18, 2017

Recently I used linear and logistic models with elastic net regularization to determine the best times to post on Reddit. With about 1,000 parameters and 700,000 rows, I wasn't too worried about the intepretation of the weekday and hour parameters indicating the estimated effect of posting time. I did have some issues with the Subreddit and domain variables, which were greater in number and not evenly distributed.

One of the models I created predicted the probability of a submission scoring at least 500 points. The effects by time didn't look to strange, so I decided to look at what sort of effect the source Subreddit had. I was a bit surprised by the results:

It turned out that out of the roughly 350 data points for /r/CasualConversation, all but one of them had the same domain (another variable in the model).The domain was unique to that Subreddit, "self.CasualConversation", and only 5 posts from the Subreddit had a score over 500, including the one with a different domain. What happened in the model is that it was very easy for the model to vary the /r/CasualConversation subreddit variable and the "self.CasualConversation" domain variable to be roughly opposite of each other, despite the L1 and L2 penalties. Doing so allowed the one non-"self.CausalConversation" row to get a much better fit, since there was a lot of room to reduce the error penalty for that point.

To illustrate the effect I observed, I made a small simulation using 500 data points with 2 group variables ("group" and "class"), as well as an "output" variable that is a count, based on a Poisson distribution that uses the two groups to determine the lambda (rate) parameter.

First, I set up a few functions and import some packages.

library(plyr) library(dplyr) library(glmnet) library(reshape2) library(ggplot2) #glmnet function for easier use clean_glmnet <- function(model, data, alpha=0.5, family='gaussian', ...){ x = model.matrix(model, data=data, contrasts.arg = list(group=contrasts(data$group, contrasts=FALSE), class=contrasts(data$class, contrasts=FALSE)) ) y = data[,as.character(model)[2]] model = glmnet(x, y, family=family, alpha=alpha, ...) return(model) } get_params <- function(model){ if ('glmnet' %in% class(model)){ le = ncol(model$beta) #print(model$beta[,le]) return(model$beta[,le]) } else{ return(coef(model)) } } #function describing the rate estimate for the model rate1 <- function(group, class){ base_rates = rep(4, length(group)) base_rates = ifelse(group=='a', base_rates + 15, base_rates) base_rates = ifelse(class=='a', base_rates + 30, base_rates) base_rates = ifelse(class=='b', base_rates + 5, base_rates) return(base_rates) }

This last function, `rate1`

, is used to determine the rate used for the Poisson-distributed output variables.

Next, I generate the data according to various configurations and run the models using a loop. I run each model with a plain GLM and an elastic net-regularized GLM. I also run the model on data with the last (outlier) row dropped, as well as a value of 30 added to the row (for even greater leverage). Lastly, I run the models on data that has 4 points in group bfrom another, but not as impactful, class that is present in group a. This last change should reduce the impact of the main outlier.

make_data <- function(group_b_classes){ set.seed(2017) if (group_b_classes=='1a') high_leverage = data.frame( group = c(rep('a',250), rep('b', 250)), class = c(rep('a',125), rep('b', 125), rep('c', 249), 'a') ) %>% mutate(output = rpois(500, rate1(group, class))) else if (group_b_classes=='1a4b') high_leverage = data.frame( group = c(rep('a',250), rep('b', 250)), class = c(rep('a',125), rep('b', 125), rep('c', 245), rep('b',4), 'a') ) %>% mutate(output = rpois(500, rate1(group, class))) return(high_leverage) } data_list = list() for (group_b_classes in c('1a','1a4b')){ data = make_data(group_b_classes) for (model_type in c('glmnet','glm')){ if (model_type=='glmnet'){ func = clean_glmnet func_family = 'poisson' } else{ func = glm family = poisson } for (modification in c('drop last','none','add 30 to last')){ if (modification=='drop last') moddata = data[-500,] else if (modification=='add 30 to last'){ moddata = data moddata[500,'output'] = moddata[500,'output'] + 30 } else moddata = data model = func(output ~ group + class, family=func_family, data= moddata) param_df = as.data.frame(t(get_params(model))) %>% mutate(group_b_classes=group_b_classes, modification=modification, model_type=model_type) data_list[[paste(group_b_classes, model_type, modification)]] = param_df } } } result_frame = rbind.fill(data_list) param_frame = melt(result_frame, id.vars = c('group_b_classes', 'modification','model_type')) %>% mutate(value = coalesce(value,0), group_b_classes = mapvalues(group_b_classes, from=c('1a','1a4b'), to=c('group b: 1 class a, 249 class c','group b: 1 class a, 4 class b, 245 class c')), modification=factor(modification, levels=c('none','drop last','add 30 to last')), model_type = mapvalues(model_type, from=c('glm','glmnet'), to=c('Poisson GLM','Poisson GLM Elastic-Net')), variable=mapvalues(variable, from=c('classa','classb','classc','groupa','groupb'), to=c('class a','class b','class c','group a','group b')) )

Lastly, I generate a plot of the model parameters:

print( ggplot(param_frame, aes(x=variable, y=value, fill=modification)) + geom_bar(stat='identity', position='dodge') + facet_grid(group_b_classes ~ model_type) + scale_fill_brewer('Data Modification',palette='Dark2') + coord_flip() + ylab("Estimate") + ggtitle("Effect of High-Leverage Point on Regression Parameter Estimates", subtitle="Poisson GLM and Poisson Elastic-Net models in first and second columns\nData source by counts of classes in Group B separated in two rows") + theme(text=element_text(size=rel(4.3)), legend.text=element_text(size=rel(3)), plot.title=element_text(size=rel(5)), plot.subtitle=element_text(size=rel(3))) )

It can help to look at the orange bar in each graph, as that is most representative of what the model "should" be. In the plain GLMs on the left half of the graph, the group b estimates are strongly affected by the presence of the outlier, especially since its presence allows for a separate estimate of class c and group b, which are no longer identical. When a bit of class b is added to group b, the effect isn't as strong, but there is still an adverse affect.

The elastic net models are technically more consistent than their non-regularized counterparts, but the regularization can only contrain the parameters so much, even when class b is added to group b.

Although the above is a contrived case, it is very similar to the one I encountered with Reddit submission scores. Below are a few suggestions I have for mitigating these effects:

- Run the model through a non-regularized form of the regression first (if possible) to determine if there are high-leverage points. Leverage is a very good metric to use to determine if any points have a lot of influence on a model:
#plot leverage and cook's distance plot(glm(output ~ group+class, data=make_data('1a4b') %>% mutate(output=output+c(numeric(499),30))), which=6)

You may want to remove that point if it is unrealistic, wuch as the five points in the above graph.

- In general, points that have an unusual combination of classes will tend to have higher leverage, especially when the response variable is an outlier. This may be difficult to detect in models with larger numbers of variables. Validating models and/or creating ensembles of learners (such as random forest models or gradient boosting) can reduce this effect, although it limits the intepretability of the model.
- Proper sampling and/or experimental design can significantly reduce the chances of high-leverage points impacting a model. This cannot be done in every situation, but it is good if these problems can be avoided in the first place.
- Increasing sample size will reduce vulnerability, but it may have the opposite effect if new categories end up appearing (but this would not affect most of the data).

Tags: