June 24, 2018
When working with survey data, your sampling is oftentimes imperfect. For example, women respond to surveys more frequently than men, so you could easily end up with a 70-30 split of women to men. Similarly, middle-aged people may be harder to reach than older people or younger people online (at least that's been my experience).
When you want to report a statistic, such as "What percentage of people in the US watch the TV Show Game of Thrones?", or "How many Big MacsĀ® does the average American eat a year?", you want to make sure that the way you report your statistic is representative of reality. If you cannot get a sample perfectly proportional to reality (e.g., 50% men, 50% women--for ages about 27% ages 18-29, 40% ages 30-54, 33% ages 55+), then weighting the data is often a good choice.
The simplest method of weighting is taking the cross-tabulation of all of the variables you want to weight by (e.g., age and gender, or possibly just gender), and then assigning weights to each row of your data so that the sum of weights in each cross-tabulated group is propoertional to the true population. For example, in the example above with a 70% response rate for women, you would give a weight of 2.33 (0.7/0.3 * 0.5/0.5
) to each man, and 1 to each woman, and then normalize them (multiply by the population divided by sum of weights) to get weights of 1.66 and 0.714 for men and women, respectively.
Oftentimes cross-tabulations are not available for every single variable, but you have individual figures for each of those variables. Or perhaps even a cross-tabulation of a subset of those variables. You can still get a good weight estimate by applying a technique called raking, which iteratively adjusts the weights of each row so that a particular column's proportions match with the weight proportions.
The process looks like this:
weighted_mean(x, weights) = sum(x*weights)/sum(weights)
Below is an implementation of raking in R
library(plyr) library(dplyr) library(reshape2) library(ggplot2) # will make sure a list's elements match the levels of # the data it corresponds to reorder_list <- function(x, reference_data){ new_list = list() data_levels = levels(reference_data) for (level in levels(as.factor(reference_data))) new_list[[level]] = x[[level]] return(new_list) } # calcualtes weights for data based on selected variables and their "true" margins rake_data <- function(data, variables, true_or_estimated_margins, max_iter=50, truncate_extreme_weights=TRUE){ weights = rep(1, nrow(data)) # calculate the sum and proportions of each variable + level in advance # and make sure that order matches the same order in the data total_margins = list() for (variable in variables){ original_margins = true_or_estimated_margins[[variable]] reordered_margins = reorder_list(original_margins, data[,variable]) total_margin = sum(unlist(reordered_margins[[variable]])) total_margins[[variable]] = total_margin for (level in names(true_or_estimated_margins[[variable]])) reordered_margins[[variable]][[level]] = reordered_margins[[variable]][[level]]/total_margin } # create design matrices (columns of 1s and 0s in this case) for faster calculations design_matrices = list() for (variable in variables){ # create model matrix with 0-intercept, which removes the concept of "reference variable" design_matrices[[variable]] = as.data.frame(model.matrix(~.+0, data=data[,variable,drop=FALSE])) # remove variable name from column name so only level remains colnames(design_matrices[[variable]]) = substr(colnames(design_matrices), 1, nchar(variable)) } # perform raking for (i in 1:max_iter){ for (variable in variables){ weighted_margins = colSums(weights * design_matrices[[variable]]) level_weight_modifications = unlist(true_or_estimated_margins[[variable]])/weighted_margins # this multiplies each column of the design matrices by the corresponding weight change factor # then each column is multiplied by the weights, and the rows are added up, since each row only # has one non-zero value weights = rowSums(weights * mapply(`*`, design_matrices[[variable]], level_weight_modifications)) } } # limits extreme weights to median plus 6 times inter-quartile range # IQR = difference between 75th percentile and 25th percentile of data if (truncate_extreme_weights){ weight_threshold = median(weights) + 6*IQR(weights) weights = pmin(weights, weight_threshold) } #normalize to population size weights = weights*length(weights)/(sum(weights)) return(weights) } # formula in http://www.analyticalgroup.com/download/WEIGHTED_MEAN.pdf weighted.var.se <- function(x, w, na.rm=FALSE) { x_weighted_mean = sum(x*w)/sum(w) sum(w*(x-x_weighted_mean)^2)/(sum(w)-1) } # the error for the weighted mean estimate weighted.var.sigma <- function(x, w){ root_b = sqrt(sum(w)^2/sum(w^2)) weighted.var.se(x,w)/root_b }
Below, I create a small data set with a gender and race variable as the groups, and a "random score" as the response variable. The goal is to find the average score as if it were truly randomly sampled from the target population.
In this case, there is a 70:30 female:male ratio, and a 50:40:10 ratio of white:black:other races. The true proportions are 50:50 female:male and 70:15:15 white:black:other. These are just made up values for this example, so don't look into them any further than that.
set.seed(525600) size_1 = 200 example_data = data.frame( gender=c("M","F")[1+rbinom(size_1, 1, 0.7)], race=sample(c('White','Black','Other'), size_1, replace=TRUE, prob=c(0.5,0.4,0.1)) ) actual_margins=list(gender=list('M'=0.5,'F'=0.5), race=list('White'=0.7, 'Black'=0.15, 'Other'=0.15)) survey_weights = rake_data(example_data, c('gender','race'), actual_margins) example_data = example_data %>% mutate(random_score = 2+rnorm(size_1)/3 + 0.2*(gender=='F') - 0.25 * (race=='Other') + 0.35 * (race=='Black') )
Now we can rake the data by gender and race, and then compare the weighted average score to the unweighted.
example_data_summary = rbind( example_data %>% summarize( score_mean = mean(random_score), score_error = sd(random_score), uses_weighting='Unweighted', mean_estimate_sigma=sd(random_score)/sqrt(length(random_score)) ), example_data %>% summarize( score_mean=sum(survey_weights*random_score)/sum(survey_weights), score_error=sqrt(weighted.var.se(random_score, survey_weights)), uses_weighting='Weighted', mean_estimate_sigma=weighted.var.sigma(random_score, survey_weights) ) ) %>% mutate(uses_weighting=factor(uses_weighting))
score_mean | score_error | uses_weighting | mean_estimate_sigma |
---|---|---|---|
2.254173 | 0.4165295 | Unweighted | 0.0294531 |
2.329059 | 0.3912572 | Weighted | 0.0157915 |
We can compare the different estimates visually, plotting the means along with the standard error estimates for the data, followed by the standard error estimates for the mean estimates themselves.
The standard error here is the weighted mean estimate standard error (very small) plus the standard error of the weighted data (much larger). The interpretation of the red ribbons is that you could expect 95% of the population to fall within them when sampled randomly. The weighted sample is more accurate when applied to the general population.
We can also look at how precise the mean estimate is itself. That is, "What is the true mean?" The weighted mean is slightly higher, and also much more precise than the unweighted version. This is likely due to more of the "average" values having more representation with the weights than in the unweighted sample.
Note that the trend of the weighted standard error/estimate standard error being smaller is not always true. If the more extreme values had higher weights, then the opposite effect would be seen.
Here is the GitHub repository for this code: https://github.com/mcandocia/raking
Tags: