Fixing Survey Samples by Raking Weights

Fixing Survey Samples by Raking Weights

By Max Candocia

|

June 24, 2018

When working with survey data, your sampling is oftentimes imperfect. For example, women respond to surveys more frequently than men, so you could easily end up with a 70-30 split of women to men. Similarly, middle-aged people may be harder to reach than older people or younger people online (at least that's been my experience).

When you want to report a statistic, such as "What percentage of people in the US watch the TV Show Game of Thrones?", or "How many Big MacsĀ® does the average American eat a year?", you want to make sure that the way you report your statistic is representative of reality. If you cannot get a sample perfectly proportional to reality (e.g., 50% men, 50% women--for ages about 27% ages 18-29, 40% ages 30-54, 33% ages 55+), then weighting the data is often a good choice.

Methods of Weighting

Basic Proportions

The simplest method of weighting is taking the cross-tabulation of all of the variables you want to weight by (e.g., age and gender, or possibly just gender), and then assigning weights to each row of your data so that the sum of weights in each cross-tabulated group is propoertional to the true population. For example, in the example above with a 70% response rate for women, you would give a weight of 2.33 (0.7/0.3 * 0.5/0.5) to each man, and 1 to each woman, and then normalize them (multiply by the population divided by sum of weights) to get weights of 1.66 and 0.714 for men and women, respectively.

Raking

Oftentimes cross-tabulations are not available for every single variable, but you have individual figures for each of those variables. Or perhaps even a cross-tabulation of a subset of those variables. You can still get a good weight estimate by applying a technique called raking, which iteratively adjusts the weights of each row so that a particular column's proportions match with the weight proportions.

The process looks like this:

  1. Set each weight to 1
  2. Select your variables to weight by and their true population/percentages
  3. Repeat steps 4-5 until convergence
  4. Repeat step 5 for each variable
  5. Set the weight of the column to its weight times the true proportion of each group divided by the proportion represented by the sum of weights in each group. A more proper description:
  6. (optional) Limit the maximum weight to the median of the weights plus 6 times the inter-quartile range (IQR) of the weights. This prevents rare categories from having too much impact in your results
  7. Multiply the final weights by the total sample size divided by the sum of the weights (for normalization)

What to do with the weights

What NOT to do with the weights

Possible Pitfalls

Code Example

Below is an implementation of raking in R

library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)

# will make sure a list's elements match the levels of
# the data it corresponds to
reorder_list <- function(x, reference_data){
  new_list = list()
  data_levels = levels(reference_data)
  for (level in levels(as.factor(reference_data)))
    new_list[[level]] = x[[level]]
  return(new_list)
}

# calcualtes weights for data based on selected variables and their "true" margins
rake_data <- function(data, variables, true_or_estimated_margins, max_iter=50, truncate_extreme_weights=TRUE){
  weights = rep(1, nrow(data))

  # calculate the sum and proportions of each variable + level in advance
  # and make sure that order matches the same order in the data
  total_margins = list()
  for (variable in variables){
    original_margins = true_or_estimated_margins[[variable]]
    reordered_margins = reorder_list(original_margins, data[,variable])

    total_margin =  sum(unlist(reordered_margins[[variable]]))
    total_margins[[variable]] = total_margin
    for (level in names(true_or_estimated_margins[[variable]]))
      reordered_margins[[variable]][[level]] =
          reordered_margins[[variable]][[level]]/total_margin
  }

  # create design matrices (columns of 1s and 0s in this case) for faster calculations
  design_matrices = list()
  for (variable in variables){
    # create model matrix with 0-intercept, which removes the concept of "reference variable"
    design_matrices[[variable]] = as.data.frame(model.matrix(~.+0, data=data[,variable,drop=FALSE]))
    # remove variable name from column name so only level remains
    colnames(design_matrices[[variable]]) = substr(colnames(design_matrices), 1, nchar(variable))
  }

  # perform raking
  for (i in 1:max_iter){
    for (variable in variables){
      weighted_margins = colSums(weights * design_matrices[[variable]])
      level_weight_modifications = unlist(true_or_estimated_margins[[variable]])/weighted_margins

      # this multiplies each column of the design matrices by the corresponding weight change factor
      # then each column is multiplied by the weights, and the rows are added up, since each row only
      # has one non-zero value
      weights = rowSums(weights * mapply(`*`, design_matrices[[variable]], level_weight_modifications))
    }
  }

  # limits extreme weights to median plus 6 times inter-quartile range
  # IQR = difference between 75th percentile and 25th percentile of data
  if (truncate_extreme_weights){
    weight_threshold = median(weights) + 6*IQR(weights)
    weights = pmin(weights, weight_threshold)
  }
  #normalize to population size
  weights = weights*length(weights)/(sum(weights))

  return(weights)
}

# formula in http://www.analyticalgroup.com/download/WEIGHTED_MEAN.pdf
weighted.var.se <- function(x, w, na.rm=FALSE)
{
  x_weighted_mean = sum(x*w)/sum(w)
  sum(w*(x-x_weighted_mean)^2)/(sum(w)-1)
}

# the error for the weighted mean estimate
weighted.var.sigma <- function(x, w){
  root_b = sqrt(sum(w)^2/sum(w^2))
  weighted.var.se(x,w)/root_b
}

Simulated Data

Below, I create a small data set with a gender and race variable as the groups, and a "random score" as the response variable. The goal is to find the average score as if it were truly randomly sampled from the target population.

In this case, there is a 70:30 female:male ratio, and a 50:40:10 ratio of white:black:other races. The true proportions are 50:50 female:male and 70:15:15 white:black:other. These are just made up values for this example, so don't look into them any further than that.

set.seed(525600)
size_1 = 200
example_data = data.frame(
  gender=c("M","F")[1+rbinom(size_1, 1, 0.7)],
  race=sample(c('White','Black','Other'), size_1, replace=TRUE, prob=c(0.5,0.4,0.1))
)

actual_margins=list(gender=list('M'=0.5,'F'=0.5), race=list('White'=0.7, 'Black'=0.15, 'Other'=0.15))

survey_weights = rake_data(example_data, c('gender','race'), actual_margins)

example_data = example_data %>% mutate(random_score = 2+rnorm(size_1)/3 +
                                         0.2*(gender=='F') -
                                         0.25 * (race=='Other') +
                                         0.35 * (race=='Black')
)

Now we can rake the data by gender and race, and then compare the weighted average score to the unweighted.

example_data_summary = rbind(
  example_data %>% summarize(
  score_mean = mean(random_score),
  score_error = sd(random_score),
  uses_weighting='Unweighted',
  mean_estimate_sigma=sd(random_score)/sqrt(length(random_score))
  ),
  example_data %>% summarize(
    score_mean=sum(survey_weights*random_score)/sum(survey_weights),
    score_error=sqrt(weighted.var.se(random_score, survey_weights)),
    uses_weighting='Weighted',
    mean_estimate_sigma=weighted.var.sigma(random_score, survey_weights)
  )
) %>%
  mutate(uses_weighting=factor(uses_weighting))
score_mean score_error uses_weighting mean_estimate_sigma
2.254173 0.4165295 Unweighted 0.0294531
2.329059 0.3912572 Weighted 0.0157915

We can compare the different estimates visually, plotting the means along with the standard error estimates for the data, followed by the standard error estimates for the mean estimates themselves.

Standard Error

The standard error here is the weighted mean estimate standard error (very small) plus the standard error of the weighted data (much larger). The interpretation of the red ribbons is that you could expect 95% of the population to fall within them when sampled randomly. The weighted sample is more accurate when applied to the general population.

plot of chunk se_estimate_plot

Mean Estimate Error

We can also look at how precise the mean estimate is itself. That is, "What is the true mean?" The weighted mean is slightly higher, and also much more precise than the unweighted version. This is likely due to more of the "average" values having more representation with the weights than in the unweighted sample.

Note that the trend of the weighted standard error/estimate standard error being smaller is not always true. If the more extreme values had higher weights, then the opposite effect would be seen.

plot of chunk mean_estimate_se_estimate_plot

Sources

GitHub Repository Code

Here is the GitHub repository for this code: https://github.com/mcandocia/raking


Tags: 

Recommended Articles

Converting fields of lists into wide and tall formats in R

If you have ever downloaded survey data, or any other kind of data, that has a field that is itself comma-separated, you may have found it annoying/difficult to reshape the data into a more useful form.

Pooled Testing for Viruses: How many tests can it save?

Pooled testing is a method of increasing the efficiency of medical tests, such as COVID-19 detection tests. How efficient is it, though? That largely depends on what percentage of the population is infected.