Converting fields of lists into wide and tall formats in R

Converting fields of lists into wide and tall formats in R

By Max Candocia

|

December 03, 2020

If you have ever downloaded survey data, or any other kind of data, that has a field that is itself comma-separated (or space/semicolon/etc.), you may have found it annoying/difficult to reshape the data into a more useful form.

Take for example, the following simple data frame:

library(tidyverse)

df = data.frame(
  id=1:3,
  wishlist=c(
    'peach,grapes',
    'peach,cherry',
    'grapes,cherry'
  )
)
idwishlist
1peach,grapes
2peach,cherry
3grapes,cherry

The above data is a simple set of IDs and wishlists, with items in each wishlist separated by commas. As it is, it is difficult to do anything with that data other than print it out.

Generally, you will want the data in one of two formats:

  1. Wide, where the number of rows is unchanged, and each possible item gets its own column. This is useful if there are a small number of possible items and you want to look at them independently.
  2. Tall, where there is a row for each item for each ID. This is useful if there is a large/unknown number of items, or you need the data in that format for easier visualization (e.g., with ggplot).

Below I will show a few ways of transforming the data, staring with tall, then wide. At the end I will show a few ways to convert back from tall to wide and vice-versa.

Note: Outside of the tidyverse package, I will import functions from packages using the package::function() notation to avoid loading functions from different packages that have the same names.

Converting to Tall Format

With Tidyverse

A relatively simple way of splitting the column into tall format is by using separate_rows() from the tidyr package:

tall_df_1 = df %>%
  separate_rows(wishlist, sep=',')
idwishlist
1peach
1grapes
2peach
2cherry
3grapes
3cherry

With purrrlyr

One less orthodox approach is to apply a function to each row of the data frame with by_row() from purrrlyr. This can be slower than other methods, but you can do additional processing of the data by-ID quite easily.

tall_df_2 = df %>%
  purrrlyr::by_row(
    ..f = (function(data){
      data.frame(item = data$wishlist %>%
                   str_split_fixed(',', n=Inf) %>%
                   as.character()
      )
    }),
    .collate='rows'
  ) %>%
  select(-wishlist, -.row)
iditem
1peach
1grapes
2peach
2cherry
3grapes
3cherry

With Base R

A simple way of splitting the values into tall format can be done with base R.

tall_df_3 = stack(
  setNames(
    strsplit(as.character(df$wishlist), ','),
    df$id
  )
)[2:1]
indvalues
1peach
1grapes
2peach
2cherry
3grapes
3cherry

Here, strsplit() splits the wishlist column into a list of character vectors according to the split pattern. The names of each element of the list are set to the original ID of the list with setNames(), and stack() converts the data to tall format. The order of the columns is reversed so that the ID column is in the right position.

Converting to Wide Format

With qdapTools

Below, I use mtabulate from the qdapTools package to create a table filled with 1s and 0s, and bind the column back to the original data without the wishlist column.

## wide form
wide_df_1 = bind_cols(
  df %>% select(-wishlist),
  qdapTools::mtabulate(
    strsplit(df$wishlist, ',')
  ) %>%
    rename_all(function(x) paste0('wishlist_',x))
)
idwishlist_cherrywishlist_grapeswishlist_peach
1011
2101
3110

With tidyverse only

Alternatively, you can use exclusively tidyverse packages to return a tibble instead of a data frame. The result is otherwise the same.

wide_df_2 = df %>%
  separate_rows(wishlist, sep=',') %>%
  mutate(i=1) %>%
  spread(wishlist, i, fill=0) %>%
  rename_at(-1, function(x) paste0('wishlist_',x))
idwishlist_cherrywishlist_grapeswishlist_peach
1011
2101
3110

As you can recognize, the separate_rows function converts the data into a tall format. Then a dummy value of 1 is inserted as a column. spread, from tidyr is then used to spread the values apart into their own columns, taking the value of 1 (i) if they are present, 0 otherwise.

The rename_at function gives a prefix of 'wishlist_' to all of the newly created columns, omitting the first (ID) column.

Base R Only

Lastly, we can avoid using the tidyverse package altogether and use only functions available in base R:

cherrygrapespeach
1011
2101
3110

Note that this approach results in the ID column named ind, and requires more processing to convert back to data frame in a wide format with the desired column names.

Converting between wide and tall

As you probably noticed, a lot of the methods of making a wide data frame involved creating a tall one, then applying additional functions to it. Likewise, you can convert back from wide to tall a few different ways.

Tidyverse Approach

The tidyr package has some good options for converting between the formats.

tidyr_tall_df = wide_df_2 %>%
  pivot_longer(starts_with('wishlist'))
idnamevalue
1wishlist_cherry0
1wishlist_grapes1
1wishlist_peach1
2wishlist_cherry1
2wishlist_grapes0
2wishlist_peach1
3wishlist_cherry1
3wishlist_grapes1
3wishlist_peach0
tidyr_wide_df = tall_df_1 %>%
  mutate(i=1) %>%
  pivot_wider(
    id_cols='id',
    names_from='wishlist',
    values_from=i,
    values_fill=0
  )
idpeachgrapescherry
1110
2101
3011

Note that whenever you specify a fill value of 0, you will probably want to filter out those values unless you need to keep them. It usually unnecessarily clutters your data, though, especially if it is very sparse.

reshape2 package

While not used as often, reshape2 can be used with its dcast() function to reshape data from tall to wide. Similarly, one can use melt() to reverse the operation and convert back into tall format. I use subset() here as a tidyverse alternative to filter out the 0 values.

new_wide_df = reshape2::dcast(
  tall_df_1,
  id ~ wishlist,
  fill=0,
  fun.aggregate=function(x) 1
)
idcherrygrapespeach
1011
2101
3110
new_tall_df = subset(
  reshape2::melt(
    new_wide_df,
    id.vars='id'
  ),
  value==1,
)
idvariablevalue
2cherry1
3cherry1
1grapes1
3grapes1
1peach1
2peach1

Tags: 

Recommended Articles

Effectively Hiding Data in Images

How to effectively hide messages and other data inside images.

Exploring Data with Generative Clustering

Clustering is a common machine learning technique applied to data for a variety of reasons, including dimensionality reduction, finding similar objects, and discovering important features. Here I demonstrate how generative clustering can be used with the Iris data set.