December 03, 2020
If you have ever downloaded survey data, or any other kind of data, that has a field that is itself comma-separated (or space/semicolon/etc.), you may have found it annoying/difficult to reshape the data into a more useful form.
Take for example, the following simple data frame:
library(tidyverse) df = data.frame( id=1:3, wishlist=c( 'peach,grapes', 'peach,cherry', 'grapes,cherry' ) )
id | wishlist |
---|---|
1 | peach,grapes |
2 | peach,cherry |
3 | grapes,cherry |
The above data is a simple set of IDs and wishlists, with items in each wishlist separated by commas. As it is, it is difficult to do anything with that data other than print it out.
Generally, you will want the data in one of two formats:
ggplot
). Below I will show a few ways of transforming the data, staring with tall, then wide. At the end I will show a few ways to convert back from tall to wide and vice-versa.
Note: Outside of the tidyverse
package, I will import functions from packages using the package::function()
notation to avoid loading functions from different packages that have the same names.
A relatively simple way of splitting the column into tall format is by using separate_rows()
from the tidyr
package:
tall_df_1 = df %>% separate_rows(wishlist, sep=',')
id | wishlist |
---|---|
1 | peach |
1 | grapes |
2 | peach |
2 | cherry |
3 | grapes |
3 | cherry |
One less orthodox approach is to apply a function to each row of the data frame with by_row()
from purrrlyr
. This can be slower than other methods, but you can do additional processing of the data by-ID quite easily.
tall_df_2 = df %>% purrrlyr::by_row( ..f = (function(data){ data.frame(item = data$wishlist %>% str_split_fixed(',', n=Inf) %>% as.character() ) }), .collate='rows' ) %>% select(-wishlist, -.row)
id | item |
---|---|
1 | peach |
1 | grapes |
2 | peach |
2 | cherry |
3 | grapes |
3 | cherry |
A simple way of splitting the values into tall format can be done with base R.
tall_df_3 = stack( setNames( strsplit(as.character(df$wishlist), ','), df$id ) )[2:1]
ind | values |
---|---|
1 | peach |
1 | grapes |
2 | peach |
2 | cherry |
3 | grapes |
3 | cherry |
Here, strsplit()
splits the wishlist
column into a list of character
vectors according to the split pattern. The names of each element of the list are set to the original ID of the list with setNames()
, and stack()
converts the data to tall format. The order of the columns is reversed so that the ID column is in the right position.
Below, I use mtabulate
from the qdapTools
package to create a table filled with 1
s and 0
s, and bind the column back to the original data without the wishlist column.
## wide form wide_df_1 = bind_cols( df %>% select(-wishlist), qdapTools::mtabulate( strsplit(df$wishlist, ',') ) %>% rename_all(function(x) paste0('wishlist_',x)) )
id | wishlist_cherry | wishlist_grapes | wishlist_peach |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 0 | 1 |
3 | 1 | 1 | 0 |
Alternatively, you can use exclusively tidyverse
packages to return a tibble
instead of a data frame. The result is otherwise the same.
wide_df_2 = df %>% separate_rows(wishlist, sep=',') %>% mutate(i=1) %>% spread(wishlist, i, fill=0) %>% rename_at(-1, function(x) paste0('wishlist_',x))
id | wishlist_cherry | wishlist_grapes | wishlist_peach |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 0 | 1 |
3 | 1 | 1 | 0 |
As you can recognize, the separate_rows
function converts the data into a tall format. Then a dummy value of 1
is inserted as a column. spread
, from tidyr
is then used to spread the values apart into their own columns, taking the value of 1 (i
) if they are present, 0
otherwise.
The rename_at
function gives a prefix of 'wishlist_'
to all of the newly created columns, omitting the first (ID) column.
Lastly, we can avoid using the tidyverse package altogether and use only functions available in base R:
cherry | grapes | peach | |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 0 | 1 |
3 | 1 | 1 | 0 |
Note that this approach results in the ID column named ind
, and requires more processing to convert back to data frame in a wide format with the desired column names.
As you probably noticed, a lot of the methods of making a wide data frame involved creating a tall one, then applying additional functions to it. Likewise, you can convert back from wide to tall a few different ways.
The tidyr
package has some good options for converting between the formats.
tidyr_tall_df = wide_df_2 %>% pivot_longer(starts_with('wishlist'))
id | name | value |
---|---|---|
1 | wishlist_cherry | 0 |
1 | wishlist_grapes | 1 |
1 | wishlist_peach | 1 |
2 | wishlist_cherry | 1 |
2 | wishlist_grapes | 0 |
2 | wishlist_peach | 1 |
3 | wishlist_cherry | 1 |
3 | wishlist_grapes | 1 |
3 | wishlist_peach | 0 |
tidyr_wide_df = tall_df_1 %>% mutate(i=1) %>% pivot_wider( id_cols='id', names_from='wishlist', values_from=i, values_fill=0 )
id | peach | grapes | cherry |
---|---|---|---|
1 | 1 | 1 | 0 |
2 | 1 | 0 | 1 |
3 | 0 | 1 | 1 |
Note that whenever you specify a fill value of 0, you will probably want to filter out those values unless you need to keep them. It usually unnecessarily clutters your data, though, especially if it is very sparse.
While not used as often, reshape2
can be used with its dcast()
function to reshape data from tall to wide. Similarly, one can use melt()
to reverse the operation and convert back into tall format. I use subset()
here as a tidyverse
alternative to filter out the 0
values.
new_wide_df = reshape2::dcast( tall_df_1, id ~ wishlist, fill=0, fun.aggregate=function(x) 1 )
id | cherry | grapes | peach |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 0 | 1 |
3 | 1 | 1 | 0 |
new_tall_df = subset( reshape2::melt( new_wide_df, id.vars='id' ), value==1, )
id | variable | value |
---|---|---|
2 | cherry | 1 |
3 | cherry | 1 |
1 | grapes | 1 |
3 | grapes | 1 |
1 | peach | 1 |
2 | peach | 1 |
Tags: