February 24, 2021
When visualizing data, you sometimes might want to represent points as tiles on a heatmap, where the color indicates the value. Take for example, the following toy data:
library(tidyverse) library(cetcolor) set.seed(2021 * 02 * 24) mydata = expand.grid( x=1:6, y=1:6 ) %>% mutate( val = rnorm(36), err_width = rexp(36) * 2/3, lower = val-err_width, upper = val+err_width )
It is a 6-by-6 grid with normally-distributed values representing some sort of estimate, with exponentially-distributed error. The first 7 rows:
x | y | val | err_width | lower | upper |
---|---|---|---|---|---|
1 | 1 | -0.418 | 0.173 | -0.592 | -0.245 |
2 | 1 | 0.963 | 0.424 | 0.538 | 1.387 |
3 | 1 | -0.531 | 0.060 | -0.590 | -0.471 |
4 | 1 | 0.286 | 0.785 | -0.499 | 1.071 |
5 | 1 | 0.081 | 0.103 | -0.022 | 0.183 |
6 | 1 | -0.550 | 0.224 | -0.774 | -0.325 |
1 | 2 | -0.160 | 1.186 | -1.345 | 1.026 |
Ideally, an “error bar” will share the same type of scale as the value of the data, while being less prominent. Below is one such way of doing this, by placing two small, colored dots vertically within each cell. The top one represents the upper limit of a confidence interval—or any other kind of interval, like prediction interval or a quantile—and the lower dot the lower limit.
# symmetric color range color_range = c(-1,1) * max(abs(c(min(mydata$lower), max(mydata$upper)))) ggplot( mydata ) + geom_tile(aes(x=x,y=y, fill=val)) + geom_point( aes( x=x,y=y-0.2, color=lower ) ) + geom_point( aes( x=x,y=y+0.2, color=upper ) ) + # limits should be the same, using divergent palette for ease of seeing when # interval contains 0 scale_fill_gradientn(colors=cetcolor::cet_pal(7, 'd1a'), limits=color_range) + scale_color_gradientn(colors=cetcolor::cet_pal(7, 'd1a'), limits=color_range) + scale_x_continuous(breaks=1:6) + scale_y_continuous(breaks=1:6) + guides(color=FALSE) + theme_bw() + ggtitle('Sample Heatmap Tile with Dot "Error Bars"', subtitle='Lower dot = lower estimate | Upper dot = upper estimate') + labs(caption='https://maxcandocia.com/link/heatmap-errorbar')
When the dots are barely visible, the interval is very narrow. When they are different colors (red vs. blue), the interval contains the value 0, which is often the reference value used in statistical significance tests. i.e., if the confidence interval contains 0, the null hypothesis that the value is 0 is not rejected.
This can get a bit messier for larger data, but it is still manageable. Take for example a 25-by-25 grid:
While it's slightly trickier to read, it is not too difficult to interpret the graph versus one without the error dots. I do not see a very large number of tiles working, at least on regular computer screens or mobile devices.
In those cases, it might suffice to either indicate the least extreme estimate—i.e., the one closest to 0 if the interval does not contain 0, or 0 itself if it does—or a binary (0 for "zero" or 1 for "nonzero") or ternary (-1, 0 or 1 for values with those signs respectively) indicator of the least extreme value.
All of this is dependent upon the application, of course. At some point, though, there can be so much information that it is not reasonable to neatly fit it all in a static visualization.
Tags: