September 26, 2017
Choosing the right color palette for a visualization can make a night and day difference for someone trying to understand the underlying data at a glance.
Most visualizations I see use an applications default color palette. Many times these seem aesthetically pleasing, but distinguishing colors becomes difficult as more categories and colors are added to the palette, regardless of whether one is colorblind or not.
Here are a few guidelines, which I will following with examples:
The brightness of different colors within a palette can be used to distinguish the elements, in addition to color. Below, the default (rainbow) palette is pretty, but the greens look similar without grayscale, and with grayscale there is ambiguity over which color refers to diamond color quality.
The yellow-orange-red palette, on the other hand, does a good job of distinguishing elements. In this case, since the categories are ordered, the physical interpretation is much stronger, since D is the best color and J is the worst color.
When simulating these for colorblindness of deuteranope (form of red/green colorblindness) and tritanope (a blue/yellow colorblindness (rare)) varieities, it is obvious which set of the above is preferable:
Below are three plots of the famous Iris dataset, which contain three different species of iris. Because there are few groups, the choice of color palette doesn't make a huge difference for most people:
In fact, the "colorblind-friendly" palette is a bit harder to read, since the background color is not that much lighter than the shade of orange used for Versicolor . I personally prefer the middle one, since the colors do a better job of contrasting with the background.
However, when you increase the number of groups being plotted, using color becomes a lot trickier. Take these plots of city & highway mileage for different cars built in the 2000s:
The second palette is obviously and improvement, but it is still difficult to distinguish between elements, especially with all of the overlapping. We can try using a lighter color palette on a dark background, and then add different shapes to some of the groups to improve this:
These images still look chaotic. Grouped boxplots, on the other hand, are very organized, and you can see the important aspects of the data immediately:
The color palette doesn't even matter that much in this case, since there are labels at the top of each column. Some information is lost in this visualization, but the important information is much more clear.
In many cases, if there are too many categories to assign colors to, you will be better off doing one of the following:
facet_grid) can be a useful way of organizing information.
So far I have only looked at categorical data. However, using a proper color scale for continuous data is very important, as well.
I highly recommend using a perceptually unifiform colormap. Most colormaps have regions which are wide and indistinguishable, which is a problem if you want to distinguish anything within that group.
Take the following plot of CO2 levels in Mauna Loa, for example:
It's not that bad of a choice of color, but it is definitely a bit drab, and it is a bit difficult to make out smaller differences. Compare this to the "inferno" palette from the
cetcolor package in R:
Look at any cell in the middle on the first one and see if you can find where it fits in the legend. Do the same for the second graph. There is a night-and-day difference between the readability of these graphs, which is a result of the second palette being designed for optimal human readability.
One last criterion for choosing color I will mention is literally representing the color of whatever you're plotting. You still want groups to be distinguishable (in most cases), but it reduces the need for the reader to rely on the legend, since they can use their own intuition to know what the colors mean. Here is an example using the fixed and volatile acidity of white and red wines:
I don't really need a legend here, since the title mentions it is by color. In this case, the takeaway is obvious: white wines (from this data set) are have a lower and more consistent acidity than red wines.
The code I used to generate the images can be found on my GitHub.