If you have a few minutes to spare, consider filling in my “Vote for Candy” survey to rank your favorite candies as part of my latest project, especially if you are a resident of a smaller US state. Click/tap here to go to the survey. This tab will stay open.
Gift cards will be given to random participants.
February 26, 2020
Last fall I competed in my first triathlon, Tri the Illini, a sprint triathlon at the University of Illinois at Urbana Champaign. The race consisted of a 300m swim in a pool, a 14-mile bike ride, and a 5k run, each with a short transition zone between each.
Naturally, I was confident I would do well at the running portion as I run 5-6 times a week, and I had been training a few weeks prior for the biking, which I was alright at. However, I had not swum in ages, and it was never really my strong point. Needless to say, I did pretty bad in the swimming portion. Out of 424 athletes, I finished 410th in swimming, 142nd in biking, and 10th in running. Here are a couple plots, one of time, and another of ranking, to show where I was in the data.
I wouldn't normally discuss a single race on this blog about analytics, but it is a good example of how to identify outliers in numeric data, and what might constitute an outlier. The variables I will use below are the swim, bike, and run times, as well as individual's respective rankings in those events and their overall place.
An outlier is generally a data point that looks like it doesn't belong to a group of data, whether it's a slice of a dataset, or the entire dataset itself. In some cases, it may belong to the data set, but it is not part of the data that we want to analyze or model. For example, if we want to look at the effect of a drug on a population, if most of the population is 40-50 years old, and there is a very small amount that is 80-90, the latter age group may not be considered, as they would have to be treated completely differently than a much younger group.
In other cases, our goal may be to identify anomalies, such as fraudulent behavior, individuals who don't belong to a group, and forged/falsified records. In that case, finding outliers and their characteristics is the goal.
In my case, we'll take a look under what considerations different methods of outlier identification can be used.
A very simple way of dealing with outliers is by removing data with values that are well outside the normal range (usually the top/bottom few percent or less). We might remove the fastest and slowest athletes if we were, for example, trying to determine the effect of a sports drink on performance. In that case, the purpose of identifying the outliers is to remove them because they would skew the model, mostly because of how long-tailed/sparse data can be at both ends (especially the slow end). If there's not enough data or a more complex model is not allowed, removing outliers may be appropriate.
A simple way of identifying an outlier is making a simple rule, a heuristic, based on intuition that should identify more extreme characteristics. For the triathlon, each athlete who completed has a ranking for each event (as well as the transitions, which we will ignore for now). Generally, you would imagine that someone who is a triathlete would be well-balanced and train roughly equally among each event, while someone who is not would have wildly varying rankings (or just very poor rankings overall).
For this test, I simply calculate the standard deviation of the rankings for each athlete, and sort them in order.
According to this metric, the standard deviation was a bit over 160 which is much higher than anyone else. Interestingly, the slop of the line around the top 50 seems to steepen, and the top 6 break away from the rest of the graph. Looking at the top 6, one other athlete and I were good at running, not as good with bike, and much worse at swimming. There were a couple swimmers who did pretty poorly on the bike, and one runner who did much worse on both other events.
In this case, identifying outliers identified individuals who obviously were most likely not training for all 3 events equally, and more likely just one event primarily. The least outlier-like were those who were at the top overall and those who were at the bottom overall.
Mahalanobis distance is a metric that measures the distance of points from the center of a cluster, adjusting for the variation and correlations within the cluster. For example, if we had three variables, age, net worth, and income of an individual, we would expect those to be all correlated positively. An old rich man with a high income might be an outlier in one sense since all 3 of those values are away from the average age/wealth/income, but a young rich person with low income would be even more anomalous, since without time to accumulate wealth or the apparent income required for it, it would seem extraordinary (and one would guess from an inheritance/trust fund).
The exact transformation is simply multiplying the vector of centered data by the inverse covariance matrix, then multiplying the centered vector again. For the curious, this is what the formula looks like:
Looking at the Mahalanobis distance of times, I come in at #8 for highest distance. What is unusual, though, is that of the top 10, I am the only one not further back overall. My overall performance is somewhat average, but it is the individual times and their opposite-direction variation that impacts it. The slowest two individuals are at the top, which makes sense from an "outlier" perspective.
Looking at the Mahalanobis distances of rankings, the distribution of athletes is more varied, as the long-tailed nature of event times is not captured here. Since the rank variables are constrained and do not have any lopsidedness, the distance here more or less represents "outside of the norm" for event ranking correlations, rather than the event rankings themselves.
If you don't take correlations into account, all of the "outliers" are the slowest and fastest people. It is quite normal to have both slow and fast people in a race, so most of them really wouldn't be considered outliers in that sense. To better visualize the respective event and overall rankings of the individuals and their Mahalanobis distances, here are a couple plots that demonstrate the trends:
While using the raw times tends to produce outliers mostly from the long-tailed end of the slower times, the ranking of times tends to produce more individuals who do well in some events, and more poorly in others. Interestingly, most of these "outlier" athletes perform relatively average overall. This is another good way to detect athletes who have noticeable strengths and weaknesses that average each other out.
One final approach that I will use is clustering, where individuals are placed into a fixed number of groups based on their characteristics, and "outliers" may be either a group that is quite separate from all the others, or possibly just individuals who do not fit particularly close to any given group.
Below are dendrograms of the Tri the Illini athletes, where each marker on the right indicates an athlete, and the length of the branches extending from them indicates how "far away" they are from their closest neighbors. Groups of athletes that are joined together in this tree tend to have average times similar to each other. Red markers were fastest overall, blue the slowest, and black/darker colors somewhere in between. I've highlighted myself with a semi-transparent marker.
Note: I've scaled the swim, bike, and run times when performing distance calculations
One way to create clusters with the dendrograms is to imagine a vertical line cutting off all the branches to the left, leaving unconnected branches on the right in their own clusters. For the dendrogram based on time, it is easy to see 3 or 4 clusters, depending on whether or not the bottom area with the blue markers contains one or two. The one based on ranks can be 3 or 4, but it's easier to make the case for 4 here rather than 3 if determining a more general cluster.
However, if we want to identify outliers, we would look at the nodes that have the steepest height, or very small groups that have a very steep height when connecting to any other groups. The bottom dozen or so could definitely be considered an outlier group. I've manually marked a few individuals/small groups that stood out using this technique.
As a final remark, clustering to detect outliers will produce drastically different results depending on what technique/algorithm you use, and the sample size can greatly affect the interpretation, as well. The above was an agglomerative hierarchical clustering technique, but there are many others that can be used.
There are a few different methods of identifying outliers that have been outlined:
Of course, the above only really holds true for this data, and possibly other triathlon data, but it should give you an idea of some basic techniques to try when looking at numeric data. For other sorts of data, such as collections of text or other models with a large number of variables, you will likely be more worried about sample size and weird combinations that give certain data points too much influence over some variables. See my article When Leverage Overshadows Regularization, where I describe outliers in a model-oriented context.
Lastly: I really need to work on my swimming...
Code used for this project/article is hosted here: https://github.com/mcandocia/tri-outlier