Calculating Similarity of Running Routes

By Max Candocia

|

September 13, 2020

Overview

Measuring the closeness of two paths on a map is a tricky task. There's no universal algorithm to measure similarity, and the nature of the task itself depends on the purpose and the data.

I run quite a bit, and with my 600+ runs I've measured on my Garmin GPS watch, I would like to be able to know/visualize the following:

  1. Which of my runs are "repeats" of each other. That is, they are very similar in path.
  2. Which of my runs are the "reverse" of each other? This usually means the reverse of a loop, but it applies to other shapes, too.
  3. How similar are my runs to each other over time? How much do my routes vary?

With a python script utilizing rasterization, that is, placing latitude & longitude points into discrete bins, I will show how this can be done with two similarity metrics: one that doesn't take direction into account, and one that does.

Note: The underlying code for this is a bit complicated, so it is not shown here, but it is available on my GitHub.

Results

The algorithm is a bit tedious, so I will show the results here, which are motivated by the similarities calculated by the algorithms in the below sections.

Identifying Running Route Patterns

A lot of my runs are similar routes. A few of them are fixed-distance, but others are out-and-back, where I will pick a general route, and then follow it until I am halfway done, and then I turn back along the same route.

Using similarity, I can visualize how my runs relate to each other over time in terms of what routes I take.

plot of chunk tile_sim

A lot of my runs are similar, as they go along the Chicago Lakefront Trail. These are the yellow/red-orange streaks throughout the graph. Yellow dots are the ones that have the closest distance to each other.

You can also see where my route change, since the trail was closed from March through the early summer this year on order of the mayor. Even since it's opened I've avoided it for the most part, and I've found new routes that I can run on, since the trail is still too crowded for my liking most of the time.

An alternate take on this is by looking at the graph over time rather than run #:

plot of chunk time_tile_sim

Since there were periods of time I did not run—usually due to sickness or injury, or just rest days in general—there are a lot of gaps that you can see. I've also darkened the background to accentuate the higher values.

Algorithms

Grouping

The first step, which is relatively simple, is grouping different tracks together, so that there are clusters of tracks that do not touch each other at all between clusters. This is primarily done so that I can use the median latitude of each group as the basis for the longitudinal width of the bins within a group. If I didn't, then if I had a track near the equator, those bins would be about 40% wider than ones in Wisconsin.

Steps:

  1. Calculate the bounding box of each track, i.e., the minimum and maximum longitude and latitude of each track.
  2. Increase the size of the box by an amount equal to the maximum distance that you want to consider two tracks "touching" each other. The motivation for this will be explained in the next section
  3. For the first track, assign it to a new group
  4. For each remaining track, assign it to an existing group if its bounding box overlaps with any member of that group. Otherwise create a new group with just that member
  5. Once all tracks are assigned a group, for each group, check to see if any members of that group overlap with any members of another group. If one is found, then merge the two groups
  6. Repeat step 5 until no new merges are made
  7. Calculate the median longitude of each group's members. This doesn't have to be super-precise, but this is the basis for mapping points of the tracks to the grid.

Once this step is done, rasterization can begin

Rasterization

The second step in this process is rasterization of all of the points. For the examples here, I place each point into a roughly 15-meter by 15-meter bin. I say roughly because as latitude changes, the number of miles/kilometers per degree longitude changes, and because I want these tiles to line up with each other on a rectangular grid, the east-west length of each of the bins varies slightly for different degrees of latitude

An example of what this looks like on a map with a 15-meter by 15-meter grid is the following run from back in 2018:

plot of chunk raster_example

Note how even along straight streets, the points on the rasterized grid go up and down due to shakiness in GPS data. For the most part, my actual running does not vary this much. It is just a natural artifact of small amounts of GPS error that end up bouncing points across the borders of different bins.

The way I resolve this problem is by measuring a certain distance out from each bin. For computational ease, I use all bins on a grid a certain Manhattan distance, from the others. As opposed to our normal way of measuring distance, where we use the Pythagorean theorem $$ d^2 \sqrt{x^2 + y^2} $$ we instead use $$ d = x + y $$

For our purposes, there isn't much of an advantage of doing it this way apart from simplifying calculations and making the code run faster. The only noticeable effect is that diagonal paths (NE/NW/SE/SW) are more vulnerable to error than N/S/E/W paths. An example of the above map using 5 additional bins away from the center, for a maximum width of \(15 \frac{m}{bin} \times 9\;bins = 135m\), which is wide, but useful for measuring similarity.

plot of chunk raster_expanded_example

As you can see, it is much wider, and it looks more "straight", in spite of the wobbles in the data.

Non-directional Similarity

The simpler version of similarity does not take direction into account. Instead, it looks at the overlaps of each points, as well as assigning weights to each point. Weights are assigned as follows:

  1. First, choose a certain amount of time to allow the same bins to be traversed before traversing that bin again counts as "overlap". I use 60 seconds, since it corresponds to about a 12 minutes per mile pace to clear 135 meters.
  2. Second, define a weight function to describe how much weight should be given to a bin based on its Manhattan distance to the center bin. This can be tricky, but I chose to have very high weights for all except the outermost bins.
  3. For each track, calculate the weight of each bin by iterating through each point in time, and for each bin,
    • If at a point in time, the track hs not been encountered within the 60-second window, increment its weight by the appropriate value and make note of that value
    • If the point has been encountered within that time, but the weight is greater than the previous weight recorded, reset the 60-second window to start at that time point, and increase its weight by the difference of the weights, making note of the higher weight
      • For each track, calculate the norm, which is the square root of the sum of the squares of each bin.
      • For each pair of tracks within a group, for each shared bin, add the products of the two weights for each of the bins. Divide that product by the product of the norms of those tracks.

Below is an example of 3 routes, with the accompanying similarity matrix, where 0 = no similarity and 1 = identical.

plot of chunk nd_similarity_example

Similarity Matrix

id1 127 175 270
127 1.00 0.92 0.28
175 0.92 1.00 0.29
270 0.28 0.29 1.00

As you can see, the two tracks in the southern half of the map have a high correlation of 0.92, while the the one on the top only shares a correlation of about 0.28 with the others.

Directional Similarity

A more complicated way of comparing two paths is by taking direction into account. The way I wanted to define this similarity is one that would result in the following scenarios:

  1. If two paths perfectly overlap in the same direction half the time, and then abruptly diverge, the similarity is \(0.5\)
  2. If two paths are identical circles, except one is clockswise and the other is counterclockwise, the similarity is \(-1\)
  3. If two paths are back-and-forth paths, and one goes east-west-east, and the other west-east-west, the similarity should be roughly \(-1\)

Yes, I know similarity should not be negative, but it the simplest term I can think of for it.

Regardless, given the above, the algorithm is very similar to the one for non-directional similarity. However, the following changes must be made:

  1. Instead of adding weights across different 60-second time frames for a given bin, a new "entry" is created in a list for that bin, so the weight and angle of that bin can be recorded for each window
  2. The direction/angle of each bin is calculated using the transform $$ \text{average angle} = \text{atan2}\left(\sum{\sin(angle)}, \sum{\cos(angle)}\right)$$
  3. When calculating the norms and similarities, they are done element-wise, in-order for each "entry" of each bin. The cosine between the two angles of the bins is multiplied by the weights of each of the bins, and these products are added together. If one track has more entries in a bin than another track, then any entries beyond the length of the shortest's are skipped.
  4. To calculate the angle, I used a 5-second lag between points to help de-noise wobbles in the data. A longer delay may be appropriate, although this also means the end of the track is forward-filled with the last calculable angle more

An example of what one of these tracks looks like with the direction included in the bin (for the first instance of each bin):

plot of chunk directional_raster_example

Note that there are some artifacts from changing directions, as well as stopping and starting again slightly behind where the stop was. Adding direction makes this quite a bit noisier, and difficult to compare. Here's the similarity matrix for the 3 example tracks from the non-directional example:

Directional Similarity Matrix

id1 127 175 270
127 1.00 0.90 -0.16
175 0.90 1.00 -0.16
270 -0.16 -0.16 1.00

You can see that run #270 has a negative similarity to the other ones. This is a result of the run starting east on the road that's traveled westward for the other runs.

Possible Changes in Future

  1. Apply a transform to each individual bin's weight for non-directional transforms to reduce the effect of overlap. I think a square-root transform for values above 1 would be ideal.
  2. Use Euclidean distance for more consistency
  3. Use a more forgiving angle-difference metric (e.g., pull values near 0 and 180 degrees closer to those values)

GitHub

https://github.com/mcandocia/track_sim

Citations

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf


Tags: 

Recommended Articles

Dealing with Zeros and Negative Values with a Log Scale

When plotting data, you may want to use a log-scale for most of your data, but zeros, near-zero values, and negative values make this impossible. With piecewise linear and logarithmic functions, however, this effect can still be achieved.

Pacing at the Boston Marathon