September 13, 2020
Measuring the closeness of two paths on a map is a tricky task. There's no universal algorithm to measure similarity, and the nature of the task itself depends on the purpose and the data.
I run quite a bit, and with my 600+ runs I've measured on my Garmin GPS watch, I would like to be able to know/visualize the following:
With a python script utilizing rasterization, that is, placing latitude & longitude points into discrete bins, I will show how this can be done with two similarity metrics: one that doesn't take direction into account, and one that does.
Note: The underlying code for this is a bit complicated, so it is not shown here, but it is available on my GitHub.
The algorithm is a bit tedious, so I will show the results here, which are motivated by the similarities calculated by the algorithms in the below sections.
A lot of my runs are similar routes. A few of them are fixed-distance, but others are out-and-back, where I will pick a general route, and then follow it until I am halfway done, and then I turn back along the same route.
Using similarity, I can visualize how my runs relate to each other over time in terms of what routes I take.
A lot of my runs are similar, as they go along the Chicago Lakefront Trail. These are the yellow/red-orange streaks throughout the graph. Yellow dots are the ones that have the closest distance to each other.
You can also see where my route change, since the trail was closed from March through the early summer this year on order of the mayor. Even since it's opened I've avoided it for the most part, and I've found new routes that I can run on, since the trail is still too crowded for my liking most of the time.
An alternate take on this is by looking at the graph over time rather than run #:
Since there were periods of time I did not run—usually due to sickness or injury, or just rest days in general—there are a lot of gaps that you can see. I've also darkened the background to accentuate the higher values.
The first step, which is relatively simple, is grouping different tracks together, so that there are clusters of tracks that do not touch each other at all between clusters. This is primarily done so that I can use the median latitude of each group as the basis for the longitudinal width of the bins within a group. If I didn't, then if I had a track near the equator, those bins would be about 40% wider than ones in Wisconsin.
Steps:
Once this step is done, rasterization can begin
The second step in this process is rasterization of all of the points. For the examples here, I place each point into a roughly 15-meter by 15-meter bin. I say roughly because as latitude changes, the number of miles/kilometers per degree longitude changes, and because I want these tiles to line up with each other on a rectangular grid, the east-west length of each of the bins varies slightly for different degrees of latitude
An example of what this looks like on a map with a 15-meter by 15-meter grid is the following run from back in 2018:
Note how even along straight streets, the points on the rasterized grid go up and down due to shakiness in GPS data. For the most part, my actual running does not vary this much. It is just a natural artifact of small amounts of GPS error that end up bouncing points across the borders of different bins.
The way I resolve this problem is by measuring a certain distance out from each bin. For computational ease, I use all bins on a grid a certain Manhattan distance, from the others. As opposed to our normal way of measuring distance, where we use the Pythagorean theorem $$ d^2 \sqrt{x^2 + y^2} $$ we instead use $$ d = x + y $$
For our purposes, there isn't much of an advantage of doing it this way apart from simplifying calculations and making the code run faster. The only noticeable effect is that diagonal paths (NE/NW/SE/SW) are more vulnerable to error than N/S/E/W paths. An example of the above map using 5 additional bins away from the center, for a maximum width of \(15 \frac{m}{bin} \times 9\;bins = 135m\), which is wide, but useful for measuring similarity.
As you can see, it is much wider, and it looks more "straight", in spite of the wobbles in the data.
The simpler version of similarity does not take direction into account. Instead, it looks at the overlaps of each points, as well as assigning weights to each point. Weights are assigned as follows:
Below is an example of 3 routes, with the accompanying similarity matrix, where 0 = no similarity and 1 = identical.
id1 | 127 | 175 | 270 |
---|---|---|---|
127 | 1.00 | 0.92 | 0.28 |
175 | 0.92 | 1.00 | 0.29 |
270 | 0.28 | 0.29 | 1.00 |
As you can see, the two tracks in the southern half of the map have a high correlation of 0.92, while the the one on the top only shares a correlation of about 0.28 with the others.
A more complicated way of comparing two paths is by taking direction into account. The way I wanted to define this similarity is one that would result in the following scenarios:
Yes, I know similarity should not be negative, but it the simplest term I can think of for it.
Regardless, given the above, the algorithm is very similar to the one for non-directional similarity. However, the following changes must be made:
An example of what one of these tracks looks like with the direction included in the bin (for the first instance of each bin):
Note that there are some artifacts from changing directions, as well as stopping and starting again slightly behind where the stop was. Adding direction makes this quite a bit noisier, and difficult to compare. Here's the similarity matrix for the 3 example tracks from the non-directional example:
id1 | 127 | 175 | 270 |
---|---|---|---|
127 | 1.00 | 0.90 | -0.16 |
175 | 0.90 | 1.00 | -0.16 |
270 | -0.16 | -0.16 | 1.00 |
You can see that run #270 has a negative similarity to the other ones. This is a result of the run starting east on the road that's traveled westward for the other runs.
https://github.com/mcandocia/track_sim
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Tags: