If you have a few minutes to spare, consider filling in my “Vote for Candy” survey to rank your favorite candies as part of my latest project, especially if you are a resident of a smaller US state. Click/tap here to go to the survey. This tab will stay open.

Gift cards will be given to random participants.

A Tool for Censoring Geographic Data

By Max Candocia


October 03, 2017

Imagine you have a bunch of GPS data recorded from bike rides, runs, etc. You might want to share it with someone, but you are worried that they might find out too much information about you based on where your activities begin and end. Ideally, you should be able to censor segments that are too close to your living spaces, work spaces, etc.

This feature exists on some activity social media sites, such as strava.com, but if you want to process your own data, I wrote a script that can do that for you. Here is the GitHub repository for the code: https://github.com/mcandocia/examples/tree/master/censor_geo_data.


Below is a visualization made with two different configurations of the code. You can find the sample data that I used here.


The file, censor_and_package.py, is manually configured to search through directories you provide for CSV and GPX files and censor fields you request if they fall within a certain radius of a set of (longitude, latitude) coordinates you provide, either in the program file or in a CSV (recommended). Fields can be silently dropped by setting 'timestamp' and 'time' in the CENSOR_PARAMS object, which affects CSV and GPX files, respectively.

The output is files in a new directory you specify, as well as an optional zip archive of that directory, so you don't have to worry about overwriting your data, unless you specify that directory to be the same as the root directory.

The program requires BeautifulSoup 4 (for GPX processing) and numpy, although only the constants need to be changed. That is unless you want to extract additional information via the program (even though it is meant to remove information). For additional information, see the README.md or view the comments in the code.


Regardless of how much you censor, if there is any pattern to your data, there are still ways for others to abuse that data. Removing points close to your home/work will help provide ambiguity to your personal life, but it is no guarantee. I would recommend specifying several points in your general area, not exactly centered on your points of interest, so that it is harder to guess the reasoning behind the censoring criteria.


Recommended Articles

Reverse-Engineering Strava's Calorie Estimator with Difficulty

Biking Calorie estimators can be mysterious, as many factors go in to how much power it takes to bike in various scenarios. Here I try reverse-engineering the calculations Strava uses for its own estimate.

Dealing with Zeros and Negative Values with a Log Scale

When plotting data, you may want to use a log-scale for most of your data, but zeros, near-zero values, and negative values make this impossible. With piecewise linear and logarithmic functions, however, this effect can still be achieved.