# Reverse-Engineering Strava's Calorie Estimator with Difficulty

|

September 19, 2017

Since I started road biking in Texas, I have used Strava to record my bike rides and, more recently, runs. One thing that always confused me was how the Calorie estimator worked. To get an exact value, you need to be hooked up to scientific equipment that monitors your breathing. Because that is impractical outside of a lab, it can only use GPS data, as well as some personal information, such as weight and sex.

The number of Calories burned while biking is mostly a result of the following factors:

1. The energy it takes to accelerate the bicycle & rider
2. The energy it takes to go up a hill
3. Air resistance, which increases the amount of power (energy per time) by the cube of the speed. E.g., air resistance has 8x the effect at 20 mph than it does at 10 mph
4. Rolling resistance, which is a result of the bike tire deforming. The power expended is proportional to velocity, and usually has less of an effect than air resistance at speeds of over 10 mph.

I noticed that when I was biking, the estimates could vary wildly. Take these rows, for example:

Date Distance (mi) Elevation Change (ft) Average Speed (mph) Estimated Calories
08/14/2016 113.0 2,456 15.2 3,204
08/27/2016 100.9 1,291 17.4 1,970
11/06/2016 60.3 1,088 16.9 1,865
08/12/2017 52.8 762 17.2 1,805

In the above table, it is obvious that even with similar distances/average speeds/elevation change, the estimate can vary wildly. The second row, which was the Hotter'N Hell Hundred, has barely any more than the following two rides, which were slightly slower on average, and about half the distance!

This ride is atypical, though, in that most of the ride was clear of any sort of stops, whereas almost every bike ride that I go on, including the other three, involve constant starting and stopping motion, which requires more energy overall, especially in areas with a lot of traffic. Because of this major discrepancy, I am interested in seeing what

## Basic Insights with Linear Regression

Before I dive into the specifics of analyzing individual time segments of the GPS data, I run a linear regression on 76 of my bike rides to see what the most significant factors in Strava's estimate are.

The considered variables are as follows:

• distance
• time spent moving
• average speed
• elevation change
• whether my GPS device (Cateye Stealth 50) or the Strava iPhone app was used to record the route

Below I have three tables showing the coefficients for the three regression models I tested (using stepwise selection)

1. A model that only considers the distance, duration, average speed, and elevation changes
2. A model that includes the above, as well as the GPS device used
3. A model that includes the above, but uses the GPS to control for interaction effects
 Calories Calories (w/Device) Calories (w/Device Interactions) B CI p B CI p B CI p (Intercept) 10.60 -72.25 – 93.46 .799 -122.58 -215.81 – -29.35 .011 -34.16 -120.60 – 52.28 .433 Ride Duration 6.75 6.23 – 7.27 <.001 8.77 1.11 – 16.43 .025 Ride Distance 40.95 27.23 – 54.66 <.001 -9.70 -68.91 – 49.50 .745 Strava App Used? 214.20 131.27 – 297.13 <.001 33.41 -81.63 – 148.46 .564 Distance * Average Speed -0.89 -1.71 – -0.06 .035 0.08 -1.92 – 2.07 .937 Distance (If Strava App Used) -43.69 -67.84 – -19.54 <.001 Distance * Average Speed (If Strava App Used) 3.14 1.68 – 4.59 <.001 Observations 77 77 77 R2 / adj. R2 .900 / .899 .930 / .927 .960 / .956

With respect to the first model, 0.9 is a pretty good value for R2. It means that 90% of a model's squared error has been explained. This is a good start. However, the following two models indicate that there is a discrepancy between rides recorded using the Cateye device and rides using the Strava iPhone app. When the iPhone app is used, the estimates tend to be higher than their counterparts (assuming an average speed of at least 15 mph). Below are three plots of the Strava estimate vs. the model estimates, with shape indicators for recording device: ## Strava's Explanations

On its website, Strava gives an explanation for its biking:

• For biking, the power (energy per unit time) is calculated with `P(total) = P(rolling resistance) + P(wind) + P(gravity) + P(acceleration)`. Rolling resistance and wind contributions are a function of the speed and the speed cubed of the bike, respectively. Gravity refers to energy expended/saved by going up/down hills, and acceleration is the energy from speeding the bike up*
• On a feedback survey, Strava indicates that it estimates that 21% of the Calories burned go into moving the bike
• In that survey, it also indicates that it ignores regions with terrible GPS signals

Obviously, the linear regression above is not going to achieve 100% accuracy, at least without overfitting the model, which isn't too hard to do with a low sample size. Additionally, the above models are not generalizable to a good precision, since they are sensitive to the devices and data that I used. The next step will be using GPS data to manually calculate energy consumed and estimate the coefficients used in Strava's air resistance and rolling resistance estimate.

## Estimating Energy with GPS Data

The next step is to try directly computing the energy required to cycle, estimating the air resistance and rolling resistance coefficients, and comparing the numbers to 21% of Strava's Calorie estimate.

### Extracting Speed & Smoothing GPS Data

The GPS data is stored in .gpx files, which contain the elevation, latitude/longitude, and timestamp of each point recorded on the GPS. One thing I noticed, right off the bat, was that the Cateye was very reliable in recording every single second (except when paused due to breaks), whereas the iPhone app was riddled with 2-second gaps between time points, bringing into question some of the accuracy of the data. Smoothing is very important because of the way energy is calculated. An average speed is not useful when the difference between air resistance is much higher for faster speeds. Knowing when the bike is accelerating vs. braking vs. maintaining speed is also very important.

1. First, I calculate the changes in kinetic energy (due to speed), potential energy (due to change in elevation), rolling resistance, and air resistance over an interval.
2. I add all the terms together for each point in time. If they are less than zero (i.e., more energy was lost than produced), I assume that I was coasting/braking, which consumes no energy on my part.
3. The above energy calcluation for each ride is added together and scaled by 1/0.21, which is compared to Strava's Calorie estimate

### Coefficient Estimation

I use the variable C1 to describe the value multiplied by the speed in meters per second to estimate the rolling resistance in watts. Similarly, I use the variable C3 to describe the value multiplied by the cube of the speed to estimate the air resistance. With R's ` optim()` function, and a bounds of 0.09 to 12 for both variables, I get estimates of C1=6.754 and C3 =0.09. These values are somewhat unrealistic, as that is a very low air resistance and a very high rolling resistance. The calculation does worse than any of the above regressions, with an R2 of 0.83. If I put in realistic values of C1=3.0

and C3 =0.2, the prediction is even worse, with an R2 of 0.73. Below are graphs of the comparisons:  And comparing the contributions each source of resistance makes, it is obvious that the "optimized" model places way too much weight on rolling resistance: ## Conclusions (for now)

It appears that I can get a semblance of a description of what goes on in Strava's calculations, and here are a few things I know for sure:

• Strava uses a reasonable amount of smoothing, although not necessarily the same type I used, to perform calculations
• There are discrepancies between using the Strava app and Cateye for recording. It is possible that Strava considers more of the data taken from a phone GPS as "unusable", but I cannot know that for sure
• Judging by my data set, my weight + my bike's weight did not come in to play very much. Most of my rides were pretty flat, and the net change in elevation was always zero.

These, however, are still mysteries:

• Why is there such a huge discrepancy between three of my longest rides in terms of Calories/distance? They did not appear to have too different of energy calculations.
• Are there any geographic corrections/interpolations that Strava makes when making these calculations?
• Why does the wind resistance calculation not seem to be very significant in these calculations? It should be a majority of the power for bike rides above 15 mph.

I will probably revisit this data in the future, but I need to go back to the drawing board and figure out what else can be done

## Code

Here is a link to my code https://github.com/mcandocia/bike_workouts . I do not currently have the data available due to my privacy concerns. I may release a modified version in the future if there is interest. Note that some of the code/variables used in the Python file did not make it to the above analysis.

Tags: