Reverse-Engineering Strava's Calorie Estimator with Difficulty

By Max Candocia

|

September 19, 2017

Since I started road biking in Texas, I have used Strava to record my bike rides and, more recently, runs. One thing that always confused me was how the Calorie estimator worked. To get an exact value, you need to be hooked up to scientific equipment that monitors your breathing. Because that is impractical outside of a lab, it can only use GPS data, as well as some personal information, such as weight and sex.

The number of Calories burned while biking is mostly a result of the following factors:

  1. The energy it takes to accelerate the bicycle & rider
  2. The energy it takes to go up a hill
  3. Air resistance, which increases the amount of power (energy per time) by the cube of the speed. E.g., air resistance has 8x the effect at 20 mph than it does at 10 mph
  4. Rolling resistance, which is a result of the bike tire deforming. The power expended is proportional to velocity, and usually has less of an effect than air resistance at speeds of over 10 mph.

I noticed that when I was biking, the estimates could vary wildly. Take these rows, for example:

Date Distance (mi) Elevation Change (ft) Average Speed (mph) Estimated Calories
08/14/2016 113.0 2,456 15.2 3,204
08/27/2016 100.9 1,291 17.4 1,970
11/06/2016 60.3 1,088 16.9 1,865
08/12/2017 52.8 762 17.2 1,805

In the above table, it is obvious that even with similar distances/average speeds/elevation change, the estimate can vary wildly. The second row, which was the Hotter'N Hell Hundred, has barely any more than the following two rides, which were slightly slower on average, and about half the distance!

This ride is atypical, though, in that most of the ride was clear of any sort of stops, whereas almost every bike ride that I go on, including the other three, involve constant starting and stopping motion, which requires more energy overall, especially in areas with a lot of traffic. Because of this major discrepancy, I am interested in seeing what

Basic Insights with Linear Regression

Before I dive into the specifics of analyzing individual time segments of the GPS data, I run a linear regression on 76 of my bike rides to see what the most significant factors in Strava's estimate are.

The considered variables are as follows:

Below I have three tables showing the coefficients for the three regression models I tested (using stepwise selection)

  1. A model that only considers the distance, duration, average speed, and elevation changes
  2. A model that includes the above, as well as the GPS device used
  3. A model that includes the above, but uses the GPS to control for interaction effects
    Calories   Calories (w/Device)   Calories (w/Device Interactions)
    B CI p   B CI p   B CI p
(Intercept)   10.60 -72.25 – 93.46 .799   -122.58 -215.81 – -29.35 .011   -34.16 -120.60 – 52.28 .433
Ride Duration   6.75 6.23 – 7.27 <.001       8.77 1.11 – 16.43 .025
Ride Distance       40.95 27.23 – 54.66 <.001   -9.70 -68.91 – 49.50 .745
Strava App Used?       214.20 131.27 – 297.13 <.001   33.41 -81.63 – 148.46 .564
Distance * Average Speed       -0.89 -1.71 – -0.06 .035   0.08 -1.92 – 2.07 .937
Distance (If Strava App Used)           -43.69 -67.84 – -19.54 <.001
Distance * Average Speed (If Strava App Used)           3.14 1.68 – 4.59 <.001
Observations  77  77  77
R2 / adj. R2  .900 / .899  .930 / .927  .960 / .956

With respect to the first model, 0.9 is a pretty good value for R2. It means that 90% of a model's squared error has been explained. This is a good start. However, the following two models indicate that there is a discrepancy between rides recorded using the Cateye device and rides using the Strava iPhone app. When the iPhone app is used, the estimates tend to be higher than their counterparts (assuming an average speed of at least 15 mph). Below are three plots of the Strava estimate vs. the model estimates, with shape indicators for recording device:

Strava's Explanations

On its website, Strava gives an explanation for its biking:

Obviously, the linear regression above is not going to achieve 100% accuracy, at least without overfitting the model, which isn't too hard to do with a low sample size. Additionally, the above models are not generalizable to a good precision, since they are sensitive to the devices and data that I used. The next step will be using GPS data to manually calculate energy consumed and estimate the coefficients used in Strava's air resistance and rolling resistance estimate.

Estimating Energy with GPS Data

The next step is to try directly computing the energy required to cycle, estimating the air resistance and rolling resistance coefficients, and comparing the numbers to 21% of Strava's Calorie estimate.

Extracting Speed & Smoothing GPS Data

The GPS data is stored in .gpx files, which contain the elevation, latitude/longitude, and timestamp of each point recorded on the GPS. One thing I noticed, right off the bat, was that the Cateye was very reliable in recording every single second (except when paused due to breaks), whereas the iPhone app was riddled with 2-second gaps between time points, bringing into question some of the accuracy of the data.

Smoothing is very important because of the way energy is calculated. An average speed is not useful when the difference between air resistance is much higher for faster speeds. Knowing when the bike is accelerating vs. braking vs. maintaining speed is also very important.

  1. First, I calculate the changes in kinetic energy (due to speed), potential energy (due to change in elevation), rolling resistance, and air resistance over an interval.
  2. I add all the terms together for each point in time. If they are less than zero (i.e., more energy was lost than produced), I assume that I was coasting/braking, which consumes no energy on my part.
  3. The above energy calcluation for each ride is added together and scaled by 1/0.21, which is compared to Strava's Calorie estimate

Coefficient Estimation

I use the variable C1 to describe the value multiplied by the speed in meters per second to estimate the rolling resistance in watts. Similarly, I use the variable C3 to describe the value multiplied by the cube of the speed to estimate the air resistance. With R's optim() function, and a bounds of 0.09 to 12 for both variables, I get estimates of C1=6.754 and C3 =0.09. These values are somewhat unrealistic, as that is a very low air resistance and a very high rolling resistance. The calculation does worse than any of the above regressions, with an R2 of 0.83. If I put in realistic values of C1=3.0

and C3 =0.2, the prediction is even worse, with an R2 of 0.73. Below are graphs of the comparisons:


And comparing the contributions each source of resistance makes, it is obvious that the "optimized" model places way too much weight on rolling resistance:


Conclusions (for now)

It appears that I can get a semblance of a description of what goes on in Strava's calculations, and here are a few things I know for sure:

These, however, are still mysteries:

I will probably revisit this data in the future, but I need to go back to the drawing board and figure out what else can be done

Code

Here is a link to my code https://github.com/mcandocia/bike_workouts . I do not currently have the data available due to my privacy concerns. I may release a modified version in the future if there is interest. Note that some of the code/variables used in the Python file did not make it to the above analysis.


Tags: 

Recommended Articles

Which Subreddits Swear the Most?

What Are The Chances That Your Vote Makes a Difference?

What is the chance that your vote, or even all of your friends votes, matter in an election? This is a small application that can figure that out for you, accompanied by some graphs and explanations that should make it easier to understand.