September 19, 2017
Since I started road biking in Texas, I have used Strava to record my bike rides and, more recently, runs. One thing that always confused me was how the Calorie estimator worked. To get an exact value, you need to be hooked up to scientific equipment that monitors your breathing. Because that is impractical outside of a lab, it can only use GPS data, as well as some personal information, such as weight and sex.
The number of Calories burned while biking is mostly a result of the following factors:
I noticed that when I was biking, the estimates could vary wildly. Take these rows, for example:
|Date||Distance (mi)||Elevation Change (ft)||Average Speed (mph)||Estimated Calories|
In the above table, it is obvious that even with similar distances/average speeds/elevation change, the estimate can vary wildly. The second row, which was the Hotter'N Hell Hundred, has barely any more than the following two rides, which were slightly slower on average, and about half the distance!
This ride is atypical, though, in that most of the ride was clear of any sort of stops, whereas almost every bike ride that I go on, including the other three, involve constant starting and stopping motion, which requires more energy overall, especially in areas with a lot of traffic. Because of this major discrepancy, I am interested in seeing what
Before I dive into the specifics of analyzing individual time segments of the GPS data, I run a linear regression on 76 of my bike rides to see what the most significant factors in Strava's estimate are.
The considered variables are as follows:
Below I have three tables showing the coefficients for the three regression models I tested (using stepwise selection)
|Calories||Calories (w/Device)||Calories (w/Device Interactions)|
|(Intercept)||10.60||-72.25 – 93.46||.799||-122.58||-215.81 – -29.35||.011||-34.16||-120.60 – 52.28||.433|
|Ride Duration||6.75||6.23 – 7.27||<.001||8.77||1.11 – 16.43||.025|
|Ride Distance||40.95||27.23 – 54.66||<.001||-9.70||-68.91 – 49.50||.745|
|Strava App Used?||214.20||131.27 – 297.13||<.001||33.41||-81.63 – 148.46||.564|
|Distance * Average Speed||-0.89||-1.71 – -0.06||.035||0.08||-1.92 – 2.07||.937|
|Distance (If Strava App Used)||-43.69||-67.84 – -19.54||<.001|
|Distance * Average Speed (If Strava App Used)||3.14||1.68 – 4.59||<.001|
|R2 / adj. R2||.900 / .899||.930 / .927||.960 / .956|
With respect to the first model, 0.9 is a pretty good value for R2. It means that 90% of a model's squared error has been explained. This is a good start. However, the following two models indicate that there is a discrepancy between rides recorded using the Cateye device and rides using the Strava iPhone app. When the iPhone app is used, the estimates tend to be higher than their counterparts (assuming an average speed of at least 15 mph). Below are three plots of the Strava estimate vs. the model estimates, with shape indicators for recording device:
On its website, Strava gives an explanation for its biking:
P(total) = P(rolling resistance) + P(wind) + P(gravity) + P(acceleration). Rolling resistance and wind contributions are a function of the speed and the speed cubed of the bike, respectively. Gravity refers to energy expended/saved by going up/down hills, and acceleration is the energy from speeding the bike up*
Obviously, the linear regression above is not going to achieve 100% accuracy, at least without overfitting the model, which isn't too hard to do with a low sample size. Additionally, the above models are not generalizable to a good precision, since they are sensitive to the devices and data that I used. The next step will be using GPS data to manually calculate energy consumed and estimate the coefficients used in Strava's air resistance and rolling resistance estimate.
The next step is to try directly computing the energy required to cycle, estimating the air resistance and rolling resistance coefficients, and comparing the numbers to 21% of Strava's Calorie estimate.
The GPS data is stored in .gpx files, which contain the elevation, latitude/longitude, and timestamp of each point recorded on the GPS. One thing I noticed, right off the bat, was that the Cateye was very reliable in recording every single second (except when paused due to breaks), whereas the iPhone app was riddled with 2-second gaps between time points, bringing into question some of the accuracy of the data.
Smoothing is very important because of the way energy is calculated. An average speed is not useful when the difference between air resistance is much higher for faster speeds. Knowing when the bike is accelerating vs. braking vs. maintaining speed is also very important.
I use the variable C1 to describe the value multiplied by the speed in meters per second to estimate the rolling resistance in watts. Similarly, I use the variable C3 to describe the value multiplied by the cube of the speed to estimate the air resistance. With R's
optim() function, and a bounds of 0.09 to 12 for both variables, I get estimates of C1=6.754 and C3 =0.09. These values are somewhat unrealistic, as that is a very low air resistance and a very high rolling resistance. The calculation does worse than any of the above regressions, with an R2 of 0.83. If I put in realistic values of C1=3.0
And comparing the contributions each source of resistance makes, it is obvious that the "optimized" model places way too much weight on rolling resistance:
It appears that I can get a semblance of a description of what goes on in Strava's calculations, and here are a few things I know for sure:
These, however, are still mysteries:
I will probably revisit this data in the future, but I need to go back to the drawing board and figure out what else can be done
Here is a link to my code https://github.com/mcandocia/bike_workouts . I do not currently have the data available due to my privacy concerns. I may release a modified version in the future if there is interest. Note that some of the code/variables used in the Python file did not make it to the above analysis.