January 09, 2019

What kind of a tool do you use when you want to describe a complex phenomenon?

Linear regression is useful when your model is described by coefficients multiplying things that are added together, but it often cannot be used when coefficients are inside functions such as an exponential or log function. An example of a *valid* linear model could be $$Y = \beta_0 + \beta_1 \times X_1 + \beta_2 \times \log(X_2) + \epsilon$$
where \(\beta_i\) are the different coefficients, \(Y\) is the response variable, \(X_i\) are the different predictors, and \(\epsilon\) is the normal error/noise (i.e., randomness) of each response.

On the other hand, models such as random forests or neural networks are good at fitting highly complex shapes, but the underlying structure cannot be interpreted very easily. Random forests are an ensemble of decision trees that ask a bunch of yes/no questions about the different variables, and neural networks can sometimes be described as a very complicated flowchart that involves adding up points. See this article of mine for a very basic introduction to neural networks.

Another method, nonlinear regression, can be very useful for fitting a model. Very generally, this is $$Y = f(X_1, X_2,...,X_N,\beta_1, \beta_2, ..., \beta_K)$$

This can be very useful if you know what the parameters of the equation should be, as well as a (wide) ballpark idea of what their values should be. Fitting it, however, is much more complicated than fitting a linear model or random forest, as well as many neural networks.

Over the past 16 months, I have been collecting heart rate data from my Garmin GPS watch for my runs, along with temperature data. One question I want to answer with that data is "How quickly does my heart rate decrease after I stop running, and how does temperature affect that?"

I had discussed certain factors affecting heart rate after exercise with my friend Allen, who has worked in various human biological research environments. The main factor affecting heart rate increase during exercise is the breaking down of lactic acid and the regulation of \(CO_2\). Another significant factor he described was temperature, where the body tries to regulate its temperature by circulating blood faster.

Of course, I do not have temperature of the body or \(CO_2\) levels directly measured by equipment, but the ambient temperature and the heart rate measurements should be sufficient, provided these are the descriptive variables we want to use.

The general shape of the data itself can be seen below. Note that as the temperature increases (from top to bottom), the asymptote of the heart rate as rest time grows large becomes higher. Any changes in rate are not as obvious. Higher initial heart rates are yellow, and lower ones are more purple/blue.

For more technical information on heart rate during exercise, see Cardiac Autonomic Responses during Exercise and Post-exercise Recovery Using Heart Rate Variability and Systolic Time Intervalsâ€”A Review. ** Figure 2** is a good example for a theoretical curve.

My first model was a bit more complicated, and looked like this: $$HR_{start} = HR_{stop} + (HR_a - HR_{stop}) \times (1-e^{-((\min(c_1t, lag))-max(c_2(t-lag), 0))})$$

Where \(HR_a\) is the asymptotic heart rate, roughly what my heart rate is when walking, which I usually did after stopping, \(lag\) is the amount of time between the stop and the change of the heart rate recovery mechanism, and \(c_1\) and \(c_2\) describe the rates of change of the heart rate before and after the mechanism change.

Later, I used a simpler model, $$HR_{start} = HR_{stop} + (HR_a - HR_{stop}) \times (1-e^{-c_1t})$$ since my row count was not too high (\(N=646\)) and the \(c_2\) and \(lag\) parameters were not fitting well or consistently. Additionally, the fit for the simpler model had an \(R^2\) value very close to the random forest model (0.73 vs 0.75), suggesting that any additional parameterization with the given variables would not improve performance.

Each of the above parameters itself is a function of temperature, so, for example $$HR_a = \beta_{HR_a,1} + \beta_{HR_a, T}T$$ where \(T\) is the average temperature.

When I first tried modeling, I used R's `optim()`

function to find the best solution. However, that method is not particularly robust, and had fairly poor performance and results.

My next approach was to use the `dfoptim`

package's `hjk()`

function, which uses the Hooke-Jeeves algorithm, which is a pattern-search algorithm that does not require functions to be continuously differentiable.

While this was a huge improvement over `optim`

, it was quite slow, was very sensitive to initialization parameters, and did not produce very robust results.

The final tool I used was a package called `rstan`

, which is an interface to the C++ STAN platform, which is used for high-performance statistical modeling used in many domains.

Like the `hjk()`

function, STAN benefits greatly from sensical initial parameters. The biggest difference this choice made was setting the constant term for \(HR_a\) to 80, which is a reasonable value for a heart rate when walking (for myself, at least).

I call the code in R using the following block:

library(rstan) library(dplyr) heart = read.csv('s1_heart_data.csv') %>% filter(!is.na(heart_rate_start) & !is.na(heart_rate_stop) & !is.na(avg_temp)) heart_data = list(N=nrow(heart)) for (name in names(heart)) heart_data[[name]] = heart[,name] init_01 = list( hr_a_const=80, hr_a_temp=1, c1_const=0.01, c1_temp=-0.001, sigma2_1=400 ) fit <- stan(file='model_01.stan', data=heart_data, init= function(chain_id) init_01, verbose=TRUE, chains=1, diagnostic_file='model_01_diagnostic.txt' ) summary(fit)

Below is the model specification written for STAN in `model_01.stan`

:

data { int<lower=0> N; vector[N] heart_rate_start; vector[N] heart_rate_stop; vector[N] avg_temp; vector[N] rest_time; } parameters { real hr_a_const; real c1_const; real hr_a_temp; real c1_temp; real<lower=0> sigma2_1; } model{ vector[N] hr_a; vector[N] c1; vector[N] yhat; vector[N] sigma2; for (n in 1:N){ hr_a[n] = hr_a_const + avg_temp[n]*hr_a_temp; c1[n] = c1_const + avg_temp[n] * c1_temp; } for (n in 1:N){ yhat[n] = heart_rate_stop[n] + (hr_a[n] - heart_rate_stop[n]) * (1-exp(-rest_time[n]*c1[n])); sigma2[n] = sigma2_1; } heart_rate_start ~ normal(yhat, sigma2); }

The STAN model, because it uses MCMC iterations, reports mean and quantile statistics for each parameter (e.g., median, 75th percentile). Using the mean, the model can be described as: $$ HR_{start} = HR_{stop} + (75.9 + 0.882T - HR_{stop})\times(1- 2^{-(0.0263 + 0.000277T)t }) + N(0, 145) $$ where \(t\) is rest time, \(T\) is temperature in Celsius, and \(N(0,145)\) describes normal error with variance 145, or a standard error of about 12. In the model, though, the temperature term does not significantly affect the rate, as the 25th and 75th percentile estimates have different signs.

However, because there is some nonlinear correlation within these parameters, if they are all to be taken at the same time, a more precise (but vulnerable to overfitting) estimate can be made using the original `hjk()`

function, with the above parameters as estimates. The results are close, but the model becomes:
$$HR_{start} = HR_{stop} + (76.9 + 0.814T - HR_{stop})\times (1- 2^{-(0.0232 + 0.00101T)t}) + N(0, 2659) $$

The overall fit is technically higher, but the \(\sigma^2\) estimate is very poor, probably since it is very sensitive to small changes that greatly increase inaccuracy in the estimates.

Visualizing the first model with the graph of the data from above:

According to the model, my normal "walking" heart rate is 75.9 bpm, plus an extra 0.88 bmp per degree Celsius outside, and after pausing, the average "half-life" of my heart rate recovery is \(\frac{1}{0.0632}\), or about 38 seconds. For example, if my heart rate was 160 bpm when it is about 5 degrees Celsius outside, my walking heart rate would eventually be 80 bpm, and after 38 seconds, that initial difference would be halved, so it would be at 120 bpm (since 160 bpm = 80 bpm + 80 bpm, and 80 bpm + 80/2 bpm = 120 bpm).

Of course, the standard error of these estimates is about 12 bpm, and the 95% confidence interval is about 10 bpm (71-81 bpm) leaving a 95% prediction interval roughly 30 bpm below and 30 bpm above that estimate for any single event. Some of this error may be hidden in some of the data I have collected but yet to transform, but much of it is likely in difference in behavior during stop time, as well as other values that were not measured.

I am constantly collecting more data, and a friend of mine, using the same models of equipment, is doing the same. I am hoping to add some complexity to these models as more data is collected and extract more useful insights from these models.

The code and data for this model can be found at https://github.com/mcandocia/heart_rate_modeling

- The more complex model had a separate standard error and "burn in rate" for that error when mechanisms for heart rate variability changed.
- I originally tested average speed over a moving window as one of the parameters, but I decided on removing that, as the heart rate at the stopwatch stop was sufficient.
- I removed events that occured less than 0.8 miles into a run, since the heart rate monitor tends to give inaccurate measurements early in a run.

You can also see the summary of the rstan model fit below:

load(file='example_rstan_fit.RData') summary(fit)$summary

## mean se_mean sd 2.5% ## hr_a_const 7.592154e+01 1.432622e-01 2.4914026031 7.136686e+01 ## c1_const 1.828310e-02 5.558837e-05 0.0014289690 1.566018e-02 ## hr_a_temp 8.821563e-01 7.072157e-03 0.1303897701 6.341916e-01 ## c1_temp 1.922401e-04 4.077463e-06 0.0001121361 -3.493921e-05 ## sigma2_1 1.453253e+01 1.953821e-02 0.3920238240 1.376281e+01 ## lp__ -2.047064e+03 6.985942e-02 1.4589710881 -2.050706e+03 ## 25% 50% 75% 97.5% ## hr_a_const 7.406508e+01 7.589436e+01 7.773107e+01 8.092963e+01 ## c1_const 1.724736e-02 1.819029e-02 1.922995e-02 2.129886e-02 ## hr_a_temp 7.880944e-01 8.813928e-01 9.784810e-01 1.121392e+00 ## c1_temp 1.226810e-04 1.933863e-04 2.692400e-04 4.072611e-04 ## sigma2_1 1.427220e+01 1.453229e+01 1.478619e+01 1.529961e+01 ## lp__ -2.047807e+03 -2.046837e+03 -2.046006e+03 -2.045062e+03 ## n_eff Rhat ## hr_a_const 302.4297 1.0000841 ## c1_const 660.8117 1.0032265 ## hr_a_temp 339.9251 1.0008696 ## c1_temp 756.3285 1.0064864 ## sigma2_1 402.5830 0.9998339 ## lp__ 436.1576 1.0044044

Tags: