Forecasting my weight with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been measuring my weight almost daily for almost 2 years now; I actually started earlier, but
not as consistently. The goal of this blog post is to get re-acquaiented with time series; I haven’t
had the opportunity to work with time series for a long time now and I have seen that quite a few
packages that deal with time series have been released on CRAN. In this blog post, I will explore
my weight measurements using some functions from the {tsibble}
and {tibbletime}
packages,
and then do some predictions with the {forecast}
package.
First, let’s load the needed packages, read in the data and convert it to a tsibble
:
library("tidyverse") library("readr") library("forecast") library("tsibble") library("tibbletime") library("mice") weight <- read_csv("https://gist.githubusercontent.com/b-rodrigues/ea60679135f8dbed448ccf66a216811f/raw/18b469f3b0720f76ce5ee2715d0f9574b615f170/gistfile1.txt") %>% as_tsibble() ## Parsed with column specification: ## cols( ## Date = col_date(format = ""), ## Poids = col_double() ## ) ## The `index` is `Date`.
You can read more about {tsibble}
here. Here, I use {tsibble}
mostly
for the next step, which is using the function fill_na()
on the tsibble. fill_na()
turns
implicit missing values into explicit missing values. These are implicit missing values:
Date Poids 1 2013-01-01 84.10 2 2013-01-04 85.60
and this is the same view, but with explicit missing values:
Date Poids 1 2013-01-01 84.10 2 2013-01-02 NA 3 2013-01-03 NA 4 2013-01-04 85.60
This is useful to do, because I want to impute the missing values using the {mice}
package.
Let’s do this:
weight <- weight %>% fill_na() imp_weight <- mice(data = weight) %>% mice::complete("long") ## ## iter imp variable ## 1 1 Poids ## 1 2 Poids ## 1 3 Poids ## 1 4 Poids ## 1 5 Poids ## 2 1 Poids ## 2 2 Poids ## 2 3 Poids ## 2 4 Poids ## 2 5 Poids ## 3 1 Poids ## 3 2 Poids ## 3 3 Poids ## 3 4 Poids ## 3 5 Poids ## 4 1 Poids ## 4 2 Poids ## 4 3 Poids ## 4 4 Poids ## 4 5 Poids ## 5 1 Poids ## 5 2 Poids ## 5 3 Poids ## 5 4 Poids ## 5 5 Poids
Let’s take a look at imp_weight
:
head(imp_weight) ## .imp .id Date Poids ## 1 1 1 2013-10-28 84.1 ## 2 1 2 2013-10-29 84.4 ## 3 1 3 2013-10-30 83.5 ## 4 1 4 2013-10-31 84.1 ## 5 1 5 2013-11-01 85.6 ## 6 1 6 2013-11-02 85.2
Let’s select the relevant data. I filter from the 11th of July 2016, which is where I started weighing myself almost every day, to the 31st of May 2018. I want to predict my weight for the month of June (you might think of the month of June 2018 as the test data, and the rest as training data):
imp_weight_train <- imp_weight %>% filter(Date >= "2016-07-11", Date <= "2018-05-31")
In the next lines, I create a column called imputation
which is simply the same as the column
.imp
but of character class, remove unneeded columns and rename some other columns (“Poids” is
French for weight):
imp_weight_train <- imp_weight_train %>% mutate(imputation = as.character(.imp)) %>% select(-.id, -.imp) %>% rename(date = Date) %>% rename(weight = Poids)
Let’s take a look at the data:
ggplot(imp_weight_train, aes(date, weight, colour = imputation)) + geom_line() + theme(legend.position = "bottom")
This plots gives some info, but it might be better to smooth the lines. This is possible by
computing a rolling mean. For this I will use the rollify()
function of the {tibbletime}
package:
mean_roll_5 <- rollify(mean, window = 5) mean_roll_10 <- rollify(mean, window = 10)
rollify()
can be seen as an adverb, pretty much like purrr::safely()
; rollify()
is a higher
order function that literally rollifies a function, in this case mean()
which means that
rollifying the mean creates a function that returns the rolling mean. The window
argument lets
you decide how smooth you want the curve to be: the higher the smoother. However, you will lose
some observations. Let’s use this functions to add the rolling means to the data frame:
imp_weight_train <- imp_weight_train %>% group_by(imputation) %>% mutate(roll_5 = mean_roll_5(weight), roll_10 = mean_roll_10(weight))
Now, let’s plot these new curves:
ggplot(imp_weight_train, aes(date, roll_5, colour = imputation)) + geom_line() + theme(legend.position = "bottom") ## Warning: Removed 20 rows containing missing values (geom_path).
ggplot(imp_weight_train, aes(date, roll_10, colour = imputation)) + geom_line() + theme(legend.position = "bottom") ## Warning: Removed 45 rows containing missing values (geom_path).
That’s easier to read, isn’t it?
Now, I will use the auto.arima()
function to train a model on the data to forecast my weight for
the month of June. However, my data, imp_weight_train
is a list of datasets. auto.arima()
does
not take a data frame as an argument, much less so a list of datasets. I’ll create a wrapper around
auto.arima()
that works on a dataset, and then map it to the list of datasets:
auto.arima.df <- function(data, y, ...){ y <- enquo(y) yts <- data %>% pull(!!y) %>% as.ts() auto.arima(yts, ...) }
auto.arima.df()
takes a data frame as argument, and then y
, which is the column that contains the
univariate time series. This column then gets pulled out of the data frame, converted to a time
series object with as.ts()
, and then passed down to auto.arima()
. I can now use this function
on my list of data sets. The first step is to nest the data:
nested_data <- imp_weight_train %>% group_by(imputation) %>% nest()
Let’s take a look at nested_data
:
nested_data ## # A tibble: 5 x 2 ## imputation data ## <chr> <list> ## 1 1 <tibble [690 × 4]> ## 2 2 <tibble [690 × 4]> ## 3 3 <tibble [690 × 4]> ## 4 4 <tibble [690 × 4]> ## 5 5 <tibble [690 × 4]>
nested_data
is a tibble with a column called data
, which is a so-called list-column. Each
element of data
is itself a tibble. This is a useful structure, because now I can map auto.arima.df()
to the data frame:
models <- nested_data %>% mutate(model = map(data, auto.arima.df, y = weight))
This trick can be a bit difficult to follow the first time you see it. The idea is the following:
nested_data
is a tibble. Thus, I can add a column to it using mutate()
. So far so good.
Now that I am “inside” the mutate call, I can use purrr::map()
. Why? purrr::map()
takes a list
and then a function as arguments. Remember that data
is a list column; you can see it above,
the type of the column data
is list. So data
is a list, and thus can be used inside purrr::map()
.
Great. Now, what is inside data
? tibbles, where inside each of them is a column
called weight
. This is the column that contains my univariate time series I want to model. Let’s
take a look at models
:
models ## # A tibble: 5 x 3 ## imputation data model ## <chr> <list> <list> ## 1 1 <tibble [690 × 4]> <S3: ARIMA> ## 2 2 <tibble [690 × 4]> <S3: ARIMA> ## 3 3 <tibble [690 × 4]> <S3: ARIMA> ## 4 4 <tibble [690 × 4]> <S3: ARIMA> ## 5 5 <tibble [690 × 4]> <S3: ARIMA>
models
is a tibble with a column called model
, where each element is a model of type ARIMA
.
Adding forecasts is based on the same trick as above, and we use the forecast()
function:
forecasts <- models %>% mutate(predictions = map(model, forecast, h = 24)) %>% mutate(predictions = map(predictions, as_tibble)) %>% pull(predictions)
I forecast 24 days (I am writing this on the 24th of June), and convert the predictions to tibbles, and then pull only the predictions tibble:
forecasts ## [[1]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 71.5 70.7 72.3 70.2 72.8 ## 2 71.5 70.7 72.4 70.3 72.8 ## 3 71.5 70.6 72.3 70.1 72.8 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.4 70.5 72.4 70.0 72.9 ## 6 71.5 70.5 72.4 70.0 72.9 ## 7 71.4 70.5 72.4 69.9 72.9 ## 8 71.4 70.4 72.4 69.9 72.9 ## 9 71.4 70.4 72.4 69.9 72.9 ## 10 71.4 70.4 72.4 69.8 73.0 ## # ... with 14 more rows ## ## [[2]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 71.6 70.8 72.3 70.3 72.8 ## 2 71.6 70.8 72.5 70.3 72.9 ## 3 71.5 70.6 72.4 70.2 72.9 ## 4 71.5 70.6 72.5 70.1 72.9 ## 5 71.5 70.5 72.5 70.0 73.0 ## 6 71.5 70.5 72.5 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.5 70.4 72.5 69.9 73.1 ## 9 71.5 70.4 72.5 69.8 73.1 ## 10 71.4 70.3 72.6 69.7 73.1 ## # ... with 14 more rows ## ## [[3]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 71.6 70.8 72.4 70.4 72.8 ## 2 71.5 70.7 72.4 70.2 72.8 ## 3 71.5 70.6 72.4 70.2 72.9 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.5 70.5 72.4 70.0 72.9 ## 6 71.5 70.5 72.4 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.4 70.4 72.5 69.9 73.0 ## 9 71.4 70.4 72.5 69.8 73.0 ## 10 71.4 70.4 72.5 69.8 73.1 ## # ... with 14 more rows ## ## [[4]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 71.5 70.8 72.3 70.3 72.8 ## 2 71.5 70.7 72.4 70.3 72.8 ## 3 71.5 70.7 72.4 70.2 72.8 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.5 70.6 72.4 70.1 72.9 ## 6 71.5 70.5 72.5 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.5 70.4 72.5 69.9 73.0 ## 9 71.4 70.4 72.5 69.8 73.1 ## 10 71.4 70.3 72.5 69.8 73.1 ## # ... with 14 more rows ## ## [[5]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 71.5 70.8 72.3 70.3 72.8 ## 2 71.5 70.7 72.4 70.3 72.8 ## 3 71.5 70.7 72.4 70.2 72.8 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.5 70.6 72.4 70.1 72.9 ## 6 71.5 70.5 72.4 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.5 70.4 72.5 69.9 73.0 ## 9 71.4 70.4 72.5 69.8 73.1 ## 10 71.4 70.3 72.5 69.8 73.1 ## # ... with 14 more rows
So forecasts
is a list of tibble, each containing a forecast. Remember that I have 5 tibbles, because
I imputed the data 5 times. I will merge this list of data sets together into one, but before I need
to add a column that indices the forecasts:
forecasts <- map2(.x = forecasts, .y = as.character(seq(1, 5)), ~mutate(.x, id = .y)) %>% bind_rows() %>% select(-c(`Lo 80`, `Hi 80`)) colnames(forecasts) <- c("point_forecast", "low_95", "hi_95", "id")
Let’s take a look again at forecasts
:
forecasts ## # A tibble: 120 x 4 ## point_forecast low_95 hi_95 id ## <dbl> <dbl> <dbl> <chr> ## 1 71.5 70.2 72.8 1 ## 2 71.5 70.3 72.8 1 ## 3 71.5 70.1 72.8 1 ## 4 71.5 70.1 72.9 1 ## 5 71.4 70.0 72.9 1 ## 6 71.5 70.0 72.9 1 ## 7 71.4 69.9 72.9 1 ## 8 71.4 69.9 72.9 1 ## 9 71.4 69.9 72.9 1 ## 10 71.4 69.8 73.0 1 ## # ... with 110 more rows
I now select the true values for the month of June. I also imputed this data, but here I will simply keep the average of the imputations:
weight_june <- imp_weight %>% filter(Date >= "2018-06-01") %>% select(-.id) %>% group_by(Date) %>% summarise(true_weight = mean(Poids)) %>% rename(date = Date)
Let’s take a look at weight_june
:
weight_june ## # A tibble: 24 x 2 ## date true_weight ## <date> <dbl> ## 1 2018-06-01 71.8 ## 2 2018-06-02 70.8 ## 3 2018-06-03 71.2 ## 4 2018-06-04 71.4 ## 5 2018-06-05 70.9 ## 6 2018-06-06 70.8 ## 7 2018-06-07 70.5 ## 8 2018-06-08 70.1 ## 9 2018-06-09 70.3 ## 10 2018-06-10 71.0 ## # ... with 14 more rows
Let’s repeat weight_june
5 times, and add the index 1 to 5. Why? Because I want to merge the
true data with the forecasts, and having the data in this form makes things easier:
weight_june <- modify(list_along(1:5), ~`<-`(., weight_june)) %>% map2(.y = as.character(seq(1, 5)), ~mutate(.x, id = .y)) %>% bind_rows()
The first line:
modify(list_along(1:5), ~`<-`(., weight_june))
looks quite complicated, but you will see that it is not, once we break it apart. modify()
modifies a list. The list to modify is list_along(1:5)
, which create a list of NULL
s:
list_along(1:5) ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ## ## [[5]] ## NULL
The second argument of modify()
is either a function or a formula. I created the following
formula:
~`<-`(., weight_june)
We all know the function <-()
, but are not used to see it that way. But consider the following:
a <- 3 `<-`(a, 3)
These two formulations are equivalent. So these lines fill the empty element of the list of NULL
s
with the data frame weight_june
. Then I add the id
column and then bind the rows together: bind_rows()
.
Let’s bind the columns of weight_june
and forecasts
and take a look at it:
forecasts <- bind_cols(weight_june, forecasts) %>% select(-id1) forecasts ## # A tibble: 120 x 6 ## date true_weight id point_forecast low_95 hi_95 ## <date> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 2018-06-01 71.8 1 71.5 70.2 72.8 ## 2 2018-06-02 70.8 1 71.5 70.3 72.8 ## 3 2018-06-03 71.2 1 71.5 70.1 72.8 ## 4 2018-06-04 71.4 1 71.5 70.1 72.9 ## 5 2018-06-05 70.9 1 71.4 70.0 72.9 ## 6 2018-06-06 70.8 1 71.5 70.0 72.9 ## 7 2018-06-07 70.5 1 71.4 69.9 72.9 ## 8 2018-06-08 70.1 1 71.4 69.9 72.9 ## 9 2018-06-09 70.3 1 71.4 69.9 72.9 ## 10 2018-06-10 71.0 1 71.4 69.8 73.0 ## # ... with 110 more rows
Now, for the last plot:
ggplot(forecasts, aes(x = date, colour = id)) + geom_line(aes(y = true_weight), size = 2) + geom_line(aes(y = hi_95)) + geom_line(aes(y = low_95)) + theme(legend.position = "bottom")
The true data fall within all the confidence intervals, but I am a bit surprised by the intervals, especially the upper confidence intervals; they all are way above 72kg, however my true weight has been fluctuating around 71kg for quite some months now. I think I have to refresh my memory on time series, because I am certainly missing something!
If you found this blog post useful, you might want to follow me on twitter for blog post updates.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.