Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Entropy. I still shiver when I hear that word, since I never fully understood that concept. Today marks the first time I was kind of forced to look into it in more detail. And by “in detail”, I mean I found a StackOverflow question that had something to do with a problem I am having (sound familiar?). The problem was is about complexity of time series and one of the suggested methods was Sample Entropy.
#used packages library(tidyverse) # for data wrangling library(pracma) # for Sample Entropy code library(Rcpp) # integrate C++ in R
Sample Entropy
Sample entropy is similar to Approximate Entropy and used for assessing the complexity of time-series. The less “complex” the time series is the easier it may be to forecast it.
Sample Entropy in R
I found two packages that implement sample entropy,
pracma
and nonlinearTimeSeries
. I looked into
nonlinearTimeSeries
first but the data structure seemed a bit too complex on first glance (for me!).
So I decided to go for pracma
. When you are ok with the default parameters, then you can simple call
sample_entropy()
.
set.seed(1886) ts <- rnorm(200) sample_entropy(ts) ## [1] 2.302585
Simple. Problem is, I need to calculate the sample entropy of 150,000 time series. Can the function handle that in reasonable time?
#calculate sample entropy for 500 time series set.seed(1886) A <- matrix(runif(500*200),500,200) system.time(apply(A,1,function(x)sample_entropy(x))) ## user system elapsed ## 40.775 0.004 40.782
This translates to several hours for 150,000 time series, which is kind of not ok. I would prefer it a little faster.
Sample Entropy with Rcpp
Sample Entropy is actually super easy to implement. So I used my rusty c++ skills
and implemented the function myself with the help of Rcpp
.
cppFunction( "double SampleEntropy(NumericVector data, int m, double r, int N, double sd) { int Cm = 0, Cm1 = 0; double err = 0.0, sum = 0.0; err = sd * r; for (unsigned int i = 0; i < N - (m + 1) + 1; i++) { for (unsigned int j = i + 1; j < N - (m + 1) + 1; j++) { bool eq = true; //m - length series for (unsigned int k = 0; k < m; k++) { if (std::abs(data[i+k] - data[j+k]) > err) { eq = false; break; } } if (eq) Cm++; //m+1 - length series int k = m; if (eq && std::abs(data[i+k] - data[j+k]) <= err) Cm1++; } } if (Cm > 0 && Cm1 > 0) return std::log((double)Cm / (double)Cm1); else return 0.0; }" )
The code can also be found on github.
Let’s see if it produces the same output as the pracma
version.
set.seed(1886) ts <- rnorm(200) sample_entropy(ts) ## [1] 2.302585 SampleEntropy(ts,2L,0.2,length(ts),sd(ts)) ## [1] 2.302585
Perfect. Now let’s check if we gained some speed up.
system.time(apply(A,1,function(x)SampleEntropy(x,2L,0.2,length(ts),sd(ts)))) ## user system elapsed ## 0.084 0.000 0.084
The speed up is actually ridiculous. Remember that the pracma code ran 40 seconds. The Rcpp code not even a tenth of a second. This is definitely good enough for 150,000 time series.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.