Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Time series data are widely seen in analytics. Some examples are stock indexes/prices, currency exchange rates and electrocardiogram (ECG). Traditional time series analysis focuses on smoothing, decomposition and forecasting, and there are many R functions and packages available for those purposes (see CRAN Task View: Time Series Analysis). However, classification and clustering of time series data are not readily supported by existing R functions or packages. There are many research publications on time series classification and clustering, but no R implementations are available so far as I know.
To demonstrate some possible ways for time series analysis and mining with R, I gave a talk on Time Series Analysis and Mining with R at Canberra R Users Group on 18 July
2011. It presents time series decomposition, forecasting, clustering and classification with R code examples. Some examples in the talk are presented below.
The complete slides of the talk and a comprehensive document titled R and Data Mining: Example and Case Studies are available at RDataMining website.
Time Series Decomposition
Time series decomposition is to decompose a time series into trend, seasonal, cyclical and irregular components. A time series of AirPassengers is used below as an example to demonstrate time series decomposition.
> f <- decompose(AirPassengers)
> # seasonal figures
> f$figure
> plot(f$figure, type=”b”, xaxt=”n”, xlab=””)
> # get names of 12 months in English words
> monthNames <- months(ISOdate(2011,1:12,1))
> # label x-axis with month names
> # las is set to 2 for vertical label orientation
> axis(1, at=1:12, labels=monthNames, las=2)
> plot(f)
In the above figure, the first chart is the original time series, the second is trend, the third shows seasonal factors, and the last chart is the remaining component.
Some other functions for time series decomposition are stl() in package stats, decomp() in package timsac, and tsr() in package ast.
Time Series Forecasting
Time series forecasting is to forecast future events based on known past data. Below is an example for time series forecasting with an autoregressive integrated moving average (ARIMA) model.
> fit <- arima(AirPassengers, order=c(1,0,0), list(order=c(2,1,0), period=12))
> fore <- predict(fit, n.ahead=24)
> # error bounds at 95% confidence level
> U <- fore$pred + 2*fore$se
> L <- fore$pred – 2*fore$se
> ts.plot(AirPassengers, fore$pred, U, L, col=c(1,2,4,4), lty = c(1,1,2,2))
> legend(“topleft”, c(“Actual”, “Forecast”, “Error Bounds (95% Confidence)”),
+ col=c(1,2,4), lty=c(1,1,2))
Time Series Clustering
Time series clustering is to partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar. For time series clustering with R, the first step is to work out an appropriate distance/similarity metric, and then, at the second step, use existing clustering techniques, such as k-means, hierarchical clustering, density-based clustering or subspace clustering, to find clustering structures.
Dynamic Time Warping (DTW) finds optimal alignment between two time series, and DTW distance is used as a distance metric in the example below. A data set of Synthetic Control Chart Time Series is used in the example, which contains 600 examples of control charts. Each control chart is a time series with 60 values. There are six classes: 1) 1-100 Normal, 2) 101-200 Cyclic, 3) 201-300 Increasing trend, 4)301-400 Decreasing trend, 5) 401-500 Upward shift, and 6) 501-600 Downward shift. The dataset is downloadable at UCI KDD Archive.
> sc <- read.table(“E:/Rtmp/synthetic_control.data”, header=F, sep=””)
# randomly sampled n cases from each class, to make it easy for plotting
> n <- 10
> s <- sample(1:100, n)
> idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
> sample2 <- sc[idx,]
> observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
> distMatrix <- dist(sample2, method=”DTW”)
# hierarchical clustering
> hc <- hclust(distMatrix, method=”average”)
> plot(hc, labels=observedLabels, main=””)
Time Series Classification
Time series classification is to build a classification model based on labelled time series and then use the model to predict the label of unlabelled time series. The way for time series classification with R is to extract and build features from time series data first, and then apply existing classification techniques, such as SVM, k-NN, neural networks, regression and decision trees, to the feature set.
Discrete Wavelet Transform (DWT) provides a multi-resolution representation using wavelets and is used in the example below. Another popular feature extraction technique is Discrete Fourier Transform (DFT).
# extracting DWT coefficients (with Haar filter)
> library(wavelets)
> wtData <- NULL
> for (i in 1:nrow(sc)) {
+ a <- t(sc[i,])
+ wt <- dwt(a, filter=”haar”, boundary=”periodic”)
+ wtData <- rbind(wtData, unlist(c(wt@W,wt@V[[wt@level]])))
+ }
> wtData <- as.data.frame(wtData)
# set class labels into categorical values
> classId <- c(rep(“1″,100), rep(“2″,100), rep(“3″,100),
+ rep(“4″,100), rep(“5″,100), rep(“6″,100))
> wtSc <- data.frame(cbind(classId, wtData))
# build a decision tree with ctree() in package party
> library(party)
> ct <- ctree(classId ~ ., data=wtSc,
+ controls = ctree_control(minsplit=30, minbucket=10, maxdepth=5))
> pClassId <- predict(ct)
# check predicted classes against original class labels
> table(classId, pClassId)
# accuracy
> (sum(classId==pClassId)) / nrow(wtSc)
[1] 0.8716667
> plot(ct, ip_args=list(pval=FALSE), ep_args=list(digits=0))
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.