Samsung Phone Data Analysis Project
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Below are my findings from the second data analysis project in Dr. Jeffery Leek’s John Hopkins Coursera class.
Introduction
I used the “Human Activity Recognition Using Smartphones Dataset” (UCI, 2013) to build a model. This data was recorded from a Samsung prototype smartphone with a built-in accelerometer. The purpose of my model was to recognize the type of activity (walking, walking upstairs, walking downstairs, sitting, standing, laying) the wearer of the device was performing based on their 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz, as recorded by the accelerometer and the additional gyroscope worn on the wrist.
Methods
The “Human Activity Recognition Using Smartphones Dataset” consists of recordings from 30 participants over specified time who wore a Samsung device on their waist. The assignment specified I only use a subset consisting of data from 21 participants. The data was already normalized and bounded within [-1,1] which meant extreme values were not to be expected. The dataset included 7352 observations with 563 variables. These variables can be summarized as:
- Accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time domain signals (prefix ‘t’ to denote time) were captured at a constant rate of 50 Hz.
- Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ).
- Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also, the magnitude of these three-dimensional signals was calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag).
The following features were estimated from the aforementioned signals.
- in array
- min(): Smallest value in array
- sma(): Signal magnitude area
- energy(): Energy measure. Sum of the squares divided by the number of values.
- iqr(): Interquartile range
- entropy(): Signal entropy
- arCoeff(): Autorregresion coefficients with Burg order equal to 4
- correlation(): correlation coefficient between two signals
- maxInds(): index of the frequency component with largest magnitude
- meanFreq(): Weighted average of the frequency components to obtain a mean frequency
- skewness(): skewness of the frequency domain signal
- kurtosis(): kurtosis of the frequency domain signal
- bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window.
- angle(): Angle between to vectors.
Exploratory Analysis
I started the exploration process by examining structures and summaries as well as distribution plots of the variables. I searched for missing data (supposedly normalized), naming convention or level issues, and eventually determined the variables to be used in a classifying or regression model. Special characters and spaces were converted and removed. In order to utilize Random Forests the “activity” and “subject” column needed to be converted to factors. Additionally, character vectors needed to be made syntactically valid (i.e. “a and b” and “a-‐and-‐b become “a.and.b” and “a.andb.1”)(gsub,makenames).
The dataset needed to be divided into a training set that per assignment instructions included at minimum participants 1, 3, 5, and 6. The test data needed to include at minimum participants 27, 28, 29, and 30. Some participants might be responsible for more of the observations in the dataset than others, thus skewing the data. As such, I took random equal samples without replacement of of observations for each participant.
I wanted to check the distribution of the training and test set to see if any differences were egregious enough to warrant transformation. The power of most modeling methods such decision trees and k-NN depend on homogeneity of data. I used Principal Components Analysis to assess the distributions. The image below shows each data set broken into 2 principal components that represent 90% of the data. There is definite homogeneity between the sets. (I’m still exploring ways to plot this using ggplot2 instead of base graphics)
Statistical Modeling
I decided to employ random forests modeling (an excellent layman’s explanation of the algorithm can be found here). If you are already familiar with classification trees then Dr. Breiman’s explanation should make sense:
Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
I chose random forests after I considered the nature of the activities I was trying to predict. I intuitively thought that some of the variables would be hard to distinguish, such as walking up versus down or standing vs sitting. I thought random forests robust enough as single decision trees are likely to suffer from high variance or high bias, but random forests can use averages to strike a natural balance between the two extremes. I also appreciate the principles of Occam’s razor and wanted to employ and algorithm with the least assumptions regarding distribution.
This model was successful as the OOB estimate of error rate was 1.4% .The Random Forests importance() output column MeanDecreaseAccuracy describes how much each factor contributes to the models ability to predict activity. Below Table 1 displays variables with the top ten MeanDecreaseAccuracy score.
Table 1
MeanDecreaseAccuracy |
MeanDecreaseGini |
|
angle.Y.gravityMean. |
17.2437045 |
57.2346914 |
tGravityAcc.min…Y |
16.8232063 |
54.0606836 |
tGravityAcc.mean…Y |
16.7407855 |
57.1805001 |
tGravityAcc.max…Y |
16.592065 |
48.0024562 |
tGravityAcc.energy…Y |
14.1988198 |
33.2963559 |
tGravityAcc.min…X |
13.0430767 |
63.426501 |
tGravityAcc.energy…X |
12.8807273 |
58.1172211 |
angle.X.gravityMean. |
12.7779327 |
54.9267196 |
tGravityAcc.mean…X |
12.6165942 |
48.6299317 |
This dotchart displays variable importance as measured by a Random Forest (varImpPlot) similar to the table above.
Examining a confusion matrix (Table 2) of prediction on the test set, we see that the highest prediction success rate was for laying at 100% (intuitive considering the above weighting of angle) and the lowest was walkdown.
Table 2
observed | laying | sitting | standing | walk | walkdown | walkup |
laying |
100.00% |
0.00% |
0.00% |
0.00% |
0.00% |
0.00% |
sitting |
0.00% |
86.53% |
13.47% |
0.00% |
0.00% |
0.00% |
standing |
0.00% |
10.02% |
89.98% |
0.00% |
0.00% |
0.00% |
walk |
0.00% |
0.00% |
0.24% |
97.62% |
1.43% |
0.71% |
walkdown |
0.00% |
0.00% |
0.00% |
1.93% |
81.77% |
16.30% |
walkup |
0.00% |
0.00% |
7.24% |
0.26% |
6.72% |
85.79% |
There were 245 misclassification errors on a test set size of 2658 for an Error Rate of 9.21 %.
Conclusions
The confusion matrix (Table 2) shows that misclassifications or false-positives were mostly in transitionary stages such as sitting to standing and vice-versa. This makes me curious to know what the orientation of these devices were on the waist. Walkdown and walkup differentiation proved to be difficult as well. This is interesting, as I believe that the speed and acceleration of walk down would be significantly higher in magnitude compared to walkup due to simple physical excretion and gravity. Variations are subjective as they’re based on the health and ability of individual users though.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.