[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To demonstrate this technique, I’m using my 2017 reading dataset. A reader requested I make this dataset available, which I have done – you can download it here. This post describes the data in more detail, but the short description is that this dataset contains the 53 books I read last year, with information on book genre, page length, how long it took me to read it, and two ratings: my own rating and the average rating on Goodreads. Look for another dataset soon, containing my 2018 reading data; I made a goal of 60 books and I’ve already read 50, meaning lots more data than last year. I’m thinking of going for broke and bumping my reading goal up to 100, because apparently 60 books is not enough of a challenge for me now that I spend so much downtime reading.
First, I’ll load my dataset, then conduct the basic linear model I demonstrated in the post linked above.
setwd("~/R") library(tidyverse) ## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' [13] ## Parsed with column specification: ## cols( ## .default = col_integer(), ## Title = col_character(), ## Author = col_character(), ## G_Rating = col_double(), ## Started = col_character(), ## Finished = col_character() ## ) ## See spec(...) for full column specifications. colnames(books)[13] <- "Author_Gender" myrating<-lm(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, data=books) summary(myrating) ## ## Call: ## lm(formula = My_Rating ~ Pages + Read_Time + Author_Gender + ## Fiction + Fantasy + Math_Stats + YA_Fic, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.73120 -0.34382 -0.00461 0.24665 1.49932 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.5861211 0.2683464 13.364 <2e-16 *** ## Pages 0.0019578 0.0007435 2.633 0.0116 * ## Read_Time -0.0244168 0.0204186 -1.196 0.2380 ## Author_Gender -0.1285178 0.1666207 -0.771 0.4445 ## Fiction 0.1052319 0.2202581 0.478 0.6351 ## Fantasy 0.5234710 0.2097386 2.496 0.0163 * ## Math_Stats -0.2558926 0.2122238 -1.206 0.2342 ## YA_Fic -0.7330553 0.2684623 -2.731 0.0090 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4952 on 45 degrees of freedom ## Multiple R-squared: 0.4624, Adjusted R-squared: 0.3788 ## F-statistic: 5.53 on 7 and 45 DF, p-value: 0.0001233
These analyses show that I give higher ratings for books that are longer and Fantasy genre, and lower ratings to books that are Young Adult Fiction. Now let’s see what happens if I run this same linear model through rpart. Note that this is a slightly different technique, looking for cuts that differentiate outcomes, so it will find slightly different results.
library(rpart) tree1 <- rpart(My_Rating ~ Pages + Read_Time + Author_Gender + Fiction + Fantasy + Math_Stats + YA_Fic, method = "anova", data=books) printcp(tree1) ## ## Regression tree: ## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + ## Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova") ## ## Variables actually used in tree construction: ## [1] Fantasy Math_Stats Pages ## ## Root node error: 20.528/53 = 0.38733 ## ## n= 53 ## ## CP nsplit rel error xerror xstd ## 1 0.305836 0 1.00000 1.03609 0.17531 ## 2 0.092743 1 0.69416 0.76907 0.12258 ## 3 0.022539 2 0.60142 0.71698 0.11053 ## 4 0.010000 3 0.57888 0.74908 0.11644
These results different somewhat. Pages is still a significant variable, as is Fantasy. But now Math_Stats (indicating books that are about mathematics or statistics, one of my top genres from last year) also is. These are the variables used by the analysis to construct my regression tree. If we look at the full summary –
summary(tree1) ## Call: ## rpart(formula = My_Rating ~ Pages + Read_Time + Author_Gender + ## Fiction + Fantasy + Math_Stats + YA_Fic, data = books, method = "anova") ## n= 53 ## ## CP nsplit rel error xerror xstd ## 1 0.30583640 0 1.0000000 1.0360856 0.1753070 ## 2 0.09274251 1 0.6941636 0.7690729 0.1225752 ## 3 0.02253938 2 0.6014211 0.7169813 0.1105294 ## 4 0.01000000 3 0.5788817 0.7490758 0.1164386 ## ## Variable importance ## Pages Fantasy Fiction YA_Fic Math_Stats Read_Time ## 62 18 8 6 4 3 ## ## Node number 1: 53 observations, complexity param=0.3058364 ## mean=4.09434, MSE=0.3873265 ## left son=2 (9 obs) right son=3 (44 obs) ## Primary splits: ## Pages < 185 to the left, improve=0.30583640, (0 missing) ## Fiction < 0.5 to the left, improve=0.24974560, (0 missing) ## Fantasy < 0.5 to the left, improve=0.20761810, (0 missing) ## Math_Stats < 0.5 to the right, improve=0.20371790, (0 missing) ## Author_Gender < 0.5 to the right, improve=0.02705187, (0 missing) ## ## Node number 2: 9 observations ## mean=3.333333, MSE=0.2222222 ## ## Node number 3: 44 observations, complexity param=0.09274251 ## mean=4.25, MSE=0.2784091 ## left son=6 (26 obs) right son=7 (18 obs) ## Primary splits: ## Fantasy < 0.5 to the left, improve=0.15541600, (0 missing) ## Fiction < 0.5 to the left, improve=0.12827990, (0 missing) ## Math_Stats < 0.5 to the right, improve=0.10487750, (0 missing) ## Pages < 391 to the left, improve=0.05344995, (0 missing) ## Read_Time < 7.5 to the right, improve=0.04512078, (0 missing) ## Surrogate splits: ## Fiction < 0.5 to the left, agree=0.773, adj=0.444, (0 split) ## YA_Fic < 0.5 to the left, agree=0.727, adj=0.333, (0 split) ## Pages < 370 to the left, agree=0.682, adj=0.222, (0 split) ## Read_Time < 3.5 to the right, agree=0.659, adj=0.167, (0 split) ## ## Node number 6: 26 observations, complexity param=0.02253938 ## mean=4.076923, MSE=0.2248521 ## left son=12 (7 obs) right son=13 (19 obs) ## Primary splits: ## Math_Stats < 0.5 to the right, improve=0.079145230, (0 missing) ## Pages < 364 to the left, improve=0.042105260, (0 missing) ## Fiction < 0.5 to the left, improve=0.042105260, (0 missing) ## Read_Time < 5.5 to the left, improve=0.016447370, (0 missing) ## Author_Gender < 0.5 to the right, improve=0.001480263, (0 missing) ## ## Node number 7: 18 observations ## mean=4.5, MSE=0.25 ## ## Node number 12: 7 observations ## mean=3.857143, MSE=0.4081633 ## ## Node number 13: 19 observations ## mean=4.157895, MSE=0.132964
we see that Fiction, YA_Fic, and Read_Time were also significant variables. The problem is that there is multicollinearity between Fiction, Fantasy, Math_Stats, and YA_Fic. All Fantasy and YA_Fic books are Fiction, while all Math_Stats books are not Fiction. And all YA_Fic books I read were Fantasy. This is probably why the tree didn’t use Fiction or YA_Fic. I’m not completely clear on why Read_Time didn’t end up in the regression tree, but it may be because my read time was pretty constant among the different splits and didn’t add any new information to the tree. If I were presenting these results somewhere other than my blog, I’d probably want to do some follow-up analyses to confirm this fact.
Now the fun part: let’s plot our regression tree:
plot(tree1, uniform = TRUE, main = "Regression Tree for My Goodreads Ratings") text(tree1, use.n = TRUE, all = TRUE, cex = 0.8)