Site icon R-bloggers

Taking Expectations to the Next Level

[This article was first published on jacobsimmering.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- Styles for R syntax highlighter --> < !-- R syntax highlighter -->

Higher Expectations

I came across this post on Thursday and found it to be quite interesting. Clearly rental prices vary according to where you live. That isn't too surprising. I started thinking a bit more about it and thought that Boston and the nearby communities would have to have some differences — the area near campus would probably be high bedroom places split among students and other areas would be more “adult” lodging with lower intended occupancy rates.

Some people might know what parts of Boston are the snobby parts and which are the low rent student areas, but I don't have a clue. This might be a wonderful use of EM like I posted about yesterday.

And I was lucky enough the the scripts and data were openly shared by the author Jeff Kaufman. This seems like the prefect time to load it up and see what we can do with it.

Lets first load the data and get a feel for it.

# data at
# http://raw.github.com/jeffkaufman/apartment_prices/master/apts-2013-01-29.txt
bostonRent <- read.table("~/bostonRents")
# from the GitHub code we know the variable names, so lets make them human
# readable
names(bostonRent) <- c("rent", "bedrooms", "id", "long", "lat")
summary(bostonRent[c(1, 2)])

##       rent          bedrooms   
##  Min.   :  250   Min.   :0.00  
##  1st Qu.: 1700   1st Qu.:1.00  
##  Median : 2400   Median :2.00  
##  Mean   : 2697   Mean   :2.31  
##  3rd Qu.: 3300   3rd Qu.:3.00  
##  Max.   :23000   Max.   :9.00

So there are a few apartments with zero bedrooms. Those could either be studio apartments, apartments where the number of bedrooms was incorrectly entered or other properties. Since we don't know and they are relatively rare {r} NROW(bostonRent$bedrooms[bostonRent[2] == 0]) out of {r} nrow(bostonRent) we will just delete them.

bostonRentNonZero <- bostonRent[bostonRent$bedrooms != 0, ]
summary(bostonRentNonZero[c(1, 2)])

##       rent          bedrooms   
##  Min.   :  550   Min.   :1.00  
##  1st Qu.: 1850   1st Qu.:1.00  
##  Median : 2500   Median :2.00  
##  Mean   : 2808   Mean   :2.55  
##  3rd Qu.: 3456   3rd Qu.:3.00  
##  Max.   :23000   Max.   :9.00

That looks better. Lets put together the simple model where rent is simply a function of the number of bedrooms:

# just to visually confirm that this is a sensible model
library(ggplot2)
ggplot(bostonRentNonZero, aes(x = bedrooms, y = rent)) + geom_point(position = "jitter")

# that looks promising, a few extreme values but mostly linear and well
# behaved lets make it second order to account for potentially decreasing
# marginal cost of bedrooms
bedroomModel <- lm(rent ~ I(bedrooms^2) + bedrooms, data = bostonRentNonZero)
summary(bedroomModel)

## 
## Call:
## lm(formula = rent ~ I(bedrooms^2) + bedrooms, data = bostonRentNonZero)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2761   -631   -158    367  19967 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1463.72      75.21   19.46   <2e-16 ***
## I(bedrooms^2)    14.59       7.78    1.87    0.061 .  
## bedrooms        479.48      52.11    9.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1130 on 3081 degrees of freedom
## Multiple R-squared: 0.33,    Adjusted R-squared: 0.33 
## F-statistic:  760 on 2 and 3081 DF,  p-value: <2e-16

Considering how simple the model is, it does relatively well with an adjusted R squared of {r} signif(summary(bedroomModel)$adj, 3). I mean, we aren't considering the number of bathrooms, on or off-street parking or the size of the unit. Glad to know that housing prices are still somewhat sensible.

Now what happens if we try and discover what are the different sections of Boston using two different sets of data to run EM on. Also, I don't know anything about Boston so lets set the max number of clusters to be 20 as that seems reasonable.

library(mclust)

## Package 'mclust' version 4.0

# for some reason mclust treats the matrix as univariate unless we build
# it this way simple just considering location
simpleEM <- matrix(NA, nrow = nrow(bostonRentNonZero), ncol = 2)
simpleEM[, 1] <- bostonRentNonZero$long
simpleEM[, 2] <- bostonRentNonZero$lat

# complex considering location and the number of bedrooms
complexEM <- matrix(NA, nrow = nrow(bostonRentNonZero), ncol = 3)
complexEM[, 1] <- bostonRentNonZero$long
complexEM[, 2] <- bostonRentNonZero$lat
complexEM[, 3] <- bostonRentNonZero$bedrooms


simpleModel <- Mclust(simpleEM, G = 1:20)
summary(simpleModel)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VEV (ellipsoidal, equal shape) model with 19 components:
## 
##  log.likelihood    n df   BIC
##           15303 3084 95 29842
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 110  83  88 181 239 130 521 112 229  79 336 107 108  81 373  60  60  47 
##  19 
## 140

complexModel <- Mclust(complexEM, G = 1:20)
summary(complexModel)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 12 components:
## 
##  log.likelihood    n df   BIC
##           35443 3084 86 70194
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 737 725 541  41 430  60 261  21 152  79  32   5

bostonRentNonZero$simple <- as.factor(simpleModel$classification)
bostonRentNonZero$complex <- as.factor(complexModel$classification)

We found somewhat fewer clusters using the model that considered both location and number of bedrooms ({r} complexModel$G vs {r} simpleModel$G). Lets see what these clusters look like in space:

ggplot(bostonRentNonZero) + geom_point(aes(x = long, y = lat, color = simple))

ggplot(bostonRentNonZero) + geom_point(aes(x = long, y = lat, color = complex))

I can plausibly buy that those map to distinct areas of Boston. I am less sure about the value of including bedrooms in the EM model but we will check it out anyway.

Lets see up the EM-improved regression models:

simpleLm <- lm(rent ~ I(bedrooms^2) + bedrooms + simple, data = bostonRentNonZero)
complexLm <- lm(rent ~ I(bedrooms^2) + bedrooms + complex, data = bostonRentNonZero)

summary(simpleLm)

## 
## Call:
## lm(formula = rent ~ I(bedrooms^2) + bedrooms + simple, data = bostonRentNonZero)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3399   -498   -123    333  19007 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     685.43     113.41    6.04  1.7e-09 ***
## I(bedrooms^2)     8.40       6.88    1.22  0.22258    
## bedrooms        613.00      46.60   13.15  < 2e-16 ***
## simple2         948.74     143.00    6.63  3.8e-11 ***
## simple3          39.29     140.41    0.28  0.77964    
## simple4        -339.03     118.85   -2.85  0.00436 ** 
## simple5         470.97     115.80    4.07  4.9e-05 ***
## simple6        1820.51     127.32   14.30  < 2e-16 ***
## simple7          26.43     105.18    0.25  0.80160    
## simple8          51.02     132.03    0.39  0.69917    
## simple9         378.53     114.08    3.32  0.00092 ***
## simple10       -279.98     144.73   -1.93  0.05314 .  
## simple11        367.45     107.82    3.41  0.00066 ***
## simple12        -63.40     133.27   -0.48  0.63433    
## simple13        724.24     133.23    5.44  5.9e-08 ***
## simple14        534.48     144.63    3.70  0.00022 ***
## simple15       1393.39     106.51   13.08  < 2e-16 ***
## simple16        899.21     157.51    5.71  1.2e-08 ***
## simple17       1351.83     157.49    8.58  < 2e-16 ***
## simple18        343.93     171.01    2.01  0.04440 *  
## simple19       1060.51     125.08    8.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 981 on 3063 degrees of freedom
## Multiple R-squared: 0.499,   Adjusted R-squared: 0.496 
## F-statistic:  153 on 20 and 3063 DF,  p-value: <2e-16

summary(complexLm)

## 
## Call:
## lm(formula = rent ~ I(bedrooms^2) + bedrooms + complex, data = bostonRentNonZero)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2546   -618   -169    357  19804 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        374       1200    0.31    0.756   
## I(bedrooms^2)     -121        138   -0.88    0.382   
## bedrooms          1740       1337    1.30    0.193   
## complex2          -825        925   -0.89    0.372   
## complex3         -1308       1572   -0.83    0.405   
## complex4          -552        179   -3.09    0.002 **
## complex5         -1800       1944   -0.93    0.355   
## complex6         -2463       1577   -1.56    0.118   
## complex7         -1119        926   -1.21    0.227   
## complex8         -3054       1958   -1.56    0.119   
## complex9         -1728       2043   -0.85    0.398   
## complex10        -1578       1873   -0.84    0.400   
## complex11        -1039       1446   -0.72    0.473   
## complex12           NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1110 on 3071 degrees of freedom
## Multiple R-squared: 0.354,   Adjusted R-squared: 0.351 
## F-statistic:  140 on 12 and 3071 DF,  p-value: <2e-16

I don't particularly like the effect the clustering method that includes bedrooms. I feel like it is too co-linear with the bedrooms variable. In fact, the variance inflation factor suggests that this is true.

In addition to the technical issues, including the bedroom number seems to move the clusters too far from geographic space. So I think we can safely conclude that adding bedrooms to the cluster definition doesn't help and might hurt our performance.

However, the model that clustered simply on geographic space did wonderfully. We increased our adjusted R squared to {r} summary(simpleLm)$adj, an increase of {r} summary(simpleLm)$adj - summary(bedroomModel)$adj. That is pretty impressive. Lets test the value of the addition more formally with an ANOVA:

anova(simpleLm, bedroomModel)

## Analysis of Variance Table
## 
## Model 1: rent ~ I(bedrooms^2) + bedrooms + simple
## Model 2: rent ~ I(bedrooms^2) + bedrooms
##   Res.Df      RSS  Df Sum of Sq    F Pr(>F)    
## 1   3063 2.95e+09                              
## 2   3081 3.95e+09 -18 -9.96e+08 57.5 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I think it is pretty safe to say that the addition of the geographically defined clusters has use in improving our housing price model.

I feel like this is a pretty interesting use of EM in real world data. I know nothing about Boston housing pricing but using EM to automagically find geographic clusters and the number of bedrooms, I was able to build a fairly simple linear model to explain roughly 50% of the variance in rent prices. The expected pattern of prices holds up in the model — there are places you would pay more (as much as 1,800 dollars more) to live and other places where prices are very low. We also learned that Boston is a bit pricey, your zero bedroom apartment will run you 685 a month.

To leave a comment for the author, please follow the link and comment on their blog: jacobsimmering.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.