[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I wanted to try some more machine learning. On Kaggle there is a competition How Much Did It Rain? II. This is quite a bigger data set than Titanic. To quote from Kaggle:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Rainfall is highly variable across space and time, making it notoriously tricky to measure. Rain gauges can be an effective measurement tool for a specific location, but it is impossible to have them everywhere. In order to have widespread coverage, data from weather radars is used to estimate rainfall nationwide. Unfortunately, these predictions never exactly match the measurements taken using rain gauges.
Data
On the data themselves:To understand the data, you have to realize that there are multiple radar observations over the course of an hour, and only one gauge observation (the ‘Expected’). That is why there are multiple rows with the same ‘Id’.
I have downloaded the data and at this point am just experimenting with them. It is quite a big data set: there are 9125329 rows in the training set. My idea was to do ‘something’ per record, combine the records of one hour to get a prediction. The ‘something’ is as yet undefined. The idea to combine by Id is supposed to be retained.
What became clear pretty quickly is that everything is slow with this amount of data. Hence for now I will use only 10% of the training data. For ease of access the data are sitting in a R data set.
load(‘aaa3.RData’)
###
# take 10% of data
rawdata <- rawdata[rawdata$Id < quantile(rawdata$Id,.1),]
# extract keys per hour . Id & Expected
# rawdata$Id!=c(0,rawdata$Id[1:(nrow(rawdata)-1)
# is the R way to write the SAS code: by Id; If first.Id;
r1b <- rawdata[rawdata$Id!=c(0,rawdata$Id[1:(nrow(rawdata)-1)]),c(1,24)]
Id <- factor(rawdata$Id)
Model
To get but to get an idea of the process I started with linear regression, but that is just a temporary approach. For linear regression there are 22 parameters, for 21 observed values and an intercept. Prediction per row follows from a simple matrix multiplication. The model including estimation of the error in the fit sits in small R function. As preparation a column of ones is added to the x data. The summary per Id can be done pretty quickly and easy via the group_by() and summarise() functions from dplyr.Based on the current results I have decided that such a function will have to be transferred to C++ or such in order to have a decent computation time. But that is for a future time, it has been quite some years that I programmed in C or Fortran, I’ll need a refresher first, luckily edX has a course ‘Introduction to C++’ running right now.
r1m <- as.matrix(rawdata[,c(-1,-2,-24)])
rm(rawdata) # control memory usage
r1m <- cbind(rep(1,nrow(r1m)),r1m) # add column of 1
r1m[is.na(r1m) ] <- 0
betas <- rep(1,ncol(r1m))
#myerr calculated mean prediction per Id and compares with Expected values
myerr <- function(betas) {
pred <- data.frame(Id=Id,
pred=as.numeric(tcrossprod(betas,r1m))) %>%
group_by(.,Id) %>%
summarise(.,m=mean(pred))
sum(abs(pred$m-r1b$Expected))
}
#mmyerr is myerr for maximization
mmyerr <- function(betas) -myerr(betas)
Parameter estimation & optimization
The problem has now been reduced to getting the parameters which give the lowest prediction error. just throwing this in optim() did not lead to satisfactory results. So this post gives some experiments with alternate approaches. So, I played with some of these, and for this post ran it all to get decent data. The table shows the quick summary.package | function | time | result | converged |
stats | optim | 771 | 1805459 | No |
optim (BFGS) | 729 | 1678736 | Yes | |
optim(CG) | 5527 | 1678775 | Yes | |
adagio | simpleEA | 623 | 1722928 | No |
dfoptim | hjk | 4289 | 1678734 | Yes |
GA | ga | 589 | 1775910 | NA |
> system.time(
+ ooptim <- optim(betas,
+ myerr,
+ control=list(maxit=5000)))
user system elapsed
771.204 0.216 770.743
> ooptim
$par
[1] 5.91962946 -0.26798071 -1.19797656 -0.13297181 0.51015756 1.49454289
[7] 0.75377772 0.47774932 -0.54917591 -1.36237144 9.32248883 -4.58509259
[13] 1.53973413 -1.28367152 1.57070010 2.80960763 -1.53639162 0.24043815
[19] 0.55385066 0.65754031 2.25820673 -0.09389256
$value
[1] 1805459
$counts
function gradient
5001 NA
$convergence
[1] 1
$message
NULL
> system.time(
+ ooptimBFGS <- optim(betas,
+ myerr,
+ control=list(maxit=5000),
+ method=’BFGS’))
user system elapsed
730.253 0.226 729.741
> ooptimBFGS
$par
[1] -0.193572528 0.036277278 0.018915001 0.055201253 -0.005343143
[6] 0.048512761 0.018770974 -0.067885397 0.006576984 0.009960527
[11] 0.354908920 -0.417610915 0.004300646 -0.313622718 0.013003956
[16] 0.030448388 0.004622354 0.026076960 0.022466472 0.015661149
[21] 0.093649287 -0.056203122
$value
[1] 1678736
$counts
function gradient
626 94
$convergence
[1] 0
$message
NULL
> system.time(
+ ooptimCG <- optim(betas,
+ myerr,
+ control=list(maxit=5000),
+ method=’CG’))
user system elapsed
5531.910 1.397 5527.053
> ooptimCG
$par
[1] -0.112575961 0.030875504 0.019111803 0.06Yes0134187 -0.009593463
[6] 0.051076794 0.019737360 -0.077191287 0.016508372 0.003658111
[11] -0.071932138 -0.036570553 -0.051100594 -0.179869515 -0.010209693
[16] 0.066388807 0.005459227 0.039691325 0.020881573 0.020044547
[21] 0.068064112 -0.051825924
$value
[1] 1678775
$counts
function gradient
2283 772
$convergence
[1] 0
$message
NULL
> library(adagio)
> system.time(
+ osimpleEA <- simpleEA(myerr,
+ lower=rep(-5,ncol(r1m)),
+ upper=rep(5,ncol(r1m))))
user system elapsed
624.432 0.176 623.857
> osimpleEA
$par
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
$val
[1] 1722928
$fun.calls
[1] 4060
$rel.scl
[1] 0.625
$rel.tol
[1] 0
> library(dfoptim)
> system.time(
+ ohjk <- hjk(betas,myerr)
+ )
user system elapsed
4291.935 2.036 4289.022
> ohjk
$par
[1] -0.188190460 0.035865784 0.020156860 0.056446075 -0.005027771
[6] 0.046699524 0.018516541 -0.069412231 0.006889343 0.010589600
[11] 0.257308960 -0.412174225 0.065429688 -0.279411316 0.014663696
[16] 0.027637482 0.011859894 0.020298004 0.023769379 0.013179779
[21] 0.092247009 -0.056926727
$value
[1] 1678734
$convergence
[1] 0
$feval
[1] 28245
$niter
[1] 19
> library(GA)
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Package ‘GA’ version 2.2
Type ‘citation(“GA”)’ for citing this R package in publications.
> system.time(
+ oga <- ga(type=’real-valued’,
+ fitness=mmyerr,
+ min=rep(-5,ncol(r1m)),
+ max=rep(5,ncol(r1m))))
Iter = 1 | Mean = -10979924 | Best = -2576076
Iter = 2 | Mean = -7162806 | Best = -2576076
Iter = 3 | Mean = -4932565 | Best = -2318262
Iter = 4 | Mean = -4639204 | Best = -2237286
Iter = 5 | Mean = -3638851 | Best = -2237286
Iter = 6 | Mean = -3689674 | Best = -2068472
Iter = 7 | Mean = -3226292 | Best = -1990599
Iter = 8 | Mean = -2977074 | Best = -1876791
Iter = 9 | Mean = -2775401 | Best = -1876791
Iter = 10 | Mean = -2332140 | Best = -1865152
Iter = 11 | Mean = -2437267 | Best = -1817242
Iter = 12 | Mean = -2471635 | Best = -1799720
Iter = 13 | Mean = -2168155 | Best = -1799720
Iter = 14 | Mean = -2096314 | Best = -1799720
Iter = 15 | Mean = -2253981 | Best = -1799720
Iter = 16 | Mean = -2049947 | Best = -1799720
Iter = 17 | Mean = -2089075 | Best = -1789813
Iter = 18 | Mean = -2093853 | Best = -1789813
Iter = 19 | Mean = -1976973 | Best = -1789813
Iter = 20 | Mean = -1933447 | Best = -1789813
Iter = 21 | Mean = -2000920 | Best = -1789813
Iter = 22 | Mean = -1950218 | Best = -1788022
Iter = 23 | Mean = -1886263 | Best = -1788022
Iter = 24 | Mean = -1914422 | Best = -1788022
Iter = 25 | Mean = -2024395 | Best = -1788022
Iter = 26 | Mean = -2016097 | Best = -1788022
Iter = 27 | Mean = -1947956 | Best = -1788022
Iter = 28 | Mean = -1900507 | Best = -1787331
Iter = 29 | Mean = -2066703 | Best = -1787331
Iter = 30 | Mean = -1835251 | Best = -1787331
Iter = 31 | Mean = -1942032 | Best = -1787331
Iter = 32 | Mean = -2019920 | Best = -1787331
Iter = 33 | Mean = -1891563 | Best = -1787331
Iter = 34 | Mean = -1838287 | Best = -1787331
Iter = 35 | Mean = -1925757 | Best = -1787331
Iter = 36 | Mean = -1972783 | Best = -1787331
Iter = 37 | Mean = -2053707 | Best = -1787331
Iter = 38 | Mean = -1999298 | Best = -1787331
Iter = 39 | Mean = -2057161 | Best = -1787331
Iter = 40 | Mean = -2016452 | Best = -1787331
Iter = 41 | Mean = -2106619 | Best = -1787331
Iter = 42 | Mean = -1848483 | Best = -1787331
Iter = 43 | Mean = -1952123 | Best = -1784611
Iter = 44 | Mean = -1911939 | Best = -1784611
Iter = 45 | Mean = -1867947 | Best = -1784611
Iter = 46 | Mean = -1961794 | Best = -1784611
Iter = 47 | Mean = -1878957 | Best = -1784611
Iter = 48 | Mean = -2004529 | Best = -1784611
Iter = 49 | Mean = -1819421 | Best = -1784611
Iter = 50 | Mean = -1904183 | Best = -1784611
Iter = 51 | Mean = -1870062 | Best = -1783617
Iter = 52 | Mean = -1948733 | Best = -1782194
Iter = 53 | Mean = -2023928 | Best = -1782194
Iter = 54 | Mean = -2008359 | Best = -1782194
Iter = 55 | Mean = -2022090 | Best = -1782194
Iter = 56 | Mean = -2086763 | Best = -1782194
Iter = 57 | Mean = -1959158 | Best = -1782194
Iter = 58 | Mean = -1849578 | Best = -1782194
Iter = 59 | Mean = -2119746 | Best = -1782194
Iter = 60 | Mean = -2218175 | Best = -1780252
Iter = 61 | Mean = -2096596 | Best = -1780252
Iter = 62 | Mean = -1979876 | Best = -1779764
Iter = 63 | Mean = -2172198 | Best = -1779187
Iter = 64 | Mean = -1876837 | Best = -1779187
Iter = 65 | Mean = -1880262 | Best = -1779187
Iter = 66 | Mean = -1830083 | Best = -1779187
Iter = 67 | Mean = -1908915 | Best = -1779187
Iter = 68 | Mean = -1994229 | Best = -1779187
Iter = 69 | Mean = -2090335 | Best = -1779187
Iter = 70 | Mean = -2314393 | Best = -1779187
Iter = 71 | Mean = -2072167 | Best = -1779187
Iter = 72 | Mean = -1901014 | Best = -1779187
Iter = 73 | Mean = -1805347 | Best = -1779187
Iter = 74 | Mean = -1868138 | Best = -1779187
Iter = 75 | Mean = -2054748 | Best = -1779187
Iter = 76 | Mean = -1894803 | Best = -1779187
Iter = 77 | Mean = -1789384 | Best = -1779187
Iter = 78 | Mean = -1919701 | Best = -1779187
Iter = 79 | Mean = -1885417 | Best = -1779187
Iter = 80 | Mean = -1911716 | Best = -1779187
Iter = 81 | Mean = -1951938 | Best = -1779187
Iter = 82 | Mean = -1968106 | Best = -1779187
Iter = 83 | Mean = -2118552 | Best = -1778650
Iter = 84 | Mean = -1891108 | Best = -1778650
Iter = 85 | Mean = -1967131 | Best = -1778650
Iter = 86 | Mean = -1971338 | Best = -1778650
Iter = 87 | Mean = -1798790 | Best = -1778508
Iter = 88 | Mean = -1990970 | Best = -1778440
Iter = 89 | Mean = -2103851 | Best = -1778440
Iter = 90 | Mean = -2176462 | Best = -1778440
Iter = 91 | Mean = -1852161 | Best = -1778440
Iter = 92 | Mean = -1913977 | Best = -1778440
Iter = 93 | Mean = -2110709 | Best = -1778440
Iter = 94 | Mean = -1948281 | Best = -1778440
Iter = 95 | Mean = -2221610 | Best = -1777905
Iter = 96 | Mean = -2276691 | Best = -1775910
Iter = 97 | Mean = -2141834 | Best = -1775910
Iter = 98 | Mean = -1871241 | Best = -1775910
Iter = 99 | Mean = -1829631 | Best = -1775910
Iter = 100 | Mean = -1914998 | Best = -1775910
user system elapsed
590.256 0.418 589.855
> summary(oga)
+———————————–+
| Genetic Algorithm |
+———————————–+
GA settings:
Type = real-valued
Population size = 50
Number of generations = 100
Elitism = 2
Crossover probability = 0.8
Mutation probability = 0.1
Search domain
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21
Min -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
Max 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
x22
Min -5
Max 5
GA results:
Iterations = 100
Fitness function value = -1775910
Solution =
x1 x2 x3 x4 x5 x6 x7
[1,] 0.2099863 0.3481724 0.3910016 -1.416855 -0.3929375 0.5143907 0.319798
x8 x9 x10 x11 x12 x13 x14
[1,] 1.795368 -0.445552 -0.8136325 -0.01177531 -0.823179 0.744795 -0.7005772
x15 x16 x17 x18 x19 x20 x21
[1,] 0.3549757 0.4885278 0.3535035 -0.04773755 -0.5155096 0.3308792 0.1921899
x22
[1,] 0.3446834
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.