Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the first part, I introduced the weather dataset and outlining its exploratory analysis. In the second part of our tutorial, we are going to build multiple logistic regression models to predict weather forecast. Specifically, we intend to produce the following forecasts:
- tomorrow’s weather forecast at 9am of the current day
- tomorrow’s weather forecast at 3pm of the current day
- tomorrow’s weather forecast at late evening time of the current day
For each of above tasks, a specific subset of variables shall be available, and precisely:
- 9am: MinTemp, WindSpeed9am, Humidity9am, Pressure9am, Cloud9am, Temp9am
- 3pm: (9am variables) + Humidity3pm, Pressure3pm, Cloud3pm, Temp3pm, MaxTemp
- evening: (3pm variables) + Evaporation, Sunshine, WindGustDir, WindGustSpeed
We suppose the MinTemp already available at 9am as we expect the overnight temperature resulting with that information. We suppose the MaxTemp already available at 3pm, as determined on central day hours. Further, we suppose Sunshine, Evaporation, WindGustDir and WindGustSpeed final information only by late evening. Other variables are explicitely bound to a specific daytime.
suppressPackageStartupMessages(library(caret))
set.seed(1023)
weather_data5 <- read.csv("weather_data5.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
colnames(weather_data5)
[1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir"
[6] "WindGustSpeed" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm"
[11] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
[16] "Temp3pm" "RainTomorrow"
nrow(weather_data5)
[1] 353
sum(weather_data5["RainTomorrow"] == "Yes")
[1] 64
sum(weather_data5["RainTomorrow"] == "No")
[1] 289
We are going to split the original dataset in a training dataset (70% of original data) and a testing dataset (30% remaining).
train_rec <- createDataPartition(weather_data5$RainTomorrow, p = 0.7, list = FALSE) training <- weather_data5[train_rec,] testing <- weather_data5[-train_rec,]
We check the balance of RainTomorrow Yes/No fractions in the training and testing datasets.
sum(training["RainTomorrow"] == "Yes")/sum(training["RainTomorrow"] == "No") [1] 0.2216749 sum(testing["RainTomorrow"] == "Yes")/sum(testing["RainTomorrow"] == "No") [1] 0.2209302
The dataset is slight unbalanced with respect the label to be predicted. We will check later if some remedy is needed or achieved accuracy is satisfactory as well.
9AM Forecast Model
For all models, we are going to take advantage of a train control directive made available by the caret package which prescribes repeated k-flod cross-validation. The k-fold cross validation method involves splitting the training dataset into k-subsets. For each subset, one is held out while the model is trained on all other subsets. This process is completed until accuracy is determined for each instance in the training dataset, and an overall accuracy estimate is provided. The process of splitting the data into k-folds can be repeated a number of times and this is called repeated k-fold cross validation. The final model accuracy is taken as the mean from the number of repeats.
trControl <- trainControl(method = "repeatedcv", repeats = 5, number = 10, verboseIter = FALSE)
The trainControl is passed as a parameter to the train() caret function. As a further input to the train() function, the tuneLength parameter indicates the number of different values to try for each algorithm parameter. However in case of the “glm” method, it does not drive any specific behavior and hence it will be omitted.
By taking into account the results from exploratory analysis, we herein compare two models for 9AM prediction. The first one is so determined and evaluated.
predictors_9am_c1 <- c("Cloud9am", "Humidity9am", "Pressure9am", "Temp9am")
formula_9am_c1 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c6, collapse="+"), sep="~"))
mod9am_c1_fit <- train(formula_9am_c1, data=training, method="glm",
family="binomial", trControl = trControl, metric = 'Accuracy')
mod9am_c1_fit$results$Accuracy
[1] 0.8383538
The resulting accuracy shown above is the one determined by the repeated k-fold cross validation as above explained. The obtained value is not that bad considering the use of just 9AM available data.
(summary_rep <- summary(mod9am_c1_fit$finalModel))
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7667 -0.5437 -0.3384 -0.1874 2.7742
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 96.01611 33.57702 2.860 0.004242 **
Cloud9am 0.17180 0.07763 2.213 0.026893 *
Humidity9am 0.06769 0.02043 3.313 0.000922 ***
Pressure9am -0.10286 0.03283 -3.133 0.001729 **
Temp9am 0.10595 0.04545 2.331 0.019744 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.90 on 247 degrees of freedom
Residual deviance: 177.34 on 243 degrees of freedom
AIC: 187.34
Number of Fisher Scoring iterations: 5
From above summary, all predictors result as significative for the logistic regression model. We can test the null hypothesis that the logistic model represents an adequate fit for the data by computing the following p-value.
1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9994587
We further investigate if any predictor can be dropped by taking advantage of the drop1() function.
drop1(mod9am_c1_fit$finalModel, test="Chisq")
Single term deletions
Model:
.outcome ~ Cloud9am + Humidity9am + Pressure9am + Temp9am
Df Deviance AIC LRT Pr(>Chi)
177.34 187.34
Cloud9am 1 182.48 190.48 5.1414 0.0233623 *
Humidity9am 1 190.19 198.19 12.8502 0.0003374 ***
Pressure9am 1 187.60 195.60 10.2668 0.0013544 **
Temp9am 1 183.13 191.13 5.7914 0.0161050 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can evaluate a second model where the MinTemp is replaced by the Temp9am. We do not keep both as they are correlated (see part #1 exploratory analysis).
predictors_9am_c2 <- c("Cloud9am", "Humidity9am", "Pressure9am", "MinTemp")
formula_9am_c2 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c2, collapse="+"), sep="~"))
mod9am_c2_fit <- train(formula_9am_c2, data=training, method="glm",
family="binomial", trControl = trControl, metric = 'Accuracy')
mod9am_c2_fit$results$Accuracy
[1] 0.8366462
(summary_rep <- summary(mod9am_c2_fit$finalModel))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7233 -0.5446 -0.3510 -0.1994 2.7237
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 103.00407 33.57102 3.068 0.00215 **
Cloud9am 0.16379 0.08014 2.044 0.04096 *
Humidity9am 0.05462 0.01855 2.945 0.00323 **
Pressure9am -0.10792 0.03298 -3.272 0.00107 **
MinTemp 0.06750 0.04091 1.650 0.09896 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.9 on 247 degrees of freedom
Residual deviance: 180.3 on 243 degrees of freedom
AIC: 190.3
Number of Fisher Scoring iterations: 5
The p-value associated with the null hypothesis that this model is a good fit for the data is:
1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9990409
This second model shows similar accuracy values, however MinTemp p-value shows that such predictor is not significative. Further, the explained variance is slightly less than the first model one. As a conclusion, we select the first model for 9AM predictions purpose and we go on by calculating accuracy with the test set.
mod9am_pred <- predict(mod9am_c1_fit, testing)
confusionMatrix(mod9am_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 80 12
Yes 6 7
Accuracy : 0.8286
95% CI : (0.7427, 0.8951)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.4602
Kappa : 0.3405
Mcnemar's Test P-Value : 0.2386
Sensitivity : 0.9302
Specificity : 0.3684
Pos Pred Value : 0.8696
Neg Pred Value : 0.5385
Prevalence : 0.8190
Detection Rate : 0.7619
Detection Prevalence : 0.8762
Balanced Accuracy : 0.6493
'Positive' Class : No
The confusion matrix shows encouraging results, not a bad test accuracy for a 9AM weather forecast. We then go on building the 3PM prediction model.
3PM Forecast Model
We are going to use what we expect to be reliable predictors at 3PM. We already seen that they are not strongly correlated each other and they are not linearly dependent.
predictors_3pm_c1 <- c("Cloud3pm", "Humidity3pm", "Pressure3pm", "Temp3pm")
formula_3pm_c1 <- as.formula(paste("RainTomorrow", paste(predictors_3pm_c1, collapse="+"), sep="~"))
mod3pm_c1_fit <- train(formula_3pm_c1, data = training, method = "glm", family = "binomial",
trControl = trControl, metric = 'Accuracy')
mod3pm_c1$_fitresults$Accuracy
[1] 0.8582333
This model shows an acceptable accuracy as measured on the training set.
(summary_rep <- summary(mod3pm_c1_fit$finalModel))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3065 -0.5114 -0.2792 -0.1340 2.6641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 122.56229 35.20777 3.481 0.000499 ***
Cloud3pm 0.27754 0.09866 2.813 0.004906 **
Humidity3pm 0.06745 0.01817 3.711 0.000206 ***
Pressure3pm -0.12885 0.03443 -3.743 0.000182 ***
Temp3pm 0.10877 0.04560 2.385 0.017071 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.90 on 247 degrees of freedom
Residual deviance: 159.64 on 243 degrees of freedom
AIC: 169.64
Number of Fisher Scoring iterations: 6
All predictors are reported as significative for the model. Let us further verify if any predictor can be dropped.
drop1(mod3pm_c1_fit$finalModel, test="Chisq")
Single term deletions
Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm
Df Deviance AIC LRT Pr(>Chi)
159.64 169.64
Cloud3pm 1 168.23 176.23 8.5915 0.003377 **
Humidity3pm 1 175.91 183.91 16.2675 5.500e-05 ***
Pressure3pm 1 175.60 183.60 15.9551 6.486e-05 ***
Temp3pm 1 165.77 173.77 6.1329 0.013269 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value associated with the null hypothesis that this model is a good fit for the data is:
1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999913
We go on with the computation of the test set accuracy.
mod3pm_pred <- predict(mod3pm_c1_fit, testing)
confusionMatrix(mod3pm_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 82 8
Yes 4 11
Accuracy : 0.8857
95% CI : (0.8089, 0.9395)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.04408
Kappa : 0.58
Mcnemar's Test P-Value : 0.38648
Sensitivity : 0.9535
Specificity : 0.5789
Pos Pred Value : 0.9111
Neg Pred Value : 0.7333
Prevalence : 0.8190
Detection Rate : 0.7810
Detection Prevalence : 0.8571
Balanced Accuracy : 0.7662
'Positive' Class : No
The test set prediction accuracy is quite satisfactory.
Evening Forecast Model
As first evening forecast model, we introduce the Sunshine variable and at the same time we take Cloud3pm and Humidity3pm out as strongly correlated to Sunshine.
predictors_evening_c1 <- c("Pressure3pm", "Temp3pm", "Sunshine")
formula_evening_c1 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c1, collapse="+"), sep="~"))
mod_ev_c1_fit <- train(formula_evening_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c1_fit$results$Accuracy
[1] 0.8466974
The training set based accuracy is acceptable.
(summary_rep <- summary(mod_ev_c1_fit$finalModel))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9404 -0.4532 -0.2876 -0.1350 2.4664
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 174.43458 37.12201 4.699 2.61e-06 ***
Pressure3pm -0.17175 0.03635 -4.726 2.29e-06 ***
Temp3pm 0.06506 0.03763 1.729 0.0838 .
Sunshine -0.41519 0.07036 -5.901 3.61e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.90 on 247 degrees of freedom
Residual deviance: 156.93 on 244 degrees of freedom
AIC: 164.93
Number of Fisher Scoring iterations: 6
The Temp3pm predictor is reported as not significative and can be dropped as confirmed below.
drop1(mod_ev_c1_fit$finalModel, test="Chisq")
Single term deletions
Model:
.outcome ~ Pressure3pm + Temp3pm + Sunshine
Df Deviance AIC LRT Pr(>Chi)
156.93 164.93
Pressure3pm 1 185.64 191.64 28.712 8.400e-08 ***
Temp3pm 1 160.03 166.03 3.105 0.07805 .
Sunshine 1 203.03 209.03 46.098 1.125e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value associated with the null hypothesis that this model is a good fit for the data is:
1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999968
As a second tentative model, we take advantage of the 3PM model predictors together with WindGustDir and WindGustSpeed.
predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir", "WindGustSpeed")
formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~"))
mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c2_fit$results$Accuracy
[1] 0.8681179
It results with an improved training set accuracy.
(summary_rep <- summary(mod_ev_c2_fit$finalModel))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.44962 -0.42289 -0.20929 -0.04328 2.76204
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.928e+01 4.863e+01 2.042 0.041193 *
Cloud3pm 2.569e-01 1.081e-01 2.376 0.017481 *
Humidity3pm 8.441e-02 2.208e-02 3.822 0.000132 ***
Pressure3pm -1.088e-01 4.695e-02 -2.318 0.020462 *
Temp3pm 1.382e-01 5.540e-02 2.495 0.012596 *
WindGustDirENE -1.064e+00 1.418e+00 -0.750 0.453254
WindGustDirESE 1.998e+00 1.088e+00 1.837 0.066235 .
WindGustDirN -2.731e-03 1.759e+00 -0.002 0.998761
WindGustDirNE 1.773e+00 1.260e+00 1.407 0.159364
WindGustDirNNE -1.619e+01 2.884e+03 -0.006 0.995520
WindGustDirNNW 1.859e+00 1.021e+00 1.821 0.068672 .
WindGustDirNW 1.011e+00 1.009e+00 1.002 0.316338
WindGustDirS 1.752e+00 1.248e+00 1.404 0.160394
WindGustDirSE -1.549e+01 2.075e+03 -0.007 0.994043
WindGustDirSSE -1.051e-01 1.636e+00 -0.064 0.948777
WindGustDirSSW -1.451e+01 6.523e+03 -0.002 0.998225
WindGustDirSW 2.002e+01 6.523e+03 0.003 0.997551
WindGustDirW 1.108e+00 1.183e+00 0.936 0.349051
WindGustDirWNW 1.325e-01 1.269e+00 0.104 0.916848
WindGustDirWSW -1.574e+01 4.367e+03 -0.004 0.997124
WindGustSpeed 1.674e-02 2.065e-02 0.811 0.417420
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.90 on 247 degrees of freedom
Residual deviance: 135.61 on 227 degrees of freedom
AIC: 177.61
Number of Fisher Scoring iterations: 17
The WindGustDir and WindGustSpeed predictors are reported as not significative. Also the AIC value is sensibly increased.
drop1(mod_ev_c2_fit$finalModel, test="Chisq")
Single term deletions
Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE +
WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE +
WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE +
WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW +
WindGustDirWNW + WindGustDirWSW + WindGustSpeed
Df Deviance AIC LRT Pr(>Chi)
135.61 177.61
Cloud3pm 1 141.64 181.64 6.0340 0.01403 *
Humidity3pm 1 154.49 194.49 18.8817 1.391e-05 ***
Pressure3pm 1 141.38 181.38 5.7692 0.01631 *
Temp3pm 1 142.21 182.21 6.5992 0.01020 *
WindGustDirENE 1 136.20 176.20 0.5921 0.44159
WindGustDirESE 1 139.40 179.40 3.7883 0.05161 .
WindGustDirN 1 135.61 175.61 0.0000 0.99876
WindGustDirNE 1 137.55 177.55 1.9412 0.16354
WindGustDirNNE 1 136.40 176.40 0.7859 0.37535
WindGustDirNNW 1 139.43 179.43 3.8160 0.05077 .
WindGustDirNW 1 136.69 176.69 1.0855 0.29747
WindGustDirS 1 137.64 177.64 2.0293 0.15430
WindGustDirSE 1 136.36 176.36 0.7458 0.38782
WindGustDirSSE 1 135.61 175.61 0.0041 0.94866
WindGustDirSSW 1 135.64 175.64 0.0339 0.85384
WindGustDirSW 1 138.38 178.38 2.7693 0.09609 .
WindGustDirW 1 136.52 176.52 0.9106 0.33996
WindGustDirWNW 1 135.62 175.62 0.0109 0.91689
WindGustDirWSW 1 135.85 175.85 0.2414 0.62318
WindGustSpeed 1 136.27 176.27 0.6633 0.41541
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
WindGustDir has some borderline p-value for some specific directions. WindGustDir is not significative and we should drop it from the model. Hence, we redefine such model after having taken WindGustDir off.
predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir")
formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~"))
mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c2_fit$results$Accuracy
[1] 0.8574256
(summary_rep <- summary(mod_ev_c2_fit$finalModel))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.45082 -0.40624 -0.21271 -0.04667 2.71193
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 118.76549 42.77047 2.777 0.005490 **
Cloud3pm 0.25566 0.10794 2.369 0.017859 *
Humidity3pm 0.08052 0.02126 3.787 0.000152 ***
Pressure3pm -0.12701 0.04172 -3.044 0.002333 **
Temp3pm 0.13035 0.05441 2.396 0.016592 *
WindGustDirENE -1.13954 1.43581 -0.794 0.427398
WindGustDirESE 2.03313 1.10027 1.848 0.064624 .
WindGustDirN -0.00212 1.68548 -0.001 0.998996
WindGustDirNE 1.59121 1.25530 1.268 0.204943
WindGustDirNNE -16.31932 2891.85796 -0.006 0.995497
WindGustDirNNW 1.87617 1.03532 1.812 0.069962 .
WindGustDirNW 1.09105 1.01928 1.070 0.284433
WindGustDirS 1.93346 1.24036 1.559 0.119046
WindGustDirSE -15.50996 2073.57766 -0.007 0.994032
WindGustDirSSE -0.09029 1.63401 -0.055 0.955934
WindGustDirSSW -14.34617 6522.63868 -0.002 0.998245
WindGustDirSW 19.99670 6522.63868 0.003 0.997554
WindGustDirW 1.16834 1.18736 0.984 0.325127
WindGustDirWNW 0.16961 1.28341 0.132 0.894858
WindGustDirWSW -15.71816 4349.40572 -0.004 0.997117
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.90 on 247 degrees of freedom
Residual deviance: 136.27 on 228 degrees of freedom
AIC: 176.27
Number of Fisher Scoring iterations: 17
drop1(mod_ev_c2_fit$finalModel, test="Chisq")
Single term deletions
Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE +
WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE +
WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE +
WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW +
WindGustDirWNW + WindGustDirWSW
Df Deviance AIC LRT Pr(>Chi)
136.27 176.27
Cloud3pm 1 142.24 180.24 5.9653 0.014590 *
Humidity3pm 1 154.50 192.50 18.2255 1.962e-05 ***
Pressure3pm 1 146.76 184.76 10.4852 0.001203 **
Temp3pm 1 142.33 180.33 6.0611 0.013819 *
WindGustDirENE 1 136.93 174.93 0.6612 0.416126
WindGustDirESE 1 140.13 178.13 3.8568 0.049545 *
WindGustDirN 1 136.27 174.27 0.0000 0.998996
WindGustDirNE 1 137.87 175.87 1.5970 0.206332
WindGustDirNNE 1 137.14 175.14 0.8698 0.351012
WindGustDirNNW 1 140.09 178.09 3.8159 0.050769 .
WindGustDirNW 1 137.52 175.52 1.2477 0.263994
WindGustDirS 1 138.78 176.78 2.5032 0.113617
WindGustDirSE 1 137.03 175.03 0.7541 0.385179
WindGustDirSSE 1 136.28 174.28 0.0031 0.955862
WindGustDirSSW 1 136.30 174.30 0.0290 0.864850
WindGustDirSW 1 139.00 177.00 2.7314 0.098393 .
WindGustDirW 1 137.28 175.28 1.0100 0.314905
WindGustDirWNW 1 136.29 174.29 0.0174 0.894921
WindGustDirWSW 1 136.51 174.51 0.2373 0.626164
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
WindGustDirESE is reported as significant and hence including WindGustDir was a right choice and accept that model as a candidate one. The p-value associated with the null hypothesis that this model is a good fit for the data is:
1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999998
To investigate a final third choice, we gather a small set of predictors, Pressure3pm and Sunshine.
predictors_evening_c3 <- c("Pressure3pm", "Sunshine")
formula_evening_c3 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c3, collapse="+"), sep="~"))
mod_ev_c3_fit <- train(formula_evening_c3, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c3_fit$results$Accuracy
[1] 0.8484513
The training set based accuracy has an acceptable value.
(summary_rep <- summary(mod_ev_c3_fit$finalModel))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1123 -0.4602 -0.2961 -0.1562 2.3997
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 192.97146 36.09640 5.346 8.99e-08 ***
Pressure3pm -0.18913 0.03545 -5.336 9.52e-08 ***
Sunshine -0.35973 0.06025 -5.971 2.36e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.90 on 247 degrees of freedom
Residual deviance: 160.03 on 245 degrees of freedom
AIC: 166.03
Number of Fisher Scoring iterations: 6
All predictors are reported as significative.
drop1(mod_ev_c3_fit$finalModel, test="Chisq")
Single term deletions
Model:
.outcome ~ Pressure3pm + Sunshine
Df Deviance AIC LRT Pr(>Chi)
160.03 166.03
Pressure3pm 1 199.36 203.36 39.324 3.591e-10 ***
Sunshine 1 206.09 210.09 46.060 1.147e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value associated with the null hypothesis that this model is a good fit for the data is:
1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999938
We compare the last two models by running an ANOVA analysis on those to check if the lower residual deviance of the first model is significative or not.
anova(mod_ev_c2_fit$finalModel, mod_ev_c3_fit$finalModel, test="Chisq")
Analysis of Deviance Table
Model 1: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE +
WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE +
WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE +
WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW +
WindGustDirWNW + WindGustDirWSW
Model 2: .outcome ~ Pressure3pm + Sunshine
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 228 136.27
2 245 160.03 -17 -23.76 0.1261
Based on p-value, there is no significative difference between them. We then choose both models. The first model because it provides with a better accuracy then the second. The second model for its simplicity. Let us evaluate the test accuracy for both of them.
modevening_pred <- predict(mod_ev_c2_fit, testing)
confusionMatrix(modevening_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 83 9
Yes 3 10
Accuracy : 0.8857
95% CI : (0.8089, 0.9395)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.04408
Kappa : 0.5604
Mcnemar's Test P-Value : 0.14891
Sensitivity : 0.9651
Specificity : 0.5263
Pos Pred Value : 0.9022
Neg Pred Value : 0.7692
Prevalence : 0.8190
Detection Rate : 0.7905
Detection Prevalence : 0.8762
Balanced Accuracy : 0.7457
'Positive' Class : No
Good test accuracy with a high sensitivity and positive predicted values. We then ttest the second evening forecast candidate model.
modevening_pred <- predict(mod_ev_c3_fit, testing)
confusionMatrix(modevening_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 81 7
Yes 5 12
Accuracy : 0.8857
95% CI : (0.8089, 0.9395)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.04408
Kappa : 0.598
Mcnemar's Test P-Value : 0.77283
Sensitivity : 0.9419
Specificity : 0.6316
Pos Pred Value : 0.9205
Neg Pred Value : 0.7059
Prevalence : 0.8190
Detection Rate : 0.7714
Detection Prevalence : 0.8381
Balanced Accuracy : 0.7867
'Positive' Class : No
Slightly higher accuracy and lower sensitivity than the previous model. Positive predicitive value is improved with respect the same.
Resuming up our final choices:
mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine
We save the training record indexes, the training and testing datasets together with all the chosen models for further analysis.
saveRDS(list(weather_data5, train_rec, training, testing, mod9am_c1_fit, mod9am_c2_fit, mod3pm_c1_fit, mod_ev_c2_fit, mod_ev_c3_fit), file="wf_log_reg_part2.rds")
Conclusions
We build models for weather forecasts purposes at different hours of the day. We explored training accuracy and prediction significancy to determine best models. We then compute the test accuracy and collect results. Prediction accuracy was shown to be quite satisfactory in case expecially for 3PM and evening models. In next part of this tutorial, we will tune all those models to improve their accuracy if possible.
If you have any questions, please feel free to comment below.
Related Post
- Weather forecast with regression models – part 1
- Weighted Linear Support Vector Machine
- Logistic Regression Regularized with Optimization
- Analytical and Numerical Solutions to Linear Regression Problems
- How to create a loop to run multiple regression models
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
