Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Categories
Tags
In this post, I am going to build a statistical learning model as based upon plant leaf datasets introduced in part one of this tutorial. We have available three datasets, each one providing sixteen samples each of one-hundred plant species.
The features are:
- shape
- texture
- margin
Specifically, I will take advantage of Discrimination Analysis for classification purposes and I will compare my results with the ones of ref. [1] table 2, as herein reported.
Ref. [1] authors show the accuracy reached by KNN (proportional and weighted proportional kernel density) for any combination of the datasets in use. Their top accuracy is 96% and it is achieved when using all three datasets. In the present tutorial I am going to compare my results with the ones obtained in ref. [1].
Packages
suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(ggplot2))
Classification Models
Loading previous post environment where datasets are available.
load('PlantLeafEnvironment.RData')
We define an utility function to gather all the steps needed for model building by caret package (ref. [3]), specifically:
- train partition as 70% training and 30% test samples
- train control specifying repeated cross-validation with ten folds
- features set, all features besides species and id
- train dataset
- test dataset
- linear discriminant analysis fit, with preprocessing to compensate for spatial distribution asymmetries and to center and scale values
- prediction computation
- confusion matrix computed on the test dataset
- result as made of the linear discriminant analysis fit and the test dataset confusion matrix
As pre-processing step, also PCA (Principal Component Analysis) is specified. That allows for slightly better accuracy results compared to non PCA preprocessing scenario (see also ref. [4] for further insights about combining PCA and LDA).
Among all available Discriminant Analysis models within caret package, the lda appears to be the most efficient with time and as well providing with good results, as we will see at the end. For details about Linear Discriminant Analysis, please see ref. [5] and [6].
set.seed(1023) plant_leaf_model <- function(leaf_dataset) { train_idx <- createDataPartition(leaf_dataset$species, p = 0.7, list = FALSE) trControl <- trainControl(method = "repeatedcv", number = 10, verboseIter = FALSE, repeats = 5) relevant_features <- setdiff(colnames(leaf_dataset), c("id", "species")) feat_sum <- paste(relevant_features, collapse = "+") frm <- as.formula(paste("species ~ ", feat_sum)) train_set <- leaf_dataset[train_idx,] test_set <- leaf_dataset[-train_idx,] lda_fit <- train(frm, data = train_set, method ="lda", tuneLength = 10, preProcess = c("BoxCox", "center", "scale", "pca"), trControl = trControl, metric = 'Accuracy') lda_test_pred <- predict(lda_fit, test_set) cfm <- confusionMatrix(lda_test_pred, test_set$species) list(model=lda_fit, confusionMatrix = cfm) }
Margin dataset model
We build our model for margin dataset only.
result <- plant_leaf_model(margin_data)
Model fit results.
result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 64 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: centered (64), scaled (64), principal component ## signal extraction (64) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1082, 1076, 1078, 1081, 1086, 1079, ... ## Resampling results: ## ## Accuracy Kappa ## 0.7947299 0.7924885
Testset confusion matrix.
result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.7950000 0.7929293 0.7520681 0.8335033
We take notice of accuracy for final report.
margin_data_accuracy <- result$confusionMatrix$overall[1]
Further details can be printed out by:
result$model$finalModel varImp(result$model)
that we do not show for brevity.
Shape dataset model
We build our model for shape dataset only.
result <- plant_leaf_model(shape_data)
Model fit results.
result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 64 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (64), scaled ## (64), principal component signal extraction (64) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1080, 1072, 1074, 1084, 1076, 1087, ... ## Resampling results: ## ## Accuracy Kappa ## 0.5167374 0.5116926
Testset confusion matrix.
result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.5150000 0.5101010 0.4648181 0.5649584
We take notice of accuracy for final report.
shape_data_accuracy <- result$confusionMatrix$overall[1]
Texture dataset model
We build our model for texture dataset only.
result <- plant_leaf_model(texture_data)
Model fit results.
result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 64 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: centered (64), scaled (64), principal component ## signal extraction (64) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1078, 1078, 1086, 1076, 1076, 1082, ... ## Resampling results: ## ## Accuracy Kappa ## 0.7580786 0.7554372
Testset confusion matrix.
result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.7575000 0.7550505 0.7124314 0.7987117
We take notice of accuracy for final report.
texture_data_accuracy <- result$confusionMatrix$overall[1]
Margin+Shape datasets model
We build our model for margin and shape datasets.
margin_shape_data <- left_join(margin_data, shape_data[,-which(colnames(texture_data) %in% c("species"))], by=c("id")) result <- plant_leaf_model(margin_shape_data)
Model fit results.
result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 128 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (128), scaled ## (128), principal component signal extraction (128) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1082, 1084, 1074, 1087, 1084, 1081, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9511508 0.9506051
Testset confusion matrix.
result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9300000 0.9292929 0.9004175 0.9529847
We take notice of accuracy for final report.
margin_shape_data_accuracy <- result$confusionMatrix$overall[1]
Margin+Texture datasets model
We build our model for margin and texture datasets.
margin_texture_data <- left_join(margin_data, texture_data[,-which(colnames(texture_data) %in% c("species"))], by=c("id")) result <- plant_leaf_model(margin_texture_data)
Model fit results.
result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 128 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: centered (128), scaled (128), principal component ## signal extraction (128) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1079, 1078, 1084, 1074, 1085, 1081, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9666299 0.9662565
Testset confusion matrix.
result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9475000 0.9469697 0.9208654 0.9672116
We take notice of accuracy for final report.
margin_texture_data_accuracy <- result$confusionMatrix$overall[1]
Shape+Texture datasets model
We build our model for shape and texture datasets.
shape_texture_data <- left_join(shape_data, texture_data[,-which(colnames(texture_data) %in% c("species"))], by=c("id")) result <- plant_leaf_model(shape_texture_data)
Model fit results.
result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 128 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (128), scaled ## (128), principal component signal extraction (128) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1083, 1080, 1080, 1082, 1082, 1081, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9102814 0.9092814
Testset confusion matrix.
result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9225000 0.9217172 0.8917971 0.9467373
We take notice of accuracy for final report.
shape_texture_data_accuracy <- result$confusionMatrix$overall[1]
Margin+Shape+Texture datasets model
We build our model for all three datasets.
leaves_data <- left_join(margin_data, shape_texture_data[,-which(colnames(texture_data) %in% c("species"))], by = c("id")) result <- plant_leaf_model(leaves_data)
Model fit results.
result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 192 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (192), scaled ## (192), principal component signal extraction (192) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1078, 1074, 1084, 1076, 1083, 1082, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9832037 0.9830154
Testset confusion matrix.
result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9900000 0.9898990 0.9745952 0.9972688
We take notice of accuracy for final report.
leaf_data_accuracy <- result$confusionMatrix$overall[1]
Final Results
Finally, we gather all results in a data frame and show for comparison. The V symbol indicates when a dataset is selected for model building.
LDA_accuracy = c(shape_data_accuracy, texture_data_accuracy, margin_data_accuracy, shape_texture_data_accuracy, margin_shape_data_accuracy, margin_texture_data_accuracy, leaf_data_accuracy) LDA_accuracy <- 100*LDA_accuracy # as percentage results <- data.frame(SHA = c("V", " ", " ", "V", "V", " ", "V"), TEX = c(" ", "V", " ", "V", " ", "V", "V"), MAR = c(" ", " ", "V", " ", "V", "V", "V"), prop_KNN_accuracy = c(62.13, 72.94, 75.00, 86.19, 87.19, 93.38, 96.81), wprop_KNN__accuracy = c(61.88, 72.75, 75.75, 86.06, 86.75, 93.31, 96.69), LDA_accuracy = LDA_accuracy )
Here are ref.[1] results compared to ours.
results ## SHA TEX MAR prop_KNN_accuracy wprop_KNN__accuracy LDA_accuracy ## 1 V 62.13 61.88 51.50 ## 2 V 72.94 72.75 75.75 ## 3 V 75.00 75.75 79.50 ## 4 V V 86.19 86.06 92.25 ## 5 V V 87.19 86.75 93.00 ## 6 V V 93.38 93.31 94.75 ## 7 V V V 96.81 96.69 99.00
As we can see, with the only exception of the shape dataset scenario, for all other cases as defined by the datasets in use, Linear Discriminant Analysis achieves higher accuracy with respect ref. [1] K-Nearest-Neighbor based models.
References
- Charles Mallah, James Cope and James Orwell, “PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES” link
- Leaf Dataset
- Caret package – train models by tag
- Combining PCA and LDA
- Linear Discriminant Analysis
- Linear Discriminant Analysis – Bit by Bit
Related Post
- NYC buses: C5.0 classification with R; more than 20 minute delay?
- NYC buses: Cubist regression with more predictors
- NYC buses: simple Cubist regression
- Visualization of NYC bus delays with R
- Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.