Classification of the Hyper-Spectral and LiDAR Imagery using R (mostly). Part 1: Result Evaluation
[This article was first published on Misanthrope's Thoughts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
There was the EEEI Data Fusion Contest this spring. This year they wanted people to elaborate about hyper-spectral (142-bands imagery) and LiDAR data. The resolution of the data-set was about 5 m. There were 2 nominations: best classification and the best scientific paper.
I work with high-resolution imagery quite often, but classification is a very rear task for me though. I thought that this contest was a great opportunity to develop my skills. And not just a classification skills, but R skills as well… I decided to participate in best classification contest, and to use R for the most part.
I learned a lot and I will share my knowledge with you in a series of posts starting with this one. And like in some great novels, I will start from the very end – evaluation of my results.
Results of my classification (created in R, designed in QGIS) |
Contest results
…are available here. As you may notice, I’m not on the list of the 10 best classification 😉 But there is almost unnoticeable 0.03% difference between my result (85.93% accuracy) and the result of the 10-th place. Not a bad result for a relatively inexperienced guy, don’t you think? And I know, that I could have done better – I had 99% prediction accuracy for the training samples. It’s funny, but my classification map looks better than map that took 7-th place!
How to evaluate classification using R
Due to I was not on the top ten list, I had to evaluate the result on my own. The organisers finally disclosed evaluation samples and I got a chance for a self assessment. So we have a set of .shp-files – each contains ground-truth polygons for one of the 16 classes and a classification map. We need to accomplish 3 goals:
- Create a visual representation of missclassification.
- Assess accuracy.
- Create a confusion matrix.
- Visualise classification map using EEEI colour palette.
Lets get a palette!
Official EEEI palette |
To extract colour values from the palette above you may use GIMP. But I used a widget that every KDE-user (Linux) should have by default. You can probe and save colour values from any part of your screen. Quite useful!
‘Color picker’ in work |
Now let’s see the code for our tasks.
Load needed libraries
library(rgdal) library(raster) library(reshape2) library(caret) library(ggplot2)
Load and process data
# get classification raster ras <- raster('~/GIS/IEEE_contest_2013/2013_IEEE_GRSS_DF_Contest/raster.tif', verbose = T) # get list of shp-files for evaluation shapes <- list.files(path = '~/GIS/IEEE_contest_2013/2013_IEEE_GRSS_DF_Contest/roi', pattern = '*shp') # a list for accuracy assessment accuracy_list <- list() # create an empty dataframe to be filled vith evaluation results # field names are not arbitrary!!! eval_df <- data.frame(variable = character(), value = character()) # create an empty dataframe to be used for plotting # field names are not arbitrary!!! plot_data <- data.frame(variable = character(), value = character(), Freq = integer()) for (f_name in shapes) { # delete '.shp' from the filename layer_name <- paste(sub('.shp', '', f_name)) class <- readOGR("~/GIS/IEEE_contest_2013/2013_IEEE_GRSS_DF_Contest/roi", layer = layer_name, verbose = F) # extract values from raster probe <- extract(ras, class) # replace class numbers with names samples <- list() for (lis in probe) { for (value in lis) { if (value == 0) {c_name <- Unclassified } else if (value == 1) {c_name <- 'Healthy grass' } else if (value == 2) {c_name <- 'Stressed grass' } else if (value == 3) {c_name <- 'Synthetic grass' } else if (value == 4) {c_name <- 'Trees' } else if (value == 5) {c_name <- 'Soil' } else if (value == 6) {c_name <- 'Water' } else if (value == 7) {c_name <- 'Residential' } else if (value == 8) {c_name <- 'Commercial' } else if (value == 9) {c_name <- 'Road' } else if (value == 10) {c_name <- 'Highway' } else if (value == 11) {c_name <- 'Railway' } else if (value == 12) {c_name <- 'Parking Lot 1' } else if (value == 13) {c_name <- 'Parking Lot 2' } else if (value == 14) {c_name <- 'Tennis Court' } else if (value == 15) {c_name <- 'Running Track'} samples <- c(samples, c = c_name) } } # make layer_name match sample name if (layer_name == 'grass_healthy') {layer_name <- 'Healthy grass' } else if (layer_name == 'grass_stressed') {layer_name <- 'Stressed grass' } else if (layer_name == 'grass_syntethic') {layer_name <- 'Synthetic grass' } else if (layer_name == 'tree') {layer_name <- 'Trees' } else if (layer_name == 'soil') {layer_name <- 'Soil' } else if (layer_name == 'water') {layer_name <- 'Water' } else if (layer_name == 'residental') {layer_name <- 'Residential' } else if (layer_name == 'commercial') {layer_name <- 'Commercial' } else if (layer_name == 'road') {layer_name <- 'Road' } else if (layer_name == 'highway') {layer_name <- 'Highway' } else if (layer_name == 'railway') {layer_name <- 'Railway' } else if (layer_name == 'parking_lot1') {layer_name <- 'Parking Lot 1' } else if (layer_name == 'parking_lot2') {layer_name <- 'Parking Lot 2' } else if (layer_name == 'tennis_court') {layer_name <- 'Tennis Court' } else if (layer_name == 'running_track') {layer_name <- 'Running Track'} # create a dataframe with classification results df <- as.data.frame(samples) dfm <- melt(df, id = 0) dfm['variable'] <- layer_name # add data to evaluation dataframe eval_df <- rbind(eval_df, dfm) # assess accuracy of current class mytable <- table(dfm) dmt <- as.data.frame(mytable) total_samples <- 0 correct_predictions <- 0 for (i in 1:nrow(dmt)) { predict_class <- toString(dmt[i,2]) pc_frequency <- dmt[i,3] if (predict_class == layer_name) { correct_predictions <- dmt[i,3] } total_samples <- total_samples + pc_frequency } accuracy <- round(correct_predictions/total_samples, 2) accuracy_list <- c(accuracy_list, c = accuracy) # append data for plotting plot_data <- rbind(plot_data, dmt) }
Create a fancy graph (that is shown on the map)
# create facets plot EEEI_palette <- c('#D4D4F6', '#5F5F5F', '#710100', '#00B300', '#007900', '#014400', '#008744', '#D0B183', '#BAB363', '#DAB179', '#737373', '#A7790A', '#00EA01', '#CA1236', '#00B9BB') plot_data <- plot_data[order(plot_data$variable),] p = ggplot(data = plot_data, aes(x = factor(1), y = Freq, fill = factor(value) ) ) p <- p + facet_grid(facets = . ~ variable) p <- p + geom_bar(width = 1) + xlab('Classes') + ylab('Classification rates') + guides(fill=guide_legend(title="Classes"))+ scale_fill_manual(values = EEEI_palette)+ ggtitle('Classification Accuracy')+ theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) p
Let't finally get our accuracy result!
# accuracy assessment fin_accuracy <- mean(unlist(accuracy_list)) fin_accuracy <- paste(round(fin_accuracy*100, 2), '%', sep = '') print(paste('Total accuracy:', fin_accuracy), sep = ' ')
[1] "Total accuracy: 85.93%"
Confusion matrix
# confusion matrix creation true <- as.factor(eval_df$variable) predict <- as.factor(eval_df$value) confusionMatrix(predict, true)
Enjoy statistics!
Confusion Matrix and Statistics Reference Prediction Commercial Healthy grass Highway Parking Lot 1 Commercial 850 0 14 151 Healthy grass 0 868 0 0 Highway 0 0 718 14 Parking Lot 1 41 0 54 641 Parking Lot 2 0 0 19 73 Railway 0 0 11 44 Residential 155 0 191 0 Road 0 0 20 107 Running Track 0 0 0 0 Soil 0 0 1 0 Stressed grass 0 61 0 0 Synthetic grass 0 0 0 0 Tennis Court 0 0 0 0 Trees 0 117 0 0 Water 0 0 0 0 Reference Prediction Parking Lot 2 Railway Residential Road Running Track Commercial 0 0 74 2 0 Healthy grass 0 0 0 0 0 Highway 0 9 0 0 0 Parking Lot 1 104 8 0 11 0 Parking Lot 2 155 4 0 3 0 Railway 2 917 11 61 0 Residential 0 40 918 0 0 Road 12 14 0 930 0 Running Track 0 0 1 0 465 Soil 6 2 0 27 1 Stressed grass 0 11 4 0 0 Synthetic grass 0 0 0 0 3 Tennis Court 0 3 0 0 0 Trees 0 0 10 0 0 Water 0 0 0 0 0 Reference Prediction Soil Stressed grass Synthetic grass Tennis Court Trees Commercial 0 0 0 0 0 Healthy grass 0 14 0 0 34 Highway 0 0 0 0 0 Parking Lot 1 0 0 0 0 0 Parking Lot 2 0 0 0 0 0 Railway 0 0 0 0 0 Residential 0 1 0 0 0 Road 14 0 0 0 0 Running Track 0 0 0 0 0 Soil 1040 0 0 0 0 Stressed grass 0 931 0 0 17 Synthetic grass 0 0 503 0 0 Tennis Court 0 0 0 245 0 Trees 0 85 0 0 1004 Water 0 0 0 0 0 Reference Prediction Water Commercial 0 Healthy grass 3 Highway 0 Parking Lot 1 22 Parking Lot 2 0 Railway 0 Residential 0 Road 0 Running Track 0 Soil 0 Stressed grass 0 Synthetic grass 0 Tennis Court 0 Trees 0 Water 118 Overall Statistics Accuracy : 0.859 95% CI : (0.853, 0.866) No Information Rate : 0.088 P-Value [Acc > NIR] : <2e-16 Kappa : 0.847 Mcnemar's Test P-Value : NA Statistics by Class: Class: Commercial Class: Healthy grass Class: Highway Sensitivity 0.8126 0.8298 0.6984 Specificity 0.9780 0.9953 0.9979 Pos Pred Value 0.7791 0.9445 0.9690 Neg Pred Value 0.9820 0.9839 0.9724 Prevalence 0.0872 0.0872 0.0857 Detection Rate 0.0709 0.0724 0.0599 Detection Prevalence 0.0910 0.0767 0.0618 Class: Parking Lot 1 Class: Parking Lot 2 Sensitivity 0.6223 0.5556 Specificity 0.9781 0.9915 Pos Pred Value 0.7276 0.6102 Neg Pred Value 0.9650 0.9894 Prevalence 0.0859 0.0233 Detection Rate 0.0535 0.0129 Detection Prevalence 0.0735 0.0212 Class: Railway Class: Residential Class: Road Sensitivity 0.9097 0.9018 0.8994 Specificity 0.9883 0.9647 0.9848 Pos Pred Value 0.8767 0.7034 0.8478 Neg Pred Value 0.9917 0.9906 0.9905 Prevalence 0.0841 0.0849 0.0862 Detection Rate 0.0765 0.0766 0.0776 Detection Prevalence 0.0872 0.1088 0.0915 Class: Running Track Class: Soil Sensitivity 0.9915 0.9867 Specificity 0.9999 0.9966 Pos Pred Value 0.9979 0.9656 Neg Pred Value 0.9997 0.9987 Prevalence 0.0391 0.0879 Detection Rate 0.0388 0.0867 Detection Prevalence 0.0389 0.0898 Class: Stressed grass Class: Synthetic grass Sensitivity 0.9030 1.0000 Specificity 0.9915 0.9997 Pos Pred Value 0.9092 0.9941 Neg Pred Value 0.9909 1.0000 Prevalence 0.0860 0.0420 Detection Rate 0.0777 0.0420 Detection Prevalence 0.0854 0.0422 Class: Tennis Court Class: Trees Class: Water Sensitivity 1.0000 0.9517 0.82517 Specificity 0.9997 0.9806 1.00000 Pos Pred Value 0.9879 0.8257 1.00000 Neg Pred Value 1.0000 0.9953 0.99789 Prevalence 0.0204 0.0880 0.01193 Detection Rate 0.0204 0.0837 0.00984 Detection Prevalence 0.0207 0.1014 0.00984
As you may see, the main source of misclassification are Parking Lot 1 and Parking Lot 2. The accuracy for other classes is above 90%, and it is great. Frankly, I still don't understand what is the difference between Parking Lots 1 and 2... Official answer was that Parking Lot 2 is cars (isn't detecting them using 5 m resolution imagery is a questionable task???)... But it seems that it is something else. It is hard to classify something that you don't understand what it is...
To leave a comment for the author, please follow the link and comment on their blog: Misanthrope's Thoughts.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.