Basic data analysis with palmerpenguins
[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
data:image/s3,"s3://crabby-images/81eb0/81eb0a02cc2b6de4365a6828096502c4161d9c3e" alt=""
Introduction
In June 17, nice article for introducing new trial dataset were uploaded via R-bloggers.
iris, one of commonly used dataset for simple data analysis. but there is a little issue for using it.
Too good.
Every data has well-structured and most of analysis method works with iris very well.
In reality, most of dataset is not pretty and requires a lot of pre-process to just start. This can be possible works in pre-process
Remove
NA
s.Select meaningful features
Handle
duplicated
or inconsistent
values.or even, just
loading
the dataset. if is not well-structured like Flipkart-productsHowever, in this penguin dataset, you can try for this work. also there’s pre-processed data too.
For more information, see the page of palmerpenguins.
There is a routine for me with brief data analysis. and today, I want to share them with this lovely penguins.
Contents
0. Load dataset and library on workspace.
library(palmerpenguins) # for data library(dplyr) # for data-handling library(corrplot) # for correlation plot library(GGally) # for parallel coordinate plot library(e1071) # for svm data(penguins) # load pre-processed penguins
palmerpenguins
have 2 data penguins
, penguins_raw
, and as you can see from their name, penguins
is pre-processed data. 1. See the
summary
and plot
of Dataset
summary(penguins) plot(penguins)
data:image/s3,"s3://crabby-images/7fe99/7fe99d31712321ecba0cb4daec78ace875ba0b84" alt=""
data:image/s3,"s3://crabby-images/41a26/41a262f5b3d473e49351b9dd8e638231516d6aa1" alt=""
It seems
species
, island
and sex
is categorical features.and remaining for numerical features.
2. Set the format of feature
penguins$species <- as.factor(penguins$species) penguins$island <- as.factor(penguins$island) penguins$sex <- as.factor(penguins$sex) summary(penguins) plot(penguins)and see
summary
and plot
again. note that result of plot
is same. data:image/s3,"s3://crabby-images/656d1/656d1a0baaa44a99db6bc3d6a8a1323f42a38a4f" alt=""
There’s unwanted
NA
and .
values in some features.3. Remove not necessary datas ( in this tutorial,
NA
)penguins <- penguins %>% filter(sex == 'MALE' | sex == 'FEMALE') summary(penguins)And here, I additionally defined color values for each penguins to see better
plot
result
# Green, Orange, Purple pCol <- c('#057076', '#ff8301', '#bf5ccb') names(pCol) <- c('Gentoo', 'Adelie', 'Chinstrap') plot(penguins, col = pCol[penguins$species], pch = 19)
data:image/s3,"s3://crabby-images/14ca8/14ca8bf447385b35ea88f1dde536375aea3660e9" alt=""
Now, plot results are much better to give insights.
Note that, other pre-process step may requires for different datasets.
4. See relation of categorical features
My first purpose of analysis this penguin is
species
So, I will try to see relation between
species
and other categorical values4-1.
species
, island
table(penguins$species, penguins$island) chisq.test(table(penguins$species, penguins$island)) # meaningful difference ggplot(penguins, aes(x = island, y = species, color = species)) + geom_jitter(size = 3) + scale_color_manual(values = pCol)
data:image/s3,"s3://crabby-images/ffa11/ffa1161cfc89fe379a7540a7e8c6192628f8db45" alt=""
data:image/s3,"s3://crabby-images/621f8/621f811df4caeaa4914d1a25ace241425b181e34" alt=""
Wow, there’s strong relationship between
species
and island
–
Adelie
lives in every island –
Gentoo
lives in only Biscoe
–
Chinstrap
lives in only Dream
4-2 & 4.3.
However,
species
and sex
or sex
and island
did not show any meaningful relation.You can try following codes.
# species vs sex table(penguins$sex, penguins$species) chisq.test(table(penguins$sex, penguins$species)[-1,]) # not meaningful difference 0.916 # sex vs island table(penguins$sex, penguins$island) # 0.9716 chisq.test(table(penguins$sex, penguins$island)[-1,]) # not meaningful difference 0.97165. See with numerical features
I will select numerical features.
and see correlation plot and parallel coordinate plots.
# Select numericals penNumeric <- penguins %>% select(-species, -island, -sex) # Cor-relation between numerics corrplot(cor(penNumeric), type = 'lower', diag = FALSE) # parallel coordinate plots ggparcoord(penguins, columns = 3:6, groupColumn = 1, order = c(4,3,5,6)) + scale_color_manual(values = pCol) plot(penNumeric, col = pCol[penguins$species], pch = 19)and below are result of them.
data:image/s3,"s3://crabby-images/4907b/4907b552dbbeb3d484b42234cbbdc9e7b573a66d" alt=""
data:image/s3,"s3://crabby-images/c6f0f/c6f0f3e73bd34d696e2b3eb45d3ccb01a770b7c4" alt=""
data:image/s3,"s3://crabby-images/3e831/3e83163c578b41bdc0442a83a9cba224d7ccdd35" alt=""
lucky, every numeric features (even only 4) have meaningful correlation and there is trend with their combination for
species
(See parallel coordinate plot)6. Give statistical work on dataset.
In this step, I usually do
linear modeling
or svm
to predict6.1
linear modeling
species
is categorical value, so it needs to be change to numeric valueset.seed(1234) idx <- sample(1:nrow(penguins), size = nrow(penguins)/2) # as. numeric speciesN <- as.numeric(penguins$species) penguins$speciesN <- speciesN train <- penguins[idx,] test <- penguins[-idx,] fm <- lm(speciesN ~ flipper_length_mm + culmen_length_mm + culmen_depth_mm + body_mass_g, train) summary(fm)
data:image/s3,"s3://crabby-images/7ae33/7ae33a19f86fdbab5963148aed6ee40b835f2d1b" alt=""
It shows that,
body_mass_g
is not meaningful feature as seen in plot
above ( it may explain gentoo
, but not other penguins )To predict, I used this code. however, numeric predict generate not complete value (like 2.123 instead of 2) so I added rounding step.
predRes <- round(predict(fm, test)) predRes[which(predRes>3)] <- 3 predRes <- sort(names(pCol))[predRes] test$predRes <- predRes ggplot(test, aes(x = species, y = predRes, color = species))+ geom_jitter(size = 3) + scale_color_manual(values = pCol) table(test$predRes, test$species)
data:image/s3,"s3://crabby-images/a53bb/a53bb378c19eb4a60f4d0f233df3eea7ea9020c5" alt=""
data:image/s3,"s3://crabby-images/1e863/1e863be0d62e14cdbdd28481390313e89501ad22" alt=""
Accuracy of basic
linear modeling
is 94.6%6-2
svm
using
svm
is also easy step.m <- svm(species ~., train) predRes2 <- predict(m, test) test$predRes2 <- predRes2 ggplot(test, aes(x = species, y = predRes2, color = species)) + geom_jitter(size = 3) + scale_color_manual(values = pCol) table(test$species, test$predRes2)and below are result of this code.
data:image/s3,"s3://crabby-images/a92eb/a92eb01f7f08584c18b4e3074411504a9ae9b127" alt=""
data:image/s3,"s3://crabby-images/8a895/8a8959f8de2f8da17d69983ebe4c19f64264e9ed" alt=""
Accuracy of
svm
is 100%. wow.Conclusion
Today I introduced simple routine for EDA and statistical analysis with penguins.
That is not difficult that much, and shows good performances.
Of course, I skipped a lot of things like processing raw-dataset.
However I hope this trial gives inspiration for further data analysis.
Thanks.
To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.