L is for Log Transformation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the more skewed variables in my reading dataset is read_time. I was able to read many books in a pretty short amount of time (a few days), but others took longer, either because they were a long book or because I was busy with other things and didn’t have as much time to read. Let’s take a quick look.
library(tidyverse) ## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 -- ## <U+2713> ggplot2 3.2.1 <U+2713> purrr 0.3.3 ## <U+2713> tibble 2.1.3 <U+2713> dplyr 0.8.3 ## <U+2713> tidyr 1.0.0 <U+2713> stringr 1.4.0 ## <U+2713> readr 1.3.1 <U+2713> forcats 0.4.0 ## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE) ## Parsed with column specification: ## cols( ## Title = col_character(), ## Pages = col_double(), ## date_started = col_character(), ## date_read = col_character(), ## Book.ID = col_double(), ## Author = col_character(), ## AdditionalAuthors = col_character(), ## AverageRating = col_double(), ## OriginalPublicationYear = col_double(), ## read_time = col_double(), ## MyRating = col_double(), ## Gender = col_double(), ## Fiction = col_double(), ## Childrens = col_double(), ## Fantasy = col_double(), ## SciFi = col_double(), ## Mystery = col_double(), ## SelfHelp = col_double() ## ) library(magrittr) ## ## Attaching package: 'magrittr' ## The following object is masked from 'package:purrr': ## ## set_names ## The following object is masked from 'package:tidyr': ## ## extract reads2019 %$% range(read_time) ## [1] 0 25Read time ranges from 0 (finished in the same day) to almost a month. If I created box-plots of reading time, I'd likely have some outliers. I'll use my Fantasy genre to generate 2 box-plots. To make these data a bit easier to visualize, I'll also change my Fantasy flag into a labeled factor.
reads2019 <- reads2019 %>% mutate(Fantasy = factor(Fantasy, labels = c("Non-Fantasy", "Fantasy"), ordered = TRUE)) reads2019 %>% ggplot(aes(Fantasy, read_time)) + geom_boxplot()
library(scales) ## ## Attaching package: 'scales' ## The following object is masked from 'package:purrr': ## ## discard ## The following object is masked from 'package:readr': ## ## col_factor reads2019 %>% ggplot(aes(Fantasy, read_time)) + geom_boxplot() + scale_y_continuous(trans = log2_trans()) + ylab("Read Time (in days)") + labs(caption = "Because reading time was skewed, data have been log-transformed.") ## Warning: Transformation introduced infinite values in continuous y-axis ## Warning: Removed 2 rows containing non-finite values (stat_boxplot).