P is for percent
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We’ve used ggplots throughout this blog series, but today, I want to introduce another package that helps you customize scales on your ggplots – the scales package. I use this package most frequently to format scales as percent. There aren’t a lot of good ways to use percents with my dataset, but one example would be to calculate the percentage each book contributes to the total pages I read in 2019.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
library(tidyverse) ## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 -- ## <U+2713> ggplot2 3.2.1 <U+2713> purrr 0.3.3 ## <U+2713> tibble 2.1.3 <U+2713> dplyr 0.8.3 ## <U+2713> tidyr 1.0.0 <U+2713> stringr 1.4.0 ## <U+2713> readr 1.3.1 <U+2713> forcats 0.4.0 ## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() reads2019 <- read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv", col_names = TRUE) ## Parsed with column specification: ## cols( ## Title = col_character(), ## Pages = col_double(), ## date_started = col_character(), ## date_read = col_character(), ## Book.ID = col_double(), ## Author = col_character(), ## AdditionalAuthors = col_character(), ## AverageRating = col_double(), ## OriginalPublicationYear = col_double(), ## read_time = col_double(), ## MyRating = col_double(), ## Gender = col_double(), ## Fiction = col_double(), ## Childrens = col_double(), ## Fantasy = col_double(), ## SciFi = col_double(), ## Mystery = col_double(), ## SelfHelp = col_double() ## ) reads2019 <- reads2019 %>% mutate(perpage = Pages/sum(Pages))The new variable, perpage, is a proportion. But if I display those data with a figure, I want them to be percentages instead. Here’s how to do that. (If you don’t already have the scales package, add install.packages(“scales”) at the beginning of this code.)
library(scales) ## ## Attaching package: 'scales' ## The following object is masked from 'package:purrr': ## ## discard ## The following object is masked from 'package:readr': ## ## col_factor reads2019 %>% ggplot(aes(perpage)) + geom_histogram() + scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) + xlab("Percentage of Total Pages Read") + ylab("Books") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This post also seems like a great opportunity to hop on my statistical highhorse and talk about the difference between a histogram and a bar chart. Why is this important? With everything going on in the world – pandemics, political elections, etc. – I’ve seen lots of comments on others’ intelligence, many of which show a misunderstanding of the most well-known histogram: the standard normal curve. You see, raw data, even from a huge number of people and even on a standardized test, like a cognitive ability (aka: IQ) test, is never as clean or pretty as it appears in a histogram.
Histograms use a process called “binning”, where ranges of scores are combined to form one of the bars. The bins can be made bigger (including a larger range of scores) or smaller, and smaller bins will start showing the jagged nature of most data, even so-called normally distributed data.
As one example, let’s show what my percent figure would look like as a bar chart instead of a histogram (like the one above).
reads2019 %>% ggplot(aes(perpage)) + geom_bar() + scale_x_continuous(labels = percent, breaks = seq(0,.05,.005)) + xlab("Percentage of Total Pages Read") + ylab("Books")