Site icon R-bloggers

Intro to R Programming | AP Statistics Histograms with ggplot

< !-- code folding -->

I will continue to work out of the The Practice Of Statistics 4th edition book and recreate some histograms. I used the data on page 35 in Chapter 1. My students will be getting this lesson this week. After they make their firs histograms, I’ll have them compare the R histogram to the histogram on their TI- Nspire calculator.

On the second day in lab, I’ll introduce them to the workflow of coding in an R script, copying to an R Markdown file, use knitr to create a Word file, and then upload to our Moodle AP Stat page for me to review. I’ll be going over this with my IB class when we’re in the statistics unit.

You can read my other post about using ggplot to create scatter plots here.

I’m using RStudio and using the knitr package to output my R Markdown file to an html file. This way, I can cut and paste the html into my wordpress post. This process is preferred because it makes my statistical analysis reproducible.

Make a histogram in ggplot2

First, to use ggplot, the data must be a numeric vector or a data.frame (columns are variables and rows are observations).

[optinform]

I input the data from page 35 of the TPS4e book into the vector foreign.born with the c function to create a numeric vector. This should work, but I prefer to work with a data.frame.

In the first line of code, I call the ggplot2 function so I can create the histogram. I then enter the data from page 35 using the combine function, c(data), and store it in the object foreign.born. Some programmers use the <- but many also use the = sign to assign values.

library(ggplot2)
foreigh.born=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8,
                 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1,
                 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3,
                 3.9,10.1,12.4,1.2,4.4,2.7)
str(foreigh.born)

##  num [1:50] 2.8 7 15.1 3.8 27.2 10.3 12.9 8.1 18.9 9.2 ...

From above, you can see that foreign.born is numeric. num [1:50]

Below is a very simple histogram using ggplot2 graphics in R. Ggplot2 was written by Hadley Wickham, who is the Chief Scientist at RStudio. He, his sister, and his father were all statisticians.

ggplot() + aes(foreigh.born)+
  geom_histogram(binwidth = 2.5)

 

This histogram is just a starting point and I can add to it to make it better.

Just so you can compare, here is the histogram using base graphics. I use breaks instead of bin width. Using the base graphics is limited and I prefer to use ggplot2 as it gives you more control over the plot. I agree with David Robinson’s article that beginners should be taught ggplot and not base graphics. You can read his article here

foreign.born=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8,
                 6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1,
                 10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3,
                 3.9,10.1,12.4,1.2,4.4,2.7)
hist(foreign.born, breaks = 10,
     main = "Histogram with Base Graphics",
     ylim = c(0,15))

Transform the data to make a data frame.

Many times your data will come from a .csv file exported from Excel so it’s best to import your data as a data.frame. It’s a little extra work, but once it is converted, you have many option to play with the ggplot histogram. Don’t be afraid to search on stackoverflow.com. When you do, make sure to ask very specific questions after you have read the relevant documentation and searched on google.

Here is a good way to change the vector to a data.frame. As you can see in the output, the structure (str) is numeric (num), and atfer I use the as.data.frame is is converted to a data.frame.

Convert your numeric data to a data.frame

  1. Put your list of data into an object using the combine “c” function
  2. Use the as.data.frame function and save into another object

See the code below as an example

foreign.born3=c(2.8,7.0,15.1,3.8,27.2,10.3,12.9,8.1,18.9,9.2,16.3,5.6,13.8,4.2,3.8,
  6.3,2.7,2.9,3.2,12.2,14.1,5.9,6.6,1.8,3.3,1.9,5.6,19.1,5.4,20.1,
  10.1,21.6,6.9,2.1,3.6,4.9,9.7,5.1,12.6,4.1,2.2,3.9,15.9,8.3,
  3.9,10.1,12.4,1.2,4.4,2.7)

str(foreign.born3)

##  num [1:50] 2.8 7 15.1 3.8 27.2 10.3 12.9 8.1 18.9 9.2 ...

fb3=as.data.frame(foreign.born3)

str(fb3)

## 'data.frame':    50 obs. of  1 variable:
##  $ foreign.born3: num  2.8 7 15.1 3.8 27.2 10.3 12.9 8.1 18.9 9.2 ...

Next I’ll use the original data in the object fb3 and as a data frame. I call the geom_hist function on fb3h which gives a similar histogram to the one found on page 37.

ggplot(fb3,aes(x=foreign.born3))+ 
  geom_histogram(color="black",fill="orange",binwidth = 3)+
  labs(x="Percent of foreign born residents",y="Number of States")+
  geom_density()

Here is the same histogram and an added density overlay plot. You need to add in the y=..density.. argument to overlay the histogram.

ggplot(fb3,aes(x=foreign.born3))+ 
  geom_histogram(aes(y=..density..),color="black",fill="orange",binwidth = 3)+
  labs(x="Percent of foreign born residents",y="Density of States")+
  geom_density(alpha=0.2,fill="#FF6666")

This is a density overlay on top of the histogram. I found out how to do this while reading some R blogs on the internet. There is much to learn with R and ggplot but it’s so much fun creating these plots.

ggplot(fb3, aes(x=foreign.born3)) + 
  geom_histogram(aes(y=..density..),    
                 binwidth=3,
                 colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")

Last year I was able to teach my AP Stat class some of the basics of R and ggplot and they really liked creating plots and doing some analysis in RStudio. This year I hope to continue and also have them use R Markdown and knitr so they can make their analysis reproducible.

I’ll have another post towards the end of next week on our progress.