Visualizing obesity across United States by using data from Wikipedia
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I will show how to collect from a webpage and to analyze or visualize in R. For this task I will use the rvest
package and will get the data from Wikipedia. I got the idea to write this post from Fisseha Berhane.
I will gain access to the prevalence of obesity in United States from Wikipedia page, then I will plot it in the map. Lets begin with loading the required packages.
## LOAD THE PACKAGES #### library(rvest) library(ggplot2) library(dplyr) library(scales)
After I loaded the packages in R, will upload the data. As I mention before, I will download the data from Wikipedia.
## LOAD THE DATA #### obesity = read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States") obesity = obesity %>% html_nodes("table") %>% .[[1]]%>% html_table(fill=T)
The first line of code is calling the data from Wikipedia and the second line of codes is transforming the table that we are interested into dataframe in R.
Now lets check how our date looks alike.
head(obesity) State and District of Columbia Obese adults Overweight (incl. obese) adults 1 Alabama 30.1% 65.4% 2 Alaska 27.3% 64.5% 3 Arizona 23.3% 59.5% 4 Arkansas 28.1% 64.7% 5 California 23.1% 59.4% 6 Colorado 21.0% 55.0% Obese children and adolescents Obesity rank 1 16.7% 3 2 11.1% 14 3 12.2% 40 4 16.4% 9 5 13.2% 41 6 9.9% 51
The dataframe looks good, now we need to clean it from making ready to plot.
## CLEAN THE DATA #### str(obesity) 'data.frame': 51 obs. of 5 variables: $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... $ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ... $ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ... $ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ... $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ... # remove the % and make the data numeric for(i in 2:4){ obesity[,i] = gsub("%", "", obesity[,i]) obesity[,i] = as.numeric(obesity[,i]) } # check data again str(obesity) 'data.frame': 51 obs. of 5 variables: $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... $ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ... $ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ... $ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ... $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...
Now we will fix the names of variables by removing the spaces
names(obesity) [1] "State and District of Columbia" "Obese adults" [3] "Overweight (incl. obese) adults" "Obese children and adolescents" [5] "Obesity rank" names(obesity) = make.names(names(obesity)) names(obesity) [1] "State.and.District.of.Columbia" "Obese.adults" [3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents" [5] "Obesity.rank"
Our data looks good. Its time to load the map data
# load the map data states = map_data("state") str(states) 'data.frame': 15537 obs. of 6 variables: $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ... $ lat : num 30.4 30.4 30.4 30.3 30.3 ... $ group : num 1 1 1 1 1 1 1 1 1 1 ... $ order : int 1 2 3 4 5 6 7 8 9 10 ... $ region : chr "alabama" "alabama" "alabama" "alabama" ... $ subregion: chr NA NA NA NA ...
We will merge two datasets (obesity and states) by region, therefore we need first to create new variable (region) in obesity dataset.
# create a new variable name for state obesity$region = tolower(obesity$State.and.District.of.Columbia)
Now we will merge the datasets.
states = merge(states, obesity, by="region", all.x=T) str(states) 'data.frame': 15537 obs. of 11 variables: $ region : chr "alabama" "alabama" "alabama" "alabama" ... $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ... $ lat : num 30.4 30.4 30.4 30.3 30.3 ... $ group : num 1 1 1 1 1 1 1 1 1 1 ... $ order : int 1 2 3 4 5 6 7 8 9 10 ... $ subregion : chr NA NA NA NA ... $ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ... $ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ... $ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ... $ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ... $ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ...
Plot the data
Finally we will plot the prevalence of obesity in adults.
## MAKE THE PLOT #### # adults ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) + labs(title="Prevalence of Obesity in Adults") + coord_map()
Similarly, we can plot the prevalence of obesity in children.
# children ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) + labs(title="Prevalence of Obesity in Adults") + coord_map()
If you like to show the name of State in the map use the code below to create a new dataset.
statenames = states %>% group_by(region) %>% summarise( long = mean(range(long)), lat = mean(range(lat)), group = mean(group), Obese.adults = mean(Obese.adults), Obese.children.and.adolescents = mean(Obese.children.and.adolescents) )
After you add this code to ggplot code above
geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)
That’s all. I hope you learned something useful today.
Related Post
- Plotting App for ggplot2 – Part 2
- Mastering R plot – Part 3: Outer margins
- Interactive plotting with rbokeh
- Mastering R plot – Part 2: Axis
- How to create a Twitter Sentiment Analysis using R and Shiny
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.