Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- code folding -->
How Best To Learn R Programming
So what’s the best or most efficient way to learn R programming? I have read numerous books and blogs on R programming in an attempt to learn more about R. While doing this has helped me understand the basics, these resources tend to get off track.
Usually, when I find a good resource, it gives the same basic commands to form a vector or import a data frame. These authors then proceed to give examples with R’s built in data such as the Iris data set, the MPG gas mileage data or the Diamonds data.
While this is good in the beginning, and the data are interesting for a while, it gets difficult to follow when you are not really interested in the data or the question being asked.
What I have found that works
I have found that, for me, the best way to learn R programming is to read some of these resources for the basics and then to find something you are interested in and find data to work with that is related to your interest. Ask a good question and work with your data to see if it can give you some insights. What story can your data tell? What things do you need to learn to solve the problem? Keep those resources nearby and Google open in another tab.
This is the way to go. When you have a good question to answer and authentic data, the challenge is a lot of fun. If you then publish your findings, it takes on more importance. That is what I will do here as I learn more R programming and some of the way cool packages like dplyr and ggplot2.
My Story
Since I live in Chengdu, China, I am concerned with the air quality here. AQI2.5 refers to the very small particulate matter in the air that is so small that it can enter the blood via the lungs and causes damage to the human body.
Analysis Of Air Quality (AQI) from 2015 in Chengdu, China
I will be modeling the reproducible research that I am teaching my students. I am writing this in an R markdown file that contains my narrative, code chunks, output statistics and plots. I’m trying to keep the formatting to a minimum so I can focus on my thoughts and writing so I don’t have to do much editing. My time is limited on this project. Given that, I was up past midnight last night and I wanted to keep going but my eyelids kept falling down.
I am working on my analysis skills using R and R Markdown to try and understand if the air quality changes during the year. The AQI values represent the 2.5 micron level of pollution. I found AQI data online at the US department of state. I think this data is important because I have not read any published research on this data.
This data includes an air quality reading every hour for every day of the year. The most recent completed year is 2015. The data file has 8760 rows and ten column vectors imported as a data frame from an Excel file from the state department.here Is is noted that these data are not fully verified or validated according to the website. webpage
I cleaned it up a bit by deleting the top two rows that contain narrative. I kept the row variable names and all the rest. I deleted one column labeled “microgram/meter^3” because it wouldn’t import into my Mac but all readings are in micrograms per meter cubed.
Following Are The Code Chunks
library(ggplot2) library(dplyr) ChAir2015 <- read.csv("~/Desktop/LearnR/Chengdu Air Data/ChAir2015.csv") # Inport the data air21015=ChAir2015 # rename the data to shorter name str(air21015) # check to see if data is a data file
## 'data.frame': 8760 obs. of 10 variables: ## $ Site : Factor w/ 1 level "Chengdu": 1 1 1 1 1 1 1 1 1 1 ... ## $ Parameter : Factor w/ 1 level "PM2.5": 1 1 1 1 1 1 1 1 1 1 ... ## $ Date..LST.: Factor w/ 8759 levels "1/1/15 0:00",..: 1 2 13 18 19 20 21 22 23 24 ... ## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## $ Month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ Day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ Hour : int 0 1 2 3 4 5 6 7 8 9 ... ## $ Value : int 152 130 125 131 133 133 131 142 154 153 ... ## $ Duration : Factor w/ 1 level "1 Hr": 1 1 1 1 1 1 1 1 1 1 ... ## $ QC.Name : Factor w/ 2 levels "Missing","Valid": 2 2 2 2 2 2 2 2 2 2 ...
# take a look at the beginning of the data print.data.frame(head(air21015,7,2))
## Site Parameter Date..LST. Year Month Day Hour Value Duration QC.Name ## 1 Chengdu PM2.5 1/1/15 0:00 2015 1 1 0 152 1 Hr Valid ## 2 Chengdu PM2.5 1/1/15 1:00 2015 1 1 1 130 1 Hr Valid ## 3 Chengdu PM2.5 1/1/15 2:00 2015 1 1 2 125 1 Hr Valid ## 4 Chengdu PM2.5 1/1/15 3:00 2015 1 1 3 131 1 Hr Valid ## 5 Chengdu PM2.5 1/1/15 4:00 2015 1 1 4 133 1 Hr Valid ## 6 Chengdu PM2.5 1/1/15 5:00 2015 1 1 5 133 1 Hr Valid ## 7 Chengdu PM2.5 1/1/15 6:00 2015 1 1 6 131 1 Hr Valid
The next part I will use the dplyr package to filter and select the data needed to make a histogram and then a box plot
Jan.air21015= filter(air21015,Month==1, Value>0) # filter just January data and values above 0 ## If there was no reading that day, a value of -999 was assigned dim(Jan.air21015) # number of rows. This is January 2015 with value >0
## [1] 741 10
ggplot(Jan.air21015,aes(x=Value))+ geom_histogram(binwidth = 10, color="black", fill="yellow") + labs(x="AQI Value") + ggtitle("January Days With AQI Values")
mean(Jan.air21015$Value)
## [1] 142.2497
fivenum(Jan.air21015$Value)
## [1] 15 91 149 187 349
Next, I’ll take a look at February
air21015=ChAir2015 Feb.air2015=filter(air21015,Month==2, Value>0) # filter just February data and values above 0 ## If there was no reading that day, a value of -999 was assigned dim(Feb.air2015)
## [1] 670 10
Feb=Feb.air2015 ggplot(Feb,aes(x=Value))+ geom_histogram(binwidth = 10, color="black", fill="yellow") + labs(x="AQI Value") + ggtitle("February Days With AQI Values")
Now I will concentrate on using the dplyr package to subset the AQI Value data so I can make and compare all the months and see if there is some pattern. March is next.
air21015=ChAir2015 Mar.air2015=filter(air21015,Month==3, Value >0)
April is next.
air21015=ChAir2015 Apr.air2015=filter(air21015,Month==4, Value >0)
May is next
air21015=ChAir2015 May.air2015=filter(air21015,Month==5, Value >0)
June is next.
air21015=ChAir2015 Jun.air2015=filter(air21015,Month==6, Value >0)
Now I will combine each month into another data.frame so I can plot their box plot on one graph.The second boxplot shows how I rotated the months so they fit.
a = data.frame(Month = "Jan", value = (Jan.air21015$Value)) b = data.frame(Month = "Feb", value = (Feb$Value)) c = data.frame(Month = "Mar", value = (Mar.air2015$Value)) d = data.frame(Month = "Apr", value = (Apr.air2015$Value)) e = data_frame(Month = "May", value = (May.air2015$Value)) f = data_frame(Month = "Jun", value = (Jun.air2015$Value)) plot.data <- rbind(a,b,c,d,e,f) AQI.plot=ggplot(plot.data, aes(x=Month, y=value, fill=Month)) + geom_boxplot() AQI.plot
AQI.plot + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + ggtitle("2015 Chengdu AQI")
Conclusion
This is as far as I got last night. The trend seems to be going down looking at the box plots. I still need to finish the rest of the months. This week I’ll work on plotting all 12 months together and making some conclusion if there is a pattern. This is my first attempt at using this data and I’m sure there are shorter and better ways of doing it. I read somewhere: “Make it work, then make it fast”.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.