Exploratory Data Analysis – Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
This is the first of a series of posts on exploratory data analysis (EDA). This post will calculate the common summary statistics of a univariate continuous data set – the data on ozone pollution in New York City that is part of the built-in “airquality” data set in R. This is a particularly good data set to work with, since it has missing values – a common problem in many real data sets. In later posts, I will continue this series by exploring other methods in EDA, including box plots and kernel density plots.
The Original “CO2″ Data Set
I used the “Ozone” vector in the “airquality” data set that is built into R. It’s always a good idea to get a sense of what a data table looks like by using the head() function; by default, it shows the first 6 data.
> head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
*I have manually added spaces between the columns for ease of viewing.
To abstract the “Ozone” vector, just use the $ symbol.
> # extract "Ozone" data vector > ozone = airquality$Ozone
Counting the Number of Data
I initially thought that counting the sample size would simply involve using the length() function.
> # sample size of "ozone' > length(ozone) [1] 153
However, the summary() function showed that it contains missing values.
> summary(ozone) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1.00 18.00 31.50 42.13 63.25 168.00 37
Notice the last column; “NA” stands for “Not Available”, and this output shows that there are 37 missing values.
I found 3 different ways to find the number of non-missing values in “ozone”. The last one is simplest.
> # 3 ways to find number of non-missing values in "ozone" > length(ozone[is.na(ozone) == F]) [1] 116 > length(ozone[!is.na(ozone)]) [1] 116 > sum(!is.na(ozone)) [1] 116
This last function, sum(), takes advantage of the fact that “True” or “T” is coded as “1″ and “False” or “F” is coded as “0″ in R. Thus, it adds the number of “1′s” that are in the vector of !is.na(ozone) to get the number of non-missing values.
Calculating the Summary Statistics
The summary() output above already shows the mean that is calculated after removing the missing values. If you try to use the mean() function to calculate the mean, you will get this strange result:
> mean(ozone) [1] NA
This is obviously the result of the missing values (the NA’s) being taken into account when computing the mean. To compute the mean without the missing values, use the “na.rm” option.
> # calculate mean of "ozone" by excluding missing values > mean(ozone, na.rm = T) [1] 42.12931
This is also needed for the var() and sd() functions when calculating the variance and the standard deviation.
> var(ozone, na.rm = T) [1] 1088.201 > sd(ozone, na.rm = T) [1] 32.98788
Filed under: Descriptive Statistics, R programming Tagged: CO2, data, data analysis, descriptive statistics, exploratory data analysis, head(), length(), mean(), missing data, missing values, ozone, R, R programming, sd(), statistics, sum(), summary(), var()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.