Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Pivot tables are a a growing staple for analysis in excel yet they remain limited to the functionality which Microsoft has chosen to include. Typical operations are the inclusion of filters, choice over rows, columns, and maths operations. In R this functionality is not always so straightforward. That is until you try sqldf(). The sqldf package implements sql (Structured Query Language) query functionality onto traditional data.frames. A simple example being:
sqldf("select * from irisTable")
Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | Species | |
---|---|---|---|---|---|
1 | 5.10 | 3.50 | 1.40 | 0.20 | setosa |
2 | 4.90 | 3.00 | 1.40 | 0.20 | setosa |
3 | 4.70 | 3.20 | 1.30 | 0.20 | setosa |
4 | 4.60 | 3.10 | 1.50 | 0.20 | setosa |
5 | 5.00 | 3.60 | 1.40 | 0.20 | setosa |
6 | 5.40 | 3.90 | 1.70 | 0.40 | setosa |
This simple statement will request all columns (all indicated with *) from a table which you are basing your work on. Before we start we will create the sample data. We will use the Iris dataset and change column names to remove any full stops and spaces.
irisTable <- iris names(irisTable) <- c("Sepal_length", "Sepal_Width", "Petal_length", "Petal_width", "Species")
However, what if we only want a selection of all columns? Further, we may wish to order our data by a particular column. This can be achieved by appending ‘order by <col_name> desc or asc.
sqldf("select Species, Sepal_length from irisTable")
Species | Sepal_length | |
---|---|---|
1 | setosa | 5.10 |
2 | setosa | 4.90 |
3 | setosa | 4.70 |
4 | setosa | 4.60 |
5 | setosa | 5.00 |
6 | setosa | 5.40 |
By specifying the column names as they are displayed in the data.frame we can select these specific columns. At this stage we have the basic functionality available in the standard Pivot Table tool. Moving on to aggregating our data we can use the ‘group by’ function as demonstrated below.
sqldf("select Species, count(*) from irisTable group by Species")
Species | count(*) | |
---|---|---|
1 | setosa | 50 |
2 | versicolor | 50 |
3 | virginica | 50 |
Having added the group by statement we are now bringing the real power of Pivot Tables to R. If we assume that Species is a list of row names we have brought together all of our records into a much smaller number of categories. Questions we may now want to ask are: What is the average Sepal Length of each Species? What is the Min, Max or average petal length? these function may be simply applied by wrapping the chosen variable with the function you wish to perform.
sqldf("select Species, avg(Sepal_length) from irisTable group by Species")
Species | avg(Sepal_length) | |
---|---|---|
1 | setosa | 5.01 |
2 | versicolor | 5.94 |
3 | virginica | 6.59 |
A full list of these functions can be found at W3 SQL. In addition to math operations a number of text based operations are available such as length() which determines the length of a string, UCASE() converts to upper-case and LCASE converts to lower case.
The next aspect of Pivot Tables are filters. Filters allow subsets of the data to be generated for example we want to study a specific district, or are looking for values between two points. You will notice that once we have used the ‘group by’ statement we use ‘having’ rather than ‘where’.
sqldf("select Species, Sepal_length from irisTable group by Species having Sepal_length between 5 and 6")
Species | Sepal_length | |
---|---|---|
1 | setosa | 5.00 |
2 | versicolor | 5.70 |
3 | virginica | 5.90 |
sqldf("select Species, Sepal_length from irisTable group by Species having Species='virginica'")
Species | Sepal_length | |
---|---|---|
1 | virginica | 5.90 |
The above two example highlight the use of where with two different approached. The first example is looking for values between, and including 0 and 10. The second example is looking for all records relating to Berkshire. It should be noted that where clauses are not case sensitive.
One issue which many users experience in R is the generated field names created during joins, or maths functions. In its basic format, sqldf will produce similar column names however this can be rectified using the ‘as’ statement.
sqldf("select Species, Sepal_length as 'Avg Sepal Length' from irisTable group by Species having Species='virginica'")
Species | Avg Sepal Length | |
---|---|---|
1 | virginica | 6.59 |
Though brief, this tutorial is able to address all and more of the functionality found within Pivot Tables and yet leaves many features of sqldf unused. Additional functionality includes table join, SQLite backend databases and data recording.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.