Site icon R-bloggers

Subsetting data from Data Frame in R with references to SQL commands – Part 1

[This article was first published on R Language – LearnR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello there, welcome to my blog post. The reason behind this post is to practice R and R markdown and share my work with the audience. I would make references to SQL/relational database commands while working with R. If you are not familiar with SQL then please ignore SQL parts, SQL references are just for easy understanding. The primary focus of this blog post is to learn data selection from data frames. Here in this post, we will look at the following:

Prerequisites:

  1. R
  2. R Studio (for ease)

Assumption: Working directory is set and datasets are stored in the working directory. setwd() command is used to set the working directory. Supply the path of directory enclosed in double quotes to set it as a working directory.

Data

Data used in this exercise is downloaded from https://www.ucrdatatool.gov/ website and can be accessed from [https://www.ucrdatatool.gov/Search/Crime/State/OneYearofData.cfm] (CrimeOneYearofData.csv)

Reading data

Before we learn how to select data, we should learn that unlike relational databases, data in R resides in memory. All of the imported data will be loaded into the memory of your computer and hence before loading the data, you must check the capacity of your computer and whether it is sufficient enough to support the dataset.

Data can come from any source, it can be a flat file, database system, or handwritten notes. Usually, flat files are the most common source of the data. In this section, we will see how to load data from a CSV file.

crimeData <- read.csv("CrimeOneYearofData.csv")     

The command above will import the content of the CrimeOneYearofData.csv file into an object called the crimeData data frame. Object crimeData is a data frame that contains all the data from the CrimeOneYearofData.csv file. We can loosely compare this to a relational database object “table”. Data frames too, have rows and columns, and data is presented in structured form.

Tip: If you want to know the number of rows and columns of your Data Frame then dim() can do that for you.

Command read.csv above take multiple other arguments other than just the name of the file. Information on additional arguments can be found [https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table] (read.csv).

Subsetting data from data frame

Before we learn how to select data from data frame crimeData, I would recommend learning following commands using crimeData frame:

  names(crimeData)
  str(crimeData)
  dim(crimeData)

Command names(crimeData) above would return all the column names of the data frame. Checking column names just after loading the data is useful as this will make you familiar with the data frame. Following is the output of the names command. Do not worry about the numbers in the square brackets just yet, we will look at them in a subsequent blog. The names of the columns are listed next to the numbers in the brackets and there are a total of 22 columns in the crimeData data frame.

names(crimeData)
#>  [1] "State"                                    
#>  [2] "Population"                               
#>  [3] "Violent.crime.total"                      
#>  [4] "Murder.and.nonnegligent.Manslaughter"     
#>  [5] "Legacy.rape..1"                           
#>  [6] "Revised.rape..2"                          
#>  [7] "Robbery"                                  
#>  [8] "Aggravated.assault"                       
#>  [9] "Property.crime.total"                     
#> [10] "Burglary"                                 
#> [11] "Larceny.theft"                            
#> [12] "Motor.vehicle.theft"                      
#> [13] "Violent.Crime.rate"                       
#> [14] "Murder.and.nonnegligent.manslaughter.rate"
#> [15] "Legacy.rape.rate..1"                      
#> [16] "Revised.rape.rate..2"                     
#> [17] "Robbery.rate"                             
#> [18] "Aggravated.assault.rate"                  
#> [19] "Property.crime.rate"                      
#> [20] "Burglary.rate"                            
#> [21] "Larceny.theft.rate"                       
#> [22] "Motor.vehicle.theft.rate"

Command str(crimeData) would return the structure of the data frame. The command str() compactly displays the internal structure of the object be it data frame or any other. following is the structure of the crimeData data frame. The following result from str() command shows the data type of the columns crimeData data frame has, as well as sample data from the individual columns.

str(crimeData)
#> 'data.frame':    159 obs. of  22 variables:
#>  $ State                                    : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
#>  $ Population                               : chr  "4849377" "736732" "6731484" "2966369" ...
#>  $ Violent.crime.total                      : chr  "20727" "4684" "26916" "14243" ...
#>  $ Murder.and.nonnegligent.Manslaughter     : int  276 41 319 165 1699 151 86 54 105 1149 ...
#>  $ Legacy.rape..1                           : int  1436 555 2464 1182 8398 2121 571 249 352 6051 ...
#>  $ Revised.rape..2                          : int  2005 771 3378 1763 11527 3039 782 386 472 8563 ...
#>  $ Robbery                                  : int  4701 629 6249 2050 48680 3039 3159 1269 3497 24914 ...
#>  $ Aggravated.assault                       : int  13745 3243 16970 10265 91803 10325 4495 2867 4125 72895 ...
#>  $ Property.crime.total                     : int  154094 20334 215240 99018 947192 135510 69070 27900 34147 679446 ...
#>  $ Burglary                                 : int  39715 3150 43562 24790 202670 23472 11955 5768 3466 143220 ...
#>  $ Larceny.theft                            : int  104238 15445 154091 68627 592670 99464 51005 20865 26898 493647 ...
#>  $ Motor.vehicle.theft                      : int  10141 1739 17587 5601 151852 12574 6110 1267 3783 42579 ...
#>  $ Violent.Crime.rate                       : num  427 636 400 480 396 ...
#>  $ Murder.and.nonnegligent.manslaughter.rate: num  5.7 5.6 4.7 5.6 4.4 2.8 2.4 5.8 15.9 5.8 ...
#>  $ Legacy.rape.rate..1                      : num  29.6 75.3 36.6 39.8 21.6 39.6 15.9 26.6 53.4 30.4 ...
#>  $ Revised.rape.rate..2                     : num  41.3 104.7 50.2 59.4 29.7 ...
#>  $ Robbery.rate                             : num  96.9 85.4 92.8 69.1 125.5 ...
#>  $ Aggravated.assault.rate                  : num  283 440 252 346 237 ...
#>  $ Property.crime.rate                      : num  3178 2760 3198 3338 2441 ...
#>  $ Burglary.rate                            : num  819 428 647 836 522 ...
#>  $ Larceny.theft.rate                       : num  2150 2096 2289 2314 1527 ...
#>  $ Motor.vehicle.theft.rate                 : num  209 236 261 189 391 ...

Command dim(crimeData) mentioned above will result in dimensions of the crimeData data frame or in other words total number of rows and columns this data frame has. In statistics terms, a column is a variable and row is an observation. Data frame crimeData has 159 observations and 22 variables.

dim(crimeData)
#> [1] 159  22

Select all data from a data frame

In base R, just putting the name of the data frame crimeData on the prompt will display all of the data for that data frame. It might be overwhelming if you have a large data frame. In such case, use of commands head(crimeData), head(crimeData, 10) (10 is just to show the parameter which limit the number of lines, you can choose any other number), tail(crimeData) or tail(crimeData, 10) will be helpful.

SQL Reference – select * from crimeData

Subset specific column from a data frame

In base R, you can specify the name of the column that you would like to select with $ sign along with the data frame. Command head(crimeData$Population, 10) would show first 10 observations from column Population from data frame crimeData:

head(crimeData$Population, 10)
#>  [1] "4849377"  "736732"   "6731484"  "2966369"  "38802500" "5355866" 
#>  [7] "3596677"  "935614"   "658893"   "19893297"

Above can be done using dplyr package as well. As per rdocumentation.org “dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Here is a command using dplyr package which selects Population column from crimeData data frame:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
result <- select(crimeData, Population)
head(result, 10)
#>    Population
#> 1     4849377
#> 2      736732
#> 3     6731484
#> 4     2966369
#> 5    38802500
#> 6     5355866
#> 7     3596677
#> 8      935614
#> 9      658893
#> 10   19893297

You can see the presentation of the result between subsetting using $ sign and using dplyr package. dplyr package present result into a tabular format and more nearer to SQL select statement though two are completely different.

Subsetting multiple columns from a data frame

Using base R, the following command will help subset multiple columns. In the command below first two columns are selected from the data frame crimeData. If you see the result for command names(crimeData) above, you would find that State and Population are the first two columns.

SQL Reference – select State, Population from crimeData

  head(crimeData[,c(1,2)],10)
#>                   State Population
#> 1               Alabama    4849377
#> 2                Alaska     736732
#> 3               Arizona    6731484
#> 4              Arkansas    2966369
#> 5            California   38802500
#> 6              Colorado    5355866
#> 7           Connecticut    3596677
#> 8              Delaware     935614
#> 9  District of Columbia     658893
#> 10              Florida   19893297

Or we can supply the name of the columns and select them. Here is an example:

  head(crimeData[,c("State","Population")],10)
#>                   State Population
#> 1               Alabama    4849377
#> 2                Alaska     736732
#> 3               Arizona    6731484
#> 4              Arkansas    2966369
#> 5            California   38802500
#> 6              Colorado    5355866
#> 7           Connecticut    3596677
#> 8              Delaware     935614
#> 9  District of Columbia     658893
#> 10              Florida   19893297

Using dplyr package select clause

result <- select(crimeData, State, Population)
head(result, 10)
#>                   State Population
#> 1               Alabama    4849377
#> 2                Alaska     736732
#> 3               Arizona    6731484
#> 4              Arkansas    2966369
#> 5            California   38802500
#> 6              Colorado    5355866
#> 7           Connecticut    3596677
#> 8              Delaware     935614
#> 9  District of Columbia     658893
#> 10              Florida   19893297

Any number of columns can be selected this way by giving the number or the name of the column within a vector. Let’s find out the first, seventh, and tenth column from crimeData data frame. Remember, instead of the number you can give the name of the column enclosed in double-quotes:

SQL Reference – select State, Robbery, Burglary from crimeData

  head(crimeData[,c(1,7,10)],10)
#>                   State Robbery Burglary
#> 1               Alabama    4701    39715
#> 2                Alaska     629     3150
#> 3               Arizona    6249    43562
#> 4              Arkansas    2050    24790
#> 5            California   48680   202670
#> 6              Colorado    3039    23472
#> 7           Connecticut    3159    11955
#> 8              Delaware    1269     5768
#> 9  District of Columbia    3497     3466
#> 10              Florida   24914   143220

Using dplyr package select clause

result <- select(crimeData, State, Robbery, Burglary)
head(result, 10)
#>                   State Robbery Burglary
#> 1               Alabama    4701    39715
#> 2                Alaska     629     3150
#> 3               Arizona    6249    43562
#> 4              Arkansas    2050    24790
#> 5            California   48680   202670
#> 6              Colorado    3039    23472
#> 7           Connecticut    3159    11955
#> 8              Delaware    1269     5768
#> 9  District of Columbia    3497     3466
#> 10              Florida   24914   143220

Subsetting all columns but one from data frame

This approach is called subsetting by the deletion of entries. In base R you can specify which column you would like to exclude from the selection by putting a minus sign in from of it. Let’s try:

  result <- head(crimeData[,-6],10)
  dim(result)
#> [1] 10 21

Now if we analyse the result of the above command, we can see the dimension of result variable is showing 10 observations (rows) and 21 variables (columns). If you check the result of command dim(crimeData) above, you can see there were total 22 variables in the crimeData data frame but as we have excluded the sixth column using -6 in column section in command result <- head(crimeData[,-6],10) which returned result for all columns except sixth. A similar operation can be performed using dplyr package as well and instead of using the minus sign on the number of a column, you can use it directly on the name of the column. Here is the example where we would exclude column “Burglary” form the result set:

result <- select(crimeData, -Burglary)
dim(head(result,10))
#> [1] 10 21

SQL Reference – In SQL to avoid a column, we must specify the list of columns in the select clause. This may sometime be overwhelming in case of a large table when a few columns need elimination from a large data set.

Subsetting all columns which starts with a particular character or string

If you go back to the result of names(crimeData) command you would see that few column names start with the same string. Two columns we particularly interested in here are “Burglary” and “Burglary rate”. Following R command using dplyr package will help us subset these two columns by writing as little code as possible. Imagine a scenario when you have several columns which start with the same character or string and in such scenario following command will be helpful:

result <- select(crimeData, starts_with("Burglary"))
head(result, 10)
#>    Burglary Burglary.rate
#> 1     39715         819.0
#> 2      3150         427.6
#> 3     43562         647.1
#> 4     24790         835.7
#> 5    202670         522.3
#> 6     23472         438.2
#> 7     11955         332.4
#> 8      5768         616.5
#> 9      3466         526.0
#> 10   143220         719.9

I hope you enjoyed this post. If it helped you in any way then please do not forget to subscribe to mailers and share this post. Thank you, it means a lot!!

In the next post, we will look at subsetting observations or selecting rows. Stay tuned.

To leave a comment for the author, please follow the link and comment on their blog: R Language – LearnR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.