Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m frequently asked by newcomers to R to provide an easy to follow generic set of instructions on how to download data, transform it, aggregate it, make graphs, and write it all up for publication in a high impact journal – all by the end of the day ! While such a request is somewhat understandable coming from a student it’s somewhat awkward when coming from a seasoned research investigator, (i.e. someone who should know better). Obviously, no such set of instructions exists given the astonishing variety of available data stored in differing formats with varying degrees of adherence to a standard(s). Simply put, “real world” data wrangling and analysis can be a complex process independent of the language you choose to employ. Nonetheless, what I’ll attempt to do here is provide a “tour” involving some data I picked at random. I’ll be sticking to the problems of downloading, filtering, and graphing some of the data. And rather than dictate to you what is the “best way” to do something I’ll simply show you what is on offer and you can make your own decisions. R is amazingly flexible, which means that there are many different ways to do the same thing.
I have chosen information relating to San Francisco restaurant inspections which can be found at https://data.sfgov.org/ along with an impressive variety of data sets involving crime reports, parking availability, bus routes, campaign funding, and many other types of civic info. For reference we can map the zipcodes mentioned in the health reports using methods outlined in my blog post on Geocoding. This is basically the area we are considering.
I’m hungry !
Without even knowing how the restaurant information is organized one can easily imagine a number of interesting questions such as, “What is the average health score for all restaurants ? What restaurants have the highest, (or lowest), scores ? What is the distribution of scores across zip codes ? Are there seasonal scoring trends ?”. While you can download the files from the above website I have saved a copy on my web site in case the original location changes at some point. The data is available in a zipped file that contains three CSV files that we can easily unpack into data frames.
url <- "http://steviep42.bitbucket.org/data/SFFoodProgram_Complete_Data.zip" download.file(url,"SFFood.zip") system("unzip -l SFFood.zip") Archive: SFFood.zip Length Date Time Name -------- ---- ---- ---- 4613344 04-23-14 09:51 violations_plus.csv 1175067 04-23-14 09:51 businesses_plus.csv 1754811 04-23-14 09:51 inspections_plus.csv -------- ------- 7543222 3 files
There are three files each of which has some key information relating to the restaurant inspection process. I should point out that a description file comes with the dataset, which you can read to get a better idea about the column names and formats. But we can also use R to help us understand what is going on. Let’s read these files into data frames:
businesses <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F) violations <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F) inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F) str(violations) 'data.frame': 55800 obs. of 5 variables: $ business_id : int 10 10 10 10 10 10 17 17 17 17 ... $ date : int 20140114 20140114 20140114 20120403 20110428 20121114 20130605 20130605 20130605 20140312 ... $ ViolationTypeID: int 103119 103145 103154 103154 103119 103154 103154 103144 103142 103142 ... $ risk_category : Factor w/ 4 levels "High Risk","Low Risk",..: 3 2 2 2 3 2 2 2 2 2 ... $ description : Factor w/ 70 levels "","Consumer advisory not provided for raw or undercooked foods",..: 22 17 62 62 22 62 62 58 61 61 ... str(inspections) 'data.frame': 40935 obs. of 4 variables: $ business_id: int 10 10 10 10 10 10 10 10 17 17 ... $ Score : int NA 92 98 NA 98 100 NA 96 94 NA ... $ date : int 20140124 20140114 20121114 20120920 20120403 20110928 20110601 20110428 20140312 20130711 ... $ type : Factor w/ 14 levels "Administrative or Document Review",..: 11 13 13 11 13 13 11 13 13 11 ... str(businesses) 'data.frame': 6109 obs. of 17 variables: $ business_id : int 10 17 19 24 29 31 37 45 48 50 ... $ name : Factor w/ 5562 levels " HOL N JAM LEMONADE STAND #2",..: 5110 1901 3525 3562 1067 3509 793 1027 265 4697 ... $ address : Factor w/ 5444 levels ""," "," VARIOUS LOCATIONS ",..: 119 1868 548 3892 713 2478 3506 2870 4814 1446 ... $ city : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ... $ state : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ... $ postal_code : Factor w/ 47 levels "","00000","04102",..: 12 29 16 12 10 35 24 17 27 10 ... $ latitude : num 37.8 37.7 37.8 37.8 37.8 ... $ longitude : num -122 -122 -122 -122 -122 ... $ phone_number : num NA 1.42e+10 NA NA 1.42e+10 ... $ TaxCode : Factor w/ 32 levels "AA","H07","H08",..: 7 7 7 7 7 7 7 7 7 7 ... $ business_certificate: int 779059 78443 NA 352312 939325 346882 315736 340024 318022 NA ... $ application_date : Factor w/ 2327 levels "","01/01/2005",..: 1 620 1 1 2137 1 1473 1836 1 1 ... $ owner_name : Factor w/ 5294 levels "","\tCARLIN'S CORNER COFFEESHOP LLC",..: 4746 2704 61 3387 186 3335 4297 1809 5198 333 ... $ owner_address : Factor w/ 5366 levels "","\t1 INDEPENDENCE POINTE #305",..: 2898 4397 484 3785 1479 2492 3413 414 1103 2206 ... $ owner_city : Factor w/ 335 levels "","\tBRISBANE",..: 239 88 239 239 88 239 241 226 241 280 ... $ owner_state : Factor w/ 51 levels "","ARIZONA","AZ",..: 7 7 7 7 7 7 7 7 7 7 ... $ owner_zip : Factor w/ 376 levels "","\t28273","\t29615",..: 165 139 170 165 139 199 183 178 188 177 ...
It looks like that all data frames have a column in common called “business_id”, which we can use as a “key” to link/merge these data frames though we can already answer some basic questions such as there were 40,935 inspections, which led to 55,800 violations over a time range of 04/25/2011 to 04/22/2014. I use the strptime function here to turn the character string dates into POSIX dates.
inspections$date <- strptime(inspections$date,format="%Y%m%d") violations$date <- strptime(violations$date,format="%Y%m%d") range(inspections$date) [1] "2011-04-25 EDT" "2014-04-22 EDT"
Okay let’s extract information just for the year 2013 and use that as our base data set. Note that I could merge the data frames into a single data frame and we will eventually look at how to do that but for now let’s see what we can get out of each data frame individually.
start_time <- strptime("20130101","%Y%m%d") stop_time <- strptime("20131231","%Y%m%d") inspections <- subset(inspections, date > start_time & date < stop_time) violations <- subset(violations, date > start_time & date < stop_time) range(inspections$date) [1] "2013-01-02 EST" "2013-12-30 EST" nrow(inspections) 13,110 nrow(violations) 19,091
Relative to the restaurant inspections process the way I understand it is that an inspector will visit a restaurant at some frequency or in response to a diner’s complaint. The inspector will evaluate the establishment, document any code violations at a severity level of low,medium,high, and ultimately assign a numeric score from 0 to 100 with the following possible meanings: 0-70: Poor, 71-85: Needs Improvement, 86-90: Adequate, 90-100: Good. If violations were noted then a followup inspection(s) will usually occur to insure that the violations are adequately addressed. Such visits usually do not involve a “re-scoring” just a verification that the establishment dealt with the previously observed violations. Evidently this could take multiple visits.
We might then expect that in the inspections file there are multiple rows for a business starting with the original visit and any associated followups. In the case of the followups it looks like the Score is set to NA. Let’s see if that is true. According to the inspections file a “Routine-Unscheduled” inspection was performed on 06/05/2013 for business_id 17 that resulted in score of 94 ,which is good, although in consulting the violations file it seems that there were three “Low Risk” violations noted. A re inspection happened on 07/11/2013.
head(inspections[order(inspections$business_id),]) business_id Score date type 10 17 NA 2013-07-11 Reinspection/Followup 11 17 94 2013-06-05 Routine - Unscheduled 18 19 96 2013-09-04 Routine - Unscheduled 23 24 100 2013-11-18 Routine - Unscheduled 27 29 87 2013-10-01 Routine - Unscheduled 34 37 96 2013-08-08 Routine - Unscheduled head(violations[order(violations$business_id),]) business_id date ViolationTypeID risk_category description 7 17 2013-06-05 103154 Low Risk Unclean or degraded floors walls or ceilings 8 17 2013-06-05 103144 Low Risk Unapproved or unmaintained equipment or utensils 9 17 2013-06-05 103142 Low Risk Unclean nonfood contact surfaces 18 19 2013-09-04 103133 Moderate Risk Foods not protected from contamination 23 29 2013-10-01 103142 Low Risk Unclean nonfood contact surfaces 24 29 2013-10-01 103120 Moderate Risk Moderate risk food holding temperature
Grading
By referring to the inspections file we can answer one of our questions easily. What is the average health score across all restaurants ?
mean(inspections$Score,na.rm=T) [1] 91.95652
But let’s dig a little deeper. I’ll create a factor called rating that implements the scoring interpretation given above. This is easy to do using the cut command. We’ll then summarize the mean score for each category and follow that up with a boxplot of the scores per category. This should give us some initial ideas about the data.
inspections$rating <- cut(inspections$Score,breaks=c(0,70,85,89,100),right=T, labels=c("P","NI","A","G")) # P = Poor, NI = Needs Improvement, A = Adequate, G = Good tapply(inspections$Score,inspections$rating,mean) P NI A G 64.34337 79.97907 87.55324 95.94264 ylim <- c(min(inspections$Score,na.rm=T)-5,max(inspections$Score,na.rm=T)+5) # Capture the boxplot output to get the number of observations per category myb <- boxplot(Score~rating,data=inspections,main="Scores per Rating Category", col=rainbow(4),ylim=ylim) leg.txt <- paste(levels(inspections$rating),myb$n,sep=" : ") legend(3,70,leg.txt,title="Obs. per Category",cex=0.8,pch=19,col=rainbow(4))
Hmm, this information is a little shocking since no one wants to think that there are restaurants anywhere with a health score in the “Poor” category let alone 166 of them as we see here. Who does the lowest score of 2013 belong to ? Business id 74522. Note that it would be really easy to consult the “businesses” data frame at this point to determine the full name associated with this id though since this is a blog on R and not on restaurant reviews I’ll leave it to you to determine what business name that is so you can make your dining plans accordingly.
# What business id had the lowest score in 2013 ? inspections[which(inspections$Score==min(inspections$Score,na.rm=T)),] business_id Score date type rating 39498 74522 42 2013-10-03 Routine - Unscheduled P # Let's see what the violations were violations[violations$business_id==74522,][,-3] business_id date risk_category description 74522 2013-10-03 Low Risk Unclean nonfood contact surfaces 74522 2013-10-03 High Risk Unclean hands or improper use of gloves 74522 2013-10-03 Low Risk Unclean or degraded floors walls or ceilings 74522 2013-10-03 Low Risk Improper storage of equipment utensils or linens 74522 2013-10-03 Low Risk Improper food storage 74522 2013-10-03 Moderate Risk Improper thawing methods 74522 2013-10-03 Moderate Risk Inadequate and inaccessible handwashing facilities 74522 2013-10-03 High Risk High risk vermin infestation 74522 2013-10-03 High Risk Unclean or unsanitary food contact surfaces 74522 2013-10-03 High Risk High risk food holding temperature 74522 2013-10-03 High Risk Contaminated or adulterated food 74522 2013-10-03 High Risk Improper cooking time or temperatures
So let’s just go ahead and find the 25 restaurants that have the highest number of “high risk” violations ? How would we find those ?
# Pull out the high risk violations violations_high <- violations[violations$risk_category == "High Risk",] # Split the data frame by business_id my_high_viols <- split(violations_high,violations_high$business_id) # Count all violations for a given business id and then sort all # business ids from highest to lowest violation_high_count <- sort(sapply(my_high_viols,nrow),T)[1:25] 33446 2427 286 3480 70068 6695 16806 67330 70281 73926 74522 853 1000 1343 1358 1458 11 9 7 7 7 6 6 6 6 6 6 5 5 5 5 5 2189 2302 2401 3217 3948 4590 18800 21524 36201 5 5 5 5 5 5 5 5 5 inspections[inspections$business_id==33446,] business_id Score date type rating 22145 33446 73 2013-12-11 Routine - Unscheduled NI
Notice that while business_id 74522 has the worst overall score, 43, it didn’t actually receive the highest number of “high risk” violations in 2013. That honor goes to business_id 33446 who received 11 violations and a score of 73 on 12/11/2013. But let’s go back to a general view here. I want to slice up the inspections into quarters of the year, (four groups of three months), to see if perhaps scores follow some seasonal pattern at least from a visual point of view. To do this we need to use some R date functions.
inspections$quarters <- quarters(inspections$date) inspections$quarters <- factor(inspections$quarters,levels=c("Q1","Q2","Q3","Q4"),ordered=TRUE) # Now let's build the barchart out of a table of quarters vs ratings library(lattice) barchart(table(inspections$quarters,inspections$rating),auto.key=list(columns=4), main="Violation Category Counts by Quarter")
Well it doesn’t look like there are drastic differences between the scoring categories on a quarterly basis though there appears to be slightly less scores in the “NI” category in Quarter 1 than all other quarters. We could drop down to a month level of granularity to check things out from month to month. All we would need to that is to use the “months” data function above instead of the “quarters” and then apply level labels accordingly. I’ll leave that to you as an exercise.
Grades by zipcode
Well what’s next ? Let’s take a look at the distribution of scores across zip codes which we can access from the inspections file. We’ll also do some merging of the data frames to make our process a bit easier. First, how many unique zip codes do we have ?
unique(businesses$postal_code) [1] "94104" "94124" "94109" "94103" "94133" "94118" "94110" "94122" "94115" [10] "94131" "94111" "94117" "94107" "94108" "94102" "94132" "94105" "94134" [19] "94116" "94121" "94112" "94127" "94123" "94114" "94513" "94545" "94066" [28] "941033148" "" "94158" "95105" "94140" "94013" "94130" "CA" "92672" [37] "94120" "94143" "94609" "94101" "00000" "CA 94122" "CA 94523" "94188" "94014" [46] "04102" "94129"
Most of them look legitimate whereas others don’t appear to be legal and are perhaps the result of data entry errors. For example while “CA 94122″ does contain a legal zip code the expected format is the 5 numbers. Also, the “00000″ might be used to signal a missing value but we don’t know. For simplicity we’ll simply strip out values that are NOT five digits. We also have some oddities like “04102″, (Portland), and “92672″, (San Clemente), although if we look close at the businesses data frame for these records we see street names of, respectively, 366 GOLDEN GATE AVE” and “1530 Market Street” which are San Francisco addresses. Look’s like somebody messed up when entering data. So in a “real” analysis we would need to deal with these issues more definitively. But we’ll pass on these for now and merge the inspections data frame with a new version of the inspections data frame that has the normalized zip codes. Finally, we’ll plot the barchart of the average score for each zip code.
myb <- businesses[nchar(businesses$postal_code)==5 & businesses$postal_code != "00000",] unique(myb$postal_code) [1] "94104" "94124" "94109" "94103" "94133" "94118" "94110" "94122" "94115" "94131" "94111" "94117" "94107" [14] "94108" "94102" "94132" "94105" "94134" "94116" "94121" "94112" "94127" "94123" "94114" "94513" "94545" [27] "94066" "94158" "95105" "94140" "94013" "94130" "92672" "94120" "94143" "94609" "94101" "94188" "94014" [40] "04102" "94129" # Let's merge the two data frames myb and inspections using business_id as a key. mym = merge(myb[,c(1:2,6)],inspections,by="business_id") head(mym) business_id name postal_code Score date 1 17 GEORGE'S COFFEE SHOP 94124 94 2013-06-05 2 17 GEORGE'S COFFEE SHOP 94124 NA 2013-07-11 3 19 NRGIZE LIFESTYLE CAFE 94109 96 2013-09-04 4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 94104 100 2013-11-18 5 29 CHICO'S PIZZA 94103 87 2013-10-01 6 37 CAFE BISTRO 94118 NA 2013-01-30 type rating quarters 1 Routine - Unscheduled G Q2 2 Reinspection/Followup <NA> Q3 3 Routine - Unscheduled G Q3 4 Routine - Unscheduled G Q4 5 Routine - Unscheduled A Q4 6 Reinspection/Followup <NA> Q1 # Let's create an aggregation hold <- aggregate(Score~postal_code,data=mym,mean) hvec <- hold$Score names(hvec) <- hold$postal_code myCols <- colorRampPalette(c("blue", "red"))( length(hvec) ) mp <- barplot(rev(sort(hvec)),axisnames=F,col=myCols,main="Average Score per Zipcode") axis(1,at=mp,labels=names(hvec),las=2,cex.axis=0.8)
Let’s see if the lowest scored 25 restaurants are in the same zip codes as the highest scored restaurants. From the following it looks like there is significant overlap so its not as if the there is a big difference between the zipcodes.
hi <- unique(mym[order(mym$Score),][1:100,]$postal_code) hi [1] "94133" "94103" "94102" "94111" "94108" "94116" "94127" "94110" "94122" [10] "94121" "94114" "94123" "94107" "94132" "94118" "94109" lo <- unique(mym[order(-mym$Score),][1:100,]$postal_code) lo [1] "94104" "94103" "94111" "94122" "94117" "94108" "94115" "94132" "94105" [10] "94110" "94123" "94118" "94107" "94112" "94133" "94131" "94121" "94102" [19] "94109" sum(lo %in% hi) [1] 13
What’s for Dessert ?
What next ? This is really just the beginning really. One approach that I didn’t take was to use the very cool sqldf package that would allow us to treat the dataframes as tables withing a relational database. However, this assumes some familiarity with SQL which, at least in my experience, newcomers to R don’t usually possess although it is worth it to learn if you plan on a career in data mining. As a teaser here is how we could have done some things using sqldf. Check this out and see you next time.
library(sqldf) businesses <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F) violations <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F) inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F) # How many restaurants are there ? sqldf("select count(*) from businesses") count(*) 1 6109 # What is the average score for all restaurants ? sqldf("select avg(Score) from inspections") avg(Score) 1 91.95652 # What are the average scores per zip code ? # Note that since this is the raw data we have some "bad" zipcodes sqldf("select avg(Score),businesses.postal_code from inspections join businesses using(business_id) group by postal_code") avg(Score) postal_code 1 91.85986 2 90.00000 00000 3 96.00000 04102 4 96.50000 92672 5 98.33333 94013 6 85.66667 94014 7 100.00000 94066 8 95.75000 94101 9 91.73356 94102 10 91.34171 94103 11 98.00000 941033148 12 95.65000 94104 13 94.11030 94105 14 94.72059 94107 15 90.39931 94108 16 90.48051 94109 17 91.54456 94110
Filed under: GeoCoding XML processing, Graphics, Public Health
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.