Site icon R-bloggers

Dining in San Francisco – Let R Guide You

[This article was first published on Rolling Your Rs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m frequently asked by newcomers to R to provide an easy to follow generic set of instructions on how to download data, transform it, aggregate it, make graphs, and write it all up for publication in a high impact journal – all by the end of the day ! While such a request is somewhat understandable coming from a student it’s somewhat awkward when coming from a seasoned research investigator, (i.e. someone who should know better). Obviously, no such set of instructions exists given the astonishing variety of available data stored in differing formats with varying degrees of adherence to a standard(s). Simply put, “real world” data wrangling and analysis can be a complex process independent of the language you choose to employ. Nonetheless, what I’ll attempt to do here is provide a “tour” involving some data I picked at random. I’ll be sticking to the problems of downloading, filtering, and graphing some of the data. And rather than dictate to you what is the “best way” to do something I’ll simply show you what is on offer and you can make your own decisions. R is amazingly flexible, which means that there are many different ways to do the same thing.

I have chosen information relating to San Francisco restaurant inspections which can be found at https://data.sfgov.org/ along with an impressive variety of data sets involving crime reports, parking availability, bus routes, campaign funding, and many other types of civic info. For reference we can map the zipcodes mentioned in the health reports using methods outlined in my blog post on Geocoding. This is basically the area we are considering.

Mmm. Let’s eat !

I’m hungry !

Without even knowing how the restaurant information is organized one can easily imagine a number of interesting questions such as, “What is the average health score for all restaurants ? What restaurants have the highest, (or lowest), scores ? What is the distribution of scores across zip codes ? Are there seasonal scoring trends ?”. While you can download the files from the above website I have saved a copy on my web site in case the original location changes at some point. The data is available in a zipped file that contains three CSV files that we can easily unpack into data frames.

url <- "http://steviep42.bitbucket.org/data/SFFoodProgram_Complete_Data.zip"
download.file(url,"SFFood.zip")
system("unzip -l SFFood.zip")

Archive:  SFFood.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
  4613344  04-23-14 09:51   violations_plus.csv
  1175067  04-23-14 09:51   businesses_plus.csv
  1754811  04-23-14 09:51   inspections_plus.csv
 --------                   -------
  7543222                   3 files

There are three files each of which has some key information relating to the restaurant inspection process. I should point out that a description file comes with the dataset, which you can read to get a better idea about the column names and formats. But we can also use R to help us understand what is going on. Let’s read these files into data frames:

businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)


str(violations)

'data.frame':	55800 obs. of  5 variables:
 $ business_id    : int  10 10 10 10 10 10 17 17 17 17 ...
 $ date           : int  20140114 20140114 20140114 20120403 20110428 20121114 20130605 20130605 20130605 20140312 ...
 $ ViolationTypeID: int  103119 103145 103154 103154 103119 103154 103154 103144 103142 103142 ...
 $ risk_category  : Factor w/ 4 levels "High Risk","Low Risk",..: 3 2 2 2 3 2 2 2 2 2 ...
 $ description    : Factor w/ 70 levels "","Consumer advisory not provided for raw or undercooked foods",..: 22 17 62 62 22 62 62 58 61 61 ...

str(inspections)

'data.frame':	40935 obs. of  4 variables:
 $ business_id: int  10 10 10 10 10 10 10 10 17 17 ...
 $ Score      : int  NA 92 98 NA 98 100 NA 96 94 NA ...
 $ date       : int  20140124 20140114 20121114 20120920 20120403 20110928 20110601 20110428 20140312 20130711 ...
 $ type       : Factor w/ 14 levels "Administrative or Document Review",..: 11 13 13 11 13 13 11 13 13 11 ...

str(businesses)

'data.frame':	6109 obs. of  17 variables:
 $ business_id         : int  10 17 19 24 29 31 37 45 48 50 ...
 $ name                : Factor w/ 5562 levels " HOL N JAM LEMONADE STAND  #2",..: 5110 1901 3525 3562 1067 3509 793 1027 265 4697 ...
 $ address             : Factor w/ 5444 levels "","   ","  VARIOUS LOCATIONS  ",..: 119 1868 548 3892 713 2478 3506 2870 4814 1446 ...
 $ city                : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ...
 $ state               : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
 $ postal_code         : Factor w/ 47 levels "","00000","04102",..: 12 29 16 12 10 35 24 17 27 10 ...
 $ latitude            : num  37.8 37.7 37.8 37.8 37.8 ...
 $ longitude           : num  -122 -122 -122 -122 -122 ...
 $ phone_number        : num  NA 1.42e+10 NA NA 1.42e+10 ...
 $ TaxCode             : Factor w/ 32 levels "AA","H07","H08",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ business_certificate: int  779059 78443 NA 352312 939325 346882 315736 340024 318022 NA ...
 $ application_date    : Factor w/ 2327 levels "","01/01/2005",..: 1 620 1 1 2137 1 1473 1836 1 1 ...
 $ owner_name          : Factor w/ 5294 levels "","\tCARLIN'S CORNER COFFEESHOP LLC",..: 4746 2704 61 3387 186 3335 4297 1809 5198 333 ...
 $ owner_address       : Factor w/ 5366 levels "","\t1 INDEPENDENCE POINTE #305",..: 2898 4397 484 3785 1479 2492 3413 414 1103 2206 ...
 $ owner_city          : Factor w/ 335 levels "","\tBRISBANE",..: 239 88 239 239 88 239 241 226 241 280 ...
 $ owner_state         : Factor w/ 51 levels "","ARIZONA","AZ",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ owner_zip           : Factor w/ 376 levels "","\t28273","\t29615",..: 165 139 170 165 139 199 183 178 188 177 ...

It looks like that all data frames have a column in common called “business_id”, which we can use as a “key” to link/merge these data frames though we can already answer some basic questions such as there were 40,935 inspections, which led to 55,800 violations over a time range of 04/25/2011 to 04/22/2014. I use the strptime function here to turn the character string dates into POSIX dates.

inspections$date <- strptime(inspections$date,format="%Y%m%d")
violations$date  <- strptime(violations$date,format="%Y%m%d")
range(inspections$date)
[1] "2011-04-25 EDT" "2014-04-22 EDT"

Okay let’s extract information just for the year 2013 and use that as our base data set. Note that I could merge the data frames into a single data frame and we will eventually look at how to do that but for now let’s see what we can get out of each data frame individually.

start_time <- strptime("20130101","%Y%m%d")
stop_time  <- strptime("20131231","%Y%m%d")
inspections <- subset(inspections, date > start_time & date < stop_time)
violations <- subset(violations, date > start_time & date < stop_time)

range(inspections$date)
[1] "2013-01-02 EST" "2013-12-30 EST"

nrow(inspections) 
13,110

nrow(violations)
19,091

Relative to the restaurant inspections process the way I understand it is that an inspector will visit a restaurant at some frequency or in response to a diner’s complaint. The inspector will evaluate the establishment, document any code violations at a severity level of low,medium,high, and ultimately assign a numeric score from 0 to 100 with the following possible meanings: 0-70: Poor, 71-85: Needs Improvement, 86-90: Adequate, 90-100: Good. If violations were noted then a followup inspection(s) will usually occur to insure that the violations are adequately addressed. Such visits usually do not involve a “re-scoring” just a verification that the establishment dealt with the previously observed violations. Evidently this could take multiple visits.

We might then expect that in the inspections file there are multiple rows for a business starting with the original visit and any associated followups. In the case of the followups it looks like the Score is set to NA. Let’s see if that is true. According to the inspections file a “Routine-Unscheduled” inspection was performed on 06/05/2013 for business_id 17 that resulted in score of 94 ,which is good, although in consulting the violations file it seems that there were three “Low Risk” violations noted. A re inspection happened on 07/11/2013.


head(inspections[order(inspections$business_id),])

   business_id Score       date                  type
10          17    NA 2013-07-11 Reinspection/Followup
11          17    94 2013-06-05 Routine - Unscheduled
18          19    96 2013-09-04 Routine - Unscheduled
23          24   100 2013-11-18 Routine - Unscheduled
27          29    87 2013-10-01 Routine - Unscheduled
34          37    96 2013-08-08 Routine - Unscheduled


head(violations[order(violations$business_id),])

   business_id       date ViolationTypeID risk_category                                      description
7           17 2013-06-05          103154      Low Risk     Unclean or degraded floors walls or ceilings
8           17 2013-06-05          103144      Low Risk Unapproved or unmaintained equipment or utensils
9           17 2013-06-05          103142      Low Risk                 Unclean nonfood contact surfaces
18          19 2013-09-04          103133 Moderate Risk           Foods not protected from contamination
23          29 2013-10-01          103142      Low Risk                 Unclean nonfood contact surfaces
24          29 2013-10-01          103120 Moderate Risk          Moderate risk food holding temperature 

Grading

By referring to the inspections file we can answer one of our questions easily. What is the average health score across all restaurants ?

mean(inspections$Score,na.rm=T)
[1] 91.95652 

But let’s dig a little deeper. I’ll create a factor called rating that implements the scoring interpretation given above. This is easy to do using the cut command. We’ll then summarize the mean score for each category and follow that up with a boxplot of the scores per category. This should give us some initial ideas about the data.


inspections$rating <- cut(inspections$Score,breaks=c(0,70,85,89,100),right=T,
    labels=c("P","NI","A","G")) 

# P = Poor, NI = Needs Improvement, A = Adequate, G = Good

tapply(inspections$Score,inspections$rating,mean)
       P       NI        A        G 
64.34337 79.97907 87.55324 95.94264 

ylim <- c(min(inspections$Score,na.rm=T)-5,max(inspections$Score,na.rm=T)+5)

# Capture the boxplot output to get the number of observations per category

myb <- boxplot(Score~rating,data=inspections,main="Scores per Rating Category",
               col=rainbow(4),ylim=ylim)

leg.txt <- paste(levels(inspections$rating),myb$n,sep=" : ")

legend(3,70,leg.txt,title="Obs. per Category",cex=0.8,pch=19,col=rainbow(4))

Boxplots of score per category

Hmm, this information is a little shocking since no one wants to think that there are restaurants anywhere with a health score in the “Poor” category let alone 166 of them as we see here. Who does the lowest score of 2013 belong to ? Business id 74522. Note that it would be really easy to consult the “businesses” data frame at this point to determine the full name associated with this id though since this is a blog on R and not on restaurant reviews I’ll leave it to you to determine what business name that is so you can make your dining plans accordingly.

# What business id had the lowest score in 2013 ?

inspections[which(inspections$Score==min(inspections$Score,na.rm=T)),]

      business_id Score       date                  type rating
39498       74522    42 2013-10-03 Routine - Unscheduled      P

# Let's see what the violations were

violations[violations$business_id==74522,][,-3]

 business_id       date  risk_category                                        description
       74522 2013-10-03                Low Risk                   Unclean nonfood contact surfaces
       74522 2013-10-03               High Risk            Unclean hands or improper use of gloves
       74522 2013-10-03                Low Risk       Unclean or degraded floors walls or ceilings
       74522 2013-10-03                Low Risk   Improper storage of equipment utensils or linens
       74522 2013-10-03                Low Risk                              Improper food storage
       74522 2013-10-03           Moderate Risk                          Improper thawing methods 
       74522 2013-10-03           Moderate Risk Inadequate and inaccessible handwashing facilities
       74522 2013-10-03               High Risk                       High risk vermin infestation
       74522 2013-10-03               High Risk        Unclean or unsanitary food contact surfaces
       74522 2013-10-03               High Risk                High risk food holding temperature 
       74522 2013-10-03               High Risk                   Contaminated or adulterated food
       74522 2013-10-03               High Risk              Improper cooking time or temperatures

So let’s just go ahead and find the 25 restaurants that have the highest number of “high risk” violations ? How would we find those ?

# Pull out the high risk violations

violations_high <- violations[violations$risk_category == "High Risk",]

# Split the data frame by business_id

my_high_viols <- split(violations_high,violations_high$business_id)

# Count all violations for a given business id and then sort all
# business ids from highest to lowest

violation_high_count <- sort(sapply(my_high_viols,nrow),T)[1:25]

33446  2427   286  3480 70068  6695 16806 67330 70281 73926 74522   853  1000  1343  1358  1458 
   11     9     7     7     7     6     6     6     6     6     6     5     5     5     5     5 
 2189  2302  2401  3217  3948  4590 18800 21524 36201 
    5     5     5     5     5     5     5     5     5 

inspections[inspections$business_id==33446,]
      business_id Score       date                  type rating 
22145       33446    73 2013-12-11 Routine - Unscheduled     NI      

Notice that while business_id 74522 has the worst overall score, 43, it didn’t actually receive the highest number of “high risk” violations in 2013. That honor goes to business_id 33446 who received 11 violations and a score of 73 on 12/11/2013. But let’s go back to a general view here. I want to slice up the inspections into quarters of the year, (four groups of three months), to see if perhaps scores follow some seasonal pattern at least from a visual point of view. To do this we need to use some R date functions.

inspections$quarters <- quarters(inspections$date)
inspections$quarters <- factor(inspections$quarters,levels=c("Q1","Q2","Q3","Q4"),ordered=TRUE)

# Now let's build the barchart out of a table of quarters vs ratings

library(lattice)
barchart(table(inspections$quarters,inspections$rating),auto.key=list(columns=4),
         main="Violation Category Counts by Quarter")

Barchart of quarters vs. ratings

Well it doesn’t look like there are drastic differences between the scoring categories on a quarterly basis though there appears to be slightly less scores in the “NI” category in Quarter 1 than all other quarters. We could drop down to a month level of granularity to check things out from month to month. All we would need to that is to use the “months” data function above instead of the “quarters” and then apply level labels accordingly. I’ll leave that to you as an exercise.

Grades by zipcode

Well what’s next ? Let’s take a look at the distribution of scores across zip codes which we can access from the inspections file. We’ll also do some merging of the data frames to make our process a bit easier. First, how many unique zip codes do we have ?

unique(businesses$postal_code)
 [1] "94104"     "94124"     "94109"     "94103"     "94133"     "94118"     "94110"     "94122"     "94115"    
[10] "94131"     "94111"     "94117"     "94107"     "94108"     "94102"     "94132"     "94105"     "94134"    
[19] "94116"     "94121"     "94112"     "94127"     "94123"     "94114"     "94513"     "94545"     "94066"    
[28] "941033148" ""          "94158"     "95105"     "94140"     "94013"     "94130"     "CA"        "92672"    
[37] "94120"     "94143"     "94609"     "94101"     "00000"     "CA  94122" "CA  94523" "94188"     "94014"    
[46] "04102"     "94129"

Most of them look legitimate whereas others don’t appear to be legal and are perhaps the result of data entry errors. For example while “CA 94122″ does contain a legal zip code the expected format is the 5 numbers. Also, the “00000″ might be used to signal a missing value but we don’t know. For simplicity we’ll simply strip out values that are NOT five digits. We also have some oddities like “04102″, (Portland), and “92672″, (San Clemente), although if we look close at the businesses data frame for these records we see street names of, respectively, 366 GOLDEN GATE AVE” and “1530 Market Street” which are San Francisco addresses. Look’s like somebody messed up when entering data. So in a “real” analysis we would need to deal with these issues more definitively. But we’ll pass on these for now and merge the inspections data frame with a new version of the inspections data frame that has the normalized zip codes. Finally, we’ll plot the barchart of the average score for each zip code.


myb <- businesses[nchar(businesses$postal_code)==5 & businesses$postal_code != "00000",]

unique(myb$postal_code)

[1] "94104" "94124" "94109" "94103" "94133" "94118" "94110" "94122" "94115" "94131" "94111" "94117" "94107"
[14] "94108" "94102" "94132" "94105" "94134" "94116" "94121" "94112" "94127" "94123" "94114" "94513" "94545"
[27] "94066" "94158" "95105" "94140" "94013" "94130" "92672" "94120" "94143" "94609" "94101" "94188" "94014"
[40] "04102" "94129"

# Let's merge the two data frames myb and inspections using business_id as a key.

mym = merge(myb[,c(1:2,6)],inspections,by="business_id")

head(mym)

business_id                               name postal_code Score       date
1          17               GEORGE'S COFFEE SHOP       94124    94 2013-06-05
2          17               GEORGE'S COFFEE SHOP       94124    NA 2013-07-11
3          19              NRGIZE LIFESTYLE CAFE       94109    96 2013-09-04
4          24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY       94104   100 2013-11-18
5          29                      CHICO'S PIZZA       94103    87 2013-10-01
6          37                        CAFE BISTRO       94118    NA 2013-01-30
                   type rating quarters
1 Routine - Unscheduled      G       Q2
2 Reinspection/Followup   <NA>       Q3
3 Routine - Unscheduled      G       Q3
4 Routine - Unscheduled      G       Q4
5 Routine - Unscheduled      A       Q4
6 Reinspection/Followup   <NA>       Q1

# Let's create an aggregation

hold <- aggregate(Score~postal_code,data=mym,mean)

hvec <- hold$Score

names(hvec) <- hold$postal_code

myCols <- colorRampPalette(c("blue", "red"))( length(hvec) )
mp <- barplot(rev(sort(hvec)),axisnames=F,col=myCols,main="Average Score per Zipcode")
axis(1,at=mp,labels=names(hvec),las=2,cex.axis=0.8)

Avg Score per Zip Code

Let’s see if the lowest scored 25 restaurants are in the same zip codes as the highest scored restaurants. From the following it looks like there is significant overlap so its not as if the there is a big difference between the zipcodes.

hi <- unique(mym[order(mym$Score),][1:100,]$postal_code)

hi
 [1] "94133" "94103" "94102" "94111" "94108" "94116" "94127" "94110" "94122"
[10] "94121" "94114" "94123" "94107" "94132" "94118" "94109"

lo <- unique(mym[order(-mym$Score),][1:100,]$postal_code)
lo
 [1] "94104" "94103" "94111" "94122" "94117" "94108" "94115" "94132" "94105"
[10] "94110" "94123" "94118" "94107" "94112" "94133" "94131" "94121" "94102"
[19] "94109"

sum(lo %in% hi)
[1] 13

What’s for Dessert ?

What next ? This is really just the beginning really. One approach that I didn’t take was to use the very cool sqldf package that would allow us to treat the dataframes as tables withing a relational database. However, this assumes some familiarity with SQL which, at least in my experience, newcomers to R don’t usually possess although it is worth it to learn if you plan on a career in data mining. As a teaser here is how we could have done some things using sqldf. Check this out and see you next time.

library(sqldf)
businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)

# How many restaurants are there ?

sqldf("select count(*) from businesses")
count(*)
1     6109

# What is the average score for all restaurants ?

sqldf("select avg(Score) from inspections")
  avg(Score)
1   91.95652

# What are the average scores per zip code ?
# Note that since this is the raw data we have some "bad" zipcodes

sqldf("select avg(Score),businesses.postal_code from inspections join businesses using(business_id) group by postal_code")

   avg(Score) postal_code
1    91.85986            
2    90.00000       00000
3    96.00000       04102
4    96.50000       92672
5    98.33333       94013
6    85.66667       94014
7   100.00000       94066
8    95.75000       94101
9    91.73356       94102
10   91.34171       94103
11   98.00000   941033148
12   95.65000       94104
13   94.11030       94105
14   94.72059       94107
15   90.39931       94108
16   90.48051       94109
17   91.54456       94110

Filed under: GeoCoding XML processing, Graphics, Public Health

To leave a comment for the author, please follow the link and comment on their blog: Rolling Your Rs.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.