Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A Question
Recently, a student, working on her Senior Thesis at Northland College, asked me the following question:
Attached is an Excel file with three “important to R” worksheets. The only thing that connects all 3 worksheets is the Lift.ID variable. Is there a way to tell R this so that I don’t have to paste all three worksheets onto one new worksheet and then manually format the data so that there is one Lift.ID that connects the Count sheet info to the Lift sheet info?
An Answer
Or course, the best answer to this (type of) question would be to teach this student how to use a relational database and generate a query. Getting that into our curriculum is a discussion for another time (and venue).
However, as these data were fairly simple, her question can be completed in R with merge()
. I demonstrate this below with two practice data frames that each have a Lift.ID variable. To test behavior, df2
only had one species for lift.ID==5
and neither species for lift.ID==6
. The data frames were constructed with (again, these are just example data, real data would be imported from an external file)
( df1 <- data.frame(Lift.ID=1:6, eff=c(3,4,2,5,1,1), loc=c("A","B","A","C","B","D")) ) ## Lift.ID eff loc ## 1 1 3 A ## 2 2 4 B ## 3 3 2 A ## 4 4 5 C ## 5 5 1 B ## 6 6 1 D ( df2 <- data.frame(Lift.ID=c(rep(1:4,each=2),5), species=c(rep(c("LKT","PWF"),4),"LKT"), catch=round(rnorm(9,mean=20,sd=3),0))) ## Lift.ID species catch ## 1 1 LKT 14 ## 2 1 PWF 21 ## 3 2 LKT 14 ## 4 2 PWF 20 ## 5 3 LKT 15 ## 6 3 PWF 19 ## 7 4 LKT 21 ## 8 4 PWF 18 ## 9 5 LKT 23
These two data frames can be joined to form one data frame with merge()
where the first two arguments are the two data frames (the first one is considered to be the “x” data frame) and the by=
argument contains the name of the variable that links the data frames. For example,
( df.comb1 <- merge(df1,df2,by="Lift.ID") ) ## Lift.ID eff loc species catch ## 1 1 3 A LKT 14 ## 2 1 3 A PWF 21 ## 3 2 4 B LKT 14 ## 4 2 4 B PWF 20 ## 5 3 2 A LKT 15 ## 6 3 2 A PWF 19 ## 7 4 5 C LKT 21 ## 8 4 5 C PWF 18 ## 9 5 1 B LKT 23
Thus, for example, the catch-per-unit-effort (CPE) for each lift could be appended with
df.comb1$CPE <- df.comb1$catch/df.comb1$eff df.comb1 ## Lift.ID eff loc species catch CPE ## 1 1 3 A LKT 14 4.667 ## 2 1 3 A PWF 21 7.000 ## 3 2 4 B LKT 14 3.500 ## 4 2 4 B PWF 20 5.000 ## 5 3 2 A LKT 15 7.500 ## 6 3 2 A PWF 19 9.500 ## 7 4 5 C LKT 21 4.200 ## 8 4 5 C PWF 18 3.600 ## 9 5 1 B LKT 23 23.000
A Potential Problem
A potential problem exists if the student wanted to computed mean CPE by species. For species=="LKT"
the mean will not include a zero CPE for Lift.ID==6
and for species=="PWF"
the mean will not include zeroes for each of the last two lifts.
This problem can be corrected by first including the all.x=TRUE
argument to merge()
such that each row (lift) in the “x” data frame will be included in the combined data frame whether or not there is a corresponding row in the “y” data frame. In other words, information about Lift.ID==6
needs to be in the combined data frame. For example, compare the following result of merge()
to the previous result.
( df.comb2 <- merge(df1,df2,by="Lift.ID",all.x=TRUE) ) ## Lift.ID eff loc species catch ## 1 1 3 A LKT 14 ## 2 1 3 A PWF 21 ## 3 2 4 B LKT 14 ## 4 2 4 B PWF 20 ## 5 3 2 A LKT 15 ## 6 3 2 A PWF 19 ## 7 4 5 C LKT 21 ## 8 4 5 C PWF 18 ## 9 5 1 B LKT 23 ## 10 6 1 D NA NA
However, this still does not completely solve the problem because there are no “zeroes” for catches. The “zeroes” can be added to the dataframe with addZeroCatch()
from the FSA package. First, load the FSA package
library(FSA)
The addZeroCatch()
function requires the original data frame as the first argument, the name of the variable that defines the sampling event (i.e., a net lift) in the eventvar=
argument, the variabile that identifies the species caught in the speciesvar=
argument, and the variable that should receive the zeroes in the zerovar=
argument. For example,
( df.comb2 <- addZeroCatch(df.comb2,eventvar="Lift.ID", specvar="species",zerovar="catch") ) ## Lift.ID eff loc species catch ## 1 1 3 A LKT 14 ## 2 1 3 A PWF 21 ## 3 2 4 B LKT 14 ## 4 2 4 B PWF 20 ## 5 3 2 A LKT 15 ## 6 3 2 A PWF 19 ## 7 4 5 C LKT 21 ## 8 4 5 C PWF 18 ## 9 5 1 B LKT 23 ## 10 6 1 D NA NA ## 102 6 1 D LKT 0 ## 91 5 1 B PWF 0 ## 101 6 1 D PWF 0
The last remaining problem is the row with the NA
s that was created when the all.x=TRUE=
argument was used (see row 10 above). This row (and if there were more rows like it) can be removed as follows
( df.comb2 <- subset(df.comb2,!is.na(catch)) ) ## Lift.ID eff loc species catch ## 1 1 3 A LKT 14 ## 2 1 3 A PWF 21 ## 3 2 4 B LKT 14 ## 4 2 4 B PWF 20 ## 5 3 2 A LKT 15 ## 6 3 2 A PWF 19 ## 7 4 5 C LKT 21 ## 8 4 5 C PWF 18 ## 9 5 1 B LKT 23 ## 102 6 1 D LKT 0 ## 91 5 1 B PWF 0 ## 101 6 1 D PWF 0
Now the summary statistics for CPE for each species can be computed (note that Summarize()
is from FSA).
df.comb2$CPE <- df.comb2$catch/df.comb2$eff ( Summarize(CPE~species,data=df.comb2,digits=2) ) ## species n mean sd min Q1 median Q3 max percZero ## 1 LKT 6 7.14 8.13 0 3.68 4.43 6.79 23.0 16.67 ## 2 PWF 6 4.18 3.80 0 0.90 4.30 6.50 9.5 33.33
Compare these to the incorrect values from the first use of merge()
.
( Summarize(CPE~species,data=df.comb1,digits=2) ) ## species n mean sd min Q1 median Q3 max percZero ## 1 LKT 5 8.57 8.21 3.5 4.20 4.67 7.50 23.0 0 ## 2 PWF 4 6.28 2.56 3.6 4.65 6.00 7.62 9.5 0
Filed under: Fisheries Science, R Tagged: Data Manipulation, R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.