Fix missing dates with R
[This article was first published on plausibel, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have data on user access to a website. This log file (helpdesk log.csv) just contains the date of access, and how many accesses were counted. It would look like this:
Date hits
13-07-2011 2
14-07-2011 1
16-07-2011 3
17-07-2011 4
…
As you can see, for days with no access (like 15-07-2011 for example), there is no entry.
I wanted to draw a graph showing the number of hits over time. plotting this shows the graph below, but it’s conditional on there having been at least one hit. So it’s a bit misleading. We don’t know if there was zero hits or one.
What I plot looks like this:
2011-06-28 1
2011-06-29 2
2011-06-30 3
2011-07-01 1
2011-07-04 3
2011-07-05 3
Obviously, no data for 2011-07-02 and 2011-07-03, when I would want an entry 2011-07-02 = 0. In other words, I want this
2011-06-28 1
2011-06-29 2
2011-06-30 3
2011-07-01 1
2011-07-02 0
2011-07-03 0
2011-07-04 3
2011-07-05 3
So, I need to insert date and a value of zero for each date with no activity. There’s an easy way to do this in R.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# load data | |
hdata <- read.csv("helpdesk log.csv",header=TRUE,sep=",") | |
hdata$date <- as.Date(hdata$Timestamp,format="%m/%d/%Y") # Timestamp is in mm/dd/YYYY format | |
daycount <- table(hdata$date) # count how many hits per day | |
# daycount is what I call the "short" vector. It misses some days. | |
################################## | |
# daily response wrong graph | |
oldmar <- c(5,4,4,2) + 0.1 # store default pars | |
par(mar = c(7, 4, 4, 2) + 0.1) # setup bigger plotting window | |
png(file="daily-wrong.png") | |
plot(daycount,xaxt="no",type="l",main="Daily number of Interactions",ylab = "Number of Interactions",xlab="") | |
axis(1,at=seq(1,44,by=6),labels=FALSE) | |
labs <- names(daycount) | |
text(seq(1,44,by=6), par("usr")[3] - 0.5, srt = 45, adj = 1, # text() allows to print axis labels rotated (srt=45) | |
labels = labs[seq(1,44,by=6)], xpd = TRUE) | |
mtext("Date",side=1,line = 5) | |
grid() | |
dev.off() | |
par(mar = oldmar) | |
################################## | |
# generate vector of all dates | |
alldays <- seq(hdata$date[1],length=62,by="+1 day") # this vector has all days | |
# notice how R recognizes the date format and chooses the appropriate method for the seq() function. | |
# You don't have to worry about 30 days in June, 31 in July etc. R does it all for you. | |
allcount <- table(alldays) # create table object from alldays. | |
actindex <- match(names(allcount),names(daycount),nomatch = 0) | |
# create "active" index: vector of length(allcount), i.e. all days. | |
# on days with no activity (i.e. a missing day in daycount), this has value 0 (nomatch = 0). | |
# For days with activity, actindex holds the index of the matching position in daycount. | |
# function to get entries of daycount corresponding to actindex | |
# indexing is a bit tricky. i loops over all days. get correct date by | |
# substracting all "zero-activity" days accumulated so far. | |
days <- function(actindex,daycount){ | |
n <- length(actindex) | |
x <- rep(NA,times=n) | |
zero <- 0 | |
for (i in 1:n){ | |
if (actindex[i]==0) { | |
zero <- zero +1 | |
x[i] <- 0 | |
} else { | |
x[i] <- daycount[i-zero] | |
} | |
} | |
return(x) | |
} | |
alldaycount <- array(days(actindex,daycount)) # construct vector with number of hits per day | |
names(alldaycount) <- names(allcount) # name entries by consecutive dates. | |
################################## | |
# daily response: correct graph | |
png(file="dailyrespons.png") # save next graph as .png | |
par(mar = c(7, 4, 4, 2) + 0.1) | |
plot(alldaycount,axes=FALSE,type="l",main="Daily number of Interactions",ylab = "Number of Interactions",xlab="") | |
axis(1,at=seq(1,62,by=5),labels=FALSE) | |
axis(2,at=c(0,1,2,3,4,5,8)) | |
labs <- names(alldaycount)[seq(1,62,by=5)] | |
text(seq(1,62,by=5), par("usr")[3] - 0.5, srt = 45, adj = 1, | |
labels = labs, xpd = TRUE) | |
mtext("Date",side=1,line = 5) | |
grid() | |
dev.off() | |
par(mar = oldmar) | |
################################## |
so actind is an index vector. the first seven entries are
2011-06-28 1
2011-06-29 2
2011-06-30 3
2011-07-01 4
2011-07-02 0
2011-07-03 0
2011-07-04 5
where each row corresponds to a consecutive date, zero means no activity on that date, and a positive number is the INDEX of the element in “daycount” (the short vector) corresponding to that date.
The correct graph is this one:
To leave a comment for the author, please follow the link and comment on their blog: plausibel.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.