Fix missing dates with R

[This article was first published on plausibel, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have data on user access to a website. This log file (helpdesk log.csv) just contains the date of access, and how many accesses were counted. It would look like this:

Date hits
13-07-2011 2
14-07-2011 1
16-07-2011 3
17-07-2011 4

As you can see, for days with no access (like 15-07-2011 for example), there is no entry.
I wanted to draw a graph showing the number of hits over time. plotting this shows the graph below, but it’s conditional on there having been at least one hit. So it’s a bit misleading. We don’t know if there was zero hits or one.



What I plot looks like this:
2011-06-28 1
2011-06-29 2
2011-06-30 3
2011-07-01 1
2011-07-04 3
2011-07-05 3


Obviously, no data for 2011-07-02 and 2011-07-03, when I would want an entry 2011-07-02 = 0. In other words, I want this

2011-06-28 1
2011-06-29 2
2011-06-30 3
2011-07-01 1
2011-07-02 0
2011-07-03 0
2011-07-04 3
2011-07-05 3

So, I need to insert date and a value of zero for each date with no activity. There’s an easy way to do this in R.

# load data
hdata <- read.csv("helpdesk log.csv",header=TRUE,sep=",")
hdata$date <- as.Date(hdata$Timestamp,format="%m/%d/%Y") # Timestamp is in mm/dd/YYYY format
daycount <- table(hdata$date) # count how many hits per day
# daycount is what I call the "short" vector. It misses some days.
##################################
# daily response wrong graph
oldmar <- c(5,4,4,2) + 0.1 # store default pars
par(mar = c(7, 4, 4, 2) + 0.1) # setup bigger plotting window
png(file="daily-wrong.png")
plot(daycount,xaxt="no",type="l",main="Daily number of Interactions",ylab = "Number of Interactions",xlab="")
axis(1,at=seq(1,44,by=6),labels=FALSE)
labs <- names(daycount)
text(seq(1,44,by=6), par("usr")[3] - 0.5, srt = 45, adj = 1, # text() allows to print axis labels rotated (srt=45)
labels = labs[seq(1,44,by=6)], xpd = TRUE)
mtext("Date",side=1,line = 5)
grid()
dev.off()
par(mar = oldmar)
##################################
# generate vector of all dates
alldays <- seq(hdata$date[1],length=62,by="+1 day") # this vector has all days
# notice how R recognizes the date format and chooses the appropriate method for the seq() function.
# You don't have to worry about 30 days in June, 31 in July etc. R does it all for you.
allcount <- table(alldays) # create table object from alldays.
actindex <- match(names(allcount),names(daycount),nomatch = 0)
# create "active" index: vector of length(allcount), i.e. all days.
# on days with no activity (i.e. a missing day in daycount), this has value 0 (nomatch = 0).
# For days with activity, actindex holds the index of the matching position in daycount.
# function to get entries of daycount corresponding to actindex
# indexing is a bit tricky. i loops over all days. get correct date by
# substracting all "zero-activity" days accumulated so far.
days <- function(actindex,daycount){
n <- length(actindex)
x <- rep(NA,times=n)
zero <- 0
for (i in 1:n){
if (actindex[i]==0) {
zero <- zero +1
x[i] <- 0
} else {
x[i] <- daycount[i-zero]
}
}
return(x)
}
alldaycount <- array(days(actindex,daycount)) # construct vector with number of hits per day
names(alldaycount) <- names(allcount) # name entries by consecutive dates.
##################################
# daily response: correct graph
png(file="dailyrespons.png") # save next graph as .png
par(mar = c(7, 4, 4, 2) + 0.1)
plot(alldaycount,axes=FALSE,type="l",main="Daily number of Interactions",ylab = "Number of Interactions",xlab="")
axis(1,at=seq(1,62,by=5),labels=FALSE)
axis(2,at=c(0,1,2,3,4,5,8))
labs <- names(alldaycount)[seq(1,62,by=5)]
text(seq(1,62,by=5), par("usr")[3] - 0.5, srt = 45, adj = 1,
labels = labs, xpd = TRUE)
mtext("Date",side=1,line = 5)
grid()
dev.off()
par(mar = oldmar)
##################################
view raw hits.R hosted with ❤ by GitHub

so actind is an index vector. the first seven entries are

2011-06-28 1
2011-06-29 2
2011-06-30 3
2011-07-01 4
2011-07-02 0
2011-07-03 0
2011-07-04 5

where each row corresponds to a consecutive date, zero means no activity on that date, and a positive number is the INDEX of the element in “daycount” (the short vector) corresponding to that date.

The correct graph is this one:






To leave a comment for the author, please follow the link and comment on their blog: plausibel.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)