Estimating arrival times of people in a shop using R

[This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Since the first time I was taught the Poisson and Exponential distributions I have been wondering how to apply them to model a real world scenario such as arrivals of people in a shop. Last week I was in a big shopping center and I thought that was the perfect time to gather some data so I chose a shop (randomly) and started recording on a piece of paper how many people arrived and their inter-arrival time.

I went through this process for an hour, in the middle of the afternoon (around 4 to 5pm) during a week day where the influx of customers should be approximately uniform, or so I am told. Here you can download in .txt format, the data I gathered. Note that even though the file is a text file, it is arranged as a .csv so that R can easily detect data.

What is an inter-arrival time? It is the interval of time between each arrival. Assuming that the arrivals are independent, their distribution is exponential. This assumption is further confirmed below by the blue histogram of the observations.

Summary of gathered data:
– I observed the arrival of customer for 58.73 minutes (around an hour)
– During that time 74 “shopping groups” entered the shop. “A shopping group” here is a broad label for a group of 1, 2, 3 or 4 actual customers. The red histograms describes the phenomena in greater detail.

 Rplot

As you can see it seems reasonable to assume that inter-arrival times are exponentially distributed

Rplot01

The red histogram above clearly shows that the majority of “shopping groups” is composed by either a single customer or a couple of customers. Groups of 3 or 4 people are less likely to appear according to the observed data. A good estimate of the probability of the number of people in a shopping group is the relative frequency of the observations.

The following code produces the two plots above

#-------------------------------------------------------------------------------
# Loading data
#-------------------------------------------------------------------------------
# Observed data
data <- read.csv('data.txt',header=T)
# Take a look of the data
View(data)
# Total minutes ~ 60 minutes (~ 1 hour of observations)
total_minutes <- sum(data$min)+sum(data$sec)/60+sum(data$cent)/100/60
print(total_minutes)
# Compute the inter-arrival times in minutes
interarrivals <- data$min + data$sec/60 + data$cent/100/60
# Range of the bins for the histogram
bins.range <- 0.3
# Intervals of the histograms
breaks.points1 <- seq(0,max(interarrivals)+1,bins.range)
# Histogram of observed data
x.lab <- 'Minutes'
y.lab <- 'Frequencies'
main.lab <- 'Interarrival times'
hist(interarrivals,breaks=breaks.points1,xlab=x.lab,ylab=y.lab,main=main.lab,
col='cyan',border='blue',density=30)
# Boxplot of observed data
boxplot(interarrivals,col='cyan',border='blue',horizontal=T,xlab='Minutes',
main='Interarrival times')
# Customers arrive alone or in groups according to the observed probabilities
# Probability list -> probability vector
break.points2 <- seq(0,max(data$pers),1)
hist(data$pers,xlab='No. of people',breaks=break.points2,
main='Number of people per arrival',col='red',border='red4',density=30)
# Calculating relative frequencies
p <- table(data$pers)/sum(data$pers)
p.vec <- c(p[[1]],p[[2]],p[[3]],p[[4]])
names(p.vec) <- c('p(1)','p(2)','p(3)','p(4)')
# Vector of observed number of people per arrival
n.of.people <- min(data$pers):max(data$pers)
print(p.vec)
view raw arrivalTimes1.R hosted with ❤ by GitHub

Now that we have uploaded the data, it is time to simulate some observations and compare them to the real one. Below in green is the simulated data.

Rplot03

Rplot02

The code used to simulate the data is here below

#-------------------------------------------------------------------------------
# Simulation
#-------------------------------------------------------------------------------
# Estimated parameters of the exponential distribution
x.rate <- length(data$num)
# Remember that mean = 1/x.rate
# meaning that, on average, we expect a new arrival every 1/74 of an hour.
# (1/74 =~ 0.01355)
# The random number generated using rexp() will be fractions of an hour.
lambda <- x.rate
# Number of simulated samples
n <- 74
# Set.seed()
set.seed(300)
# Let's generate random samples
# The random number generated using rexp() will be fractions of an hour.
simulated <- rexp(n,rate=lambda)
# We expect mean(simulated) =~ 0.01355* of an hour
avg <- mean(simulated)
print(paste('Average inter-arrival time (fraction of an hour)',avg)) # *
print(paste('Average inter-arrival time (minutes)',avg*60))
print(paste('Average inter-arrival time (seconds)',avg*60*60))
# Let's convert the simulated interarrival times into minutes
simulated.min <- simulated * 60
# Let's plot the histogram
breaks.points <- seq(0,max(simulated.min)+1,bins.range)
hist(simulated.min,col='chartreuse',breaks=breaks.points,border='seagreen',density=30)
boxplot(simulated.min,col='chartreuse',border='seagreen',horizontal=T,xlab='Minutes',
main='Simulated interarrival times')
# Let's simulate the number of people per arrival
people.per.arr <- sample(x=n.of.people,size=n,replace=TRUE,prob=p.vec)
hist(people.per.arr,xlab='No. of people',breaks=break.points2,
main='Simulated number of people per arrival',col='chartreuse',
border='seagreen',density=30,angle=30)
view raw arrivalTimes2.R hosted with ❤ by GitHub

Now, the graphs above are pretty, but they are not much use, we need to compare the two sets of graphs to get an idea if our model is fine for this simple task.

Rplot04 Rplot06

Rplot07

It looks like our model tends to overestimate the number of arrivals in the 0-0.5 minutes section, however this is just one simulation and the average arrival time is close to what we expected using this model. It seems reasonable to use this distribution to estimate the inter-arrival times and the number of people arrived in the shop.

The R-code

#-------------------------------------------------------------------------------
# Compare simulated to obseved data
#-------------------------------------------------------------------------------
# Let's compare observed data with simulated data
# Comparative boxplot
boxplot(interarrivals,simulated.min,xlab='Minutes',col=c('cyan','chartreuse'),
border=c('blue','seagreen'),names=c('Observed','Simulated'),
main='Interarrival times',notch=TRUE,horizontal=TRUE)
# Comparative histograms
hist(simulated.min,col='chartreuse',breaks=breaks.points1,xlab=x.lab,ylab=y.lab,
main=main.lab,border='seagreen',density=20,angle=30)
hist(interarrivals,col='cyan',breaks=breaks.points1,border='blue',
density=20,angle=160,add=TRUE)
# Add a legend
legend("topright",c("Simulated","Observed"), col=c("seagreen", "blue"), lwd=10)
# Comparative histograms of number of people per arrival
hist(people.per.arr,xlab='No. of people',breaks=break.points2,
main='Simulated number of people per arrival',col='chartreuse',
border='seagreen',density=30,angle=30)
hist(data$pers,breaks=break.points2,col='red',border='red4',density=30,
angle=160,add=TRUE)
# Add a legend
legend("topright",c("Simulated","Observed"), col=c("seagreen", "red"), lwd=10)
view raw arrivalTimes3.R hosted with ❤ by GitHub

Finally, let’s have a look at the ideal exponential distribution given the estimated parameters. Note that some adjustments are necessary because our model is set so that the simulated times are hours, but for practical purposes it is better to work with minutes. Furthermore, by using the exponential distribution formulas we can calculate some probabilities.

Rplot08

#-------------------------------------------------------------------------------
# Probability distributions
#-------------------------------------------------------------------------------
# Density plot
#Since the rate is given as person/hour we need to set time in hours
#Initial time Final time Step
#0 hour 0.15 of an hour 0.001 of an hour
#0 minutes 9 minutes 0.06 minutes = 3.6 seconds
# Time vector
time <- seq(0,0.15,0.001)
# 1 row, 2 columns
par(mfrow=c(1,2))
# Probability distribution plot
plot(time*60,dexp(time,rate=lambda),type='l',xlab='Minutes',
ylab='Probability density',main='Probability distribution',col='blue',lwd=2)
# Cumulative distribution plot
plot(time*60,pexp(time,rate=lambda),type='l',xlab='Minutes',
ylab='Cumulative probability',main='Cumulative distribution',col='red',lwd=2)
#-------------------------------------------------------------------------------
# Useful functions
#-------------------------------------------------------------------------------
# T = waiting time before an arrival (Random variable)
l = lambda
# Note: since it is more practical, t must be entered in seconds,
# it will be converted into hours later
# Probability of waiting time of the arrival (T) being greater than t
# P(T > t)
p.right <- function(t)
{
t <- t/60/60
p <- exp(-l*t)
return(p)
}
# Probability of waiting time of the arrival (T) being less than t
# P (T < t) cumulative dist
p.left <- function(t)
{
t <- t/60/60
p <- 1 - exp(-l*t)
return(p)
}
print('Probability of waiting time before an arrival being greater than 3 minutes')
p.right(3*60)
print('Probability of waiting time before an arrival being less than 3 minutes')
p.left(3*60)
view raw arrivalTimes4.R hosted with ❤ by GitHub

By running the code we find out that the probability of the waiting time being less than 3 minutes is 0.975 meaning that it is very likely that one customer group will show up in less than three minutes.

Conclusions.

This model is very basic and simple, both to fit and to run, therefore is very fast to implement. It can be useful to optimize the number of personnel in the store and their function, according to the expected inflow of customers and the average time needed to help the current customers. The hypothesis seem reasonable, given the time of the day considered and provided that the shop does not make some promotion or discount or any other event that could suggest a different behaviour of customers (for instance when there is a discount, it is likely that customers might tell each other to visit the store). For different times of the day, different parameters could be estimated by gathering more data and fitting different exponential distributions.

To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)