Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
DEP="Paris" ARR="Montreal" DATE1D=rep(c(1:30,1:31,1:30,1:31,1:31,1:30,1:31,1:30, 1:31,1:31,1:29),3) DATE1M=rep(c(rep(4,30),rep(5,31),rep(6,30),rep(7,31), rep(8,31),rep(9,30),rep(10,31),rep(11,30),rep(12,31), rep(1,31),rep(2,29)),3) DATE1Y=rep(c(rep(2011,30+31+30+31+31+30+31+ 30+31+31+28),rep(2012,31+29)),3) k=3 DATE3D=c((1+k):30,1:31,1:30,1:31,1:31,1:30,1:31, 1:30,1:31,1:31,1:29,1:k) DATE3M=c(rep(4,30-k),rep(5,31),rep(6,30),rep(7,31),rep(8,31), rep(9,30),rep(10,31),rep(11,30),rep(12,31),rep(1,31),rep(2,29), rep(3,k)) DATE3Y=c(rep(2011,30+31+30+31+31+30+31+30+31+ 31+28-k),rep(2012,31+29+k))
It is also possible (for a nice robot), to skip all prior dates
skip=max(as.numeric(Sys.Date()-as.Date("2011-04-01")),1)
Then, we need a website where requests can be written nicely (with cities and dates appearing explicitly). Here, I cannot not mention the website that I used since it is stated on the website that it is strictly forbidden to run automatic requests… Anyway, consider a loop create a url address (actually I chose the value of the date randomly, since I had been told that those websites had memory: if you ask too many times for the same thing during a short period of time, prices would go up),
URL=paste("http://www.♦♦♦♦/dest.dll?qscr=fx&flag=q&city1=", DEP,"&citd1=",ARR,"&", "date1=",DATE1D[s],"/",DATE1M[s],"/",DATE1Y[s], "&date2=",DATE3D[s],"/",DATE3M[s],"/",DATE3Y[s], "&cADULT=1",sep="")
then, we just have to scan the webpage, looking for ticket prices (just looking for some specific names)
page=as.character(scan(URL,what="character")) I=which(page%in%c("Price0","Price1","Price2")) if(length(I)>0){ PRIX=substr(page[I+1],2,nchar(page[I+1])) if(PRIX[1]=="1"){PRIX=paste(PRIX,page[I+2],sep="")} if(PRIX[1]=="2"){PRIX=paste(PRIX,page[I+2],sep="")}
Here, we have to be a bit cautious, if prices exceed 1000. Then, it is possible to start a statistical study. For instance, if we compare to destination (from Paris), e.g. Montréal and New York, we obtain the following patterns (with high prices during holidays),
It is also possible to run the code twice (here it was run last month, and a couple of days ago), for the same destination (from Paris to Montréal),
Of course, it would be great if I could run that code say every week, to build up a nice dataset, and to study the dynamic of prices…
The problem is that it is forbidden to do this. In fact, on the website, it is mentioned that if we want to extract data (for an academic purpose), it is possible to ask for an extraction. But if we do tell that we study specific prices, data might be biased. So the good idea would be to use several servers, to make several requests, randomly, and to collect them (changing dates and destination). But here, my computing skills – unfortunately – reach a limit….
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.