Expedia Data Analysis Part 1

datacademy

6 years ago

[This article was first published on R – datacademy, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Expedia Hotel Recommendations

This dataset can be found at Kaggle. We are given logs of visitors at different Expedia sites and are asked to predict the hotel clusters in the test set. Expedia aims to use customer data to improve their hotel recommendations. In this blog, I will analyze this dataset and try to get some insight of it.

## 

Read 37670293 rows and 24 (of 24) columns from 3.791 GB file in 00:02:37

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  25981347 1387.6   36236006 1935.3  25989623 1388.0
## Vcells 646504616 4932.5 1095841779 8360.7 651377761 4969.7

## Classes 'data.table' and 'data.frame':   37670293 obs. of  24 variables:
##  $ date_time                : chr  "2014-08-11 07:46:59" "2014-08-11 08:22:12" "2014-08-11 08:24:33" "2014-08-09 18:05:16" ...
##  $ site_name                : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ posa_continent           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ user_location_country    : int  66 66 66 66 66 66 66 66 66 66 ...
##  $ user_location_region     : int  348 348 348 442 442 442 189 189 189 189 ...
##  $ user_location_city       : int  48862 48862 48862 35390 35390 35390 10067 10067 10067 10067 ...
##  $ orig_destination_distance: num  2234 2234 2234 913 914 ...
##  $ user_id                  : int  12 12 12 93 93 93 501 501 501 501 ...
##  $ is_mobile                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_package               : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ channel                  : int  9 9 9 3 3 3 2 2 2 2 ...
##  $ srch_ci                  : chr  "2014-08-27" "2014-08-29" "2014-08-29" "2014-11-23" ...
##  $ srch_co                  : chr  "2014-08-31" "2014-09-02" "2014-09-02" "2014-11-28" ...
##  $ srch_adults_cnt          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ srch_children_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ srch_rm_cnt              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ srch_destination_id      : int  8250 8250 8250 14984 14984 14984 8267 8267 8267 8267 ...
##  $ srch_destination_type_id : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_booking               : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ cnt                      : int  3 1 1 1 1 1 2 1 1 1 ...
##  $ hotel_continent          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ hotel_country            : int  50 50 50 50 50 50 50 50 50 50 ...
##  $ hotel_market             : int  628 628 628 1457 1457 1457 675 675 675 675 ...
##  $ hotel_cluster            : int  1 1 1 80 21 92 41 41 69 70 ...
##  - attr(*, ".internal.selfref")=

##   date_time           site_name      posa_continent user_location_country
##  Length:37670293    Min.   : 2.000   Min.   :0.00   Min.   :  0.00       
##  Class :character   1st Qu.: 2.000   1st Qu.:3.00   1st Qu.: 66.00       
##  Mode  :character   Median : 2.000   Median :3.00   Median : 66.00       
##                     Mean   : 9.795   Mean   :2.68   Mean   : 86.11       
##                     3rd Qu.:14.000   3rd Qu.:3.00   3rd Qu.: 70.00       
##                     Max.   :53.000   Max.   :4.00   Max.   :239.00       
##                                                                          
##  user_location_region user_location_city orig_destination_distance
##  Min.   :   0.0       Min.   :    0      Min.   :    0            
##  1st Qu.: 174.0       1st Qu.:13009      1st Qu.:  313            
##  Median : 314.0       Median :27655      Median : 1140            
##  Mean   : 308.4       Mean   :27753      Mean   : 1970            
##  3rd Qu.: 385.0       3rd Qu.:42413      3rd Qu.: 2553            
##  Max.   :1027.0       Max.   :56508      Max.   :12408            
##                                          NA's   :13525001         
##     user_id          is_mobile        is_package        channel      
##  Min.   :      0   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 298910   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 2.000  
##  Median : 603914   Median :0.0000   Median :0.0000   Median : 9.000  
##  Mean   : 604452   Mean   :0.1349   Mean   :0.2489   Mean   : 5.871  
##  3rd Qu.: 910168   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.: 9.000  
##  Max.   :1198785   Max.   :1.0000   Max.   :1.0000   Max.   :10.000  
##                                                                      
##    srch_ci            srch_co          srch_adults_cnt srch_children_cnt
##  Length:37670293    Length:37670293    Min.   :0.000   Min.   :0.0000   
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:0.0000   
##  Mode  :character   Mode  :character   Median :2.000   Median :0.0000   
##                                        Mean   :2.024   Mean   :0.3321   
##                                        3rd Qu.:2.000   3rd Qu.:0.0000   
##                                        Max.   :9.000   Max.   :9.0000   
##                                                                         
##   srch_rm_cnt    srch_destination_id srch_destination_type_id
##  Min.   :0.000   Min.   :    0       Min.   :0.000           
##  1st Qu.:1.000   1st Qu.: 8267       1st Qu.:1.000           
##  Median :1.000   Median : 9147       Median :1.000           
##  Mean   :1.113   Mean   :14441       Mean   :2.582           
##  3rd Qu.:1.000   3rd Qu.:18790       3rd Qu.:5.000           
##  Max.   :8.000   Max.   :65107       Max.   :9.000           
##                                                              
##    is_booking           cnt          hotel_continent hotel_country  
##  Min.   :0.00000   Min.   :  1.000   Min.   :0.000   Min.   :  0.0  
##  1st Qu.:0.00000   1st Qu.:  1.000   1st Qu.:2.000   1st Qu.: 50.0  
##  Median :0.00000   Median :  1.000   Median :2.000   Median : 50.0  
##  Mean   :0.07966   Mean   :  1.483   Mean   :3.156   Mean   : 81.3  
##  3rd Qu.:0.00000   3rd Qu.:  2.000   3rd Qu.:4.000   3rd Qu.:106.0  
##  Max.   :2.00000   Max.   :269.000   Max.   :6.000   Max.   :212.0  
##                                                                     
##   hotel_market    hotel_cluster  
##  Min.   :   0.0   Min.   : 0.00  
##  1st Qu.: 160.0   1st Qu.:25.00  
##  Median : 593.0   Median :49.00  
##  Mean   : 600.5   Mean   :49.81  
##  3rd Qu.: 701.0   3rd Qu.:73.00  
##  Max.   :2117.0   Max.   :99.00  
##

The is_booking and cnt variables are not present in the test
set. And as expected the hotel_cluster variable is also not present
in the test set, as this is the variable to be predicted.

The test set has an ID column, which is not present in the train set.
The ID will be the first column in the submission file, and the
predicted hotel_cluster will be the second. The sample submission file
already has the ID column, thus the ID of test file is not imported.

Let’s convert the types of some columns. is_mobile, is_package,
channel, posa_continent, hotel_continent are categorical variables and
will be converted by using the as.factor function. There are some
other categorical varaibles such as hotel_cluster, hotel_country,
user_location, etc. but these will be left as integer values, since
they have too manyentries, which makes it harder to deal with them when
plotting.

The date columns are imported as characters, and will be converted to
POSIXct date format. The date_time column is the time a search was
made on the website by a user. The year, month, day of week, and day of
month, and hour information will be extracted from the date_time
columns.

I am also making use of garbage collector gc() functions, since I have
realized that when working with big data, there is a lot of leftovers in
the memory which should be cleaned.

trn[, `:=`(is_mobile = as.factor(is_mobile),
               is_package = as.factor(is_package),
               channel = as.factor(channel),
               posa_continent = as.factor(posa_continent),
               hotel_continent = as.factor(hotel_continent))]
gc()


trn[, date_time := parse_date_time(date_time, "%y-%m-%d %H:%M:%S")]
class(trn[,date_time])

## [1] 'POSIXct' 'POSIXt'


trn[, `:=`(date_year = as.factor(year(date_time)), 
               date_month = as.factor(month(date_time)), 
               date_day = as.factor(day(date_time)), 
               date_wday = as.factor(wday(date_time, label = T)), 
               date_hour = as.factor(hour(date_time)) )]
gc()

The train dataset has 37,670,293 rows, and 24 columns. This is huge! Now
that we have imported the data and made some conversions, let's start
our analysis.

Exploring the proportion of sites visited reveals that site_name 2 is
the most frequently visited site, which dominates all the others. So, it
has been taken out in the second graph, in order to see other
proportions better.

# barplot site_name
d = trn[, .N, by = site_name][, j = .(site_name, Prop = N/ sum(N))]
g = ggplot(d, aes(x = site_name, y = Prop))
g1 = g + geom_bar(aes(fill = factor(site_name)), stat = 'identity' ) + scale_fill_discrete(name = 'site_name')
ggplotly(g1)

# Site name 2 dominates all the others
d = d[i = site_name != 2]
g = ggplot(d, aes(x = site_name, y = Prop))
g1 = g + geom_bar(aes(fill = factor(site_name)), stat = 'identity') + scale_fill_discrete(name = 'site_name')
ggplotly(g1)

Hotel Cluster – Mobile – Package Relationship

The following graphs are to gain some insight into hotel cluster –
mobile relationship. The majority of customers are not mobile. I use
tile graph where each point is depicted as a rectangle and colored based
on the intensity of the 'fill' value. In this case this is the number of
points falling into each category.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile)]
g = ggplot(d, aes(x = is_mobile, y = hotel_cluster)) + geom_raster(aes(fill = N))
ggplotly(g)

Are there any hotel clusters in which mobile is more prevalent? Instead
of total number at each hotel cluster, it makes more sense to check the
proportions. The proportions are calculated over each hotel_cluster.
Thus, the probabilities in is_mobile0 and is_mobile1 sum up to 1 for
each hotel_cluster.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile)][, Prop := N/ sum(N), by = .(hotel_cluster)]
g = ggplot(d, aes(x = is_mobile, y = hotel_cluster)) + geom_raster(aes(fill = Prop))
ggplotly(g)

All the hotel_clusters have more or less the same proportion among
is_mobile 0 and 1, with just a few having a larger proportion than
average. Although I like tile graphs, bar graph could be a better
visualization in making a comparison. Let's do that.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile)][, Prop := N/ sum(N), by = .(hotel_cluster)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop)) + geom_bar(aes(fill = Prop), stat = 'identity') + facet_grid(is_mobile~., labeller = label_both)
ggplotly(g)

OK, we see the same here. So no differnce between being mobile or not
and hotel_clusters.

Now, let's check the relation between is_package and hotel_cluster.
Here I have preferred to use the bar plot rather than the tile graph.

d = trn[,j = .N, by = .(hotel_cluster, is_package)]
g = ggplot(d, aes(x = hotel_cluster, y = N)) + geom_bar(aes(fill=factor(hotel_cluster)), stat = 'identity') + facet_grid(is_package~., labeller = label_both)
ggplotly(g)

The above graph tells us that some hotel clusters have more visits. But
in order to see if any hotel cluster is more related to being searched
within a package, we will need to plot the proportions within each
hotel_cluster. In my opinion, some hotel clusters can be related to
being searched within a package deal.

d = trn[,j = .N, by = .(hotel_cluster, is_package)][, Prop := N/ sum(N), by = .(hotel_cluster)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop)) + geom_bar(aes(fill=factor(hotel_cluster)) , stat = 'identity') + facet_grid(is_package~., labeller = label_both) + theme_dark()
ggplotly(g)

Indeed, some clusters such as 52, 65, 66, and 87 are more related to
being in a package.

Below I am trying to find a relationship between hotel_cluster, is_mobile, and is_package.

First, when is_mobile is 1, does is_package change or stay the same? As, can be seen, it stays the same, albeit slightly higher proportions of is_package 1 and is_mobile1. Second, when is_package has a large value for a hotel_cluster, does it behave similarly in is_mobile? As can be seen from second graph, is_mobile behaves the same in both of its categories, suggesting indifference to is_package.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile, is_package)][, Prop := N/ sum(N), by = .(hotel_cluster, is_package)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop, fill = Prop))+ geom_bar(stat = 'identity')+facet_grid(is_mobile~is_package, labeller = label_both)
ggplotly(g)

d = trn[,j = .N, by = .(hotel_cluster, is_mobile, is_package)][, Prop := N/ sum(N), by = .(hotel_cluster, is_mobile)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop, fill = Prop))+ geom_bar(stat = 'identity')+facet_grid(is_mobile~is_package, labeller = label_both)
ggplotly(g)

Channel of Marketing

Let's investigate the behavior of marketing channel.

levels(trn$channel)

    ##  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"


d = trn[, j = .N, by = .(channel)][, Prop:=N/sum(N)]
g = ggplot(d, aes(x=channel, y= Prop, fill = channel)) + geom_bar(stat = 'identity')
ggplotly(g)

The proportion of channel 9 dominates all the others. Channel 10 is the
least used channel. It makes me wonder whether there is any relation
between channel and being mobile?

d = trn[, j = .N, by = .(is_mobile, channel)][, Prop:=N/sum(N), by = channel]
g = ggplot(d, aes(x = is_mobile, y = channel, fill = Prop)) + geom_raster() + scale_fill_gradient2(name="N", low="blue", mid = 'white' , high="red")
ggplotly(g)

Channel 0, 1 and 2 have larger proportions than average in mobile users,
suggesting that these marketing channels might be a little bit more
geared towards mobile users.

Channel 10, and 6 are interesting as their proportion was very low which
requires further investigation. It might be that Channel 10 and 6 were
unsuccessful and were abandoned after a while. Let's see.

Let's plot months over the two years for channel 9, 6, and 10. For this
I will use the new data extracted from the date field.

d = trn[i = channel == 9, .N, by = .(date_year, date_month)][order(date_year, date_month)]
g = ggplot(d,aes(x= date_month, y = N))+ geom_bar(aes(fill = date_month), stat = 'identity') + facet_grid(date_year~.) + theme(legend.position="none") + ggtitle("Channel 9")
ggplotly(g)

d = trn[i = channel == 6, .N, by = .(date_year, date_month)][order(date_year, date_month)]
g = ggplot(d,aes(x= date_month, y = N, fill = date_month))+ geom_bar(stat = 'identity') + facet_grid(date_year~.) + theme(legend.position="none") + ggtitle("Channel 6")
ggplotly(g)

d = trn[i = channel == 10, .N, by = .(date_year, date_month)][order(date_year, date_month)]
g = ggplot(d,aes(x= date_month, y = N, fill = date_month))+ geom_bar(stat = 'identity') + facet_grid(date_year~.) + theme(legend.position="none") + ggtitle("Channel 10")
ggplotly(g)

As can be seen from the tables, I first checked channel 9, which is
distributed over the months in 2013, and 2014. The amount increased by
more than 100% after the first half of 2014.

Channel 6 seems to be a new marketing channel which is seeing some
growth over time. In fact, it has almost quadrupled in the second half
of 2014, compared to January of 2014.

Channel 10 as suspected was tried in 2013 and 2014 but it seems to be a
failure. The numbers are so low, and it seems abandoned after June of
2014. It might also be a very niche market, tried for a little while.

I also have checked all the other marketing channels,and they all behave
almost the same as channel 9. This makes me wonder that ther are amny
more datapoints from 2014, than 2013.

trn[, .N, by = date_year]

##    date_year        N
## 1:      2014 26483412
## 2:      2013 11186881

Indeed, as can be seen there about 11 million points from 2013, and
almost 26.5 million from 2014. This change might be due to the fact the
internet traffic to Expedia is rising by large amounts, and/or simply
more points were sampled from 2014.

Anyway, this is all for now. Hope you gained some insight. I will
analyze the other variables in my second post on this topic. I will also
look further into the trends of different variables. I am curious to see
whether mobile traffic increased from 2013 to 2014.

Please comment below. I am always open to suggestions, and best
practices. Thank you and see you soon!

To leave a comment for the author, please follow the link and comment on their blog: R – datacademy.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.