Hierarchical Clustering for Location based Strategy using R for E-Commerce
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hi Folks! This is my first blog and I am super excited to share with you how I used R Programming to work upon a location based strategy in my E commerce organization.
Please check out r-bloggers.com for more exciting stuff on R
Just a little brief about the problem statement
I work for an e-commerce organization (an online travel platform) for booking hotels and flights based out of India. This problem is concerned with the Hotel department.
Each locality in a city behaves differently based on certain features of the locality e.g. Airport Zone of a city would behave differently from a Central Zone in the vicinity of a famous Historical Site. Therefore separate strategies are required for different areas for monitoring and controlling parameters such as Inventory, Production and Demand.
In the data I had latitude and longitude for each hotel and the task was to identify clusters of these hotels or what we call a hyperlocation.
Let’s Get Started
This is how the data looks
Let’s look at it Visually (I am using Power BI here)
Outlier Removal In the image above we can see there are certain hotels outside the city that can create problems while forming clusters, let’s remove these outliers statistically.
library(geosphere) #Mean of Lat Lon MeanLat<- mean(HotelsCity$latitude, na.rm = TRUE) MeanLon<- mean(HotelsCity$longitude, na.rm = TRUE) #Distance of all hotels from mean lat lon HotelsLatLon<- HotelsCity[,c(4,5)] MeanLatLon<- data.frame(MeanLat,MeanLon) Distance_Mat<- distm(HotelsLatLon[2:1],MeanLatLon[2:1],fun = distHaversine) Distance_Mat<- as.data.frame(Distance_Mat) #Calculating Cutoff Distance for Outlier Removal IQR<- IQR(as.numeric(Distance_Mat[,1]),na.rm = TRUE) Cutoff<- as.numeric(quantile(Distance_Mat$V1,0.75,na.rm = TRUE)+IQR*1.5) HotelDetail$Flag<- ifelse(HotelDetail$V1>Cutoff,"Incorrect","Correct") Outliers_Final<- filter(HotelDetail,Flag=="Incorrect")
Clustering
After cleaning the data (outlier removal) now let’s create a distance matrix i.e. distance of each hotel from every other hotel, I am doing this using the geosphere library in R.
#Distance Matrix for city Distance_Mat<- distm(HotelsLatLon[2:1],HotelsLatLon[2:1],fun = distHaversine) Distance_Mat<- as.data.frame(Distance_Mat) Distance_Mat[is.na(Distance_Mat)]<-0 DMat<- as.dist(Distance_Mat)
Let’s Create Clusters now.
#Hierarchical Clustering hc <- hclust(DMat, method="complete") HotelCity_Valid$Clusters<- cutree(hc, h=AvgDist2)
In the above code snippet in the cutree function I have used a different cutoff distance for different cities. How I arrived at that distance is a different science altogether, in this case the cutoff distance is around 2 KMs which means that each cluster would be roughly of a diameter of 2 KMs.
This is how these different clusters look like when plotted
How I named these localities? There was a system name tagged to each hotel’s locality, I used the most frequent name in that cluster as the Cluster Name
Please reach out to me at [email protected] for any kind of queries regarding this.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.