Hierarchical Clustering for Location based Strategy using R for E-Commerce
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hi Folks! This is my first blog and I am super excited to share with you how I used R Programming to work upon a location based strategy in my E commerce organization.
Please check out r-bloggers.com for more exciting stuff on R
Just a little brief about the problem statement
I work for an e-commerce organization (an online travel platform) for booking hotels and flights based out of India. This problem is concerned with the Hotel department.
Each locality in a city behaves differently based on certain features of the locality e.g. Airport Zone of a city would behave differently from a Central Zone in the vicinity of a famous Historical Site. Therefore separate strategies are required for different areas for monitoring and controlling parameters such as Inventory, Production and Demand.
In the data I had latitude and longitude for each hotel and the task was to identify clusters of these hotels or what we call a hyperlocation.
Let’s Get Started
This is how the data looks
![](https://rvsshubhamhome.files.wordpress.com/2019/06/hotellist-1.png?w=578)
Let’s look at it Visually (I am using Power BI here)
![](https://rvsshubhamhome.files.wordpress.com/2019/06/delhi.png?w=578)
Outlier Removal In the image above we can see there are certain hotels outside the city that can create problems while forming clusters, let’s remove these outliers statistically.
library(geosphere) #Mean of Lat Lon MeanLat<- mean(HotelsCity$latitude, na.rm = TRUE) MeanLon<- mean(HotelsCity$longitude, na.rm = TRUE) #Distance of all hotels from mean lat lon HotelsLatLon<- HotelsCity[,c(4,5)] MeanLatLon<- data.frame(MeanLat,MeanLon) Distance_Mat<- distm(HotelsLatLon[2:1],MeanLatLon[2:1],fun = distHaversine) Distance_Mat<- as.data.frame(Distance_Mat) #Calculating Cutoff Distance for Outlier Removal IQR<- IQR(as.numeric(Distance_Mat[,1]),na.rm = TRUE) Cutoff<- as.numeric(quantile(Distance_Mat$V1,0.75,na.rm = TRUE)+IQR*1.5) HotelDetail$Flag<- ifelse(HotelDetail$V1>Cutoff,"Incorrect","Correct") Outliers_Final<- filter(HotelDetail,Flag=="Incorrect")
![](https://rvsshubhamhome.files.wordpress.com/2019/06/delhioutliers.png?w=578)
Clustering
After cleaning the data (outlier removal) now let’s create a distance matrix i.e. distance of each hotel from every other hotel, I am doing this using the geosphere library in R.
#Distance Matrix for city Distance_Mat<- distm(HotelsLatLon[2:1],HotelsLatLon[2:1],fun = distHaversine) Distance_Mat<- as.data.frame(Distance_Mat) Distance_Mat[is.na(Distance_Mat)]<-0 DMat<- as.dist(Distance_Mat)
![](https://rvsshubhamhome.files.wordpress.com/2019/06/dmat.png?w=578)
Let’s Create Clusters now.
#Hierarchical Clustering hc <- hclust(DMat, method="complete") HotelCity_Valid$Clusters<- cutree(hc, h=AvgDist2)
In the above code snippet in the cutree function I have used a different cutoff distance for different cities. How I arrived at that distance is a different science altogether, in this case the cutoff distance is around 2 KMs which means that each cluster would be roughly of a diameter of 2 KMs.
This is how these different clusters look like when plotted
![](https://rvsshubhamhome.files.wordpress.com/2019/06/clusters.png?w=578)
How I named these localities? There was a system name tagged to each hotel’s locality, I used the most frequent name in that cluster as the Cluster Name
Please reach out to me at [email protected] for any kind of queries regarding this.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.