Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last month I attended the CeBIT trade fair in Hannover. Besides the so called “shareconomy” there was also another main topic across all expedition halls – Big Data. This subject is not completely new and I think that a lot of you also have experiences with some of the tools associated with Big Data. But due to the great number of databases, frameworks and engines in this field, there will always be something new to learn. So two weeks ago I started my own experiments with a graph database called Neo4j. This is one of the NoSQL databases, intended to distribute all of the computation across dozens of clusters in a fault-tolerant way. What attracted me was that I read that it is well suited for highly connected data and offers a descriptive language for querying the graph database. Roughly speaking, a graph database consists of nodes and edges connecting nodes. Both could also be enriched with properties. Some introduction which helped me can be found here and here. The graph query language “Cypher” then can be used to query the data by traversing the graph. Cypher itself is a declarative “Pattern-Matching” language and should be easily understandable for all folks familiar with SQL. There is a well arranged overview under this address. If you look at my older posts, you will see that most of them are about spatial data or data with at least some spatial dimension. This kind of data often has some inherent relationships – for example streets connected in a street network, regions connected through some border, places visited by people and so on. Thus I decided to connect one of the most discussed use cases from Big Data – Recommendation/Recommender Systems – with an attractive dataset about the Location Based Social Network Foursquare I collected last year, for my first experiment with Neo4j.
The main plot behind this simple “Spatial Recommendation Engine” is to utilize public available check-in data to recommend users new types of places they never visited before. Such a “check-in” consists of a user ID, a place (called venue) and a check-in time plus additional information (venue type, ..). The following code will show the structure of the already preprocessed data:
options(width = 90) # load required libraries require(data.table) require(reshape) require(reshape2) require(bitops) require(RCurl) require(RJSONIO) require(plyr) # load Foursquare data fileName <- "DATA/Foursquare_Checkins_Cologne.csv" dataset <- read.csv2(fileName, colClasses = c(rep("character", 7), rep("factor", 2), rep("character", 2)), dec = ",", encoding = "UTF-8") # how the first 10 elements look like head(dataset) ## CHECKIN_ID CHECKIN_DATE CHECKIN_TEXT USERID ## 1 4ff0244de4b000351aa08c35 2012-07-01 11:19:57 15601925 ## 2 50a66a8ee4b04d0625654fad 2012-11-16 16:32:14 7024136 ## 3 50fbe6b6e4b03e4eab759beb 2013-01-20 12:44:38 193265 ## 4 50647c22e4b011670f2a173e 2012-09-27 17:17:38 10795451 ## 5 500fc5b9e4b0d630c79ab4f8 2012-07-25 11:08:57 13964243 ## 6 50d09108e4b013668d5538f3 2012-12-18 15:51:36 126823 ## VENUEID VENUE_NAME GKZ_ID CATEGORY_ID ## 1 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259 ## 2 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259 ## 3 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259 ## 4 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259 ## 5 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259 ## 6 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259 ## CATEGORY_NAME LAT LNG ## 1 Travel & Transport 50,9431986273333 6,95889741182327 ## 2 Travel & Transport 50,9431986273333 6,95889741182327 ## 3 Travel & Transport 50,9431986273333 6,95889741182327 ## 4 Travel & Transport 50,9431986273333 6,95889741182327 ## 5 Travel & Transport 50,9431986273333 6,95889741182327 ## 6 Travel & Transport 50,9431986273333 6,95889741182327
The data was crawled last year as basis for an academic paper in the field of Urban Computing (which will be presented in May at the AGILE Conference on Geographic Information Science in Brussels) and contains public available check-ins for Germany. It seems to me, that such a kind of data is ideally suited for doing recommendations in a graph database and avoids the use of well-known toy datasets. The key idea behind our recommendation system is the following: Starting with a person for whom we want to make a recommendation, we will calculate the most similar users. A similar user is someone who rated venues in the same way the person of interest did. Because there is no explicit rating in the foursquare data, we take the number of visits as rating. The logic behind this is that either the person likes this place or it is important to him. So if both, the person and a user, will give high “ratings” to venues visited by both (thus both are similar), then the person may also be interested in visiting other venues highly rated by the other user, that the person has not seen yet. Technically speaking, this approach is called a collaborative filtering (calculate user similarity based on behavior) while the data collection is implicit (we have no explicit rating). Our data model therefore is straightforward: We take the venues and the users as nodes and transform all the related attributes from both into corresponding node properties. Then we connect every user node and venue node with a relationship if the user has visited this venue. The number of visits will be coded as a property of the relationship. For the recommender system we will use a combination of R and Cypher statements, the second primarily for loading the data into Neo4j and traversing the graph. To send Cypher statements to Neo4j the REST-API is of great value. We then could use the great abilities of R to preprocess the data, catch the results and calculate the final recommendation list.
The following is a short overview of all the steps:
- Extracting all relevant information (venues, users, ratings) from the check-in data
- Loading the data into Neo4j
- Calculating similarities for a specific user and making a recommendation on-the-fly
- Plotting the results on a map
I assume that Neo4j is installed (it’s very simple – look here) and the graph database is empty. For this delete the “graph.db” directory. After this start Neo4j.
So our first step is to extract all venues, users and ratings from the check-in data.
# -------------------------------------- # data preprocessing # -------------------------------------- dataset$CHECKIN_DATE <- as.POSIXct(dataset$CHECKIN_DATE, format = "%Y-%m-%d %H:%M:%S") dataset$LAT <- sub(",", ".", dataset$LAT) dataset$LNG <- sub(",", ".", dataset$LNG) dataset$LAT <- as.numeric(dataset$LAT) dataset$LNG <- as.numeric(dataset$LNG) dataset$HOUR24 <- as.numeric(format(dataset$CHECKIN_DATE, "%H")) venueDataset <- unique(dataset[, c("VENUEID", "LNG", "LAT", "VENUE_NAME", "CATEGORY_NAME")]) # use data.table for aggregation datasetDT <- data.table(dataset) venueUserDataset <- datasetDT[, list(COUNT_CHECKINS = length(unique(CHECKIN_ID))), by = list(VENUEID, USERID)] venueUserDataset <- data.frame(venueUserDataset) # now unique(venueUserDataset$USERID) contains all user IDs, head(unique(venueUserDataset$USERID)) ## [1] "15601925" "7024136" "193265" "10795451" "13964243" "126823" # venueDataset contains all venues and head(venueDataset) ## VENUEID LNG LAT VENUE_NAME ## 1 4aef5d85f964a520dfd721e3 6.959 50.94 Köln Hauptbahnhof ## 24 4bade052f964a520506f3be3 6.949 50.93 Stadtbibliothek Köln ## 25 4baf1998f964a52033eb3be3 6.964 50.93 Deutsches Sport & Olympia Museum ## 26 4baf428cf964a52024f43be3 6.962 50.92 Ubierschänke ## 27 4ba4f032f964a520dac538e3 6.849 50.92 OBI Baumarkt ## 28 4bc210d92a89ef3b7925f388 6.927 50.95 Pfeiler Grill ## CATEGORY_NAME ## 1 Travel & Transport ## 24 College & University ## 25 Arts & Entertainment ## 26 Nightlife Spot ## 27 Shop & Service ## 28 Food # venueUserDataset contains all the relationships (aka ratings) head(venueUserDataset) ## VENUEID USERID COUNT_CHECKINS ## 1 4aef5d85f964a520dfd721e3 15601925 5 ## 2 4aef5d85f964a520dfd721e3 7024136 1 ## 3 4aef5d85f964a520dfd721e3 193265 1 ## 4 4aef5d85f964a520dfd721e3 10795451 6 ## 5 4aef5d85f964a520dfd721e3 13964243 6 ## 6 4aef5d85f964a520dfd721e3 126823 11
The next thing is to import all that data into Neo4j. We will do this by generating dynamic Cypher statements to create all the nodes and relationships. This will of course take some time. If you have more data, then it’s maybe wiser to use the “Batch Importer”. But this needs more development and will not be explained here. Neo4j’s website offers a lot of possibilities to import data from various sources into the graph database. All of our Cypher statements will be sent to Nei4j via the “query” method, which I got from here.
# Function for querying Neo4j from within R # from http://stackoverflow.com/questions/11188918/use-neo4j-with-r query <- function(querystring) { h = basicTextGatherer() curlPerform(url = "localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query", postfields = paste("query", curlEscape(querystring), sep = "="), writefunction = h$update, verbose = FALSE) result <- fromJSON(h$value()) data <- data.frame(t(sapply(result$data, unlist))) names(data) <- result$columns return(data) } # -------------------------------------- # import all data into neo4j # -------------------------------------- nrow(venueDataset) # number of venues ## [1] 3352 length(unique(venueUserDataset$USERID)) # number of users ## [1] 3306 nrow(venueUserDataset) # number of relationships ## [1] 11293 # venues (-> nodes) for (i in 1:nrow(venueDataset)) { q <- paste("CREATE venue={name:\"", venueDataset[i, "VENUEID"], "\",txt:\"", venueDataset[i, "VENUE_NAME"], "\",categoryname:\"", venueDataset[i, "CATEGORY_NAME"], "\",type:\"venue\",\nlng:", venueDataset[i, "LNG"], ", lat:", venueDataset[i, "LAT"], "} RETURN venue;", sep = "") data <- query(q) } # users (-> nodes) for (i in unique(venueUserDataset$USERID)) { q <- paste("CREATE user={name:\"", i, "\",type:\"user\"} RETURN user;", sep = "") data <- query(q) } # number of checkins (-> relationships) for (i in 1:nrow(venueUserDataset)) { q <- paste("START user=node:node_auto_index(name=\"", venueUserDataset[i, "USERID"], "\"), venue=node:node_auto_index(name=\"", venueUserDataset[i, "VENUEID"], "\") CREATE user-[:RATED {stars : ", venueUserDataset[i, "COUNT_CHECKINS"], "}]->venue;", sep = "") data <- query(q) }
So before we start with the recommender itself, I will discuss some of it’s the details. First part of the plan is to compute the similarities between a person and all other users, who also visited at least one of the venues the person did. Based on these similarities we will then determine the recommendations. This means, that we need a similarity measure first. In our case we will use the cosines similarity, a similarity measure typically used in text mining for high dimensional data (this also fits our case). A maximum value of 1 means that both users rated all venues they visited in the same way (“the profiles of both are similar”). If you calculate the similarity in the traditional way, you would first have to build up a feature table of the size \(m\) x \(n\) (\(m\) ~ number of users and \(n\) ~ number of venues) where a value \((i,j)\) represents the rating from user \(i\) about venue \(j\). This feature table would be huge and sparse because most users only visited a few venues. A graph is an efficient way to represent that, because only the ratings that already exist have to be encoded as explicit relationships.
After we choose a person for whom we want to compute recommendations, we start by calculating all of the relevant similarities. To get some more meaningful recommendations we exclude all venues related to the venue type “Travel & Transport”“ and only take those users into account, who have at least two visited venues in common with the chosen person. For the last part we have to use R because if I’m right, Neo4j is unable at the moment to carry out “Subselects”.
# -------------------------------------- # simple venue recommendation # -------------------------------------- userName <- "7347011" # chose username/ID nOfRecommendations <- 20 # number of recommendations # Determine similiar users using the cosinus distance measure q <- paste("START me=node:node_auto_index(name=\"", userName, "\") MATCH (me)-[r1]->(venue)<-[r2]-(simUser) WHERE venue.categoryname <> \"Travel & Transport\" RETURN id(me) as id1, id(simUser) as id2, sqrt(sum(r1.stars*r1.stars)) as mag1, sqrt(sum(r2.stars*r2.stars)) as mag2, sum(r1.stars * r2.stars) as dotprod, sum(r1.stars * r2.stars)/ (sqrt(sum(r1.stars*r1.stars)) * sqrt(sum(r2.stars*r2.stars))) as cossim, count(venue) as anz_venues ORDER BY count(venue) DESC;", sep = "") ans <- query(q) simUser <- subset(ans, anz_venues >= 2) head(simUser) ## id1 id2 mag1 mag2 dotprod cossim anz_venues ## 1 3450 3518 16.371 16.643 196 0.7194 15 ## 2 3450 3782 3.000 3.000 8 0.8889 6 ## 3 3450 3382 2.828 2.828 7 0.8750 5 ## 4 3450 4031 4.690 2.236 10 0.9535 5 ## 5 3450 3860 2.236 7.483 12 0.7171 5 ## 6 3450 3537 2.236 11.180 15 0.6000 5
The second query then selects all venues (call them recommendation candidates) rated by similar users which are still not visited by the person. It returns the user ratings and the venue properties like name, type and the geographic coordinates.
# Query all venues from the similar users still not visited by the chosen user q2 <- paste("START su=node(", paste(simUser$id2, collapse = ","), "), me=node:node_auto_index(name=\"", userName, "\") MATCH su-[r]->v WHERE NOT(v<-[]-me) AND v.categoryname <> \"Travel & Transport\" RETURN id(su) as id_su, r.stars as rating, id(v) as id_venue, v.txt as venue_name, v.lng as lng, v.lat as lat ORDER BY v.txt;", sep = "") ans2 <- query(q2) head(ans2) ## id_su rating id_venue venue_name lng lat ## 1 3480 1 1297 . HEDONISTIC CRUISE . 6.926502 50.949425 ## 2 3436 1 1269 30works 6.93428135142764 50.9401634603315 ## 3 3480 1 1376 3DFACTORY 6.9209361076355 50.9508572326839 ## 4 3381 1 2274 4 Cani della Citta 6.942126 50.938178 ## 5 3369 1 1418 4010 Telekom Shop 6.94348 50.93852 ## 6 3547 1 1418 4010 Telekom Shop 6.94348 50.93852
The last step is to determine the top X recommendations. Therefore we compute a weighted (by the similarity between the user and the chosen person) rating for every recommendation candidate over all of the similar users that had already visited it and pick the top X venues as our recommendations.
# Calculate top X recommendations recommendationCandidates <- ans2 venueRecommendationCandidates <- merge(ans, recommendationCandidates, by.x = "id2", by.y = "id_su") venueRecommendationCandidates$rating <- as.numeric(as.character(venueRecommendationCandidates$rating)) venueRecommendation <- ddply(venueRecommendationCandidates, c("id_venue", "venue_name", "lng", "lat"), function(df) { sum(df$cossim * as.numeric(df$rating))/sum(df$cossim) }) venueRecommendation <- venueRecommendation[order(venueRecommendation[, 5], decreasing = TRUE), ] venueRecommendation$lat <- as.numeric(as.character(venueRecommendation$lat)) venueRecommendation$lng <- as.numeric(as.character(venueRecommendation$lng)) # Our recommendations for the chosen user venueRecommendation[c(1:nOfRecommendations), ] ## id_venue venue_name lng lat V1 ## 687 168 Wohnung 16 6.922 50.94 100.00 ## 697 187 Fork Unstable Media GmbH 6.966 50.93 52.00 ## 152 56 Pixum | Diginet GmbH & Co. KG 7.000 50.87 44.00 ## 49 536 Fachhochschule des Mittelstands (FHM) Köln 6.939 50.94 37.00 ## 635 154 Seminar für Politische Wissenschaft - Uni Köln 6.924 50.93 30.00 ## 752 201 Sinn und Verstand Kommunikationswerkstatt 6.954 50.95 27.00 ## 789 677 Happyjibe's Loft 6.919 50.97 26.00 ## 831 425 PlanB. GmbH Office Cologne 6.963 50.93 26.00 ## 484 586 Praxis Dokter H. 6.989 50.93 25.00 ## 666 303 Bürogemeinschaft Eckladen 6.962 50.93 24.00 ## 223 337 Köln-Lindweiler 6.887 51.00 23.00 ## 516 723 Health City 6.932 50.94 23.00 ## 611 306 Kreuzung Zollstockgürtel / Vorgebirgstraße 6.945 50.90 23.00 ## 201 784 Paul-Humburg-Schule 6.922 50.99 21.00 ## 412 122 unimatrix0 6.934 50.93 21.00 ## 601 989 Prinzessinnenküche 6.947 50.90 20.00 ## 754 957 Reitergemeinschaft Kornspringer 7.076 50.97 19.00 ## 371 233 MTC 6.939 50.93 17.56 ## 721 188 ESA-Besprechungsecke 6.976 50.96 17.00 ## 694 973 fischerappelt 6.966 50.93 16.00
We will close the coding section by a nice visualization of already visited (red) and recommended (blue) venues in a map. This gives a first impression on how the venues and the recommendations are distributed in geographic space.
# -------------------------------------- # Plot all of the visited and recommended venues # -------------------------------------- # get coordinates of the venues already visited qUserVenues <- paste("START me=node:node_auto_index(name=\"", userName, "\") MATCH me-[r]->v WHERE v.categoryname <> \"Travel & Transport\" RETURN r.stars, v.txt as venuename, v.type as type, v.lng as lng, v.lat as lat ORDER BY r.stars, v.txt;", sep = "") userVenues <- query(qUserVenues) userVenues$lng <- as.numeric(as.character(userVenues$lng)) userVenues$lat <- as.numeric(as.character(userVenues$lat)) # plot venues using the ggmap package require(ggmap) theme_set(theme_bw(16)) hdf <- get_map(location = c(lon = mean(userVenues$lng), lat = mean(userVenues$lat)), zoom = 11) ggmap(hdf, extent = "normal") + geom_point(aes(x = lng, y = lat), size = 4, colour = "red", data = userVenues) + geom_point(aes(x = lng, y = lat), size = 4, colour = "blue", data = venueRecommendation[c(1:nOfRecommendations), ])
Finally it is time to summarize what was done: We built a simple recommendation engine which could be used to recommend new places to users, which they should visit. The recommendation is based on their past behavior and can be computed in near real-time (there is no long running batch job). What we left out, is a rigor evaluation of the recommendation performance. But because recommendation only serves as a demo use case, this was not the topic of this posting. More important I have to say, that I’m really impressed on how easy it is to set up a Neo4j graph database and how simple it is to make the first steps with the query language Cypher. The SQL style of Cypher makes the determination of the most similar users straightforward. What’s also interesting is the simplicity of the connection from R to Neo4j via the REST-Interface. No additional things are needed. The “outlook” is even more promising due to the main advantages of NoSQL databases like fast operations on huge datasets or easy and fault-tolerant scalability (not shown here). Even though the operations run fast on that moderate sized dataset, a broad test lies beyond that session. But maybe in one of the next postings …
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.