Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Databricks recently announced GraphFrames, awesome Spark extension to implement graph processing using DataFrames.
I performed graph analysis and visualized beautiful ball movement network of Golden State Warriors using rich data provided by NBA.com’s stats
Pass network of Warriors
Passes received & made
The league’s MVP Stephen Curry received the most passes and the team’s MVP Draymond Green provides the most passes.
We’ve seen most of the offense start with their pick & roll or Curry’s off-ball cuts with Green as a pass provider.
inDegree
id | inDegree |
---|---|
CurryStephen | 3993 |
GreenDraymond | 3123 |
ThompsonKlay | 2276 |
LivingstonShaun | 1925 |
IguodalaAndre | 1814 |
BarnesHarrison | 1241 |
BogutAndrew | 1062 |
BarbosaLeandro | 946 |
SpeightsMarreese | 826 |
ClarkIan | 692 |
RushBrandon | 685 |
EzeliFestus | 559 |
McAdooJames Michael | 182 |
VarejaoAnderson | 67 |
LooneyKevon | 22 |
outDegree
id | outDegree |
---|---|
GreenDraymond | 3841 |
CurryStephen | 3300 |
IguodalaAndre | 1896 |
LivingstonShaun | 1878 |
BogutAndrew | 1660 |
ThompsonKlay | 1460 |
BarnesHarrison | 1300 |
SpeightsMarreese | 795 |
RushBrandon | 772 |
EzeliFestus | 765 |
BarbosaLeandro | 758 |
ClarkIan | 597 |
McAdooJames Michael | 261 |
VarejaoAnderson | 94 |
LooneyKevon | 36 |
Label Propagation
Label Propagation is an algorithm to find communities in a graph network.
The algorithm nicely classifies players into backcourt and frontcourt without providing label!
name | label |
---|---|
Thompson, Klay | 3 |
Barbosa, Leandro | 3 |
Curry, Stephen | 3 |
Clark, Ian | 3 |
Livingston, Shaun | 3 |
Rush, Brandon | 7 |
Green, Draymond | 7 |
Speights, Marreese | 7 |
Bogut, Andrew | 7 |
McAdoo, James Michael | 7 |
Iguodala, Andre | 7 |
Varejao, Anderson | 7 |
Ezeli, Festus | 7 |
Looney, Kevon | 7 |
Barnes, Harrison | 7 |
Pagerank
PageRank can detect important nodes (players in this case) in a network.
It’s no surprise that Stephen Curry, Draymond Green and Klay Thompson are the top three.
The algoritm detects Shaun Livingston and Andre Iguodala play key roles in the Warriors’ passing games.
name | pagerank |
---|---|
Curry, Stephen | 2.17 |
Green, Draymond | 1.99 |
Thompson, Klay | 1.34 |
Livingston, Shaun | 1.29 |
Iguodala, Andre | 1.21 |
Barnes, Harrison | 0.86 |
Bogut, Andrew | 0.77 |
Barbosa, Leandro | 0.72 |
Speights, Marreese | 0.66 |
Clark, Ian | 0.59 |
Rush, Brandon | 0.57 |
Ezeli, Festus | 0.48 |
McAdoo, James Michael | 0.27 |
Varejao, Anderson | 0.19 |
Looney, Kevon | 0.16 |
Everything together
library(networkD3)
setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")
passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50
groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100
forceNetwork(Links = passes,
Nodes = nodes,
Source = "source",
Family = "Arial",
colourScale = JS("d3.scale.category10()"),
Target = "target",
Value = "PASS",
NodeID = "nodeid",
Nodesize = "pagerank",
linkDistance = 350,
Group = "group",
opacity = 0.8,
Size = 16,
zoom = TRUE,
opacityNoHover = TRUE)
Here is a network visualization using the results of above.
- Node size: pagerank
- Node color: community
- Link width: passes received & made
Workflow
Calling API
I used the endpoint playerdashptpass and saved data for all the players in the team into local JSON files.
The data is about who passed how many times in 2015-16 season
# GSW player IDs
playerids = [201575,201578,2738,202691,101106,2760,2571,203949,203546,
203110,201939,203105,2733,1626172,203084]
# Calling API and store the results as JSON
for playerid in playerids:
os.system('curl "http://stats.nba.com/stats/playerdashptpass?'
'DateFrom=&'
'DateTo=&'
'GameSegment=&'
'LastNGames=0&'
'LeagueID=00&'
'Location=&'
'Month=0&'
'OpponentTeamID=0&'
'Outcome=&'
'PerMode=Totals&'
'Period=0&'
'PlayerID={playerid}&'
'Season=2015-16&'
'SeasonSegment=&'
'SeasonType=Regular+Season&'
'TeamID=0&'
'VsConference=&'
'VsDivision=" > {playerid}.json'.format(playerid=playerid))
JSON -> Panda’s DataFrame
Then I combined all the individual JSON files into a single DataFrame for later aggregation.
raw = pd.DataFrame()
for playerid in playerids:
with open("{playerid}.json".format(playerid=playerid)) as json_file:
parsed = json.load(json_file)['resultSets'][0]
raw = raw.append(
pd.DataFrame(parsed['rowSet'], columns=parsed['headers']))
raw = raw.rename(columns={'PLAYER_NAME_LAST_FIRST': 'PLAYER'})
raw['id'] = raw['PLAYER'].str.replace(', ', '')
Prepare vertices and edges
You need a special data format for GraphFrames in Spark, vertices and edges.
Vertices are lis of nodes and IDs in a graph.
Edges are the relathionship of the nodes.
You can pass additional features like weight but I couldn’t find out a way to utilize there features well in later analysis.
A workaround I took below is brute force and not even a proper graph operation but works (suggestions/comments are very welcome).
# Make raw vertices
pandas_vertices = raw[['PLAYER', 'id']].drop_duplicates()
pandas_vertices.columns = ['name', 'id']
# Make raw edges
pandas_edges = pd.DataFrame()
for passer in raw['id'].drop_duplicates():
for receiver in raw[(raw['PASS_TO'].isin(raw['PLAYER'])) &
(raw['id'] == passer)]['PASS_TO'].drop_duplicates():
pandas_edges = pandas_edges.append(pd.DataFrame(
{'passer': passer, 'receiver': receiver
.replace( ', ', '')},
index=range(int(raw[(raw['id'] == passer) &
(raw['PASS_TO'] == receiver)]['PASS'].values))))
pandas_edges.columns = ['src', 'dst']
Graph analysis
Bring the local vertices and edges to Spark and let it spark.
vertices = sqlContext.createDataFrame(pandas_vertices)
edges = sqlContext.createDataFrame(pandas_edges)
# Analysis part
g = GraphFrame(vertices, edges)
print("vertices")
g.vertices.show()
print("edges")
g.edges.show()
print("inDegrees")
g.inDegrees.sort('inDegree', ascending=False).show()
print("outDegrees")
g.outDegrees.sort('outDegree', ascending=False).show()
print("degrees")
g.degrees.sort('degree', ascending=False).show()
print("labelPropagation")
g.labelPropagation(maxIter=5).show()
print("pageRank")
g.pageRank(resetProbability=0.15, tol=0.01).vertices.sort(
'pagerank', ascending=False).show()
Visualise the network
When you run gsw_passing_network.py in my github repo, you have passes.csv, groups.csv and size.csv in your working directory.
I used networkD3 package in R to make a cool interactive D3 chart.
library(networkD3)
setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")
passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50
groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100
forceNetwork(Links = passes,
Nodes = nodes,
Source = "source",
Family = "Arial",
colourScale = JS("d3.scale.category10()"),
Target = "target",
Value = "PASS",
NodeID = "nodeid",
Nodesize = "pagerank",
linkDistance = 350,
Group = "group",
opacity = 0.8,
Size = 16,
zoom = TRUE,
opacityNoHover = TRUE)
Code
The full codes are available on github.
< !-- htmlwidgets dependencies -->Analyzing Golden State Warriors’ passing network using GraphFrames in Spark was originally published by Kirill Pomogajko at Opiate for the masses on March 15, 2016.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.