Network analysis in Bollywood

R on Asitav Sen

2 years ago

[This article was first published on R on Asitav Sen, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Network Analysis

Network theory is the study of graphs as a representation of relationship between discrete elements. When applied to social relations, it is known as social network analysis.¹

Bollywood

In this article, network theory is applied to analyse relationship between some professionals in bollywood, based on data from movie set. The data has been compiled by Parth Parikh from various sources. The analysis in this article involves a subset of relevant data.

Data preparation

The CSVs downloaded are imported as following

film1: Movie details from 1950 to 1989
film2: Movie details from 1990 to 2009
film3: Movie details from 2010 to 2019
crew: Crew information with unique identifier
filmcrew: Crew details (unique identifier) in each movie

 1#Combining film data
 2films<-rbind(film1,film2,film3)
 3rm(film1,film2,film3)
 4# separating columns with actor names, since actor names are in single column
 5f<-films%>%
 6  separate(actors,c
 7           ("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"),sep="[|]")%>%
 8    filter(a1!="NA")
 9#dedup any possible duplicates in movie
10f<-f[!duplicated(f$imdb_id), ]
11rm(films)
12# selecting relevant columns from filmcrew
13fc<-filmcrew[,c(1,2,4,5)]
14rm(filmcrew)
15# separating column with writer names, since writer names are in single column and deleting rows with no writer
16c<-crew%>%
17    separate(writers, c("w1","w2","w3","w4","w5","w6","w7","w8","w9","w10"), sep="[|]")%>%
18    filter(w1!="\\N")
19#dedup any possible duplicates in movie
20c<-c[!duplicated(c$imdb_id), ]
21rm(crew)
22#Compilining actor and crew info per movie together
23full.coded.raw<-c%>%right_join(f,by=c("imdb_id"))
24
25#Relevant columns are directors, writers and actors i.e. director, wx and ax
26dt<-full.coded.raw[,c(2:12,16:25)]
27rm(full.coded.raw)
28#To create a file suitable for network analysis, a 2 column file is required, which depicts relationships. This can be achieved by creating combination of columns and binding them together. It can be done manually or the `juggling_jaguar` function from package `Rmessy` can be used. The package is under development. Please feel free to use it and develop it further. The package can be downloaded from github here. Or it can be installed using devtools using the following command.
29
30#>install.packages("devtools") #if not installed
31#>devtools::install_github("asitav-sen/Rmessy")
32
33#using juggling_jaguar
34
35dt.net<-juggling_jaguar(dt)
36rm(dt)
37
38# names(dt.net)[1]="x"
39# names(dt.net)[2]="y"
40# 
41# dt.net<-
42#   dt.net%>%filter(x!=y)
43
44
45
46#The data frame contains unique ids of the crew (and names of the actors). The data frame fc contains the relevant information to get the names of the relevant codes. However, since actor names are not coded, it is important that their names remain intact.
47
48dt.net1<-
49dt.net%>%
50  left_join(fc,by=c("x"="crew_id"))%>%
51  mutate(from=ifelse(is.na(name),x,name))%>%
52  select(c(6,2))%>%
53  left_join(fc,by=c("y"="crew_id"))%>%
54  mutate(to=ifelse(is.na(name),y,name))%>%
55  select(c(1,6))
56rm(dt.net)
57# Since some of the names were not found in the crew list available, these rows may be deleted. This is optional. One may want to analyse using the codes.
58dt.net2<-
59dt.net1%>%
60  filter(!str_detect(to,"^nm"))%>%
61  filter(!str_detect(from,"^nm"))
62rm(dt.net1)
63#Removing possible empty rows
64
65rm.ro<-which(dt.net2$to=="")
66dt.net2<-dt.net2[-rm.ro,]
67
68#In this data set, pair of A-B and B-A are considered different. However, they are ultimately same. Hence, the data was further rectified.
69# converting df in igraph file
70
71gra.ph<-graph_from_data_frame(d=dt.net2, directed=FALSE)
72
73# converting back to data frame
74
75data.df<-get.data.frame(gra.ph)
76rm(gra.ph)
77#To analyze 'strength of relation' one can assume the number of times people have worked together to be a good indicator.
78data.df<-
79  data.df%>%
80  group_by(from,to)%>%
81  count()%>%
82  arrange(desc(n))%>%
83  rename(works=n)%>%
84  filter(from!=to)
85
86# creating graph object and Removing scatters
87
88gra.ph<-graph_from_data_frame(d=data.df, directed=FALSE)
89gra.ph$weight<-data.df$works
90V(gra.ph)$comp <- components(gra.ph)$membership
91gra.main <- induced_subgraph(gra.ph,V(gra.ph)$comp==1)
92rm(gra.ph)

Analyses

Understanding importance of nodes (individuals) and the network

There is immense inequality in the importance of the individuals. Out of all, very few individuals have worked with more than 100 different people in the industry. This is observed through histogram of degrees. Similar trend is observed in the histogram of betweenness, which is another indicator of importance. Roughly, betweenness in this case can be simplified as the tendency to do have worked with different individuals who have not worked together. (This is an oversimplification)

1main.deg<-degree(gra.main, mode = "all")
2main.bw<-betweenness(gra.main,directed = FALSE, normalized = TRUE)
3eigen.main<-eigen_centrality(gra.main)
4par(mfrow=c(1,3))
5hist(main.deg, breaks = 10, main = "Degree", xlab = "Degree")
6hist(main.bw, breaks = 100, main = "Betweenness", xlab = "Betweenness")
7hist(eigen.main$vector, breaks = 100, main= "Eigen Vector", xlab = "Eigen Vector")

The top individuals identified are mentioned in the table.

1bydegree<-sort(main.deg, decreasing=TRUE)[1:20]
2bybetweenness<-sort(main.bw, decreasing=TRUE)[1:20]
3byeigen<-sort(eigen.main$vector,decreasing = TRUE)[1:20]
4
5top<-data.frame(bydegree,bybetweenness,byeigen)
6top

 1##                    bydegree bybetweenness   byeigen
 2## Anupam Kher             904    0.03737696 1.0000000
 3## Gulshan Grover          798    0.02848179 0.9701978
 4## Shakti Kapoor           779    0.02748425 0.9354072
 5## Aruna Irani             773    0.02258863 0.9272276
 6## Prem Chopra             739    0.02089639 0.8935966
 7## Amitabh Bachchan        725    0.02057037 0.8924957
 8## Asrani                  723    0.01898179 0.8523556
 9## Dharmendra              707    0.01789976 0.8479439
10## Mithun Chakraborty      672    0.01777744 0.8417188
11## Om Puri                 657    0.01526129 0.8135059
12## Pran                    633    0.01497864 0.8016672
13## Paresh Rawal            625    0.01481343 0.7927105
14## Amrish Puri             591    0.01458164 0.7898096
15## Jackie Shroff           586    0.01362665 0.7886611
16## Naseeruddin Shah        583    0.01347162 0.7875499
17## Satyendra Kapoor        571    0.01282562 0.7858475
18## Johnny Lever            570    0.01215650 0.7852549
19## Rishi Kapoor            566    0.01104178 0.7800349
20## Jeetendra               565    0.01039253 0.7749431
21## Kader Khan              563    0.01029858 0.7732780

The farthest or the longest connection in the network is between Ajai Sinha and Edwin Fernandes with 5 individuals in between.

1farthest_vertices(gra.main)

1## $vertices
2## + 2/7436 vertices, named, from edb7f57:
3## [1] Ivan Ayr   Adiba Bhat
4## 
5## $distance
6## [1] 7

1get_diameter(gra.main) 

1## + 8/7436 vertices, named, from edb7f57:
2## [1] Ivan Ayr         Himanshu Kohli   Barkha Madan     Rekha           
3## [5] Amitabh Bachchan Prateik          Nitin Kakkar     Adiba Bhat

The assortativity based on degree i.e. tendency for individuals to work with other individuals with similar degree (connections), lies somewhere in the middle, near 0. There is almost equal mix of cases.

1assortativity.degree(gra.main, directed= FALSE)

1## [1] 0.0175864

Transitivity of 0.23 is much higher than that of randomly generated network of similar properties. However, it is not uncommon to observe social networks to have transitivity between 03. to 0.6.² Transitivity measures how well connected the network is. (Oversimplification)

 1# creating random trees for comparison
 2# *****Requires substantial computational power*****
 3
 4rnd.main <- vector('list',500)
 5dens.main<-edge_density(gra.main)
 6n=gorder(gra.main)
 7
 8for(i in 1:500){
 9  rnd.main[[i]] <- erdos.renyi.game(n=n, p.or.m = dens.main, type = "gnp")
10}
11
12
13tra.main<-transitivity(gra.main)
14tra.rnd <- unlist(lapply(rnd.main, transitivity))
15
16par(mfrow=c(1,2))
17hist(tra.rnd, main="Transitivity")
18abline(v=tra.main)
19
20hist(tra.rnd, main="Transitivity, x-axis extended", xlim = c(0,0.3))
21abline(v=tra.main)

1rm(rnd.main,tra.rnd)
2#similar test ca be prformed for other properties like diameter, max cliques etc.
3# dia.main<- diameter(gra.main, directed = FALSE)
4# dia.rnd <- unlist(lapply(rnd.main, diameter, directed = FALSE))
5
6
7# max.c.main<-max_cliques(gra.main)
8# lar.c.main<-largest_cliques(gra.main)

Understanding communities/clusters in the network

Fast Greedy algorithm identifies several segments, top five of which are as follows.

1#Fast Greedy
2com.fg<-fastgreedy.community(gra.main)
3sort(sizes(com.fg), decreasing = TRUE)[1:5]

1## Community sizes
2##    3    1    4    2   16 
3## 3312 1992  617  347   75

1#To check membership
2
3#membership(com.fg)
4#membership(com.fg)[membership(com.fg)=1]
5#membership(com.fg)[names(membership(com.fg))="Amitabh Bahchan"]

Following is the plot of cluster 4 , third largest community identified by the algorithm.

 1comm.fg.4 <- as.undirected(induced_subgraph(gra.main, com.fg[[4]]))
 2comm.fg.4.deg<-degree(comm.fg.4, mode = "all")
 3
 4
 5par(bg="black", mfrow=c(1,1))
 6plot(comm.fg.4,
 7     rescale= TRUE,
 8     vertex.label = ifelse(degree(comm.fg.4) >= 20, names(V(comm.fg.4)), NA),
 9     vertex.color = adjustcolor("gold", alpha.f = .5),
10     vertex.size = sqrt(comm.fg.4.deg),
11     layout = layout_with_lgl(comm.fg.4),
12     vertex.label.cex= 0.75,
13     vertex.label.degree=pi/2,
14     vertex.label.dist=1.5,
15     vertex.label.color="white", 
16     edge.curved=0.5,
17     edge.width= 0.5,
18     edge.color = ifelse(comm.fg.4$weight>25, "dark green", "dark red"))
19title("Network of cluster 4",cex.main=1,col.main="white")
20legend("topright", c("<=25 times",">25 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

Central figure cannot be identified from this graph. There seems to be several who can clain to be ‘central’. Eminent personality like Kamal Hassan, Girish Karnad, Smita Patil etc. are also present in this segment. Satyajit Ray and Anil Chatterjee too. Interestingly, they some of the major characters of parallel cinema.

Mahabanoo Mody Kotwal seems to be the central figure in cluster 3. This group does not seem to consist of bollywood blockbuster creators. However, they have gained popularity in regional movies and television. Some of them are foreigners too.

 1comm.fg.3 <- as.undirected(induced_subgraph(gra.main, com.fg[[3]]))
 2
 3comm.fg.3.deg<-degree(comm.fg.3, mode = "all")
 4
 5
 6par(bg="black", mfrow=c(1,1))
 7plot(comm.fg.3,
 8     rescale= TRUE,
 9     vertex.label = ifelse(comm.fg.3.deg >= 11, names(V(comm.fg.3)), NA),
10     vertex.color = adjustcolor("gold", alpha.f = .5),
11     vertex.size = comm.fg.3.deg^(1/5),
12     layout = layout_with_lgl(comm.fg.3),
13     vertex.label.cex= 0.75,
14     vertex.label.degree=pi/2,
15     vertex.label.dist=1,
16     vertex.label.color="white",
17     edge.curved=0.5,
18     edge.width= 0.5,
19     edge.color = ifelse(comm.fg.3$weight>25, "dark green", "dark red"),
20     xlim = c(-1,1.1),
21     asp=-0.5)
22title("Network of cluster 3",cex.main=1,col.main="white")
23legend("topright", c("<=25 times",">25 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

Cluster 1 and 2 are too big and complex to show anything meaningful in the network plot. They need to be broken down further or other segmenting methods need to be used to capture different segments.

 1comm.fg.1 <- as.undirected(induced_subgraph(gra.main, com.fg[[1]]))
 2comm.fg.1.deg<-degree(comm.fg.1, mode = "all")
 3
 4comm.fg.2 <- as.undirected(induced_subgraph(gra.main, com.fg[[2]]))
 5comm.fg.2.deg<-degree(comm.fg.2, mode = "all")
 6
 7# cluster 1
 8par(bg="black", mfrow=c(2,2))
 9plot(comm.fg.1,
10     rescale= TRUE,
11     vertex.label = NA, #ifelse(comm.fg.2.deg >= 350, names(V(comm.fg.2)), NA),
12     vertex.color = adjustcolor("gold", alpha.f = .5),
13     vertex.size = ifelse(comm.fg.1.deg<50,0.1,(comm.fg.1.deg)^(1/4)),
14     layout = layout_with_lgl(comm.fg.1),
15     vertex.label.cex= 0.75,
16     vertex.label.degree=pi/2,
17     vertex.label.dist=1,
18     vertex.label.color="black",
19     edge.curved=0.5,
20     edge.width= 0.5,
21     edge.color = adjustcolor(ifelse(comm.fg.1$weight>10, "dark green", "dark red"),alpha=0.3),
22     xlim = c(-1,1.1),
23     asp=-1,
24     axes = F)
25title("Network of cluster 1",cex.main=1,col.main="white")
26legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")
27
28#cluster 1, zoomed
29plot(comm.fg.1,
30     rescale= TRUE,
31     vertex.label = ifelse(comm.fg.1.deg >= 350, names(V(comm.fg.1)), NA),
32     vertex.color = adjustcolor("gold", alpha.f = .3),
33     vertex.size = ifelse(comm.fg.1.deg<50,0.1,(comm.fg.1.deg)^(1/5)),
34     layout = layout_with_lgl(comm.fg.1),
35     vertex.label.cex= 0.75,
36     vertex.label.degree=pi/2,
37     vertex.label.dist=1,
38     vertex.label.color="black",
39     edge.curved=0.5,
40     edge.width= 0.5,
41     edge.color = adjustcolor(ifelse(comm.fg.1$weight>10, "dark green", "dark red"),alpha=0.2),
42     xlim = c(-0.025,0.025),
43     ylim = c(-0.025,0.025),
44     asp=-1,
45     axes = F)
46title("Network of cluster 1, zoomed",cex.main=1,col.main="white")
47legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")
48
49# cluster 2
50plot(comm.fg.2,
51     rescale= TRUE,
52     vertex.label = NA, #ifelse(comm.fg.2.deg >= 350, names(V(comm.fg.2)), NA),
53     vertex.color = adjustcolor("gold", alpha.f = .5),
54     vertex.size = ifelse(comm.fg.2.deg<50,0.1,(comm.fg.2.deg)^(1/4)),
55     layout = layout_with_lgl(comm.fg.2),
56     vertex.label.cex= 0.75,
57     vertex.label.degree=pi/2,
58     vertex.label.dist=1,
59     vertex.label.color="black",
60     edge.curved=0.5,
61     edge.width= 0.5,
62     edge.color = adjustcolor(ifelse(comm.fg.2$weight>10, "dark green", "dark red"),alpha=0.3),
63     xlim = c(-1,1.1),
64     asp=-1,
65     axes = F)
66title("Network of cluster 2",cex.main=1,col.main="white")
67legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")
68
69#cluster 2, zoomed
70plot(comm.fg.2,
71     rescale= TRUE,
72     vertex.label = ifelse(comm.fg.2.deg >= 350, names(V(comm.fg.2)), NA),
73     vertex.color = adjustcolor("gold", alpha.f = .3),
74     vertex.size = ifelse(comm.fg.2.deg<50,0.1,(comm.fg.2.deg)^(1/5)),
75     layout = layout_with_lgl(comm.fg.2),
76     vertex.label.cex= 0.75,
77     vertex.label.degree=pi/2,
78     vertex.label.dist=1,
79     vertex.label.color="black",
80     edge.curved=0.5,
81     edge.width= 0.5,
82     edge.color = adjustcolor(ifelse(comm.fg.2$weight>10, "dark green", "dark red"),alpha=0.2),
83     xlim = c(-0.025,0.025),
84     ylim = c(-0.025,0.025),
85     asp=-1,
86     axes = F)
87title("Network of cluster 2, zoomed",cex.main=1,col.main="white")
88legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

Comparing the clusters

1l1<-mapping_monkey(comm.fg.1)
2l2<-mapping_monkey(comm.fg.2)
3l3<-mapping_monkey(comm.fg.3)
4l4<-mapping_monkey(comm.fg.4)

Comparing the histogram of degrees, betweenness and eigen vectors do not show any significant difference or unexpected outcome. In the histogram of eigen vectors of segment 3, the pattern is different from others. There are more individuals with higher eigen vectors. This is natural in a smaller group (Segment 3 is very small). People in a smaller group tend to connect to each other more than in those in larger group.

 1par(mfrow=c(3,4))
 2hist(l1[[1]], breaks = 50, col=adjustcolor("black", alpha=0.3), main = "Degrees, Segment 1", xlab = "Degrees")
 3hist(l2[[1]], breaks = 50, col=adjustcolor("red", alpha=0.3), main = "Degrees, Segment 2", xlab = "Degrees")
 4hist(l3[[1]], breaks = 50, col=adjustcolor("green", alpha=0.3), main = "Degrees, Segment 3", xlab = "Degrees")
 5hist(l4[[1]], breaks = 50, col=adjustcolor("blue", alpha=0.3), main = "Degrees, Segment 4", xlab = "Degrees")
 6hist(l1[[2]], breaks = 50, col=adjustcolor("black", alpha=0.3), main = "Betweenness, Segment 1", xlab = "Betweenness")
 7hist(l2[[2]], breaks = 50, col=adjustcolor("red", alpha=0.3), main = "Betweenness, Segment 2", xlab = "Betweenness")
 8hist(l3[[2]], breaks = 50, col=adjustcolor("green", alpha=0.3), main = "Betweenness, Segment 3", xlab = "Betweenness")
 9hist(l4[[2]], breaks = 50, col=adjustcolor("blue", alpha=0.3), main = "Betweenness, Segment 4", xlab = "Betweenness")
10hist(l1[[3]], breaks = 50, col=adjustcolor("black", alpha=0.3), main = "Eigen vectors, Segment 1", xlab = "Eigen Vector")
11hist(l2[[3]], breaks = 50, col=adjustcolor("red", alpha=0.3), main = "Eigen vectors, Segment 2", xlab = "Eigen Vector")
12hist(l3[[3]], breaks = 50, col=adjustcolor("green", alpha=0.3), main = "Eigen vectors, Segment 3", xlab = "Eigen Vector")
13hist(l4[[3]], breaks = 50, col=adjustcolor("blue", alpha=0.3), main = "Eigen vectors, Segment 4", xlab = "Eigen Vector")

Variation in edge density is noticed in the segments. Segment 3, being smallest, can be expected to have higher density (people connected to each other). Comparing density of segment 2 and 4 is interesting. The population of segment 4 is much smaller. Yet, the density is lower than that of segment 2. This indicates that individuals in segment 2 are more connected (have worked with) to each other than those in segment 4. This, kind of, gets reinforced when the diameter is observed. The diameter of segment 4 is 8, compared to 4 of segment 2. This means that there are 3 connections in between the farthest points of the network in segment 2, compared to 8 in case of segment 4 (despite substantially lower population).

Another interesting observation is the assortativity of segment 3, which is highest. The tendency to stick together with people with similar number of connections is higher in segment 3.

1seg1<-c(l1[[4]],l1[[5]],l1[[8]],l1[[9]])
2seg2<-c(l2[[4]],l2[[5]],l2[[8]],l2[[9]])
3seg3<-c(l3[[4]],l3[[5]],l3[[8]],l3[[9]])
4seg4<-c(l4[[4]],l4[[5]],l4[[8]],l4[[9]])
5rname<-c("density", "diameter","assortativity", "transitivity")
6
7tba<-data.frame(`segmnt 1`= seg1,`segmnt 2`= seg2,`segmnt 3`= seg3,`segmnt 4`= seg4)
8rownames(tba)<-rname
9tba

1##                  segmnt.1    segmnt.2    segmnt.3     segmnt.4
2## density        0.02286444  0.02057270  0.01073862  0.012139805
3## diameter       5.00000000 13.00000000  5.00000000 12.000000000
4## assortativity -0.10422588 -0.03089446 -0.04228715  0.008327399
5## transitivity   0.21290278  0.21290278  0.21290278  0.212902779

Final plot (for fun)

For fun, graph with individuals with highest eigen vector score is plotted to visualize the network among themselves.

 1ev<-sort(eigen.main$vector, decreasing = T)[1:50]
 2ename<-c(names(ev))
 3n<-c(match(ename,V(gra.main)$name))
 4
 5rock<-induced_subgraph(gra.main,vids=n)
 6rock.d<-degree(rock,mode = "all")
 7par(bg="black")
 8plot(rock,
 9     rescale= TRUE,
10     vertex.label = ifelse(rock.d >= 11, names(V(rock)), NA),
11     vertex.color = adjustcolor("gold", alpha.f = .5),
12     vertex.size = ifelse(rock.d<10,0.1,sqrt(rock.d)),
13     layout = layout_with_lgl(rock),
14     vertex.label.cex= 0.75,
15     vertex.label.degree=pi/2,
16     vertex.label.dist=1,
17     vertex.label.color="white",
18     edge.curved=0.5,
19     edge.width= 0.5,
20     edge.color = adjustcolor(ifelse(rock$weight>8, "dark green", "dark red"),alpha=0.9),
21     xlim = c(-1,1),
22     #asp=-1,
23     axes = F)
24title("Network of top guns",cex.main=1,col.main="white")
25legend("topright", c("<=8 times",">8 times"), pch=21, col="black", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

This analysis can be further extended. Esp, by using clustering algorithms other than fast greedy. Moreover, detailed analysis of ego graphs may reveal interesting insights. Don’t forget to share your results, if you do any of it.

Contact me if

You want to understand how network analysis can help in your sales and marketing efforts.
You are looking to collaborate for some investigation/research.

Sources

< section class="footnotes" role="doc-endnotes">

To leave a comment for the author, please follow the link and comment on their blog: R on Asitav Sen.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.