Site icon R-bloggers

Visualizing threaded conversation volume and intensity

[This article was first published on SoMe Lab » r-project, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

click for larger view

As a researcher interested in information flows in digital environments I’m often interested in finding patterns in social trace data. For this discussion we can think of digital social trace data as the text that people post into threaded topics on forums, like on Reddit or a Wiki Talk page on Wikipedia. One way to find patterns in this kind of data is to make visualizations based on different quantifiable dimensions in the data, for example, total topic volume per day, volume per thread per day, and, possibly, the intensity of the discussion (as interpreted by qualitative researchers). In the remainder of this post I will note what we can learn from our visualization as well as its limitations and then post the R code I used to make the plot.

This type of plot is called a stacked spline plot. The curved lines are known as “splines” (don’t know why) and each one encloses an area, in this case shaded with some value between red and white. The total number of posts on a given day is represented by top spline in the plot. The volume for a given thread is the distance (y-axis) between its spline and the spline below it (the top of a different thread). So each topic is stacked on the one below it. If thread X had 0 posts on a given day, and thread Y had 3 posts on that day, we would see thread Y mound upward, but all we would see for thread X is its grey spline. In fact, the first limitation to this kind of plot is differentiating between specific threads. You can work around this by assigning topics specific colors. We use colors differently here.

In this plot I have shaded threads by how intense the discussion in the thread was. How do you do that? You could have qualitative researchers read each post against some agreed upon criteria for the vehemence of the discussion. In fact, I include this to show that we can include qualitatively coded data with our quantitative representation of an information flow to give the data visualization more depth and meaning. It doesn’t have to be intensity. It could be any particular measure that researchers are interested in.

click for larger view

One serious limitation of using splines is that data points are targets for the lines; the spline may not actually go all the way up or down to the point, so the exact point may not be faithfully represented. It does this because the spline algorithm is attempting to smooth the points out. I’ve made a second version with polygons instead of splines to highlight this: compare October 24th and 25th in the first (red) and second (blue) plots. The data is the same, but the spline fails to capture the depth of the dip or the height of the second peak.

Still, this kind of visualization is good for giving a sense of the patterns of exchange in online forums as well as comparing one topic to another in the same forum. We may not be interested in exact numbers, but rather in exploring the data and making some initial observations. The third plot (orange) shows an alternate data set over the same time frame. Why does this second topic have such a different signature and much lower overall volume? Why the large volume right at the start of the first plot? Which are we more likely to see in most topics? Will it differ based on some sort of category of topics? If we knew the topics of discussion for these plots we could generate some observations and explanations. In this case, both data sets were generated for this example (the first is based on a real, controversial Wikipedia Talk page data).

click for larger view

How did I make the plot? Below is the code I created in R to make the plot. Instead of writing about the code, I have inserted comments so that if you download to script you will have the explanations handy and modify with an understanding of what I had intended. Let me know if you come up something cool, and please acknowledge the Social Media Lab at UW if you use it as is.

First, the plotting function, then below, the calls to it.

?Download download.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# Author: Jeff Hemsley jhemsley at uw dot edu
# Twitter: @JeffHemsley
# Created: Sometime in 2010
# 
# Function creates a stacked spline from a matrix of data. Can also use polygones
#   for straight lines instead of splines. The function is not mature in that it 
#   hasn't been tested (or even written) to deal a broad range of cases. I have
#   used it to plot a number of threads over time (days), but other applications
#   are certainly possible.
# 
#  mat.in: matrix where rows are x-axis and columns are y-axis. Uses rownames
#          for x-axis lables. The cell values indicate the y value for a given
#          column (thread) at a point in time (day).
#  seriesborders: (FALSE) to print borders around the spline or just the fill.
#          default is NO borders.
#  splineseries: (TRUE) if true then spline else polygone
#  col.vec: vector of colors. If null, random colors are assigned to the (columns) 
#          threads. 
#  col.fg: The color for the spline border. Default is grey.
#  weekend: a vector with the same length as the number of rows in mat.in. Function
#          expects a 0 (no shading) or 1 (vertical shaded block). If the i-th value 
#          in the vector is 1, then were x == 1 the function will shade. Used to 
#          note weekends. Alternate colors currently not supported.
#
plotstackedseries<-function(mat.in, seriesborders=FALSE, splineseries=TRUE
      , col.vec=NULL, col.fg=NULL, weekend=NULL
			, f.main="Stacked Spline Plot", f.ylab="Counts", f.xlab="Days"){ 
 
	myfg <- par("fg")
	mybg <- par("bg")
	rows <- dim(mat.in)[1]
	cols <- dim(mat.in)[2]
	# Since we are plotting a shape instead of a line, we need an additonal
	# start and end point for each thread. This allows us to close the shape
	new.rows <- rows + 2
	a <- 2
	b <- rows + 1
	# we can adjust how curved the spline is relative to the original points. 
	# At 0 there is no cruve, at -1 it overshoots the points, I like .9.
	spline.shape <- c(0,rep(.9,rows),0)
 
    # new matrix with a first and last row of zeros to close the splines
    zip.vec <- rep(0, cols)
    m <- rbind(zip.vec, mat.in, zip.vec)
 
	# Now. The way to do this is to over plot each thread on top of the last.
    # So the thread at the top has to be shaped according to its own cell 
    # values as well as all the cells below it. In otherwords, we need to 
    # do a cummulative sum on the rows (we expect these to represent days).
	m <- t(apply(m, 1, cumsum))
 
	# now we plot an empty plot that we'll put splines on later
	y.max <- max(m)
	x <- seq(1:length(mat.in[,1]))
	# all of our splines are x y plots, so we need xs.
    x.vec <- c(0,x,max(x)) 
 
	plot(x, mat.in[,1], type="n", ylim=c(0,y.max)
		  , xaxt="n"
	    , yaxt="n"
			, main=f.main
			, ylab=f.ylab
			, xlab=f.xlab
			, bty="n"
	)
 
    # custom axis stuff.
    my.at <- axTicks(1)
	if (my.at[1] == 0) {
	  my.at[1] <- 1
	} 
    # bottom axis lables are the row names of the matrix
	axis(1, at=my.at, label=rownames(raw.thread.mat)[my.at], line=-.7
	     , las=0, tck=-.02, cex.axis=1.1, lwd=1) # las=2 for perpindicular
    # left had y axis 
	axis(2, at=axTicks(2), label=axTicks(2), line=-.7
	     , las=2, tck=0, cex.axis=1.1, lwd=0) 
 
    # if no color vector is passed in, use random heat colors
	if (is.null(col.vec)) {
	  col.vec <- sample(heat.colors(cols), size=cols, replace=F)
	}
 
    # deal with borders.
	if (seriesborders==FALSE) {
		col.fg <- col.vec
	} else if (is.null(col.fg)) { 
		col.fg <- myfg
	} 
 
	#print weekends?
	if (is.null(weekend)==FALSE) {
	  y.lable.max <- max(axTicks(2)) 
		for (i in 1:rows) {
			if (weekend[i]>0) {
				x.weekend<-c(i,i,i+1,i+1)
				y.weekend<-c(0,y.lable.max,y.lable.max,0)
				polygon(x.weekend, y.weekend, col=rgb(0,0,1,.1), border=NA)
			}
		}
	}
 
	# for the series, we work from back to frount. We plot the 
    # tallest filled in spline shape first, followed by the next
    # tallest and so on. So each spline series is plotted over the
    # last one. Nifty.
	for (i in cols:1) {
 
		if (length(col.fg)==1) {
			local.fg <- col.fg
		} else {
			local.fg <- col.fg[i]
		}
 
        # we can plot a spline or a polygon
		if (splineseries==TRUE) {
		  xspline(x.vec, m[,i], col=col.vec[i], shape=spline.shape, open=F
              , border=local.fg, lty=1, lwd=.5)
		} else {
			polygon(x.vec, m[,i], col=col.vec[i], border=local.fg)
		}
	}
 
	par(fg=myfg, bg=mybg)
}

Below is the code I used to call the function.

?Download download.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Author: Jeff Hemsley jhemsley at uw dot edu
# Twitter: @JeffHemsley
# Created: Sometime in 2010
# 
# the location of the files I used are at:
# http://somelab.net/wp-content/uploads/2013/01/thread_data_v7.csv
# http://somelab.net/wp-content/uploads/2013/01/thread_data_v6.csv
# note that rows are dates and cols are threads: series in cols.
# Grab these files or make your own and fix the path info below
#
f.path <- "c:/r/RUserGrp/"
f.name <- "thread_data_v7.csv"
f.raw.thread.mat <- paste(f.path, f.name, sep="")
plot.file.name <- paste(f.path, "topic5.png", sep="")
 
raw.thread.mat <- as.data.frame(read.csv(file=f.raw.thread.mat
                                         , header=TRUE, row.names=1))
num.conversations <- dim(raw.thread.mat) [2]
num.posts <- sum(raw.thread.mat)
 
# and, I already checked when the weekend dates are, so this vector
# will highlight the weekends in the plot: 1: time off, 0: work
week.end <- c(1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1
              , 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0)
 
# for these examples, the intensity is random binomial distributed
# becuase it seems to loosly resemble actual data
conversasion.intensity <- rbinom(num.conversations, 10, 0.3)
# since we are going to color the threads based on this, lets
# normalize the values 0 - 1
conversasion.intensity <- conversasion.intensity/max(conversasion.intensity)
 
# I usually use rgb colors, but hsv allows me to work in monochrome 
# h=0: red, .1: gold/yellow, .3: green, .6: blue, .7: purpleish
# V:  1 is brightest
# s (saturation) conversasion.intensity. 0 to 1, with 0 being none
conversasion.intensity.color <- hsv(h=.1, s=conversasion.intensity, v=1, alpha=1)
 
# I sent this out to a png, but it looks pretty nifty sent to pdf or svg
png(filename=plot.file.name, width=1024, height=768)
plotstackedseries(raw.thread.mat, seriesborders=T, splineseries=T,
                  col.vec=conversasion.intensity.color,
                  col.fg=rgb(.5,.5,.5,.5), weekend=week.end, 
                  f.main="Talk threads by volume and intensity",
                  f.ylab="Posts", f.xlab="Days"
)
 
sub.title <- paste("1 topic, ", num.conversations, " threads, ", num.posts, " posts")
mtext(sub.title, side=3)
dev.off()

To leave a comment for the author, please follow the link and comment on their blog: SoMe Lab » r-project.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.