Using FAFSA Data to study Competitors – Part 2

btibert3

9 years ago

[This article was first published on Data Twirling » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wanted to build upon my previous post and dive a little deeper into the sorts of questions we can answer using the FAFSA data supplied to us by applicants.

As a quick overview, students completing the FAFSA for student aid can list up to ten institutions on the form. I consider this the student’s consideration set. When aggregating these data, we can start to get a sense of the most frequently listed schools and how these institutions may be related.

With these data, you can manipulate the structure to answer a wide range of questions. One approach would coerce the data into a network. For this task, I am going to use the statistical programming language R and the library igraph. The resulting network includes all schools listed (excluding the host institution) with weighted edges representing the # of co-occurences.

Listed below are some quick stats on my undirected network from the last few years:

Graph density: 0.05108093
Diameter: 5
Average Path Length: 2.418751
Transitivity (clustering coefficient): 0.3390529

Graph density is the ratio of edges related to the total number of possible edges. For context, an edge is a connection between two schools. If you think of Facebook, you and your friends are connected by an edge. Diameter is a measure of how many steps (edges) are required to connect the two farthest nodes in the network. The Average Path Length is basically an average of how many steps it would take for all schools to be connected. The clustering coefficient is a measure of how well the nodes tend to cluster together (listed on the same FAFSA form).

Shown below is a plot of the graph, with each school sized by pagerank score (included function in igraph).

It’s easy to see that there are few key players in the FAFSA network; I consider these “core” competitors. More interesting to me, however, are the schools at the outer edge, as they are less common and speak to the choice set of an applicant.

In summary, this post was intended to be a quick overview of how one might employ network analysis to study the schools commonly listed on the FAFSA form for your institution. In the future, I will take the same data and use association rules to find common patterns of school listings.

EDIT: Here are the code snippets that I used to generate the data and plot above:

## basic stats:
## density (graph.density)
graph.density(g)
## diamter
diameter(g, directed=F)
## average path length (shortest.paths)
average.path.length(g, directed=F)
## transivity (clustering coeffecient)
transitivity(g)
## radius
radius(g)
## degree distribution
plot(1-degree.distribution(g, cumulative=T), type="l",
xlab="degree", ylab="Cume Distribution", main="FAFSA Network")
g$layout pagerank plot(g,
vertex.size= pagerank*150,
vertex.label=NA,
vertex.color= "red",
vertex.frame.color="black",
edge.arrow.size=0,
edge.color=colors()[239],
edge.width=.5,
edge.curved=TRUE,
layout=layout.auto(g))

To leave a comment for the author, please follow the link and comment on their blog: Data Twirling » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.