Student Performance Indicators
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Check out my:
Source: http://archive.ics.uci.edu/ml/datasets/Student+Performance
This project is based upon two datasets of the academic performance of Portuguese students in two different classes: Math and Portuguese. Initially, I show the simplicity of predicting student performance using linear regression. Later, I show that it is still possible, yet more difficult, to predict the final grade without Period 1 and Period 2 grades but we we learn from those predictions provides much deeper insight. I ask deeper questions about the mathematical structure of student performance and potential indicators that can be used for early support and intervention.
Preparation¶
Load R and packages.
<span class="o">%</span><span class="k">load_ext</span> <span class="n">rpy2</span><span class="o">.</span><span class="n">ipython</span>
<span class="o">%%</span>R suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>ggplot2<span class="p">))</span> suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>dplyr<span class="p">))</span> suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>caret<span class="p">))</span> suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>gridExtra<span class="p">))</span> suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>MASS<span class="p">))</span> suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>leaps<span class="p">))</span> suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>relaimpo<span class="p">))</span> suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>mgcv<span class="p">))</span>
Read in data.
<span class="o">%%</span>R student.mat <span class="o"><-</span> read.csv<span class="p">(</span><span class="s">"student-mat.csv"</span><span class="p">,</span>sep<span class="o">=</span><span class="s">";"</span><span class="p">)</span> student.por <span class="o"><-</span> read.csv<span class="p">(</span><span class="s">"student-por.csv"</span><span class="p">,</span>sep<span class="o">=</span><span class="s">";"</span><span class="p">)</span> head<span class="p">(</span>student.mat<span class="p">)</span>
Linear Model¶
For determining the best linear model, we will use student.mat as a training set and student.por as a test set.
<span class="o">%%</span>R train <span class="o"><-</span> student.mat test <span class="o"><-</span> student.por
Saturated Model¶
Let’s fit a linear model to all of the variables. The saturated model will overfit the data, but it will provide a control that can be used to test against.
<span class="o">%%</span>R fit <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> .<span class="p">,</span> train<span class="p">)</span>
Compare Adjusted R2, BIC, and Mallow’s CP With Best Subsets¶
5 variables give the lowest BIC and Mallow’s CP while providing an optimal Adjusted R2.
<span class="o">%%</span>R subs <span class="o"><-</span> regsubsets<span class="p">(</span>G3 <span class="o">~</span> .<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span> df <span class="o"><-</span> data.frame<span class="p">(</span>est <span class="o">=</span> c<span class="p">(</span>summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>adjr2<span class="p">,</span> summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>cp<span class="p">,</span> summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>bic<span class="p">),</span> x <span class="o">=</span> rep<span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">8</span><span class="p">,</span> <span class="m">33</span><span class="p">),</span> type <span class="o">=</span> rep<span class="p">(</span>c<span class="p">(</span><span class="s">"adjr2"</span><span class="p">,</span> <span class="s">"cp"</span><span class="p">,</span> <span class="s">"bic"</span><span class="p">),</span> each <span class="o">=</span> <span class="m">8</span><span class="p">))</span> qplot<span class="p">(</span>x<span class="p">,</span> est<span class="p">,</span> data <span class="o">=</span> df<span class="p">,</span> geom <span class="o">=</span> <span class="s">"line"</span><span class="p">)</span> <span class="o">+</span> theme_bw<span class="p">()</span> <span class="o">+</span> facet_grid<span class="p">(</span>type <span class="o">~</span> .<span class="p">,</span> scales <span class="o">=</span> <span class="s">"free_y"</span><span class="p">)</span>
From the summary, we need to pick the top 5 variables. G1, G2, absences, and famrel will be the first four and the fifth will either be age or activities.
<span class="o">%%</span>R fit <span class="o"><-</span> lm<span class="p">(</span>formula <span class="o">=</span> G3 <span class="o">~</span> .<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span> summary<span class="p">(</span>fit<span class="p">)</span>
ANOVA¶
The ANOVA test tells us that the best model is the one with age.
<span class="o">%%</span>R model1 <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> age<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span> model2 <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> activities<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span> anova<span class="p">(</span>fit<span class="p">,</span> model1<span class="p">,</span> model2<span class="p">)</span>
Test Set¶
Very quickly, we have an accurate model that did a great job predicting our test set. Notice the darker alpha areas snugly against the line.
We can visually compare the success of the final model versus the saturated model by graphing the predicted values versus the actual values. The line represents a perfect model.
Please note the outliers around the actual values of 0. I will go into more detail about this group later in this project.
<span class="o">%%</span>R <span class="c1">#Saturated Model</span> control.model <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> .<span class="p">,</span> data <span class="o">=</span> test<span class="p">)</span> control.graph <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>control.model<span class="p">),</span> data <span class="o">=</span> test<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.5</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Saturated Model"</span><span class="p">)</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span> <span class="c1">#Final Model</span> final.model <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> age<span class="p">,</span> data <span class="o">=</span> test<span class="p">)</span> final.graph <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final.model<span class="p">),</span> data <span class="o">=</span> test<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.5</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Final Model"</span><span class="p">,</span> guide<span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span> grid.arrange<span class="p">(</span>control.graph<span class="p">,</span>final.graph<span class="p">,</span>nrow<span class="o">=</span><span class="m">2</span><span class="p">)</span>
Diagnostics¶
Overall, our model looks pretty good. The main issue with our model is the cluster when G3 is 0.
It affects the residuals at the lower end of our distribution.
<span class="o">%%</span>R plot<span class="p">(</span>final.model<span class="p">)</span>
The 0-Cluster¶
Upon further inspection of the data, it becomes obvious that this cluster most likely belongs to students who dropped the course.
- They have G1 and/or G2 grades but final grades of 0.
- There are no G1s of 0 but there are G2s with 0 value.
- The exploratory model predicts these students as scoring between 0 and 10 which would constitute failing grades.
As a result, we should drop these data points before continuing our analysis since they will not be useful for the question we are researching.
<span class="o">%%</span>R score0 <span class="o"><-</span> subset<span class="p">(</span>student.por<span class="p">,</span> G3<span class="o">==</span><span class="m">0</span><span class="p">)</span> score0
Final Model¶
Here is the final model for students who finish the course.
<span class="o">%%</span>R <span class="c1">#Final Model</span> test <span class="o"><-</span> subset<span class="p">(</span>train<span class="p">,</span> G3<span class="o">!=</span><span class="m">0</span><span class="p">)</span> final.model.no0 <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> age<span class="p">,</span> data <span class="o">=</span> test<span class="p">)</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final.model.no0<span class="p">),</span> data <span class="o">=</span> test<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.5</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Final Model"</span><span class="p">,</span> guide<span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>
Deeper Questions and Analysis¶
Our model does a great job at predicting student success; however, there are deeper questions that this model doesn’t address. In particular, it doesn’t demonstrate how we can pick which students are most likely to fail classes at an early age when they lack the best predictors in this model.
As we’ve seen, the best predictors of success are current grades within the course (G1 and G2), age, quality of family relationships, and absences.
Current grades are already present once a problem exists.
Let’s try to see if we can determine what factors can be more useful at preventing student failure and promoting academic success.
Let’s start by looking at all the variables within a linear model, but remove our strongest indicators, G1 and G2, which overshadow other potential factors.
<span class="o">%%</span>R fit <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> student.mat<span class="p">)</span>
Our predictions stop at 15 but actual scores rise until 20. Without G1 and G2, our model is unable to make predictions that are any higher.
A score of 15 shows a clear dividing line where the “potential” futures merge into current academic success. This line is important in that it can help us determine what deeper differences successful students have from their peers and also allows to create a definition of a “successful” student that we can use.
For this section, it becomes clear that two models will need to be analyzed: one for grades below 15 and another for grades above 15.
<span class="o">%%</span>R qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>fit<span class="p">),</span> data <span class="o">=</span> student.mat<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.8</span><span class="p">)</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>
Breaking Up the Analysis¶
So far, the data has shown that it should be broken into three parts in order to analyze deeper predictors of future success.
Students who drop 1. The first isolates students who drop a course. Their final outcome is 0 even though they should have a higher predicted outcome. These students have predicted scores below 10.
Students who finish 2. Between 0 and 15, one set of predictors (one model) will be used to predict student outcomes. 3. Between 15 and 20, a different set of predictors (a different model) will be used.
<span class="o">%%</span>R <span class="c1">#Prep Data</span> score0 <span class="o"><-</span> subset<span class="p">(</span>student.mat<span class="p">,</span> G3<span class="o">==</span><span class="m">0</span><span class="p">)</span> score.no0 <span class="o"><-</span> subset<span class="p">(</span>student.mat<span class="p">,</span> G3<span class="o">!=</span><span class="m">0</span><span class="p">)</span> score14 <span class="o"><-</span> subset<span class="p">(</span>score.no0<span class="p">,</span> G3<span class="o"><</span><span class="m">15</span><span class="p">)</span> score15 <span class="o"><-</span> subset<span class="p">(</span>score.no0<span class="p">,</span> G3<span class="o">></span><span class="m">14</span><span class="p">)</span>
Students Above 15¶
Students in this group have 3 things that stand out:
1. All of them have parents that live together. 2. None of them have had past class failures. 3. All of them plan on seeking higher education.
<span class="o">%%</span>R score15
Students Below 14¶
Create a training and test set for this group.
<span class="o">%%</span>R set.seed<span class="p">(</span><span class="m">123</span><span class="p">)</span> inTraining <span class="o"><-</span> createDataPartition<span class="p">(</span>score14<span class="o">$</span>G3<span class="p">,</span> p <span class="o">=</span> <span class="m">.75</span><span class="p">,</span> list <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span> training <span class="o"><-</span> score14<span class="p">[</span> inTraining<span class="p">,]</span> testing <span class="o"><-</span> score14<span class="p">[</span><span class="o">-</span>inTraining<span class="p">,]</span>
Saturated Model¶
Below is a general model with all of our variables using the training set. This can help determine which predictors are statistically significant.
<span class="o">%%</span>R saturated14 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span> summary<span class="p">(</span>saturated14<span class="p">)</span>
Let’s use the step function to find a cut down version of Model 1 that removes uneccesary predictors.
<span class="o">%%</span>R step<span class="p">(</span>saturated14<span class="p">)</span>
Model 2¶
Model 2 will be equivalent to the output of the step function.
<span class="o">%%</span>R model2 <span class="o"><-</span> lm<span class="p">(</span>formula <span class="o">=</span> G3 <span class="o">~</span> sex <span class="o">+</span> address <span class="o">+</span> studytime <span class="o">+</span> failures <span class="o">+</span> goout <span class="o">+</span> absences<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span>
<span class="o">%%</span>R subs <span class="o"><-</span> regsubsets<span class="p">(</span>G3 <span class="o">~</span> sex <span class="o">+</span> address <span class="o">+</span> studytime <span class="o">+</span> failures <span class="o">+</span> goout <span class="o">+</span> absences<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span> df <span class="o"><-</span> data.frame<span class="p">(</span>est <span class="o">=</span> c<span class="p">(</span>summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>adjr2<span class="p">,</span> summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>bic<span class="p">),</span> x <span class="o">=</span> rep<span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">,</span> <span class="m">6</span><span class="p">),</span> type <span class="o">=</span> rep<span class="p">(</span>c<span class="p">(</span><span class="s">"adjr2"</span><span class="p">,</span> <span class="s">"bic"</span><span class="p">),</span> each <span class="o">=</span> <span class="m">6</span><span class="p">))</span> qplot<span class="p">(</span>x<span class="p">,</span> est<span class="p">,</span> data <span class="o">=</span> df<span class="p">,</span> geom <span class="o">=</span> <span class="s">"line"</span><span class="p">)</span> <span class="o">+</span> theme_bw<span class="p">()</span> <span class="o">+</span> facet_grid<span class="p">(</span>type <span class="o">~</span> .<span class="p">,</span> scales <span class="o">=</span> <span class="s">"free_y"</span><span class="p">)</span>
<span class="o">%%</span>R summary<span class="p">(</span>model2<span class="p">)</span>
Model 3¶
Model 3 will be our final model.
<span class="o">%%</span>R model3 <span class="o"><-</span> lm<span class="p">(</span>formula <span class="o">=</span> G3 <span class="o">~</span> sex <span class="o">+</span> failures<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span> summary<span class="p">(</span>model3<span class="p">)</span>
ANOVA¶
We can now compare the 3 models we made using ANOVA.
<span class="o">%%</span>R anova<span class="p">(</span>saturated14<span class="p">,</span>model2<span class="p">,</span>model3<span class="p">)</span>
In this case, ANOVA isn’t very useful since the strongest predictors from the original model have been cut out. By comparing models graphically, it’s easier to get an idea of what’s going on.
By removing the strong predictors of the original model, single predictors become less important and holistic models become more accurate. Below, we see that Model 1 performs the best on the test set.
This gives insight into how we should approach these students early on. One indicator will not make or break a child, but the overall profile can still be a strong indicator.
<span class="o">%%</span>R <span class="c1">#Models</span> final1 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> data<span class="o">=</span>testing<span class="p">)</span> final2 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> sex <span class="o">+</span> address <span class="o">+</span> studytime <span class="o">+</span> failures <span class="o">+</span> goout <span class="o">+</span> absences<span class="p">,</span> data<span class="o">=</span> testing<span class="p">)</span> final3 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> sex <span class="o">+</span> failures<span class="p">,</span> data<span class="o">=</span>testing<span class="p">)</span> <span class="c1">#Graphs</span> plot1 <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final1<span class="p">),</span> data <span class="o">=</span> testing<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.8</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Model 1"</span><span class="p">)</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span> plot2 <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final2<span class="p">),</span> data <span class="o">=</span> testing<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.8</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Model 2"</span><span class="p">)</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span> plot3 <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final3<span class="p">),</span> data <span class="o">=</span> testing<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.8</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Model 3"</span><span class="p">)</span> <span class="o">+</span> geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span>slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span> grid.arrange<span class="p">(</span>plot1<span class="p">,</span>plot2<span class="p">,</span>plot3<span class="p">,</span>nrow<span class="o">=</span><span class="m">2</span><span class="p">,</span>main<span class="o">=</span><span class="s">"3 Models"</span><span class="p">)</span>
The most important influencers of the holistic model are: – The school the student attends – Access to school supplies – Past failures – Absences – How often the student goes out
<span class="o">%%</span>R tester <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> data<span class="o">=</span&g...
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.