Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Abstract
The California Deparment of Education recently (June 2012) had a news release on the increase in high school (grades 9-12) graduation rates and decrease in dropout rates. The data used by the Department was from two cohorts (4-year periods) on a statewide basis. While the statewide numbers show an increase (decrease) in graduation (dropout) rates, a particular school district’s numbers do not necessarily follow the statewide trend. Moreover, the longer-term data (about 20 years) on graduation and dropout rates has been ignored by the Department in their analyses. Following is my attempt to dig deeper into the data – just for kicks!
Overview
The California Deparment of Education recently (June 2012, news article) publicized the increase in high school graduation rates and decrease in dropout rates. Newspapers have also reported on this topic:
However, the data is available for only two cohorts (‘09 and ‘10). Moreover, the available cohort data is at an aggregated scale – school district or state, and not for each individual school.
I am interested in individual school performance – particulalry the five high schools in the Fremont Unified district, since I currently live here. The following is my attempt to dig deeper into this data, of course, using R. Below are the data sources. I downloaded all the files (2 on cohort rates, 1 on school list and 20 on dropout rates) before running the following R code.
Reproduce Dept of Education’s news release data
First, I try to reproduce the results from the news release.
rm(list = ls()) workDir <- "C:/Rstuff/DATA/CA Schools/" setwd(workDir) # read data coh09 <- read.csv("cohort09.txt", header = TRUE, sep = "\t", as.is = TRUE) coh10 <- read.csv("cohort10.txt", header = TRUE, sep = "\t", as.is = TRUE)
News release numbers are based on “Statewide” data. Extract only this subset and produce a table similar to the one in the news release.
subsetName <- "Statewide" # cohort09 data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All")) # cohort10 data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All"))
Reproduce table of Graduation Rates
grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release gradTable <- merge(grad09, grad10) gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16]) gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(gradTable) <- gradTable$Type gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910) gradTable ## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 74.72 378976 76.26 382558 1.54 ## 5 5 68.08 159364 70.41 167886 2.33 ## 1 1 67.32 2933 68.00 2692 0.68 ## 2 2 89.01 40384 89.66 39717 0.65 ## 3 3 72.33 2520 74.26 2432 1.93 ## 4 4 87.34 12002 89.00 12104 1.66 ## 6 6 60.54 25224 62.76 24917 2.22 ## 7 7 83.52 130539 85.42 124863 1.90 ## 9 9 82.80 4534 81.36 5311 -1.44 ## 0 0 53.81 1476 46.26 2636 -7.55 ## EL EL 56.39 53015 60.21 60280 3.82 ## MIG MIG 71.14 10790 71.86 9794 0.72 ## SE SE 56.72 34385 59.04 34156 2.32 ## SD SD 68.04 204189 69.94 219856 1.90
Reproduce table of Dropout Rates
drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release dropTable <- merge(drop09, drop10) dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16]) dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(dropTable) <- dropTable$Type dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910) dropTable ## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 16.6 84298 14.4 72314 -2.2 ## 5 5 20.8 48706 17.7 42126 -3.1 ## 1 1 22.1 962 20.7 818 -1.4 ## 2 2 7.2 3270 6.2 2764 -1.0 ## 3 3 19.6 683 17.4 571 -2.2 ## 4 4 7.8 1078 6.7 906 -1.1 ## 6 6 26.7 11143 24.7 9791 -2.0 ## 7 7 10.7 16759 8.9 12980 -1.8 ## 9 9 10.2 556 11.2 730 1.0 ## 0 0 41.6 1141 28.6 1628 -13.0 ## EL EL 29.0 27264 24.8 24858 -4.2 ## MIG MIG 18.8 2845 17.3 2352 -1.5 ## SE SE 21.9 13296 18.4 10628 -3.5 ## SD SD 20.1 60309 17.6 55483 -2.5
Note
- Data used by the news release appears to be similar, but not identical, to the data currently available on the web. The discrepancies in the above numbers and those in the news article are attrbuted to the slightly different datasets.
- Also, the sum of graduation percentage and dropout percentage for a category need not add up to 100 because some students take longer than 4 years to graduate and are not (yet) accounted here.
Repeat the analysis on Fremont Unified School District
subsetName <- "Fremont Unified" # cohort09 data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All")) # cohort10 data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All"))
Table of Graduation Rates
grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release gradTable <- merge(grad09, grad10) gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16]) gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(gradTable) <- gradTable$Type gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910) gradTable ## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 82.74 2177 89.34 1952 6.60 ## 5 5 69.06 279 79.42 220 10.36 ## 1 1 90.00 * 100.00 * 10.00 ## 2 2 89.63 1055 93.49 977 3.86 ## 3 3 80.95 17 68.42 13 -12.53 ## 4 4 80.37 131 92.25 119 11.88 ## 6 6 67.86 76 82.52 85 14.66 ## 7 7 82.19 586 87.98 505 5.79 ## 9 9 81.48 22 85.71 24 4.23 ## 0 0 50.00 * 83.33 * 33.33 ## EL EL 68.56 205 73.06 179 4.50 ## MIG MIG 71.43 * 85.71 * 14.28 ## SE SE 55.47 137 65.41 121 9.94 ## SD SD 72.69 503 79.42 436 6.73
Table of Dropout Rates
drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release dropTable <- merge(drop09, drop10) dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16]) dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(dropTable) <- dropTable$Type dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910) dropTable ## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 5.1 133 5.5 120 0.4 ## 5 5 9.7 39 13.4 37 3.7 ## 1 1 0.0 * 0.0 * 0.0 ## 2 2 1.7 20 2.5 26 0.8 ## 3 3 4.8 * 10.5 * 5.7 ## 4 4 4.9 * 6.2 * 1.3 ## 6 6 17.0 19 9.7 * -7.3 ## 7 7 5.8 41 5.9 34 0.1 ## 9 9 11.1 * 10.7 * -0.4 ## 0 0 50.0 * 0.0 * -50.0 ## EL EL 11.4 34 16.3 40 4.9 ## MIG MIG 7.1 * 14.3 * 7.2 ## SE SE 12.6 31 11.4 21 -1.2 ## SD SD 9.4 65 11.8 65 2.4
Some Observations
- Statewide graduation rates, consistent with the news release, have increased across the different ethinic and other demographic groups considered.
The Fremont Unified School District’s graduation rates show a similar pattern – increase in graduation rates across the groups.
Statewide dropout rates have decreased across most categories. However, Fremont Unified’s dropout rates have increased across most categories.
For whatever reason the results in the news release are not presented by gender.
Closer look at Fremont Unified’s High Schools
Now I use the enrollment and dropout numbers by year data, 1992-2011, to see the time series of dropout rates for the 5 high schools in Fremont Unified.
Identify high schools in Fremont Unified – I look at only public schools.
# school info from http://www.cde.ca.gov/ds/si/ds/pubschls.asp allSchools <- read.csv("pubschls.txt", header = TRUE, sep = "\t", as.is = TRUE, colClasses = "character") # Fremont specific data freSchools <- subset(allSchools, County == "Alameda" & District == "Fremont Unified" & StatusType == "Active" & SOC == 66) freSchools$School ## [1] "American High" "Irvington High" "John F. Kennedy High" ## [4] "Mission San Jose High" "Washington High"
For each year, extract the total dropouts (“DTOT”) and total enrollment (“ETOT”) for 9,10,11 and 12 grades. Estimate the average dropout rate as the fraction DTOT/ETOT.
# dropout data from http://www.cde.ca.gov/ds/sd/sd/filesdropouts.asp extns <- substr(as.character(seq(1992, 2011, by = 1)), 3, 4) freData <- NULL #stores summary for (eachYr in extns) { dropData <- read.csv(paste("dropouts", eachYr, ".txt", sep = ""), header = TRUE, sep = "\t", as.is = TRUE, colClasses = "character") # Fremont specific data yearData <- subset(dropData, CDS_CODE %in% freSchools$CDSCode) yearData$ETOT <- as.numeric(yearData$ETOT) yearData$DTOT <- as.numeric(yearData$DTOT) # total dropouts and total enrolled for each school by year yearSumm <- aggregate(cbind(DTOT, ETOT) ~ CDS_CODE + YEAR, data = yearData, FUN = sum) # update output freData <- rbind(freData, yearSumm) }
Plot the results.
schoolCodes <- levels(as.factor(freData$CDS_CODE)) schoolNames <- freSchools$School[schoolCodes %in% freSchools$CDSCode] png(filename = "fredrop.png") par(mfrow = c(5, 1), mar = c(1, 4, 3, 2)) yearNames <- seq(1992, 2011, by = 1) for (eachSchool in 1:length(schoolCodes)) { subData <- subset(freData, CDS_CODE == schoolCodes[eachSchool]) barplot(subData$DTOT * 100/subData$ETOT, xlab = "", ylim = c(0, 5), ylab = "drop %", names.arg = yearNames, main = schoolNames[eachSchool]) } garbage <- dev.off()
From the graph it is evident that between 2011 and 2010 American and Irvington’s droprates have decreased, while Kennedy, Mission and Washington’s droprates have increased. However, both increases and decreases in recent droprates are relatively small. Fremont schools have pretty low dropout rates!
Summary
While what I did here sheds light on some of the issues with the Department’s analyis, it is by no means complete. The University of California in Santa Barbara has an entire project dedicated to analyze the dropout rates – UCSB Dropout Research Project.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.