[This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Winter Olympics are on. The Guardian’s DataBlog has graciously compiled a database on Winter Olympic Medals. Thus, I thought I’d run a few quick analyses on the data in R. In this post I was hoping to show how one could quickly churn out some basic analyses (and answer some interesting questions) using R.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First, a disclaimer: I ran these analyses in about 45 minutes. Thus, I make no claims of perfect accuracy or in the source data provided by the Guardian. The data also does not include 2010 medals.
Below you will see:
- The R Console input and output
- The plots
- The source code on its own
CONSOLE INPUT AND OUTPUT
> # tips on reading a Google Spreadsheet: > # http://blog.revolution-computing.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html > # Data taken from:"https://spreadsheets.google.com/ccc?key=0AgdO92JOXxAOdDVlaUpkNlB2WERtV3l1ZVFYbzllQWc" > # http://www.guardian.co.uk/news/datablog/2010/feb/11/winter-olympics-medals-by-country > > googleLink <- "http://spreadsheets.google.com/pub?key=tsddww6vOYePkhPSxRpDeYw&single=true&gid=1&output=csv" > medals <- read.csv(googleLink, stringsAsFactors = FALSE) > savePlot <- TRUE # optional variable used to save or not save plots in code > > # remove rows that do not contain data > medals$Year <- as.numeric(medals$Year) Warning message: NAs introduced by coercion > medals <- medals[!is.na(medals$Year), ] > > > # Quick look at data > head(medals) Year City Sport Discipline NOC Event Event.gender 1 1924 Chamonix Skating Figure skating AUT individual M 2 1924 Chamonix Skating Figure skating AUT individual W 3 1924 Chamonix Skating Figure skating AUT pairs X 4 1924 Chamonix Bobsleigh Bobsleigh BEL four-man M 5 1924 Chamonix Ice Hockey Ice Hockey CAN ice hockey M 6 1924 Chamonix Biathlon Biathlon FIN military patrol M Medal 1 Silver 2 Gold 3 Gold 4 Bronze 5 Gold 6 Silver > sapply(medals, function(x) cbind(sort(table(x), decreasing = TRUE))) $Year [,1] 2006 252 2002 234 1998 205 1994 183 1992 171 1988 138 1984 117 1980 115 1976 111 1968 106 1972 105 1964 103 1960 81 1956 72 1948 68 1952 67 1936 51 1924 49 1932 42 1928 41 $City [,1] Turin 252 Salt Lake City 234 Innsbruck 214 Nagano 205 Lillehammer 183 Albertville 171 Lake Placid 157 Calgary 138 Sarajevo 117 St. Moritz 109 Grenoble 106 Sapporo 105 Squaw Valley 81 Cortina d'Ampezzo 72 Oslo 67 Garmisch-Partenkirchen 51 Chamonix 49 $Sport [,1] Skiing 1060 Skating 758 Biathlon 162 Bobsleigh 133 Luge 108 Ice Hockey 69 Curling 21 $Discipline [,1] Speed skating 455 Cross Country S 399 Alpine Skiing 367 Figure skating 207 Biathlon 162 Bobsleigh 115 Ski Jumping 114 Luge 108 Short Track S. 96 Nordic Combined 84 Ice Hockey 69 Freestyle Ski. 54 Snowboard 42 Curling 21 Skeleton 18 $NOC [,1] NOR 280 USA 216 URS 194 AUT 185 GER 158 FIN 151 CAN 119 SUI 118 SWE 118 GDR 110 ITA 101 FRA 83 NED 78 RUS 76 FRG 41 CHN 33 JPN 32 KOR 31 TCH 25 EUN 23 GBR 21 EUA 19 CZE 10 LIE 9 POL 8 CRO 7 AUS 6 BLR 6 BUL 6 EST 6 HUN 6 BEL 5 KAZ 5 UKR 5 SLO 4 YUG 4 ESP 2 LUX 2 PRK 2 DEN 1 LAT 1 NZL 1 ROU 1 SVK 1 UZB 1 $Event [,1] individual 195 500m 133 1500m 111 downhill 97 slalom 96 1000m 94 giant slalom 90 5000m 78 singles 72 ice hockey 69 10km 60 50km 60 K90 individual (70m) 60 pairs 60 10000m 57 two-man 57 four-man 55 4x10km relay 51 15km 48 20km 45 4x7.5km relay 42 alpine combined 42 3000m 39 30km mass start 39 doubles 36 K120 individual (90m) 36 super-G 36 5km 30 moguls 30 4x5km relay 27 ice dancing 27 aerials 24 curling 21 10km pursuit 18 18km 18 Half-pipe 18 K120 team (90m) 18 Team 18 15km mass start 15 3000m relay 15 30km 15 3x5km relay 15 5000m relay 15 7.5km 15 Giant parallel slalom 12 Combined 10km + 15km pursuit 9 Combined 5km + 10km pursuit 9 12.5km pursuit 6 Alpine combined 6 giant-slalom 6 Snowboard Cross 6 Sprint 1,5km 6 sprint 1.5km 6 Team pursuit 6 Team sprint 6 12,5km mass start 3 3x7.5km relay 3 4x6km relay 3 5km pursuit 3 combined (4 events) 3 Combined 15 + 15km mass start 3 Combined 7.5 + 7.5km mass start 3 five-man 3 Individual 3 Individual sprint 3 military patrol 3 sprint 3 $Event.gender [,1] M 1386 W 802 X 123 $Medal [,1] Gold 774 Silver 773 Bronze 764 > > > # How many medals have been awarded in each Olympics? > medalsByYear <- aggregate(medals$Year, list(Year = medals$Year), length) > if (savePlot == TRUE) png("fig1.png") > plot(x ~ Year, medalsByYear, ylim = c(0,max(x)), + ylab = "Total Medals Awarded", bty="l", + main = "Total Medals Awarded in Winter Olympics by Year") > if (savePlot == TRUE) dev.off() windows 2 > > # How has the amount of medals awarded to males and females changed over the years? > # Get data. > medalsByYearByGender <- aggregate(medals$Year, + list(Year = medals$Year, Event.gender = medals$Event.gender), length) > medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender != "X", ] > > # Plot results. > if (savePlot == TRUE) png("fig2.png") > plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == "M", ], + ylim = c(0,max(x)), pch = "m", col = "blue", + ylab = "Total Medals Awarded", bty="l", + main = "Total Medals Awarded in Winter Olympicsn by Gender and by Year") > points(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"], + medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"], + col = "red", pch = "f") > if (savePlot == TRUE) dev.off() windows 2 > > # Table of proportion female > propFemalePerYear <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] / ( + medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] + + medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"]) > propFemalePerYear <- round(propFemalePerYear, 2) > cbind(Year = medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"], + PropFemale = propFemalePerYear) Year PropFemale [1,] 1924 0.07 [2,] 1928 0.08 [3,] 1932 0.08 [4,] 1936 0.12 [5,] 1948 0.18 [6,] 1952 0.23 [7,] 1956 0.26 [8,] 1960 0.38 [9,] 1964 0.37 [10,] 1968 0.37 [11,] 1972 0.36 [12,] 1976 0.35 [13,] 1980 0.34 [14,] 1984 0.36 [15,] 1988 0.37 [16,] 1992 0.43 [17,] 1994 0.43 [18,] 1998 0.44 [19,] 2002 0.45 [20,] 2006 0.46 > > > # Which countries have won the most medals? > sort(table(medals$NOC), dec = TRUE) NOR USA URS AUT GER FIN CAN SUI SWE GDR ITA FRA NED RUS FRG CHN JPN KOR TCH EUN 280 216 194 185 158 151 119 118 118 110 101 83 78 76 41 33 32 31 25 23 GBR EUA CZE LIE POL CRO AUS BLR BUL EST HUN BEL KAZ UKR SLO YUG ESP LUX PRK DEN 21 19 10 9 8 7 6 6 6 6 6 5 5 5 4 4 2 2 2 1 LAT NZL ROU SVK UZB 1 1 1 1 1 > > > # Of the countries that have won more than 50 medals, > # which have the highest percentage of gold medals? > NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50]) > medalsSubset <- medals[medals$NOC %in% NOC50Plus, ] > medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), margin = 1) > medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"], + decreasing = TRUE), c("Gold", "Silver", "Bronze")] > round(medalsByMedalByNOC, 2) Gold Silver Bronze RUS 0.43 0.32 0.25 URS 0.40 0.29 0.30 GER 0.37 0.37 0.26 SWE 0.36 0.26 0.37 USA 0.36 0.37 0.27 ITA 0.36 0.31 0.34 GDR 0.35 0.33 0.32 NOR 0.35 0.35 0.30 SUI 0.32 0.31 0.36 NED 0.32 0.38 0.29 CAN 0.32 0.32 0.36 FRA 0.30 0.29 0.41 AUT 0.28 0.35 0.38 FIN 0.27 0.38 0.34 > > > # How many different countries have won medals by year? > listOfYears <- unique(medals$Year) > names(listOfYears) <- unique(medals$Year) > totalNocByYear <- sapply(listOfYears, function(X) + length(table(medals[medals$Year == X, "NOC"]))) > > # Table > totalNocByYear 1924 1928 1932 1936 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 10 12 10 11 13 14 13 14 14 15 17 16 19 17 17 20 1994 1998 2002 2006 22 24 24 26 > > # Plot > if (savePlot == TRUE) png("fig3.png") > plot(x= names(totalNocByYear), totalNocByYear, + ylim = c(0, max(totalNocByYear)), + xlab = "Year", + ylab = "Total Number of Countries", + bty = "l", + main = "Total Number of Countriesn Winning Medals By Year") > if (savePlot == TRUE) dev.off() windows 2 > > # Which Countries have won a medal at every Olympics? > propYearsOnePlusMedals <- apply(table(medals$NOC, medals$Year) > 0, 1, mean) > > #Answer > names(propYearsOnePlusMedals[propYearsOnePlusMedals == 1.0]) [1] "AUT" "CAN" "FIN" "NOR" "SWE" "USA" > > # Table Sorted by Proportion of Olympics with a Medal > cbind(sort(propYearsOnePlusMedals, decreasing = TRUE)) [,1] AUT 1.00 CAN 1.00 FIN 1.00 NOR 1.00 SWE 1.00 USA 1.00 FRA 0.95 SUI 0.95 ITA 0.80 GBR 0.65 NED 0.65 TCH 0.55 JPN 0.50 GER 0.45 URS 0.45 FRG 0.35 GDR 0.30 HUN 0.30 CHN 0.25 KOR 0.25 POL 0.25 AUS 0.20 BEL 0.20 BLR 0.20 BUL 0.20 LIE 0.20 RUS 0.20 CZE 0.15 EUA 0.15 UKR 0.15 CRO 0.10 ESP 0.10 EST 0.10 KAZ 0.10 PRK 0.10 SLO 0.10 YUG 0.10 DEN 0.05 EUN 0.05 LAT 0.05 LUX 0.05 NZL 0.05 ROU 0.05 SVK 0.05 UZB 0.05
THE PLOTS
THE R SOURCE CODE
# tips on reading a Google Spreadsheet: # http://blog.revolution-computing.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html # Data taken from:"https://spreadsheets.google.com/ccc?key=0AgdO92JOXxAOdDVlaUpkNlB2WERtV3l1ZVFYbzllQWc" # http://www.guardian.co.uk/news/datablog/2010/feb/11/winter-olympics-medals-by-country googleLink <- "http://spreadsheets.google.com/pub?key=tsddww6vOYePkhPSxRpDeYw&single=true&gid=1&output=csv" medals <- read.csv(googleLink, stringsAsFactors = FALSE) savePlot <- TRUE # optional variable used to save or not save plots in code # remove rows that do not contain data medals$Year <- as.numeric(medals$Year) medals <- medals[!is.na(medals$Year), ] # Quick look at data head(medals) sapply(medals, function(x) cbind(sort(table(x), decreasing = TRUE))) # How many medals have been awarded in each Olympics? medalsByYear <- aggregate(medals$Year, list(Year = medals$Year), length) if (savePlot == TRUE) png("fig1.png") plot(x ~ Year, medalsByYear, ylim = c(0,max(x)), ylab = "Total Medals Awarded", bty="l", main = "Total Medals Awarded in Winter Olympics by Year") if (savePlot == TRUE) dev.off() # How has the amount of medals awarded to males and females changed over the years? # Get data. medalsByYearByGender <- aggregate(medals$Year, list(Year = medals$Year, Event.gender = medals$Event.gender), length) medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender != "X", ] # Plot results. if (savePlot == TRUE) png("fig2.png") plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == "M", ], ylim = c(0,max(x)), pch = "m", col = "blue", ylab = "Total Medals Awarded", bty="l", main = "Total Medals Awarded in Winter Olympicsn by Gender and by Year") points(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"], medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"], col = "red", pch = "f") if (savePlot == TRUE) dev.off() # Table of proportion female propFemalePerYear <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] / ( medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] + medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"]) propFemalePerYear <- round(propFemalePerYear, 2) cbind(Year = medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"], PropFemale = propFemalePerYear) # Which countries have won the most medals? sort(table(medals$NOC), dec = TRUE) # Of the countries that have won more than 50 medals, # which have the highest percentage of gold medals? NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50]) medalsSubset <- medals[medals$NOC %in% NOC50Plus, ] medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), margin = 1) medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"], decreasing = TRUE), c("Gold", "Silver", "Bronze")] round(medalsByMedalByNOC, 2) # How many different countries have won medals by year? listOfYears <- unique(medals$Year) names(listOfYears) <- unique(medals$Year) totalNocByYear <- sapply(listOfYears, function(X) length(table(medals[medals$Year == X, "NOC"]))) # Table totalNocByYear # Plot if (savePlot == TRUE) png("fig3.png") plot(x= names(totalNocByYear), totalNocByYear, ylim = c(0, max(totalNocByYear)), xlab = "Year", ylab = "Total Number of Countries", bty = "l", main = "Total Number of Countriesn Winning Medals By Year") if (savePlot == TRUE) dev.off() # Which Countries have won a medal at every Olympics? propYearsOnePlusMedals <- apply(table(medals$NOC, medals$Year) > 0, 1, mean) #Answer names(propYearsOnePlusMedals[propYearsOnePlusMedals == 1.0]) # Table Sorted by Proportion of Olympics with a Medal cbind(sort(propYearsOnePlusMedals, decreasing = TRUE))
To leave a comment for the author, please follow the link and comment on their blog: Jeromy Anglim's Blog: Psychology and Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.