Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Abstract:
We use the RMariaDB
and dbplyr
packages to analyze the results database of the World Cubing Association. In particular we are interested in finding unofficial world records of fastest 3x3x3 solves, countries with large proportion of female cubers as well as acceptable solving times before entering a WCA competition.
Motivation
An item on my bucket list of irrelevant-things-to-excel-in-during-parental-leave™ is to solve the Rubik’s cube again. In 2013 I learned the beginner’s method for solving the cube by reading Dan Harris’ Speedsolving the Cube book. However, any attempts to improve times by learning CFOP failed miserably in 2013 due to lack of time. So in this 2019 attempt I just try to revoke my beginner’s method and improve it slightly by 4LL. Nevertheless, my solves fluctuate at 120-180s. This has made me worry about the shame-factor of pulling through the next item on the bucket list: competing in a World Cubing Association (WCA) 3x3x3 competition. The motivating question of this post is therefore: how large a skill outlier would one be at such a competition with, say, an 180s average?
Mining the World Cubing Association Database
Luckily, the WCA provides the results of all cubers and competitions in a downloadable SQL database (400 MB mySQL/MariaDB SQL text file). The downloadable version of 19 Apr 2019 contains tables with information on all 124,997 players, 6,014 competitions as well as the 2,130,374 results obtained by players in competitions starting with the Rubik’s Cube World Championship in 1982. Not big data, but big enough and structured in a way that justifies using a SQL database backend for the analysis instead of just a flat data.frame
in R.
With a little bit of effort, one installs a MariaDB together with the MariaDB Connector/C and imports the WCA data into the database. Subsequently, the RMariaDB
(Müller et al. 2018) and DBI
(R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2018) R packages allow one to connect directly to the DB. Furthermore, the dbplyr
(Wickham and Ruiz 2019) package allows one to write dplyr
code, which is then translated into and evaluated as SQL queries behind the scenes.
We connect to the database as follows:
library(DBI) library(RMariaDB) con <- DBI::dbConnect(RMariaDB::MariaDB(), group="my-db")
where, as suggested in the RMariaDB
package documentation, my-db
is a custom group defined in the option file ~/my.cnf
containing the user, password, protocol type and the database to use (wca
). This is preferred instead of calling dbConnect
with options such as socket
, user
, password
and dbname
.
To import the data one can either open a mysql
shell and type SQL commands or send the SQL query from R to the database:
SOURCE WCA_export.sql;
After such an import we can view the contents of the database:
dbListTables(con) ## [1] "eligible_country_iso2s_for_championship" "Formats" "Countries" ## [4] "RoundTypes" "Rounds" "Competitions" ## [7] "Continents" "Persons" "championships" ## [10] "Events" "Scrambles" "RanksAverage" ## [13] "Results" "RanksSingle"
Of particular interest is the Results
table, where each row contains the results of a player (personId
) in a round of a WCA competition of events such as the 2x2x2, 3x3x3, 4x4x4, etc. For the 3x3x3 event each round consist of five solves solves (values1
–values5
) of which the best is reported in the column best
and the trimmed mean (removing the best and worst solve and computing the mean of the three other solves) is reported in the column average
.
## [1] "competitionId" "eventId" "roundTypeId" "pos" "best" "average" ## [7] "personName" "personId" "personCountryId" "formatId" "value1" "value2" ## [13] "value3" "value4" "value5" "regionalSingleRecord" "regionalAverageRecord"
Furthermore, the table Persons
contains additional basic information about each player:
## [1] "id" "subid" "name" "countryId" "gender"
Note: At WCA competitions there is no separate ranking according to gender.
After getting initially acquainted with the database, we do two small SQL-side modifications. The first modification is to create an index for the tables, which will help to substantially reduce the join times later on. The second is to merge the separate year, month, day columns of the event day in the Competitions
table into a single ISO 8601 date. See the github code for details.
As the next step we create tables to use with dbplyr
. These will then just dispatch to the database tables. For now, we work only with the results of the 3x3x3 cubes, but will join the Results, Persons and Competitions table in order to get relevant information about gender of the person as well as date of the solve together with the time of each solve.
suppressPackageStartupMessages(library(dbplyr)) suppressPackageStartupMessages(library(dplyr)) results <- tbl(con, "Results") %>% filter(eventId == "333") persons <- tbl(con, "Persons") competitions <- tbl(con, "Competitions2") ##JOIN the three tables together allResults <- results %>% inner_join(persons, by=c("personId"="id")) %>% inner_join(competitions, by=c("competitionId"="id"))
Note that dbplyr
uses lazy evaluation, i.e. calls are not executed before needed. However, one can with show_query
check the SQL call, which is used in case of evaluation. In this example the SQL query to get the results of female cubers, i.e.:
fwr_single <- allResults %>% filter(gender == "f", best>0)
in SQL looks like
fwr_single %>% show_query() ## <SQL> ## SELECT * ## FROM (SELECT `LHS`.`competitionId` AS `competitionId`, `LHS`.`eventId` AS `eventId`, `LHS`.`roundTypeId` AS `roundTypeId`, `LHS`.`pos` AS `pos`, `LHS`.`best` AS `best`, `LHS`.`average` AS `average`, `LHS`.`personName` AS `personName`, `LHS`.`personId` AS `personId`, `LHS`.`personCountryId` AS `personCountryId`, `LHS`.`formatId` AS `formatId`, `LHS`.`value1` AS `value1`, `LHS`.`value2` AS `value2`, `LHS`.`value3` AS `value3`, `LHS`.`value4` AS `value4`, `LHS`.`value5` AS `value5`, `LHS`.`regionalSingleRecord` AS `regionalSingleRecord`, `LHS`.`regionalAverageRecord` AS `regionalAverageRecord`, `LHS`.`subid` AS `subid`, `LHS`.`name` AS `name.x`, `LHS`.`countryId` AS `countryId.x`, `LHS`.`gender` AS `gender`, `RHS`.`name` AS `name.y`, `RHS`.`cityName` AS `cityName`, `RHS`.`countryId` AS `countryId.y`, `RHS`.`information` AS `information`, `RHS`.`year` AS `year`, `RHS`.`month` AS `month`, `RHS`.`day` AS `day`, `RHS`.`endMonth` AS `endMonth`, `RHS`.`endDay` AS `endDay`, `RHS`.`eventSpecs` AS `eventSpecs`, `RHS`.`wcaDelegate` AS `wcaDelegate`, `RHS`.`organiser` AS `organiser`, `RHS`.`venue` AS `venue`, `RHS`.`venueAddress` AS `venueAddress`, `RHS`.`venueDetails` AS `venueDetails`, `RHS`.`external_website` AS `external_website`, `RHS`.`cellName` AS `cellName`, `RHS`.`latitude` AS `latitude`, `RHS`.`longitude` AS `longitude`, `RHS`.`date` AS `date` ## FROM (SELECT `LHS`.`competitionId` AS `competitionId`, `LHS`.`eventId` AS `eventId`, `LHS`.`roundTypeId` AS `roundTypeId`, `LHS`.`pos` AS `pos`, `LHS`.`best` AS `best`, `LHS`.`average` AS `average`, `LHS`.`personName` AS `personName`, `LHS`.`personId` AS `personId`, `LHS`.`personCountryId` AS `personCountryId`, `LHS`.`formatId` AS `formatId`, `LHS`.`value1` AS `value1`, `LHS`.`value2` AS `value2`, `LHS`.`value3` AS `value3`, `LHS`.`value4` AS `value4`, `LHS`.`value5` AS `value5`, `LHS`.`regionalSingleRecord` AS `regionalSingleRecord`, `LHS`.`regionalAverageRecord` AS `regionalAverageRecord`, `RHS`.`subid` AS `subid`, `RHS`.`name` AS `name`, `RHS`.`countryId` AS `countryId`, `RHS`.`gender` AS `gender` ## FROM (SELECT * ## FROM `Results` ## WHERE (`eventId` = '333')) `LHS` ## INNER JOIN `Persons` AS `RHS` ## ON (`LHS`.`personId` = `RHS`.`id`) ## ) `LHS` ## INNER JOIN `Competitions2` AS `RHS` ## ON (`LHS`.`competitionId` = `RHS`.`id`) ## ) `dbplyr_001` ## WHERE ((`gender` = 'f') AND (`best` > 0.0))
Analysing the data
We are now all ready to perform some descriptive analyses of the data. The top-5 females according to their time for 3x3x3 single solve is obtained by:
fwr_single %>% group_by(personId) %>% top_n(-1, best) %>% ungroup %>% arrange(best) %>% select(competitionId, personName, personCountryId, best, date) %>% top_n(-5, best) ## # Source: lazy query [?? x 5] ## # Database: mysql 10.3.14-MariaDB-log [hoehle@localhost:/wca] ## # Ordered by: best ## competitionId personName personCountryId best date ## <chr> <chr> <chr> <int> <date> ## 1 SlowNSteadySummer2017 Dana Yi USA 537 2017-07-01 ## 2 BarbyCubeOpen2018 Juliette Sébastien France 581 2018-11-03 ## 3 FruitSalad2018 Katie Hull USA 606 2018-03-03 ## 4 CubeduKorea2018 Yu Da-Hyun (유다현) Korea 629 2018-01-26 ## 5 NewYorkCitySpring2018 Kymberlyn Calderon USA 641 2018-04-21
Note that the solve times are given in centiseconds (i.e. 1/100’th of a second). The above result can be compare with the corresponding information at https://wcadb.net.
As professional data scientist one knows how important it is to understand the data generating process and to know your data. So this is how the 5.37s solve by Dana Yi in 2017 looks like:
The evolution of the 3x3x3 single solve WR
So at this point we force an execution of the allResults
query and cache the result as an object in R. This feels slightly disappointing, because the hope was to leave the data in the database management system (DBMS) as long as possible, but it felt like the most efficient way to compute sequential ranks – however, it might have been possible to perform the sequential rank directly using SQL statements, although I did not succeed to find the correct approach within the available time.1
Instead, the results are sorted according to their date in R and subsequently each result is checked to see if it’s sequential rank is 1, i.e. whether the time is lower than all previous results. For this purpose a fast Rcpp function sequential_is_rank1
function is provided, which computes the sequential rank of a vector of values (see github code for details). Note: If we had not pulled the data into R at this point, such a computation within R would not have been possible.
The evolution of the world record for males and females over time is then easily plotted. Note: As mentioned WCA doesn’t officially distinguish between male and female results.
Countries with highest proportion of female cubers
Since the overall fraction of female cubers is around 10%, we determine the top-5 countries (with at least 50 cubers), having the highest proportion of female cubers in the Persons
database:
countryId | n_total | n_male | n_female | frac_female (in %) |
---|---|---|---|---|
Algeria | 157 | 103 | 54 | 34.4 |
Mongolia | 373 | 278 | 95 | 25.5 |
Morocco | 67 | 53 | 14 | 20.9 |
Pakistan | 54 | 42 | 12 | 22.2 |
Tunisia | 211 | 171 | 40 | 19.0 |
I find this list quite surprising and also encouraging!
Skill level before entering a WCA competition
Getting back to the motivating question of this post, we derive for each result how experienced the cuber was at the time of obtaining the result. We shall measure experience terms of years since the cuber’s first WCA competition. Special interest will then be in results of cubers in their very first WCA competition – see the github code for details.
We then use this information to create a scatter plot with result average (in seconds) and corresponding experience of the cuber (in years). For better visualization we plot the marginal 5%, 10%, 50%, 90%, 95% quantiles as smooth function of experience – this is done with with ggplot2
‘s geom_quantile
function together with the argument method="rqss"
, which then uses the rqss
function of the quantreg
package (Koenker 2018) to compute smooth quantile curves:
## 5% 10% 50% 90% 95% 99% 99.9% ## 16 19 36 77 93 136 208
This shows that with a 180s average one is located at the 99.7% quantile of all cubers entering a WCA competition. In other words: if the comfort zone is defined as being within the 95% envelope, then a ~90s average is needed before entering a WCA competition.
To further investigate, how cubers of that skill level evolve in time, we study the solving skills of cubers entering their first WCA competition with a solve time between 180s and 240s. In order to reduce the potential effect of secular trends due to, e.g., better cubes, we consider the skill evolution of the cohort of first time cubers from 2015 and onwards.
The cohort inclusion criterion provide a total of 174 first time competitors in this skill bracket. Only 27.6% of these cubers decide to participate in further WCA competitions! The further development of the averages of these cubers is best shown in a trajectory plot. Note that the end of the lines does not necessarily mean that they stopped cubing, instead it could be due to right truncation, because only competitions until 2019-04-19 are available.
Instead of the trajectories we can also overlay an expectation smoother on top of the data to see how the expected average progresses with time in the cohort. Note, that this portrays the marginal expectation and thus is only based on cubers, who are still cubing at that time. No adjustment for any, potentially informative, drop-out is made.
From the figure we notice a rapid improvement the first 6 months after entering the first WCA competition. Hereafter results only improve slowly.
Discussion
Through analysis of the WCA results database it became clear that participating in a 3x3x3 event with a 180s average is uncool. The data also show that cubers entering the world of WCA competitions with such an average are likely to never participate in another WCA event. In case they do, their times drop to 90-120s averages within 6 months after which they are stuck – no one cracks the 20s barrier. To conclude: It seems wise to practice more, before going to the first WCA competition. ????
From a data science perspective this post provided insights on using dbplyr
as a tidyverse frontend for SQL queries. Being an SQL novice, I learned that generating indexes is a must, if you do not want to wait forever for your INNER JOINs. Furthermore, I ran into some trouble with mutates and filters with the package, because the produced SQL code provide empty results. It remains unclear to me if this was due to differences in SQL DBMSs (e.g. MariaDB being differnet from PostGres), my lack of SQL knowledge or if it’s a shortcoming of the dbplyr
package. My conclusion is that performing the data wrangling within the DBMS was overkill for the medium sized wca data. For larger or more structured datasets such an approach can, however, be fruitful, because a DBMS’s main purpose is to provide efficient solutions for working with your data. Furthermore, I like the idea of having a familiar dplyr frontend and just auto-generate an efficient backend code (here: SQL) to do the actual wrangling. This also provides an opportunity to get an outside-R-implementation, which can be an important aspect in quality assurance of statistical analyses.
Acknowledgments
The terms of use of the WCA database requests any use of it to be equipped with the following text:
This information is based on competition results owned and maintained by the World Cube Association, published at https://worldcubeassociation.org/results as of April 19, 2019.
Besides this formal note, I thank the WCA Results Team for providing the WCA data for download in this comprehensive form!
Literature
Koenker, Roger. 2018. Quantreg: Quantile Regression. https://CRAN.R-project.org/package=quantreg.
Müller, Kirill, Jeroen Ooms, David James, Saikat DebRoy, Hadley Wickham, and Jeffrey Horner. 2018. RMariaDB: Database Interface and ’Mariadb’ Driver. https://CRAN.R-project.org/package=RMariaDB.
R Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2018. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.
Wickham, Hadley, and Edgar Ruiz. 2019. Dbplyr: A ’Dplyr’ Back End for Databases. https://CRAN.R-project.org/package=dbplyr.
If you know how to do this efficiently in SQL, let me know!↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.