literacy rates using semantics and R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Somehow I stumbled into the world of linked open data trying to pull information easily off of a wikipedia page without having to write a customer scrapper. Enter in dbpedia, semantic technologies and some wonderful R packages take care of the back-end coding.
The Research Group Data and Web Science at the University of Mannheim has exposed a SPARQL endpoint for the CIA Factbook
Using this and the following query, I was able to quickly pull the gender specific literacy rates:
PREFIX db: <http://wifo5-04.informatik.uni-mannheim.de/factbook/resource/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX d2r: <http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/config.rdf#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX map: <file:/var/www/wbsg.de/factbook/factbook.n3#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX factbook: <http://wifo5-04.informatik.uni-mannheim.de/factbook/ns#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT DISTINCT ?label ?litMale ?litFemale ((?litMale - ?litFemale) AS ?litDiff) WHERE { ?resource factbook:literacy_female ?litFemale; factbook:literacy_male ?litMale; rdfs:label ?label . }
What’s the next logical step after getting data back in tabular form?
Visualization* using ggplot2!
Female literacy rates are on the x-axis, male literacy rates on the y-axis, size of the country name represents the distance between the gender rates and the color of the country name is based on the relative “strength” of the gender differences.
Full code is available in a github repo: dataparadigms – SemanticR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.