R in Open Data: Complaints in The Field of Freedom of Information data set from data.gov.rs
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The notebooks (R, Rmd, and HTML files are provided in my GitHub repository) focus on an exploratory analysis of the open data set on the complaints in the field of freedom of information, provided at the Open Data Portal of the Republic of Serbia that is currently under development. The data set was kindly provided to the Open Data Portal by the Commissioner for Information of Public Importance and Personal Data Protection of the Republic of Serbia. Many more open data sets will be indexed and uploaded to the Open Data Portal of the Republic of Serbia in the forthcoming weeks and months.
You should view this as an exercise in data wrangling and visualization with {ggplot2} and {igraph} primarily. As of the data set: (a) no metadata and no documentation were provided; (b) the translation of legal terms from Serbian to English is mine, meaning: a lot of Google Translate suggestions were used (I’m a psychologists, not a lawyer or a legal expert); © mixture of latin and cyrilic alphabet was detected in the data; (d) thorough cleaning takes place here in Part A; exploratory analysis + data visualizations are be presented Part B. R programming language provides a fantastic infrastructure for data wrangling (cleaning, preparation, re-structuring; in a nutshell, all necessary data management process than needs to be taken care of before any attempts at EDA or statistical modeling). In Part A I have used {dplyr} in combination with {tidyr} and {base} functions to inspect and clean up the data set (as much as I could); in Part B, {ggplot2} and {igraph} functionality was added to visualize some of the interesting patterns from the data set.
This is also a good reality check for all those who are contemplating a Data Science career. Similarly to what I had to do here, you will be often faced with data sets with no documentation and no metadata, and than you will need to do explorations before doing real EDA in order to try to figure out the semantics of the data; many times, you will be forced to combine structured and abstract approaches to clean data with manual procedures; you will be driven mad by inconsistencies and unavailability, but it will still be up to you to do what you can in order to try to squeeze out something useful from the data that you have at your disposal. It’s no joke: data wrangling and related procedures will be stealing a huge amount of time from you. Statistical modeling comes almost as a reward after what you’ve been through since you’ve been introduced to the data set…
Here are some examples with {ggplot2} and {igraph} from this case study.
Figure 1. Number of complaints filed per applicant group 2005 – 2016. {ggplot2} w. facet_wrap().
Figure 2. Each applicant group (blue circles) in this directed graph points towards the domains (gold circles) in respect to which it has sent its complaints to the Commissioner for Information of Public Importance and Personal Data Protection. {igraph}.
Figure 3. Each applicant group in this directed graph points towards the top three authority groups in respect to which it has sent the maximum numbers of complaints to the Commissioner for Information of Public Importance and Personal Data Protection (applicant groups represented by blue and authority groups by red circles): {igraph}.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.