A real-world, messy dataset to practice on

[This article was first published on R programming – Oscar Baruffa, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At some point you may be looking for a “real world” dataset to practice analysis on or to give to students.

The value of such data is that it gives analysts a chance to develop skills they need for their work, but are hard to master when given “clean” datasets, especially inside a guided course.

I’ve found this dataset below which, apart from being actual, real-life data, has a few characteristics that makes it a good set to learn about data cleaning and then further analyzing.

The data is a Salary Survey from AskAManager.org. It’s US-centric-ish but does allow for a range of country inputs. I find salary surveys inherently interesting, but here’s some other notable aspects of this dataset.

A spreadsheet showing salary survey responses. Column headers include age, industry, job title, annual salary, currency, country, city, years of experience, gender and race amongst others.
  • There are 17 variables, so its not too overwhelming
  • 6 of the variables are free-form text entry, which always results in lots of data cleaning to be done!
  • All variables make intuitive sense you don’t need any domain expertise to figure out what they are HOWEVER….
  • You can apply some domain expertise to a subset of the data that you are familiar with, be it country, state, job title or sector knowledge.
  • The dataset is “live” and constantly growing. In the time it’s taken me to write the first lines of this post, the responses grew from 11,588 to 11,603. This means that fixes you made to earlier analysis may not hold for all new entries.
  • When downloading the dataset, there’s also a “timestamp” variable (column A), so you can simulate a growing list by filtering data by longer and longer timespans if it’s no longer receiving any updates.

If you’re using R, you can read the sheet using the googlesheets4 package.

You can of course make a copy of the sheet directly in Google sheets, or you can download it in multiple formats.

File menu from google sheets showing the dropdowns to download the data which is in order: File, Download and then a  selection of format options like XLS, CSV etc.

Happy analyzing!

Social media preview image: Photo by Wonderlane on Unsplash

Don’t miss any updates, sign up below. I don’t post very often ?

* indicates required

The post A real-world, messy dataset to practice on appeared first on Oscar Baruffa.

To leave a comment for the author, please follow the link and comment on their blog: R programming – Oscar Baruffa.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)