[This article was first published on R - Blendo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data scientists spend 80% of their time in data cleaning and data manipulation and only 20% of their time actually analyzing it.And then you find yourself spending 80% of your time to clean these data. At the same time, deadlines and management demands keep you up at night. This is one reason data analysts and data scientists regularly scour the web looking for anything that could help. Tools, tutorials, resources. I have stumbled many posts around related with general Data Science MOOC courses or tutorials. But never one that has a list of resources on one of the most time-consuming processes in the data pipeline. Data cleaning. In this post, I did my best to gather everything there is online. If you find a resource that I missed please let me know in the comments below. Let’s start with the basics…
What is data cleaning?
Data cleaning, data cleansing or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Source: Wikipedia1 Note: Some of the courses bellow belong to specializations or batches of courses. For example, Coursera has a Data Science specialization or Udacity’s Nanodegree Program but you may also take each course individually. If you are interested in a certificate, then usually there is a fee. If not (for Coursera at least) you may “audit” the course. Other courses are free and others are subscription based services.Data Cleaning in R
Getting and Cleaning Data (Coursera)
- Course Name: Getting and Cleaning Data
- Institution: Johns Hopkins University
- Coursera Specialization: Data Science Specialization
- Price: Free
- Belongs to Coursera’s Data Science Specialization from Johns Hopkins University and it is one of the best Data Cleaning courses out here.The course covers the basics needed for collecting, cleaning, and sharing data.
Data Science and Machine Learning Essentials (edX)
- Course Name: Data Science and Machine Learning Essentials
- Institution: Microsoft
- Price: Free, paid for certificate
- Another one of the best Data Science courses MOOC course. It covers tools like R, Python and SQL and among others covers data acquisition, ingestion, sampling, quantization, cleaning, and transformation.
Data Science with R (O’Reilly)
- Course Name: Data Science with R
- Price: Paid
- It is part in one of O’Reilly’s Learning Paths. It starts from the basics to more advanced techniques including R Graph and machine learning. It contains an intro to Data Science with R, how to manipulate data sets and expert Data Wrangling with R.
Cleaning Data in R (DataCamp)
- Course Name: Learn How to Clean Your Data Using R
- Price: Free (some chapters), Subscription based.
- Provides a basic intro to cleaning data in R.
Foundations of Data Science (Springboard)
- Course Name: Foundations of Data Science
- Price: Free (some chapters), Subscription based or one-time payment
- It has a unit about Data Wrangling and data cleaning with R.
Udemy Courses
You may want to take a look at the list of resources about Data cleaning and R inside Udemy. There are a lot to choose from, but it might require some searching to find which one is valuable to you.21+ Online Courses to Get Started Today with Data Cleaning #datascience #datacleaning https://t.co/GoOaHhNeeV pic.twitter.com/qxMVn4rWE6
— Blendo (@blendoapp) 26 May 2016
Data Cleaning in Python
Data Science and Machine Learning Essentials
See the Data Science and Machine Learning Essentials (edX) course above.Intro to Data Analysis – Data Analysis Using NumPy and Pandas (Udacity)
- Course Name: Intro to Data Analysis – Data Analysis Using NumPy and Pandas
- Udacity Nanodegree Program: Data Analyst Nanodegree
- Price: Free.
- It belongs to Udacity’s Data Analyst Nanodegree and provides info on the entire data analysis process, like wrangling data, exploring data, cleaning data and how to use Python libraries like NumPy, Pandas, and Matplotlib.
Data Wrangling with MongoDB – Data Manipulation and Retrieval (Udacity)
- Course Name: Data Wrangling with MongoDB – Data Manipulation and Retrieval
- Udacity Nanodegree Program: Data Analyst Nanodegree
- Institution: Udacity + MongoDB
- Price: Free.
- It belongs to Udacity’s Data Analyst Nanodegree. It povides information on how to gather and extract data in widely used data formats. How to assess the quality of data and explore best practices for data cleaning. It also covers the essentials of storing data, the MongoDB query language and how to perform exploratory analysis using the MongoDB aggregation framework.
Python for Data Analysis (Big Data University)
- Course Name: Python for Data Analysis (course is under development)
- Price: Free.
- Although the course is under development, it seems to contain a great syllabus about loading data, data wrangling, data cleaning, transformation and visualization of data.
Intermediate Python and Pandas (DataQuest)
- Course Name: Intermediate Python and Pandas
- Price: Free (some chapters), Subscription based.
- It helps you acquire more advanced Python and Pandas skills that among other will help you to improve your data munging and data cleaning skills.
Data Analysis and Visualization (DataQuest)
- Course Name: Data Analysis and Visualization
- Price: Free (some chapters), Subscription based.
- Play with NumPy, Pandas and Jupyter while learning how to clean your data.
Data Science Intensive (Springboard)
- Course Name: Data Science Intensive
- Price: Free (some chapters), Subscription based or one-time payment
- It has a unit about Data Wrangling and data cleaning with Python.
Big Data Science with BD2K-LINCS (Coursera)
- Course Name: BD2K-LINCS Data Coordination and Integration Center
- Institution: BD2K-LINCS Data Coordination and Integration Center
- Coursera Specialization: None
- Price: Free
- This is a life science related statistics course but it provides info on how to collect data, basic data processing and data normalization methods that can be used for data cleaning. Basic courses in statistics and molecular biology are useful but not required. Ability to write short scripts in languages such as Python would be useful.
Exploring CO2 Emissions Data using Pandas data frames in Python (Big Data University)
- Course Name: Exploring CO2 Emissions Data using Pandas data frames in Python
- Price: Free.
- Learn to explore, clean and visualize a dataset working with CO2 Emissions from the United Nation’s sustainable goals.
Python Applications (DataQuest)
- Course Name: Python Applications
- Price: Free (some chapters), Subscription based.
- Learn how to use Python to visualize, explore and clean data using real datasets.
Python for Business Analysts (DataQuest)
- Course Name: Python for Business Analysts
- Price: Free (some chapters), Subscription based.
- Use Python to clean, visualize, and explore datasets.
Udemy Courses
You may want to take a look at the list of resources about Data cleaning and Python inside Udemy. There are a lot to choose from, but it might require some searching to find which one is valuable to you. < !-- Begin MailChimp Signup Form --> < !--End mc_embed_signup-->Data Cleaning (SQL, Spark etc.)
Introduction to Big Data Analytics (Coursera)
- Course Name: Introduction to Big Data Analytics
- Institution: University of California, San Diego
- Coursera Specialization: Big Data Specialization
- Price: Free, paid for certificate
- This is a (really) quick intro on Big Data query interfaces, environments, and tools like HBASE, HIVE, Pig or Spark. There are some parts that focus on data exploration and data cleaning with Spark.
Working With Large Datasets (DataQuest)
- Course Name: Working With Large Datasets
- Price: Free (some chapters), Subscription based.
- Work with Map-Reduce and Spark to clean, process and analyze large datasets.
Data Cleaning (OpenRefine, Tableau, Excel or other tools)
Introduction to OpenRefine (Big Data University)
- Course Name: Introduction to OpenRefine
- Price: Free.
- It covers the basics of OpenRefine and its scripting language GREL and provides info on data cleaning.
How to clean your data (European Data Portal)
- Course Name: How to clean your data
- Price: Free.
- It covers the topic of cleaning up data, explores common errors found in open datasets and how they affect the way we work with this data. You can find more training material at the European Data portal here.
Data, Analytics and Learning (edX)
- Course Name: Data, Analytics and Learning
- Institution: University of Texas Arlington + Tableau Software
- Price: Free
- The course provides a great overview of the field, suitable for a broad audience. Explore the logic of analytics, the basics of finding, cleaning, using educational data to build predictive models and perform text analysis
Data Analysis for your Business (edX)
- Course Name: Data Analysis for your Business
- Institution: DelftX
- Specialization: XSeries
- Price: Paid for certificate
- Use Excel for importing data, data cleaning, data wrangling, interpreting and visualizing, with special emphasis on real-time dashboards.
Videos
When I was searching for this courses I stumbled upon some great videos from presentation in conferences. I added them here in case anybody is interested.- UC Berkeley AMP Camp – Data Cleaning by Sanjay Krishnan.
- Data Cleaning on Text to Prepare for Analysis and Machine Learning @ EuroSciPy 2015 by Ian Ozsvald.
- Data Cleansing and Analysis with Cross-Filters by Salesforce Customer Succes Team.
- Introduction to data cleaning with Jupyter and pandas @ PyCaribbean 2016 by Melissa Lewis.
- Introduction to R Data Analysis: Data Cleaning.
Closure
I hope this list will help anyone who is looking to clean her data or is looking for a smooth start with the subject of data wrangling. If you know any course that I missed or any of the above is not fitting for the list please let me know in the comments or Twitter bellow.References and some more links:
To leave a comment for the author, please follow the link and comment on their blog: R - Blendo Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.