Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As a biostatistician, my preferred way to receive data from a client is a csv extract from a well-designed database that included validation with the data-entry process. However, 95% of the data I see comes as an Excel file.
I have just sent an email to a client with some tips for using Excel to collect and store their research data. These 5 steps are my recommendations that save me a lot of headache when importing data from .xlsx files into R for analysis:
- (basic) Only enter a single piece of information into a cell. For example, if the column is “number”, then 365 is the only thing in the cell, do not do this: “365 (children only)”. If you need to leave a comment, then put it in another column.
- (basic) Keep column names simple, my preference is up to a maximum of 3 words, separated by underscores, all lower caps. If you can, keep column names meaningful to a human reader. Also, only use Row 1 for column names/information. Extra information about the data contained within a specific column should be placed in the data dictionary, which is in a different worksheet.
- (basic) Include a data dictionary that should have at least 5 columns: 1| column name (point 2), 2| Description, 3| data type (e.g. dichotomous (yes/no, true/false), categorical (list the categories), numeric, text), 4| valid ranges of data (for example, valid entries for age could be between 0 and 110 years, if 200 was entered then I know it’s likely an error), and 5| if the value is numeric, provide the unit of measurement (e.g. years, days, cm, kg, …).
- (advanced) Use Excel’s data validation functions – especially to make drop-down lists for categorical responses.
- (intermediate) If you have any date columns, change the cell data type to “Text” (the same for phone numbers).
Note: Always, any information that can be used to identify an individual should be removed from the file before it gets sent to a statistician.
If everyone followed these 5 steps, then 80% of my work (and time) would be freed up for analysis and communicating results. I know there must be other simple pieces of advice. What have I missed? What other simple advice do you give your colleagues/clients about data entry into Microsoft Excel?
Photo by Scott Graham on Unsplash
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.