Data preparation for analysis, use VBA
[This article was first published on R in the Antipodes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data on daily nutrient intake information is stored within a database, and the users are extracting the relevant data on a nutrient-by-nutrient basis through an export procedure. The exported data is being stored in Excel 2010 data sets, one for each nutrient. There is a set of variables being saved in each export, as well as an optional variable that does not need passing to R, but – when it is present – is situated between subsets of relevant variables. The names of the variables are not exactly the same between the Excel data sets, nor is the starting row for the variable names, and a couple of variable names are saved across two rows instead of one. This creates a bit of a headache for creating a .csv file to import into R.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For one file, the starting rows look like:
Energy, including dietary fibre. 2007 ANCNPAS. Including supps. | ||||||||
nutrient_code | Respondent ID | Gender | Age | Body weight | samplewt | Intake- day1 | Intake- day2 | Units |
267 | 100013 | 2 | 15 | 59.4 | 0.333552084 | 8591.5354 | 8747.9084 | KJ |
267 | 100020 | 1 | 12 | 51.6 | 0.495283471 | 12145.8524 | 13495.798 | KJ |
For another file, the starting rows look like:
iodine | ||||||||||
with supplements | ||||||||||
day1 | day 2 | |||||||||
nut_code | id | sex | age | Body weight | seifa | samplewt | Intake | Intake | Units | Upper safe level (UL) |
315 | 100013 | 2 | 15 | 59.4 | 0.33355208 | 103.7881 | 128.07576 | ug | 900 | |
315 | 100020 | 1 | 12 | 51.6 | 0.49528347 | 140.46202 | 218.31528 | ug | 600 | |
315 | 100050 | 2 | 15 | 62.1 | 0.33355208 | 118.21546 | 184.33722 | ug | 900 |
Because there are multiple data sets that will need conversion to R, the best method is to construct some VBA code in an Excel workbook to handle the data cleaning and preparation. The saving grace is that the relative order of the variables appears to be the same across the different data sets.
The code below cleans both files using a button click event in Excel 2010. The cleaned data is saved as a .csv file, using the same file name as the input .xlsx file and into the same directory. The Respondent IDs are integer, but must be dimmed as Long because they exceed the maximum value that can be stored in an Integer type. The variables are defined as arrays, and then redimmed to the correct length just before they have values allocated, so the method will work with data sets of varying numbers of observations. For those unfamiliar with VBA, the ‘ is used to comment code.
The Open file has been restricted to only Excel 2010 workbooks – this should save the user from scrolling through irrelevant non-Excel files. Because we have constructed the .csv filename to have the correct file extension, we just need to put in the file name at the end but we do need to amend the type.
Sub Button1_Click()
Dim FilePath As String
Dim FileName As String
Dim CSVFile As String
Dim WB As Workbook
Dim StartRow As Range
Dim Count As Integer ‘Used to store the size of the array for each variable, can only be a whole number
Dim i As Integer ‘Count in the values to each array
Dim NutrientCode() As Integer ‘Numeric code for nutrient, same value for all rows in the same sheet
Dim Respondent() As Long ‘Respondent IDs are larger than 32K so need to make these long rather than integer
Dim Gender() As Integer ‘Sex, codes are 1 or 2
Dim Age() As Integer
Dim BodyWeight() As Double
Dim SampleWeight() As Double
Dim Day1Intake() As Double
Dim Day2Intake() As Double
On Error GoTo Err_Clr
‘From http://msdn.microsoft.com/en-us/library/ff834966.aspx
‘Get workbook name
FilePath = Application _
.GetOpenFilename(“Excel Files (*.xlsx), *.xlsx”)
If FilePath <> False Then
‘MsgBox “Open ” & FilePath
End If
‘Save filename and path for output to csv
FileName = Dir(FilePath)
‘MsgBox “The filename is ” & FileName
CSVFile = Left(FileName, Len(FileName) – 4) & “csv”
‘MsgBox “The filename is ” & CSVFile
‘Open workbook
Set WB = Workbooks.Open(FilePath)
WB.Activate
‘Locate variable name row
Range(“A1”).Select
Do Until (ActiveCell.Value = “nutrient_code”) Or (ActiveCell.Value = “nut_code”)
ActiveCell.Offset(1, 0).Select
Loop
Set StartRow = ActiveCell
‘MsgBox “Startrow is with cell ” & StartRow
‘Get Count value
Count = 0
Do While ActiveCell.Value <> “”
ActiveCell.Offset(1, 0).Select
Count = Count + 1
Loop
‘Count is 1 too high due to looping to empty row, reduce by 1
Count = Count – 1
‘MsgBox “The number of observations is ” & Count
‘Go back to variable name row
StartRow.Select
‘Nutrient intakes
ReDim NutrientCode(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
NutrientCode(i) = ActiveCell.Value
Next
‘Respondent IDs
StartRow.Select
ActiveCell.Offset(0, 1).Select
ReDim Respondent(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
Respondent(i) = ActiveCell.Value
Next
‘Gender
StartRow.Select
ActiveCell.Offset(0, 2).Select
ReDim Gender(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
Gender(i) = ActiveCell.Value
Next
‘Age
StartRow.Select
ActiveCell.Offset(0, 3).Select
ReDim Age(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
Age(i) = ActiveCell.Value
Next
‘Body Weight
StartRow.Select
ActiveCell.Offset(0, 4).Select
ReDim BodyWeight(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
BodyWeight(i) = ActiveCell.Value
Next
‘need to check for seifa column at this point
‘Sample Weight
StartRow.Select
ActiveCell.Offset(0, 5).Select
If ActiveCell.Value = “seifa” Then ActiveCell.Offset(0, 1).Select
ReDim SampleWeight(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
SampleWeight(i) = ActiveCell.Value
Next
‘Day 1 intake
StartRow.Select
ActiveCell.Offset(0, 5).Select
If ActiveCell.Value = “seifa” Then
ActiveCell.Offset(0, 2).Select
Else
ActiveCell.Offset(0, 1).Select
End If
ReDim Day1Intake(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
Day1Intake(i) = ActiveCell.Value
Next
‘Day 2 intake
StartRow.Select
ActiveCell.Offset(0, 5).Select
If ActiveCell.Value = “seifa” Then
ActiveCell.Offset(0, 3).Select
Else
ActiveCell.Offset(0, 2).Select
End If
ReDim Day2Intake(Count)
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
Day2Intake(i) = ActiveCell.Value
Next
‘Save all the array values to a new workbook
Workbooks.Add
‘Output Nutrient ID
ActiveCell.Value = “NutrientID”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = NutrientCode(i)
Next
‘Output Respondent
Range(“B1”).Select
ActiveCell.Value = “RespondentID”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = Respondent(i)
Next
‘Output Gender
Range(“C1”).Select
ActiveCell.Value = “Gender”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = Gender(i)
Next
‘Output Age
Range(“D1”).Select
ActiveCell.Value = “Age”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = Age(i)
Next
‘Output Body Weight
Range(“E1”).Select
ActiveCell.Value = “BodyWeight”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = BodyWeight(i)
Next
‘Output SampleWeight
Range(“F1”).Select
ActiveCell.Value = “SampleWeight”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = SampleWeight(i)
Next
‘Output Day 1 Intake
Range(“G1”).Select
ActiveCell.Value = “Day1Intake”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = Day1Intake(i)
Next
‘Output Day 2 Intake
Range(“H1”).Select
ActiveCell.Value = “Day2Intake”
For i = 1 To Count
ActiveCell.Offset(1, 0).Select
ActiveCell.Value = Day2Intake(i)
Next
‘Save the worksheet as a csv file for R
ActiveWorkbook.SaveAs FileName:=CSVFile, FileFormat:=xlCSV
Err_Clr:
If Err <> 0 Then
Err.Clear
Resume Next
End If
End Sub
To leave a comment for the author, please follow the link and comment on their blog: R in the Antipodes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.