How to extract a data.frame from string data
A guest article by Asher Raz, PhD, CareerHarmony
Sometimes, data of subjects are recorded on a server (e.g. SQL server) as string data records for each subject. In some cases we need only a part of those string data for each subject and we need it as numerical data (e.g. as a data.frame). How can we get the required data?
In the following post I would like to share my experience with this issue.
Below is a sample of string data:
"03F05DCACF-15BF-4328-BF1B-5D2503B4A18D00004|||03889D4711-4968-45DA-B1EF-E8559EEE43B400001|||03E56E89E7-5EA3-4A5A-B945-BC6B94D982EE00003|||03FEC7049F-B4E2-4D65-833D-59478CC7780D00003|||039BC5FC41-6E83-4880-8531-F0E38A892C2D00002|||035F88A090-E28F-4E6E-B680-33502F95C41D00003|||",20,DD31036F-CB38-4FF6-8DD8-495F42F29185,"032D6E0DDF-B553-4150-83D9-241DFD401E6D00001|||031228E393-88BC-4550-83D2-FBF3427D483600003|||035FA90EE3-8C51-48B9-91A0-8F5D163D6A7A00004|||03A564C47F-0A5C-4DA4-B94E-CEB55AEA947400002|||03BAA8F0A8-5BE3-4BC1-9DAF-C69ECECF1F7400002|||03C7CD867A-6557-4315-AF34-4B3938D6E81700003|||"Each string data is organized according to rules. It could be that the required data is organized according to one rule at the beginning of the sting data and according to another rule in the rest of the string data. Firstly, we need to find the rules that organize the required data. That is done by checking the data thoroughly. For instance, in the sample string data mentioned above, the required data are numbers that appears after four 0’s. The first datum we need is 4, the second is 1, and so on. The first datum appears in the 45th place in the string data, the second in 46th place after the first datum and so on. In order to record those rules, we can create a vector that contains the places of all the required data in the string:
#Creating the counter vector. counter <- c(45) i <- 45 for(n in 1:6) { if (n<6) { i <- i+46 } else { i <- i+91 } countert <- c(i) counter <- rbind(counter, countert) } counter <- as.vector (counter) for(n in 1:30) { for(n in 1:6) { if (n<6) { i <- i+46 } else { i <- i+91 } countert <- c(i) counter <- c(counter, countert) } }Secondly, we need to input into R the string data (in this example from SQL server) and then prepare the data.frame. That can be done by the following code:
#Read the answers from the text files. library (RODBC) myconn <-odbcConnect("Subjects_Data", uid="ab", pwd="cde") NEW-DATA_sql_data_RawAnswer <- sqlQuery(myconn, "select top 1000 RawAnswer from New-Data",stringsAsFactors = FALSE) NEW-DATA_all_sample <-c() list_item <- 14 for(RawAnswer in NEW-DATA_sql_data_RawAnswer$RawAnswer) { save(NEW-DATA_sql_data_RawAnswer, file=" C:/R_DATA/NEW-DATAt.txt ",ascii=T) NEW-DATA_string <- readLines("C:/R_DATA/NEW-DATAt.txt") list_item <- list_item + 3 NEW-DATA_stringt <- NEW-DATA_string[c(list_item)] NEW-DATA_one_subject <- c() for(n in counter) { NEW-DATAt <- substring(NEW-DATA_stringt, n,n) NEW-DATA_one_subject <- rbind(NEW-DATA_one_subject, NEW-DATAt) } NEW-DATA_one_subject <- as.numeric(NEW-DATA_one_subject) NEW-DATA_one_subject <- NEW-DATA_one_subject [c(-187)] NEW-DATA_all_sample <- rbind(NEW-DATA_all_sample, NEW-DATA_one_subject) } NEW-DATA_all_sample_df<-as.data.frame(apply(NEW-DATA_all_sample, 2, as.numeric)) names(NEW-DATA_all_sample_df)<-sprintf("q%d",1:186) NEW-DATA_sql_data_CandidateID_Username <- sqlQuery(myconn, "select top 1000 CandidateID,Username from New-Data",stringsAsFactors = FALSE) NEW-DATA_sql_data_CandidateID_Username_df<-as.data.frame(NEW-DATA_sql_data_CandidateID_Username) NEW-DATA_all_sample_df$CandidateID <- NEW-DATA_sql_data_CandidateID_Username_df$CandidateID NEW-DATA_all_sample_df$Username <- NEW-DATA_sql_data_CandidateID_Username_df$Username NEW-DATA_all_sample_df <- NEW-DATA_all_sample_df[,c(187,188,1:186)] NEW-DATA_all_sample_df write.csv(NEW-DATA_all_sample_df,file = "C:/R_DATA/NEW-DATA_all_sample.csv")It should be noted that the string data that was imported to R had to be saved into a text file, before it can be read by R (see above the commands before the command NEW-DATA_string <- readLines("C:/R_DATA/NEW-DATAt.txt")). This command can’t work directly on the data that was imported from SQL server.