How to extract a data.frame from string data

Posted on December 28, 2014 by Tal Galili in R guest posts | 0 Comments

A guest article by Asher Raz, PhD, CareerHarmony Sometimes, data of subjects are recorded on a server (e.g. SQL server) as string data records for each subject. In some cases we need only a part of those string data for each subject and we need it as numerical data (e.g. as a data.frame). How can we get the required data? In the following post I would like to share my experience with this issue. Below is a sample of string data:

"03F05DCACF-15BF-4328-BF1B-5D2503B4A18D00004|||03889D4711-4968-45DA-B1EF-E8559EEE43B400001|||03E56E89E7-5EA3-4A5A-B945-BC6B94D982EE00003|||03FEC7049F-B4E2-4D65-833D-59478CC7780D00003|||039BC5FC41-6E83-4880-8531-F0E38A892C2D00002|||035F88A090-E28F-4E6E-B680-33502F95C41D00003|||",20,DD31036F-CB38-4FF6-8DD8-495F42F29185,"032D6E0DDF-B553-4150-83D9-241DFD401E6D00001|||031228E393-88BC-4550-83D2-FBF3427D483600003|||035FA90EE3-8C51-48B9-91A0-8F5D163D6A7A00004|||03A564C47F-0A5C-4DA4-B94E-CEB55AEA947400002|||03BAA8F0A8-5BE3-4BC1-9DAF-C69ECECF1F7400002|||03C7CD867A-6557-4315-AF34-4B3938D6E81700003|||"

Each string data is organized according to rules. It could be that the required data is organized according to one rule at the beginning of the sting data and according to another rule in the rest of the string data. Firstly, we need to find the rules that organize the required data. That is done by checking the data thoroughly. For instance, in the sample string data mentioned above, the required data are numbers that appears after four 0’s. The first datum we need is 4, the second is 1, and so on. The first datum appears in the 45^th place in the string data, the second in 46^th place after the first datum and so on. In order to record those rules, we can create a vector that contains the places of all the required data in the string:

#Creating the counter vector.
 
counter <- c(45) 
i <- 45 
for(n in 1:6) { 
  if (n<6) { 
    i <- i+46 
  } else { 
    i <- i+91 
  } 
  countert <- c(i) 
  counter <- rbind(counter, countert) 
}
 
counter <- as.vector (counter) 
for(n in 1:30) { 
  for(n in 1:6) { 
    if (n<6) { 
      i <- i+46 
    } else { 
      i <- i+91 
    } 
    countert <- c(i) 
    counter <- c(counter, countert) 
  } 
}

Secondly, we need to input into R the string data (in this example from SQL server) and then prepare the data.frame. That can be done by the following code:

#Read the answers from the text files.
library (RODBC)
myconn <-odbcConnect("Subjects_Data", uid="ab", pwd="cde")
NEW-DATA_sql_data_RawAnswer <- sqlQuery(myconn, "select top 1000 RawAnswer from New-Data",stringsAsFactors = FALSE)
 
NEW-DATA_all_sample <-c()
list_item <- 14
 
for(RawAnswer in NEW-DATA_sql_data_RawAnswer$RawAnswer) {
save(NEW-DATA_sql_data_RawAnswer, file=" C:/R_DATA/NEW-DATAt.txt ",ascii=T)
NEW-DATA_string <- readLines("C:/R_DATA/NEW-DATAt.txt")
list_item <- list_item + 3
NEW-DATA_stringt <- NEW-DATA_string[c(list_item)]
NEW-DATA_one_subject <- c()
for(n in counter) {
  NEW-DATAt <- substring(NEW-DATA_stringt, n,n)
  NEW-DATA_one_subject <- rbind(NEW-DATA_one_subject, NEW-DATAt)
}
 
NEW-DATA_one_subject <- as.numeric(NEW-DATA_one_subject)
NEW-DATA_one_subject <- NEW-DATA_one_subject [c(-187)]
NEW-DATA_all_sample <- rbind(NEW-DATA_all_sample, NEW-DATA_one_subject)
}
 
NEW-DATA_all_sample_df<-as.data.frame(apply(NEW-DATA_all_sample, 2, as.numeric))
 
names(NEW-DATA_all_sample_df)<-sprintf("q%d",1:186)
 
NEW-DATA_sql_data_CandidateID_Username <- sqlQuery(myconn, "select top 1000 CandidateID,Username from New-Data",stringsAsFactors = FALSE)
 
NEW-DATA_sql_data_CandidateID_Username_df<-as.data.frame(NEW-DATA_sql_data_CandidateID_Username)
 
NEW-DATA_all_sample_df$CandidateID <- NEW-DATA_sql_data_CandidateID_Username_df$CandidateID
 
NEW-DATA_all_sample_df$Username <- NEW-DATA_sql_data_CandidateID_Username_df$Username
 
NEW-DATA_all_sample_df <- NEW-DATA_all_sample_df[,c(187,188,1:186)]
 
NEW-DATA_all_sample_df
 
write.csv(NEW-DATA_all_sample_df,file = "C:/R_DATA/NEW-DATA_all_sample.csv")

It should be noted that the string data that was imported to R had to be saved into a text file, before it can be read by R (see above the commands before the command NEW-DATA_string <- readLines("C:/R_DATA/NEW-DATAt.txt")). This command can’t work directly on the data that was imported from SQL server.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How to extract a data.frame from string data

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)