[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Its time for some fun today – because its Friday as David Smith says :).Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.
So the challenge today will be to write a shortest code in R that performs a required data transformation.
Let’s start with the data transformation task (actually the problem was taken from a real data set I have recently analyzed).
We are running a survey. Each respondent is asked some subset of possible questions (labelled by letters) and answers the question positively (1) or negatively (0). As input we are given a data frame with two columns: labels of questions asked (as letters) and sequence of answers given to them (string of 0’s and 1’s). A good R example is better than 1000 words :):
> set.seed(1)
> questions <- replicate(1000, paste(sample(letters[1:10],
sample.int(4) + 2), collapse = “”))
> answers <- sapply(questions, function(x) {
paste(as.character(rbinom(nchar(x), 1, 0.5)),
collapse = “”) })
> dataset <- data.frame(questions, answers,
stringsAsFactors = FALSE)
> head(dataset)
questions answers
1 cihe 1100
2 gdjie 01100
3 cfbhja 001000
4 febj 1110
5 ehfid 01101
6 hgdic 10010
> head(dataset2)
a b c d e f g h i j
1 NA NA 1 NA 0 NA NA 0 1 NA
2 NA NA NA 1 0 NA 0 NA 0 1
3 0 1 0 NA NA 0 NA 0 NA 0
4 NA 1 NA NA 1 1 NA NA NA 0
5 NA NA NA 1 0 1 NA 1 0 NA
6 NA NA 0 0 NA NA 0 1 1 NA
The challenge is to transform dataset in such a way to generate dataset2 in as few keystrokes as possible, assuming that number of questions and number of respondents (respectively equal to 10 and 1000 in example data set) is unknown. The constraints are that one line of code may not be longer than 80 characters and the solution must be in base R only (no package loading is allowed).
Here is my attempt:
d<-dataset;y<-sort(unique(strsplit(paste(d[[1]],collapse=””),””)[[1]]))
d2<-data.frame(t(mapply(function(q,a){r<-rep(NA,length(y))
r[grepl(paste(“[“,q,”]”,sep=””),y)]<-as.numeric(strsplit(a,split=””)[[1]][
order(strsplit(q,split=””)[[1]])]);names(r)<-y;r},d[[1]],d[[2]],USE.NAMES=F)))
It has 284 characters (including 3 newline characters). If you take the challenge and have a shorter solution that produces exactly the same dataset2 data set for a given input post a comment ;). In order for the comment to be accepted the solution must be robust to changes of generated data set (different number of possible questions and answers).
Before I quit I present the same code in slightly more readable format and commented:
# extract all classes that exist in dataset$questions
# and sort them
classes <- sort(unique(strsplit(paste(dataset$questions,
collapse = “”), “”)[[1]]))
# change one pair of questions and answers into
# a full vector containing all classes sorted
process.qa <- function(q, a) {
res <- rep(NA, length(classes)) # initially no classes are set
qs <- strsplit(q, split=””)[[1]] # extract question classes
# extract answers and sort them in order of question classes
as <- as.numeric(strsplit(a, split=””)[[1]][order(qs)])
# update result with answers for existing questions
res[grepl(paste(“[“,q, “]”, sep=””), classes)] <- as
names(res) <- classes
res
}
dataset2 <- data.frame(t(mapply(process.qa,
dataset$questions, dataset$answers, USE.NAMES = F)))
To leave a comment for the author, please follow the link and comment on their blog: R snippets.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.