Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
There are many packages available in R like data.table, tables, psych etc. to provide descriptive statistics like mean, standard deviation etc. group-wise(factor-wise) for number variables. In this article, an attempt is made to generate similar type of tabulated results utilizing the functions available in the base package and the concepts of object oriented system available with R. The main purpose of this type of exercise is to illustrate the application of object oriented system available in R to generate the output results as per our requirement. For the purpose of illustration, the iris data is considered, which consists of the data of four variables Sepal Length, Sepal Width, Petal Length and Petal Width for three species(factors). The following algorithm and R code illustrate the calculation of mean and standard deviations of these four variables for each specie and generate a tabulated results, which is similar to those obtained from the above packages.
Algorithm and Code
1.For generating the mean and standard deviation of any vector, a user defined function meansd() is defined as follows :
meansd<-function(x) {
l<-list()
l$Mean<-mean(x)
l$SD<-sd(x)
return(l)
}
The above function receives any vector as input, calculates the mean and standard deviation and returns the results as a list l.
2.Initially, the execution starts by calling a function with just two arguments viz., i).the data frame containing all the variables for which mean and standard deviation are required and ii). a vector containing the factor variable. So a new function basstat() is defined with just two arguments, the first containing the data frame all the variables and the second containing the factor variable. For the iris data we can call this function as given below.
res<-basstat(iris[,1:4],iris[,5])
3. The basstat() function will split the data iris specie-wise into a list containing three sub data frames, one for each specie. We will now use lapply function, which in turn calls another function result() for each of these sub data frames and obtains the aggregated results in a variable “bres”. For the purpose of printing these aggregated results in a neat tabular fashion, we will take the help of object oriented programming concepts of R. For this purpose, we will change the class of bres as “myclass” and return this bres object. The basstat () function code is given below :
basstat<-function(df,f) {
l<-split(df,f)
res<-lapply(l,result)
class(res)<-“myclass”
return(res)
}
4.The lapply function in step 3 in turn, is calling the function result(), using each sub data frame as input argrument. The result function, contains a sapply() function. This function in turn will call the meansd function, with each of these sub data frames one at a time and receives the mean and standard deviation results for all the variables in the sub data frames. It will capture them in the object “tres” and returns these results to the calling function lapply. The code of the result() function is given below :
result<-function(x) {
tres<-sapply(x,meansd)
return(tres)
}
Through all these function calls, all the results are now available in the object res, which is of class “myclass”. The results obtained from all these function calls are available specie-wise but not in a neat compact tabular fashion as shown below.
5.To facilitate the printing in a compact neat tabular fashion, the print function of myclass is defined as follows. This function, in turn cbinds all the results, does the required string manipulations and finally prints the results in a neat tabular fashion. The code of the print.myclass() function is given below :
print.myclass<-function(x) {
nm<-names(x)
options(digits=4)
finres<-vector()
for(i in 1:length(x)) {
finres<-cbind(finres,t(x[[i]]))
}
cat(” “)
tsp<-max(nchar(names(x)))
isp<-paste(rep(” “,tsp),collapse=””)
cat(isp)
for(i in 1:length(nm)) {
tt<-nchar(nm[i])
ifelse((tt<12),esp<-(12-tt),esp<-1)
rsp<-paste(rep(” “,esp),collapse=””)
nm[i]<-paste(nm[i],rsp)
cat(nm[i])
}
cat(“\n”)
print(finres)
}
6.These results can be printed by just typing the res object of step 2
>res
The code for using the basstat() function and the results obtained are given below :
Some more Results :
i).Descriptive Statistics of six variables mpg,disp,hp,drat,wt,qsec for the factor cyl consisting of the levels/groups viz., cylinder 4, 6 and 8 of the dataset mtcars of MASS package
res1<-basstat(mtcars[,c(1,3,4,5,6,7)],mtcars[,2])
res1
ii).Descriptive Statistics of two variables Prewt and Postwt for the groupsCBT, Cont and FT of the dataset anorexia of MASS package
iii).Desriptive Statistics of three variables Price, MPG.city and MPG.highway for the groups Compact, Large, Midsize, Small, Sporty and Van of the dataset Cars93 of MASS package
Conclusions
The tabulations of the output results obtained from all the above examples are found to be similar to those obtained from the data tables and tables packages. We could achieve this by using object oriented concepts of R language. In this exercise, I have obtained the mean and standard deviations of number of variables group-wise. It is also possible to modify the program to obtain the other statistics like min, max, median, 1st and 3rd quartiles etc. for number of variables group-wise.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.