translate2R: Easy switch from SPSS to R by using common concepts like temporary and column wise missing values
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you translate scripts from one language to another you usually encounter conceptual differences in both languages. Typically, you start with a new language by adapting the concepts you are used to and then you expand your skills by using the concepts of the other language step by step.
Part of the idea behind our translateSPSS2R approach is to help you as a SPSS user to work with usual programming concepts in SPSS to gain quick wins. Then you are free to explore the new world of R and maybe you will find it useful to combine some ideas from both languages as we realized it in our translateSPSS2R package.
The challenge: SPSS meta information differs from R – How can I temporarily split my data set?
Compared to R, a data set in SPSS holds additional structured meta information. This information can be used directly in data management and analysis functions.
Roughly, two classes of meta information can be distinguished in SPSS.
- Applied filters, temporary-select if structures, split file and the do if – end if bracket are referring to the whole data set.
- Variable name, variable labels, value labels and missing values belong to single columns. Especially the characteristics of meta information related to the data set highly differ from conventional R logic and behavior.
For a deeper understanding of the structural differences between SPSS and R we will elucidate two of them.
Structural difference No. 1: Subsetting
For subsetting data, SPSS enables us to switch a filter on and off without changing the amount of cases and columns of the data set. After applying a filter, some of the cases become passive and all subsequent functions are just using the filtered cases. Turning the filter off means to transform the passive cases back to active ones. Later on, applied functions will consider all cases of the entire data set as active ones. In R this approach is not designated. Creating a subset means removing the cases physically from the R object or using the indexing operators each time the object is named in a function.
Structural difference No. 2: User defined missing values
As mentioned above, the column related information of user defined missing values is not implemented in R. In SPSS any value can be specified as missing, so that subsequent analysis functions treat it as non-existent, whereas the true value is still visible in the SPSS data viewer. Despite its risk and the lack of transparency, this is a common operating principle in SPSS. By undoing the missing specification the value is reconverted to a valid value and taken into account by subsequent functions. In R we would define an existing value as a missing one by transforming it explicitly to NA. Turning NA´s back into valid values means firstly to back up the value itself and its respective position in the original data set and secondly to replace the NA´s from the current working data set by the externally stored data.
The solution: The xpssFrame as a temp object that holds all meta data
Building R-functions that are able to simulate SPSS means to attach the same information to the data as SPSS does. Therefore, we invented the xpssFrame object that holds the data.
In our translateSPSS2R package the xpssFrame object behaves like a SPSS data set and holds all the additional information. This object unites the origin data and the reverse engineered meta information belonging to the data set, as is the case in SPSS.
Internal workings of SPSS are bridged to R by means of native R attributes. Thereby, relevant information such as filtered cases or user defined missing values are stored in the background and can be brought back by the user whenever requested.
With the base R function attributes(data_set), all attributes of a data set are shown. Attributes of the variables are being stored in the variables themselves and can be shown with the R function attributes(data_set$varname). Thus, conditions and behaviors of further applied functions can be traced. With an overview of all active and inactive subset conditions, this procedure enhances the transparency of the analyzing process. In SPSS the information about applied meta instructions is just provided through the source code itself, so that the analyst has to watch out for the actual line with the respective command to see which data he addresses with his syntax. As of now, one feature in translateSPSS2R is the ability to visualize which subsetting processes have already been applied or will be applied to the data by using the attributes function.
Same but different – Using analogue procedures in SPSS like a temporary switch in R
In the translateSPSS2R package R functions are working analogously to the way they do in SPSS. Regarding the example of applying filter in SPSS which has just been discussed, we can use xpssFilter() to select a subset of cases for all subsequent functions. After using xpssFilterOff(), all cases are part of the data object again.
The strategy for user defined missing values works comparably to SPSS as well. We can define missing values with xpssMissingValues() so that the defined values are not used for calculations applied afterwards. User defined values are saved onto the original data set in the defined.Miss attribute. In order to transform them back into usable values, translateSPSS2R requires a new execution of xpssMissingValues() by leaving out the specification of the missing values. Thereby the attribute defined.Miss is read out so that the NA´s can be replaced by the original values. Three advantages are met by using this strategy. Firstly, the user does not have any overhead anymore with creating backups of origin values. Secondly, SPSS users changing to R can do it step by step and don´t have to struggle with the different data management logic of R from the beginning. Thirdly, by means of this strategy we are able to translate most SPSS operations automatically without any loss of information. Every function applied in R generates the same data and data structure SPSS provides. A web interface is developed in order to generate automation (http://blog.eoda.de/2014/10/22/translate2r-migrate-spss-scripts-to-r-at-the-push-of-a-button/).
translateSPSS2R: Usual SPSS features and more!
A common challenge in R is to keep attributes in existence when the structure of a data set is changed – thus what is basically done with subsetting processes dependent on meta information. The translateSPSS2R package offers a two-step solution for this challenge. At first, “attributesBackup” will be executed, essentially preserving all variable attributes in an external object, whereas “applyAttributes” attaches all attributes saved onto the newly created subset.
Summary
Meta information plays a key role in the data transformation and data analysis process. Hence, one of the success factors to build analogous proceedings of SPSS in the R environment was a reverse engineering approach with the use of native R attributes. Due to the nature of R, attributes facilitate the enhancement of transparency in data manipulations regarding traceability of filter and subset Actions.
You can now download the translateSPSS2R package by following this link:
http://cran.r-project.org/web/packages/translateSPSS2R/index.html
For more information about translate2R visit the Homepage of eoda.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.