Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected. Here is an example involving the built-in data set SASHELP.CLASS.
Here is the code:
data c1;
set sashelp.class;
* define a new character variable to classify someone as tall or short;
if height > 60
then height_class = 'Tall';
else height_class = 'Short';
run;
* print the results for the first 5 rows;
proc print
data = c1 (obs = 5);
run;
Here is the result:
| Obs | Name | Sex | Age | Height | Weight | height_class |
|---|---|---|---|---|---|---|
| 1 | Alfred | M | 14 | 69.0 | 112.5 | Tall |
| 2 | Alice | F | 13 | 56.5 | 84.0 | Shor |
| 3 | Barbara | F | 13 | 65.3 | 98.0 | Tall |
| 4 | Carol | F | 14 | 62.8 | 102.5 | Tall |
| 5 | Henry | M | 14 | 63.5 | 102.5 | Tall |
What happened? Why does the word “Short” render as “Shor”?
This occurred because SAS sets the length of a new character variable as the length of the first value given in its definition. My code defined “height_class” by setting the value “Tall” first, which has a length of 4. Thus, “height_class” was defined as a character variable with a length of 4. Any subsequent values must follow this variable type and format.
How can we circumvent this? You can pre-set the length of any new variable with the LENGTH statement before the SET statement. In the revised code below, I correct the problem by setting the length of “height_class” to 5 before defining its possible values.
data c2;
set sashelp.class;
* define a new character variable to classify someone as tall or short;
length height_class $ 5;
if height > 60
then height_class = 'Tall';
else height_class = 'Short';
run;
* print the results for the first 5 rows;
proc print
data = c2 (obs = 5);
run;
Here is the result:
| Obs | Name | Sex | Age | Height | Weight | height_class |
|---|---|---|---|---|---|---|
| 1 | Alfred | M | 14 | 69.0 | 112.5 | Tall |
| 2 | Alice | F | 13 | 56.5 | 84.0 | Short |
| 3 | Barbara | F | 13 | 65.3 | 98.0 | Tall |
| 4 | Carol | F | 14 | 62.8 | 102.5 | Tall |
| 5 | Henry | M | 14 | 63.5 | 102.5 | Tall |
Notice that “height_class” for Alice is “Short”, as it should be.
An alternative solution is to re-write the code so that the first instance of “height_class” is the longest possible value. This does not require the use of the LENGTH statement.
data c3;
set sashelp.class;
* define a new character variable to classify someone as tall or short;
if height < 60
then height_class = 'Short';
else height_class = 'Tall';
run;
By the way, I don’t notice this problem in R. Here is some code to illustrate this observation.
> set.seed(235)
>
> # randomly generate 4 values
> x = rnorm(3, 60, 5)
>
> # add a value to the beginning of "x" so that the first value is above 60
> # add a value to the end of "x" so that the last vlaue is below 60
> x = c(63, x, 57)
> x
[1] 63.00000 70.68902 61.36082 56.62601 57.00000
>
> # pre-allocate a vector for classifying "x" as "tall" or "short"
> y = 0 * x
>
>
> for (i in 1:length(x))
+ {
+ if (x[i] > 60)
+ {
+ y[i] = 'Tall'
+ }
+ else
+ {
+ y[i] = 'Short'
+ }
+ }
>
>
> # display "y"
> y
[1] "Tall" "Tall" "Tall" "Short" "Short"
Notice that the value “Short” renders fully with a length of 5. I did not need to pre-set the length of “y” first.
Filed under: Categorical Data Analysis, Data Analysis, R programming, SAS Programming, Statistics, Tutorials Tagged: categorical data, categorical variable, character data, character variable, length(), R, r programing, SAS, sas programming
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
