Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
1. Vectors do not have defined "dimensions”
Strangely the dim “dimension” function does not work on vectors though it does work on higher dimensional matrices and arrays.
dim(1:10) ## NULL
If you want to figure out how long a vector is you need to use the length function.
length(1:10) ## [1] 10
Which would be fine except that the length function does work for matrices and will give you results based on counting the number of elements rather than on its “length” which is no longer well defined.
length(matrix(1, nrow = 10, ncol = 10)) ## [1] 100
So length with vectors and dim with matrices, easy right? . Wrong. Which brings me to my next quirk.
2. Class Dropping
Matrices that are reduced to a single row become vectors and some matrix functions no longer work.
mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4) mymatrix ## [,1] [,2] [,3] [,4] ## [1,] 1.3941 -0.7149 -1.7237 -1.6695 ## [2,] 0.6882 1.4039 -2.2238 -0.3019 ## [3,] -0.2032 1.3995 -0.3562 -0.3349 mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4) dim(mymatrix[1:2, ]) ## [1] 2 4 dim(mymatrix[1, ]) ## NULL
Fortunately there is a kind of fix for this. Specifying drop=F when specifying a subset of a matrix will preserve its class as a matrix.
dim(mymatrix[1, , drop = F]) ## [1] 1 4
Though, you do not need to worry about your matrix becoming a vector if you go to zero columns or rows.
dim(mymatrix[-1:-3, ]) ## [1] 0 4
3. Negative subscripts
These are nothing to be bothered by since they are entirely optional. However, if you run into them they can certainly throw you for a loop. Negative subscripts act as though you are subsetting a matrix or vector by removing the specified columns or rows.
For example:
myvect <- -4:5 myvect ## [1] -4 -3 -2 -1 0 1 2 3 4 5
They can be a little tricky to work with. Say we wanted to remove every other number in our 10 digit vector.
myvect[c 1="-3," 2="-5," 3="-7," 4="-9)" language="(-1,"][/c] ## [1] -3 -1 1 3 5
Or more easily
myvect[-seq(1, 9, 2)] ## [1] -3 -1 1 3 5
4. NA Values Specifying Missing Data
These different special values can be quite challenging to work with. The biggest challenge for me is detecting and adjusting my code when one of these values pops up.
NA indicates a value that is “Not Available” or missing. Having an indicator for missing is standard. Stata uses a “.” while other programs use a -9999 or other unlikely number. The problem with NAs are not that they exist since almost all data sets are going to have some missing values at some point, but the frequency in which they interrupt normal function behavior. For instance if you have a long vector as 1:100, if one of the values is NA then most of the functions you attempt on it will fail.
a <- c(1:20, NA) sum(a) ## [1] NA max(a) ## [1] NA min(a) ## [1] NA cor(a, a) ## [1] NA
Once again this is not a problem necessarily, but it is annoying and can easily create unexpected problems. There are several solutions to the problem. One may be to remove observations from your data which have any values which are missing. This can create its own issues since NAs do not conform to logical operators (in contrast with Stata, SPSS, and all other statistical languages I know of). Thus
a2 <- a[a != NA] a2 ## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Fails while
a3 <- a[!is.na(a)] a3 ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
does the job.
Alternatively, many commonly used functions have special arguments expressly devoted to how to handle missing data.
sum(a, na.rm = T) ## [1] 210 max(a, na.rm = T) ## [1] 20 min(a, na.rm = T) ## [1] 1 cor(a, a, use = "pairwise.complete.obs") ## [1] 1
5. “Empty values”: NULL, integer(0), numeric(0), logical(0)
These four values are several of R's empty value indicators. I don't believe they exhaust the list and I am sure each of them has its own unique properties which I cannot say I fully understand. The challenge usually lies in how to detect when they occur. As with NAs they resist logical arguments
a == NULL ## logical(0) logical(0) ## logical(0) # Even: (a == NULL) == logical(0) ## logical(0)
What we are forced to do instead is detect the vector length of these values.
length(NULL) ## [1] 0 length(integer(0)) ## [1] 0 length(numeric(0)) ## [1] 0 length(logical(0)) ## [1] 0
This trick works for an empty matrix as well.
length(mymatrix[!1:3, !1:4]) ## [1] 0
Though dim
is not a bad choice.
dim(mymatrix[!1:3, !1:4]) ## [1] 0 0
6. Attach Does not Do What You Want it to Do
Attach and detach are functions which ostensibly promise to fulfill the desires of a user to be able to rapidly reference, use, and make changes to an active data set in a similar fashion as Stata or SPSS.
However, this is a mistake. IT IS NOT the solution to making R work more like Stata. If a data frame is “attached” in R it will make its columns accessible through to common interface.
mydata <- data.frame(a1 = 1:30, b1 = 31:60) attach(mydata) b1 ## [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 ## [24] 54 55 56 57 58 59 60
However, if you want to create a new variable in the data set say
c1 <- a1 + b1
Then you are going to have a problem because sure you just created c1, but c1 is not part of mydata
.
names(mydata) ## [1] "a1" "b1"
You may add c to mydata
mydata$c1 <- c1 names(mydata) ## [1] "a1" "b1" "c1"
But the “c1” in my data is a different object than the c1 in the working memory.
mydata$c1 <- mydata$c1 * -1 head(mydata) ## a1 b1 c1 ## 1 1 31 -32 ## 2 2 32 -34 ## 3 3 33 -36 ## 4 4 34 -38 ## 5 5 35 -40 ## 6 6 36 -42 head(c1) ## [1] 32 34 36 38 40 42
Which is not necessarily a problem since we can drop our working memory c1.
rm(c1)
But c1
is still not available even though mydata
is still attached.
try(c1)
Which means we have “attached” our mydata object but that was back when mydata did not include c1. We need to reattach it to update the values. But first we need to detach our old mydata:
detach(mydata) attach(mydata) head(c1) ## [1] -32 -34 -36 -38 -40 -42
Thus we realize that attach is not a workable solution for handling data. Instead we are forced to use R's elaborate sub-scripting to manipulate our data such as
mydata$c1 <- mydata$a1 + mydata$b1
This is not unworkable though it can be annoying. There are shortcuts in R for helping to specify a dataframe or list to work from but they end up facing similar challenges to that of attach and ultimately create code which is unnecessarily complex.
7. R Functional Help Is Grouped
For instance ?gsub
will return a list of seven related functions grep
, grepl
, sub
, regexpr
, gregexpr
, and regexec
as well as gsub
which all share some commonalities with gsub
, yet each do something somewhat different. I can understand why the documentation could have been originally organized this way since I imagine it was rather terse initially (as some help files persist in being). As help files got fleshed out, it just never happened that functions were broken out of their clusters.
It is worth noting that I have never seen this kind of function clumping in any of the help files in packages in R. For example library(ggplot2); ?ggplot ;?geom_point
all return help files specific to a single function apiece.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.