[This article was first published on
Just Another R Blog, and kindly contributed to
R-bloggers]. (You can report issue about the content on this page
here)
Want to share your content on R-bloggers?
click here if you have a blog, or
here if you don't.
< !DOCTYPE html>< !-- saved from url=(0014)about:internet -->
as.character() is your friend < !-- Styles for R syntax highlighter --> < !-- R syntax highlighter -->
as.character() is your friend
Sometimes when you open a data file (lets say a .csv), variables will be recognized as factor whereas it should be numeric. It is therefore tempting to simply convert the variable to numeric using as.numeric(). Big mistake! If you use directly as.numeric() on a variable of the type factor, R will returns the levels of the factor rather the actual values. To overcome this hassle, you should first convert the variable into a string with as.character() and thereafter use as.numeric().
x = as.factor(rnorm(5))
x
## [1] 1.641794358679 0.963356182913528 -0.752806105114263
## [4] 0.278437879536854 -0.25115986471113
## 5 Levels: -0.752806105114263 -0.25115986471113 ... 1.641794358679
## Wrong way
as.numeric(x)
## [1] 5 4 1 3 2
## Right way
as.numeric(as.character(x))
## [1] 1.6418 0.9634 -0.7528 0.2784 -0.2512
Another way to prevent numeric to be converted into factor would be to use as.is = T when opening the data file with read.csv.
Remove all objects in the workspace
rm(list = ls())
Replace values of a data frame with NA
library(xtable)
## Random matrix with with some (50) zeros
df = matrix(rnorm(100), ncol = 10)
df[sample(100, 50)] = 0
df = data.frame(df)
print(xtable(df), type = "html")
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
1 | 0.00 | 2.27 | 0.00 | -0.41 | -0.51 | 0.62 | 0.00 | 0.64 | 0.00 | -0.62 |
2 | 1.46 | 0.00 | 0.90 | 1.82 | 0.31 | 0.00 | -0.37 | -1.55 | 0.78 | 0.55 |
3 | 0.00 | 0.09 | -0.00 | 0.28 | 0.00 | -0.59 | 0.00 | 0.00 | 0.00 | -0.09 |
4 | -0.38 | 0.00 | 0.00 | -2.12 | 0.00 | 1.09 | -0.77 | 0.00 | 0.20 | 0.00 |
5 | 0.98 | 1.21 | 1.03 | 0.00 | -1.50 | 0.00 | 1.86 | 1.23 | -0.75 | 0.00 |
6 | 1.24 | 0.00 | -1.29 | 0.00 | 1.73 | -0.14 | 1.56 | 0.00 | -0.09 | 0.00 |
7 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.09 | 0.00 | 0.00 |
8 | 0.00 | 0.00 | -0.86 | 0.00 | 1.73 | 0.00 | 0.00 | 0.00 | 0.22 | 0.00 |
9 | 1.27 | -2.36 | -0.62 | 0.00 | -1.31 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
10 | 0.00 | -0.45 | 0.00 | 0.00 | 0.00 | -0.44 | -0.27 | 0.00 | -0.30 | -0.84 |
## Replace the 0 with NA
df[df == 0] = NA
print(xtable(df), type = "html", NA.string = "NA")
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
1 | NA | 2.27 | NA | -0.41 | -0.51 | 0.62 | NA | 0.64 | NA | -0.62 |
2 | 1.46 | NA | 0.90 | 1.82 | 0.31 | NA | -0.37 | -1.55 | 0.78 | 0.55 |
3 | NA | 0.09 | -0.00 | 0.28 | NA | -0.59 | NA | NA | NA | -0.09 |
4 | -0.38 | NA | NA | -2.12 | NA | 1.09 | -0.77 | NA | 0.20 | NA |
5 | 0.98 | 1.21 | 1.03 | NA | -1.50 | NA | 1.86 | 1.23 | -0.75 | NA |
6 | 1.24 | NA | -1.29 | NA | 1.73 | -0.14 | 1.56 | NA | -0.09 | NA |
7 | NA | NA | NA | NA | NA | NA | NA | 1.09 | NA | NA |
8 | NA | NA | -0.86 | NA | 1.73 | NA | NA | NA | 0.22 | NA |
9 | 1.27 | -2.36 | -0.62 | NA | -1.31 | NA | NA | NA | NA | NA |
10 | NA | -0.45 | NA | NA | NA | -0.44 | -0.27 | NA | -0.30 | -0.84 |
Count % of missing values per column
The other day I was working with a large database. I was looking for a way to calculate % of missing values for each variable. Here is how I calculated it.
## Generate some data
df = matrix(rnorm(100), ncol = 10)
df[sample(100, 50)] = NA
df = data.frame(df)
print(xtable(df), type = "html", NA.string = "NA")
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
1 | NA | NA | NA | -1.04 | NA | -0.19 | NA | NA | NA | NA |
2 | NA | NA | NA | 0.15 | 0.14 | 2.48 | NA | NA | -1.04 | 1.51 |
3 | NA | NA | 0.01 | NA | -0.58 | -0.27 | NA | NA | 1.09 | -0.36 |
4 | 0.83 | 0.23 | NA | -0.07 | 1.04 | NA | -1.97 | 0.56 | 0.71 | -0.03 |
5 | NA | NA | NA | 0.80 | -0.98 | -0.97 | NA | -1.30 | NA | -1.19 |
6 | NA | NA | -0.56 | NA | 0.12 | -0.32 | -0.58 | 0.91 | NA | NA |
7 | NA | 0.78 | NA | -0.55 | -0.80 | NA | NA | -1.24 | NA | NA |
8 | NA | -0.04 | NA | NA | 0.80 | NA | NA | NA | NA | -0.19 |
9 | 0.51 | -0.75 | -1.33 | -0.68 | 0.41 | -0.25 | NA | NA | 1.47 | NA |
10 | 0.53 | NA | -0.11 | NA | -0.04 | NA | NA | -0.34 | -1.48 | -1.82 |
## Find % of missing values in each column with apply
missing = apply(df, 2, function(x) sum(is.na(x)))/nrow(df) * 100
## Print the result
df2 = data.frame(variable = names(df), missing = missing)
print(xtable(df2), type = "html", include.rownames = FALSE)
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
variable | missing |
X1 | 70.00 |
X2 | 60.00 |
X3 | 60.00 |
X4 | 40.00 |
X5 | 10.00 |
X6 | 40.00 |
X7 | 80.00 |
X8 | 50.00 |
X9 | 50.00 |
X10 | 40.00 |
Grouping data frame
There is various way to merge data frame in R. Here I present some variant of the simplest way using merge
df1 = data.frame(ID = sample(LETTERS[1:20], 10, replace = F), x = rnorm(10))
df2 = data.frame(ID = sample(LETTERS[1:20], 10, replace = F), y = rnorm(10))
print(xtable(df1), type = "html", include.rownames = FALSE)
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
ID | x |
P | 0.51 |
A | 1.25 |
F | -0.25 |
I | 0.57 |
D | 0.25 |
C | -0.56 |
E | -1.30 |
H | 0.34 |
S | -1.08 |
O | 0.01 |
print(xtable(df2), type = "html", include.rownames = FALSE)
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
ID | y |
G | -1.12 |
N | 0.47 |
A | -1.64 |
O | 2.02 |
S | -0.71 |
T | -0.48 |
F | -1.63 |
B | -0.59 |
D | -0.80 |
J | 1.14 |
df_exact_match = merge(df1, df2, by = "ID") ## Exact match
print(xtable(df_exact_match), type = "html", include.rownames = FALSE, NA.string = "NA")
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
ID | x | y |
A | 1.25 | -1.64 |
D | 0.25 | -0.80 |
F | -0.25 | -1.63 |
O | 0.01 | 2.02 |
S | -1.08 | -0.71 |
df_all_x = merge(df1, df2, by = "ID", all.x = T) ## Keep all x (discard y that are not matched)
print(xtable(df_all_x), type = "html", include.rownames = FALSE, NA.string = "NA")
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
ID | x | y |
A | 1.25 | -1.64 |
C | -0.56 | NA |
D | 0.25 | -0.80 |
E | -1.30 | NA |
F | -0.25 | -1.63 |
H | 0.34 | NA |
I | 0.57 | NA |
O | 0.01 | 2.02 |
P | 0.51 | NA |
S | -1.08 | -0.71 |
df_all_y = merge(df1, df2, by = "ID", all.y = T) ## Keep all y (discard x that are not matched)
print(xtable(df_all_y), type = "html", include.rownames = FALSE, NA.string = "NA")
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
ID | x | y |
A | 1.25 | -1.64 |
D | 0.25 | -0.80 |
F | -0.25 | -1.63 |
O | 0.01 | 2.02 |
S | -1.08 | -0.71 |
B | NA | -0.59 |
G | NA | -1.12 |
J | NA | 1.14 |
N | NA | 0.47 |
T | NA | -0.48 |
df_all = merge(df1, df2, by = "ID", all = T) ## Keep everything
print(xtable(df_all), type = "html", include.rownames = FALSE, NA.string = "NA")
< !-- html table generated in R 2.15.1 by xtable 1.7-0 package --> < !-- Mon Mar 11 21:06:27 2013 -->
ID | x | y |
A | 1.25 | -1.64 |
C | -0.56 | NA |
D | 0.25 | -0.80 |
E | -1.30 | NA |
F | -0.25 | -1.63 |
H | 0.34 | NA |
I | 0.57 | NA |
O | 0.01 | 2.02 |
P | 0.51 | NA |
S | -1.08 | -0.71 |
B | NA | -0.59 |
G | NA | -1.12 |
J | NA | 1.14 |
N | NA | 0.47 |
T | NA | -0.48 |