R 101

Posted on March 11, 2013 by Phil in R bloggers | 0 Comments

[This article was first published on Just Another R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

<strong>as.character() is your friend</strong>

as.character() is your friend

Sometimes when you open a data file (lets say a .csv), variables will be recognized as factor whereas it should be numeric. It is therefore tempting to simply convert the variable to numeric using as.numeric(). Big mistake! If you use directly as.numeric() on a variable of the type factor, R will returns the levels of the factor rather the actual values. To overcome this hassle, you should first convert the variable into a string with as.character() and thereafter use as.numeric().

x = as.factor(rnorm(5))
x

## [1] 1.641794358679     0.963356182913528  -0.752806105114263
## [4] 0.278437879536854  -0.25115986471113 
## 5 Levels: -0.752806105114263 -0.25115986471113 ... 1.641794358679


## Wrong way
as.numeric(x)

## [1] 5 4 1 3 2


## Right way
as.numeric(as.character(x))

## [1]  1.6418  0.9634 -0.7528  0.2784 -0.2512

Another way to prevent numeric to be converted into factor would be to use as.is = T when opening the data file with read.csv.

Remove all objects in the workspace

rm(list = ls())

Replace values of a data frame with NA

library(xtable)

## Random matrix with with some (50) zeros
df = matrix(rnorm(100), ncol = 10)
df[sample(100, 50)] = 0
df = data.frame(df)

print(xtable(df), type = "html")

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
1	0.00	2.27	0.00	-0.41	-0.51	0.62	0.00	0.64	0.00	-0.62
2	1.46	0.00	0.90	1.82	0.31	0.00	-0.37	-1.55	0.78	0.55
3	0.00	0.09	-0.00	0.28	0.00	-0.59	0.00	0.00	0.00	-0.09
4	-0.38	0.00	0.00	-2.12	0.00	1.09	-0.77	0.00	0.20	0.00
5	0.98	1.21	1.03	0.00	-1.50	0.00	1.86	1.23	-0.75	0.00
6	1.24	0.00	-1.29	0.00	1.73	-0.14	1.56	0.00	-0.09	0.00
7	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.09	0.00	0.00
8	0.00	0.00	-0.86	0.00	1.73	0.00	0.00	0.00	0.22	0.00
9	1.27	-2.36	-0.62	0.00	-1.31	0.00	0.00	0.00	0.00	0.00
10	0.00	-0.45	0.00	0.00	0.00	-0.44	-0.27	0.00	-0.30	-0.84

## Replace the 0 with NA
df[df == 0] = NA

print(xtable(df), type = "html", NA.string = "NA")

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
1	NA	2.27	NA	-0.41	-0.51	0.62	NA	0.64	NA	-0.62
2	1.46	NA	0.90	1.82	0.31	NA	-0.37	-1.55	0.78	0.55
3	NA	0.09	-0.00	0.28	NA	-0.59	NA	NA	NA	-0.09
4	-0.38	NA	NA	-2.12	NA	1.09	-0.77	NA	0.20	NA
5	0.98	1.21	1.03	NA	-1.50	NA	1.86	1.23	-0.75	NA
6	1.24	NA	-1.29	NA	1.73	-0.14	1.56	NA	-0.09	NA
7	NA	NA	NA	NA	NA	NA	NA	1.09	NA	NA
8	NA	NA	-0.86	NA	1.73	NA	NA	NA	0.22	NA
9	1.27	-2.36	-0.62	NA	-1.31	NA	NA	NA	NA	NA
10	NA	-0.45	NA	NA	NA	-0.44	-0.27	NA	-0.30	-0.84

Count % of missing values per column

The other day I was working with a large database. I was looking for a way to calculate % of missing values for each variable. Here is how I calculated it.

## Generate some data
df = matrix(rnorm(100), ncol = 10)
df[sample(100, 50)] = NA
df = data.frame(df)
print(xtable(df), type = "html", NA.string = "NA")

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
1	NA	NA	NA	-1.04	NA	-0.19	NA	NA	NA	NA
2	NA	NA	NA	0.15	0.14	2.48	NA	NA	-1.04	1.51
3	NA	NA	0.01	NA	-0.58	-0.27	NA	NA	1.09	-0.36
4	0.83	0.23	NA	-0.07	1.04	NA	-1.97	0.56	0.71	-0.03
5	NA	NA	NA	0.80	-0.98	-0.97	NA	-1.30	NA	-1.19
6	NA	NA	-0.56	NA	0.12	-0.32	-0.58	0.91	NA	NA
7	NA	0.78	NA	-0.55	-0.80	NA	NA	-1.24	NA	NA
8	NA	-0.04	NA	NA	0.80	NA	NA	NA	NA	-0.19
9	0.51	-0.75	-1.33	-0.68	0.41	-0.25	NA	NA	1.47	NA
10	0.53	NA	-0.11	NA	-0.04	NA	NA	-0.34	-1.48	-1.82

## Find % of missing values in each column with apply
missing = apply(df, 2, function(x) sum(is.na(x)))/nrow(df) * 100

## Print the result
df2 = data.frame(variable = names(df), missing = missing)
print(xtable(df2), type = "html", include.rownames = FALSE)

variable	missing
X1	70.00
X2	60.00
X3	60.00
X4	40.00
X5	10.00
X6	40.00
X7	80.00
X8	50.00
X9	50.00
X10	40.00

Grouping data frame

There is various way to merge data frame in R. Here I present some variant of the simplest way using merge

df1 = data.frame(ID = sample(LETTERS[1:20], 10, replace = F), x = rnorm(10))
df2 = data.frame(ID = sample(LETTERS[1:20], 10, replace = F), y = rnorm(10))

print(xtable(df1), type = "html", include.rownames = FALSE)

ID	x
P	0.51
A	1.25
F	-0.25
I	0.57
D	0.25
C	-0.56
E	-1.30
H	0.34
S	-1.08
O	0.01

print(xtable(df2), type = "html", include.rownames = FALSE)

ID	y
G	-1.12
N	0.47
A	-1.64
O	2.02
S	-0.71
T	-0.48
F	-1.63
B	-0.59
D	-0.80
J	1.14

df_exact_match = merge(df1, df2, by = "ID")  ## Exact match
print(xtable(df_exact_match), type = "html", include.rownames = FALSE, NA.string = "NA")

ID	x	y
A	1.25	-1.64
D	0.25	-0.80
F	-0.25	-1.63
O	0.01	2.02
S	-1.08	-0.71

df_all_x = merge(df1, df2, by = "ID", all.x = T)  ## Keep all x (discard y that are not matched)
print(xtable(df_all_x), type = "html", include.rownames = FALSE, NA.string = "NA")

ID	x	y
A	1.25	-1.64
C	-0.56	NA
D	0.25	-0.80
E	-1.30	NA
F	-0.25	-1.63
H	0.34	NA
I	0.57	NA
O	0.01	2.02
P	0.51	NA
S	-1.08	-0.71

df_all_y = merge(df1, df2, by = "ID", all.y = T)  ## Keep all y (discard x that are not matched)
print(xtable(df_all_y), type = "html", include.rownames = FALSE, NA.string = "NA")

ID	x	y
A	1.25	-1.64
D	0.25	-0.80
F	-0.25	-1.63
O	0.01	2.02
S	-1.08	-0.71
B	NA	-0.59
G	NA	-1.12
J	NA	1.14
N	NA	0.47
T	NA	-0.48

df_all = merge(df1, df2, by = "ID", all = T)  ## Keep everything
print(xtable(df_all), type = "html", include.rownames = FALSE, NA.string = "NA")

ID	x	y
A	1.25	-1.64
C	-0.56	NA
D	0.25	-0.80
E	-1.30	NA
F	-0.25	-1.63
H	0.34	NA
I	0.57	NA
O	0.01	2.02
P	0.51	NA
S	-1.08	-0.71
B	NA	-0.59
G	NA	-1.12
J	NA	1.14
N	NA	0.47
T	NA	-0.48

To leave a comment for the author, please follow the link and comment on their blog: Just Another R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

R 101

as.character() is your friend

Remove all objects in the workspace

Replace values of a data frame with NA

Count % of missing values per column

Grouping data frame

Related

as.character() is your friend

Remove all objects in the workspace

Replace values of a data frame with NA

Count % of missing values per column

Grouping data frame

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)