Site icon R-bloggers

R Data types 101, or What kind of data do I have?

[This article was first published on R on R (for ecology), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most of us are pretty familiar with data types in our daily lives — we can easily tell that things like 1, 2, 3, and 4 are numbers (in this case, integers). 15.7 is still a number, but has a decimal. We know that every single word I’m typing in this sentence is composed of characters, and we know that in math, “true” and “false” are the answers to logical statements.

Just as we do in our heads, R also categorizes our data into different classes. These categories are similar to the real-life ones I described above, but can be a little different in terms of syntax and things to watch out for in your code.

To work in R and perform data analyses, you’ll need to have a solid understanding of data types. In this tutorial, I’m going to introduce several different types of data, explain how to use and manipulate each of them, and show you how to check what type of data you have. Let’s dive in.

Types of data

There are five main types of data in R that you’d come across as an ecologist. I’ll discuss all of them below except complex numbers, which are rarely used for data analysis in R.

  1. Numeric (1.2, 5, 7, 3.14159)

  2. Integer (1, 2, 3, 4, 5)

  3. Complex (i + 4)

  4. Logical (TRUE / FALSE)

  5. Character ("a", "apple")

I’m also going to discuss a sixth, related category that helps you work with categorical variables:

  1. Factor

Numeric

Numeric data types are pretty straightforward. These are just numbers, written as either integers or decimals. We can check if our vector is numeric by using the function is.numeric().

# Create a numeric vector
x <- c(3, 5, 6, 10.7)
# Is our vector numeric? Yes!
is.numeric(x)

## [1] TRUE

We can check our data type by using the functions class() and typeof(). class() tells us that we’re working with numeric values, while typeof() is more specific and tells us we’re working with doubles (i.e., numbers with decimals).

# Check the type of data class we have
class(x)

## [1] "numeric"

# Check the specific type of data that you have
typeof(x)

## [1] "double"

You can, of course, perform mathematical operations with numeric values.

# Add 4 to all the values in the vector
x + 4

## [1] 7.0 9.0 10.0 14.7

Integer

You can also do math with integers, which represent numbers without decimal places. These are usually used if you’re counting something — for example, you can observe 7 butterflies in a plot, but you can’t observe 7.2 butterflies (or at least I hope not!).

If you create a vector manually and don’t have any decimal values, R will still identify your vector as the class “numeric”.

# Create a vector with only integers
x <- c(1, 4, 2, 7, 8)
# Look at the class
class(x)

## [1] "numeric"

You can change this vector to be an integer by using the function as.integer().

# Change the vector class
x <- as.integer(x)
# Look at the class
class(x)

## [1] "integer"

Alternatively, you can generate an integer vector like this. The “L” after each number tells R that you want it to be an integer.

# Create an integer vector
x <- c(1L, 2L, 5L, 3L, 10L)
# View vector
x

## [1] 1 2 5 3 10

# View class
class(x)

## [1] "integer"

You could also create an integer vector like this. The colon (:) tells R to generate a sequence of vectors from 1 to 10, going up by 1 each time.

# Create a sequence of integers
x <- c(1:10)
# View vector
x

## [1] 1 2 3 4 5 6 7 8 9 10

# View data class
class(x)

## [1] "integer"

Some functions will also automatically generate integer vectors, like the function sample(). This function randomly samples a certain number of integer values within a specified range. I asked sample() to choose ten values between 1 and 10.

# Create a random sequence of integers from 1 to 10:
set.seed(123) # use set.seed to get the same random values as me
x <- sample(1:10, 10)
# View vector
x

## [1] 3 10 2 8 6 9 1 7 5 4

# View data class
class(x)

## [1] "integer"

Complex

I’m not going to discuss this one because complex numbers aren’t used much in R for data analysis, though they exist. These are just numbers with real and imaginary components (containing the number i, or the square root of -1).

Character

Characters are another common data type. These are used to store text in R (also called “strings”). To indicate something is a character, we put quotation marks around it "".

# Create a vector of characters
x <- c("These", "are", "characters")
# View class
class(x)

## [1] "character"

Putting quotation marks around numbers will also turn them into characters, which can get confusing.

# Create a vector of characters
x <- c("1", "4", "5", "7", "8")
# View vector
x

## [1] "1" "4" "5" "7" "8"

You can’t do math with a vector of numbers that are classed as characters.

# Try to do math
mean(x)

## Warning in mean.default(x): argument is not numeric or logical: returning NA

## [1] NA

Why? Because R views them as text!

# View class
class(x)

## [1] "character"

You can turn this character vector of numbers into a numeric vector using the as.numeric() function.

Note: a common case of this happening is if you happen to accidentally have a character value (i.e. a letter or symbol) in a column of values that are otherwise supposed to be numeric. Adding a space to a number or empty cell might have the same effect. This can happen accidentally (and so easily!) during data entry, so using as.numeric() is one way to resolve that issue. Any values that were character will be converted to NAs. In that scenario you’ll probably want to go back and fix your raw CSV file, but at least now the NAs will help you find where the problem was.
# Turn it into a numeric vector
x <- as.numeric(x)
# View vector
x

## [1] 1 4 5 7 8

# View class
class(x)

## [1] "numeric"

And then you can turn it back into a character using as.character().

# Turn it back into a character
x <- as.character(x)
# View vector
x

## [1] "1" "4" "5" "7" "8"

# View class
class(x)

## [1] "character"

Logical

The logical class is represented by only two possible values: TRUE or FALSE (also can be written T / F, but never true / false or t / f).

These values result from any logical statements that are made. For example, in the code below I asked R if the elements of my vector were greater than 5. This returns a logical vector where each element is either TRUE or FALSE.

# Create a vector
x <- c(1, 5, 6, 7, 2, 8)
# Are the elements of vector x greater than 5? Store results in vector y
y <- x > 5
# View y
y

## [1] FALSE FALSE TRUE TRUE FALSE TRUE

# View class
class(y)

## [1] "logical"

You can also create a vector of logical statements.

# Create logical vector
x <- c(T, F, T, F, F, T)
# View vector
x

## [1] TRUE FALSE TRUE FALSE FALSE TRUE

And you can convert logical values to numeric values, and back. FALSE is the same as 0, while TRUE is the same as 1.

# Convert to numeric vector
x <- as.numeric(x)
# View vector
x

## [1] 1 0 1 0 0 1

# Convert back to logical vector
x <- as.logical(x)
# View vector again
x

## [1] TRUE FALSE TRUE FALSE FALSE TRUE

This also means that you can do math with logical values. This is useful if, for example, you’re trying to see how many TRUE values you have in your vector. In fact, applying any math operations to a logical vector will automatically convert the values to 1s and 0s.

# View vector
x

## [1] TRUE FALSE TRUE FALSE FALSE TRUE

# Count how many "TRUE" values there are. There are 3!
sum(x)

## [1] 3

Factor

Factors are a special data type that is primarily used to represent repeating categories (i.e., categorical variables). When you specify an object as a factor, you’re telling R to think of it as a categorical variable, with different levels. This can be helpful when analyzing your data, as categorical variables and continuous variables are often handled differently in statistical analyses.

In the code below, I created a data frame showing the height and sex of five individuals.

# Create an example data frame
example <- data.frame(indiv = c("A", "B", "C", "D", "E"),
height = c(15, 10, 12, 9, 17),
sex = c("female", "female", "female", "male", "female"))
# View structure of data frame
str(example)

## 'data.frame': 5 obs. of 3 variables:
## $ indiv : chr "A" "B" "C" "D" ...
## $ height: num 15 10 12 9 17
## $ sex : chr "female" "female" "female" "male" ...

Right now, the sex column is a character vector because I entered the data in quotation marks. But really what I want to do is tell R that sex is a categorical variable, with “female” and “male” as levels. To do that, all I have to do is use the as.factor() function.

# Change the sex column to be a factor
example$sex <- as.factor(example$sex)
# View the factor
example$sex

## [1] female female female male female
## Levels: female male

You can see that R listed the vector and then beneath that, has figured out on its own that the levels are “female” and “male”. When writing the levels, R will sort them in alphabetical order. That’s why the levels are female male instead of male female.

You may want to change the order of your factor levels (this can be useful when plotting your data and determining the order in which they appear).

For example, you might have a vector like this:

# Create vector
places <- factor(c("first", "first", "second", "third", "fifth", "fourth", "second"))
# View factor
places

## [1] first first second third fifth fourth second
## Levels: fifth first fourth second third

The order of the levels doesn’t make sense. We want it to go from first through fifth in the implied numeric order — not alphabetically. So let’s change the order using factor(vector, levels = c("first", "second", "third", etc.)).

# Change level order
places <- factor(places, levels = c("first", "second", "third", "fourth", "fifth"))
# View factor
places

## [1] first first second third fifth fourth second
## Levels: first second third fourth fifth

Much better!

Factors don’t just have to be text. They can also be integers. For example, in the code below I created a data frame describing the stream width and order of several stream sites. Stream order is not a continuous variable, even though it’s represented by numbers. It would be best to convert stream order to a factor.

# Create data frame 
example2 <- data.frame(stream = c("Patuxent", "Patapsco", "Deer Creek", "Town Creek", "Browns Branch"),
width = c(37, 42, 25, 32, 22),
order = c(6, 6, 4, 5, 3))
# View data frame structure
str(example2)

## 'data.frame': 5 obs. of 3 variables:
## $ stream: chr "Patuxent" "Patapsco" "Deer Creek" "Town Creek" ...
## $ width : num 37 42 25 32 22
## $ order : num 6 6 4 5 3

R sees stream order as being numeric, which makes sense. But let’s tell R that stream order is a factor.

# Change stream order to a factor
example2$order <- as.factor(example2$order)
# View stream order
example2$order

## [1] 6 6 4 5 3
## Levels: 3 4 5 6

Looks good. Since these are numbers, R just orders the levels in ascending order.

How to check and manipulate data types

As demonstrated throughout this tutorial, it can be useful to check the type of data you’re working with and be able to change it to another type if you need. You might need this especially in situations where you’re reading in data from a .csv, and need to check that all your numbers are numeric instead of characters.

The main way to check your data type is to use the function class(). If you have a data frame, another easy way to check data types is to use the str() function. This displays the structure of your data frame and tells you what data type each of your columns is. The example below lists heights over time for five individuals.

# Create an example data frame
example <- data.frame(indiv = c("A", "B", "C", "D", "E"),
height_0 = c(15, 10, 12, 9, 17),
height_10 = c(20, 18, 14, 15, 19),
height_20 = c(23, 24, 18, 17, 26))
str(example)

## 'data.frame': 5 obs. of 4 variables:
## $ indiv : chr "A" "B" "C" "D" ...
## $ height_0 : num 15 10 12 9 17
## $ height_10: num 20 18 14 15 19
## $ height_20: num 23 24 18 17 26

You can see that the column indiv is a character vector (abbreviated “chr”), while each successive column is numeric (abbreviated “num”).

You also noticed me using functions like is.numeric() or as.character(). All of the data types have is. and as. functions, where the first one is a logical statement to check the specific data type, asking “is this object of the class XXX?” and returns TRUE or FALSE. The as. functions are actions that convert objects into a new data type. You may find yourself using these often when you’re first formatting your data and preparing it for analysis.

That’s it for data types in R! Keep an eye out for our next tutorial, which will go over different data structures in R like vectors, lists, data frames, and tibbles. I hope this tutorial was helpful! Happy coding!



If you enjoyed this tutorial and want learn more about data types and how to use them, you can check out Luka Negoita’s full course on the complete basics of R for ecology here:

Also be sure to check out R-bloggers for other great tutorials on learning R

To leave a comment for the author, please follow the link and comment on their blog: R on R (for ecology).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.