Build a data frame from vectors
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Tabular data is the most common format used by data scientists. In R, tables are respresented through data frames. They can be inspected by printing them to the console.
- Understand why data frames are important
- Interpret console output created by a data frame
- Create a new data frame using the
data.frame()
function - Define vectors to be used for single columns
- Specify names of data frame columns
data.frame(___ = ___, ___ = ___, ...)
Introduction to Data Frames
In analysis and statistics, tabular data is the most important data structure. It is present in many common formats like Excel files, comma separated values (CSV) or databases. R integrates tabular data objects as first-class citizens into the language through data frames. Data frames allow users to easily read and manipulate tabular data within the R language.
Let’s take a look at a data frame object named Davis
, from the package carData, which includs height and weight measurements for 200 men and women:
Davis sex weight height repwt repht 1 M 77 182 77 180 2 F 58 161 51 159 3 F 53 161 54 158 [ reached 'max' / getOption("max.print") -- omitted 197 rows ]
From the printed output we can see that the data frame spans over 200 rows (3 printed, 197 omitted) and 5 columns. In the example above, each row contains data of one person through attributes, which correspond to the columns sex
, weight
, height
, reported weight repwt
and reported height repht
.
For example, the first row in the table specifies a M
ale weighing 77
kg and has a height of 182
cm. The reported weights are very close with 77
kg and 180
cm, respectively.
The rows in a data frame are further identified by row names on the left which are simply the row numbers by default. In the case of the Davis
dataset above the row names range from 1 to 200.
Quiz: Data Frame Output
rank discipline yrs.since.phd yrs.service sex salary 1 Prof B 19 18 Male 139750 2 Prof B 20 16 Male 173200 3 AsstProf B 4 3 Male 79750 [ reached 'max' / getOption("max.print") -- omitted 394 rows ]
The data frame above shows the nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.
Which answers about the data frame printed above are correct?- The data frame has 3 rows.
- The data frame has 394 rows.
- The data frame has 397 rows.
- The data frame has 6 attributes.
- The attribute names contain
Prof
andAsstProf
Quiz: Data Frame Output (2)
rank discipline yrs.since.phd yrs.service sex salary 1 Prof B 19 18 Male 139750 2 Prof B 20 16 Male 173200 3 AsstProf B 4 3 Male 79750 [ reached 'max' / getOption("max.print") -- omitted 394 rows ]
The data frame above shows the nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.
Which answers about the first three faculty members are correct?- All three are male.
- The salaries of all three members are about the same.
- The Professor in row three is most probably be the oldest.
- All shown professors are from the same discipline.
- The highest salary amongst the three Professors is $139,750.
Creating Data Frames
data.frame(___ = ___, ___ = ___, ...)
Data frames hold tabular data in various columns or attributes. Each column is represented by a vector of different data types like numbers or characters. The data.frame()
function supports the construction of data frame objects by combining different vectors to a table. To form a table, vectors are required to have equal lengths. A data frame can also be seen as a collection of vectors connected together to form a table.
Let’s create our first data frame with four different persons including their ids, names and indicators if they are female or not. Each of these attributes is created by a different vector of different data types (numeric, character and logical). The attributes are finally combined to a table using the data.frame()
function:
data.frame( c(1, 2, 3, 4), c("Louisa", "Jonathan", "Luigi", "Rachel"), c(TRUE, FALSE, FALSE, TRUE) ) c.1..2..3..4. c..Louisa....Jonathan....Luigi....Rachel.. 1 1 Louisa 2 2 Jonathan 3 3 Luigi 4 4 Rachel c.TRUE..FALSE..FALSE..TRUE. 1 TRUE 2 FALSE 3 FALSE 4 TRUE
The resulting data frame stores the values of each vector in a different column. It has four rows and three columns. However, the column names printed on the first line seem to include the column values separated by dots which is a very strange naming scheme!
Column names can be included into the data.frame()
construction as argument names preceding the values of column vectors. To improve the column naming of the previous data frame we can write
data.frame( id = c(1, 2, 3, 4), name = c("Louisa", "Jonathan", "Luigi", "Rachel"), female = c(TRUE, FALSE, FALSE, TRUE) ) id name female 1 1 Louisa TRUE 2 2 Jonathan FALSE 3 3 Luigi FALSE 4 4 Rachel TRUE
The resulting data frame includes the column names needed to see the actual meaning of the different columns.
Exercise: Creating Your First Data Frame
weekday | temperature | hot |
---|---|---|
Monday | 28 | FALSE |
Tuesday | 31 | TRUE |
Wednesday | 25 | FALSE |
Let’s create a data frame as shown above using the data.frame()
function. The resulting data frame should consist of the three columns weekday
, temperature
and hot
:
- The first column named
weekday
contains the weekday names"Monday"
,"Tuesday"
,"Wednesday"
. - The second column named
temperature
contains the temperatures (in degrees Celsius) as28
,31
,25
. - The third column named
hot
contains the logical valuesFALSE
,TRUE
,FALSE
.
Store the final data frame in the variable temp
and print its output to the console:
Quiz: Which statements are true about this data frame?
price <- c(28, 31, 25) data.frame( weekday = c("Monday", "Tuesday", "Wednesday", "Thursday"), price = price, expensive = price > 30 )Which statements are true about the data frame above?
- The
data.frame()
function will fail because the columnexpensive
is no vector. - The
data.frame()
function will not fail - The
data.frame()
function fails because the lengths of the vectors are different - The command would work if
weekday
had the valuesc("Monday", "Tuesday", "Wednesday")
Build a data frame from vectors is an excerpt from the course Introduction to R, which is available for free at https://www.quantargo.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.