R XML: How to Work With XML Files in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R programming language can read all sorts of data, and XML is no exception. There are many ways to read, parse, and manipulate these markup language files in R, and today we’ll explore two. By the end of the article, you’ll know how to use two R packages to work with XML.
We’ll kick things off with an R XML introduction – you’ll get a sense of what XML is, and we’ll also write an XML dataset from scratch. Then, you’ll learn how to access individual elements, convert XML files to an R tibble
and a data.frame
, and much more.
Are you a complete beginner in R? See how R handles Object-Oriented Programming (OOP) with R6.
Table of contents:
- Introduction to R XML
- R XML Basics – How to Read and Parse XML Files
- How to Convert XML Data to tibble and data.frame
- Summary of R XML
Introduction to R XML
First, let’s answer one important question: What is XML? The acronym stands for Extensible Markup Language. It’s similar to HTML since they’re both markup languages, but XML is used for storing and transmitting data over the internet. As you would assume, all XML files have an .xml
file extension.
Building an interactive map with R and Shiny? See if you should be using Leaflet vs Tmap.
When you first start working with XML files you’ll immediately appreciate the structure. It’s human-readable, and there aren’t a gazillion of brackets as with JSON. There are no predefined tags, as in HTML. You can name your tags however you want, but it’s best to name them around the business logic.
All XML documents start with the following – the XML prolog:
<?xml version="1.0" encoding="UTF-8"?>
Each XML file also must have a root element that can have one or many child notes. All child nodes may have sub-childs.
Let’s see this in action! The following code snippet declares an XML dataset containing employees. There’s one root element – <records>
, and each <employee>
child has sub-childs, such as <last_name>
:
<?xml version="1.0" encoding="UTF-8"?> <records> <employee> <id>1</id> <first_name>John</first_name> <last_name>Smith</last_name> <position>CEO</position> <salary>10000</salary> <hire_date>2022-1-1</hire_date> <department>Management</department> </employee> <employee> <id>2</id> <first_name>Jane</first_name> <last_name>Sense</last_name> <position>Marketing Associate</position> <salary>3500</salary> <hire_date>2022-1-15</hire_date> <department>Marketing</department> </employee> <employee> <id>3</id> <first_name>Frank</first_name> <last_name>Brown</last_name> <position>R Developer</position> <salary>6000</salary> <hire_date>2022-1-15</hire_date> <department>IT</department> </employee> <employee> <id>4</id> <first_name>Judith</first_name> <last_name>Rollers</last_name> <position>Data Scientist</position> <salary>6500</salary> <hire_date>2022-3-1</hire_date> <department>IT</department> </employee> <employee> <id>5</id> <first_name>Karen</first_name> <last_name>Switch</last_name> <position>Accountant</position> <salary>4000</salary> <hire_date>2022-1-10</hire_date> <department>Accounting</department> </employee> </records>
Copy this file and save it locally – we’ve named it data.xml
. You’ll need it in the following section when we’ll work with XML in R.
But before we can do that, you’ll have to install two R packages:
install.packages("xml2") install.packages("XML")
Both are used to work with XML, and you can pretty much get around by using only the first. The second one has a couple of convenient functions for converting XML files, which we’ll cover later.
Want to add a Google Map to Shiny? Check out our guide to building interactive Google Maps with R Shiny!
First things first, let’s see how you can read and parse XML files in R.
R XML Basics – How to Read and Parse XML Files
By now you should have the dataset downloaded and R packages installed. Create a new R script and use the following code to load in the packages and read the XML file:
library(xml2) library(XML) employee_data <- read_xml("data.xml") employee_data
Here’s what it looks like:
The data is all there, but it’s unusable. You can make it usable by parsing the entire document or reading individual elements.
Let’s explore the parsing option first. Call the xmlParse()
function and pass in employee_data
:
employee_xml <- xmlParse(employee_data) employee_xml
The contents now look like our source file:
Pro tip: if you don’t care about the data, you can print the structure only. That’s done with the xml_structure()
function:
xml_structure(employee_data)
If you want to access all elements with the same tag, you can use the xml_find_all()
function. It returns both the opening and closing tags and any content that’s between them:
xml_find_all(employee_data, ".//position")
In the case you only want the content, use either xml_text()
, xml_integer()
, or xml_double()
function – depending on the underlying data type. The first one makes the most sense here:
xml_text(xml_find_all(employee_data, ".//position"))
You now know how to do some basic R XML operations, but most of the time you want to convert these files to either a tibble or a data frame for easier access and manipulation. Let’s see how to do that next.
How to Convert XML Data to tibble and data.frame
Most of the time with R and XML you’ll want to extract either all or a couple of features and turn them into a more readable format. We’ve already shown you how to use xml_text()
to extract text from a specific element, and now we’ll do a similar thing with integers. Then, we’ll format these two attributes as a tibble.
Here’s the entire code snippet:
library(tibble) # Extract department and salary info dept <- xml_text(xml_find_all(employee_data, ".//department")) salary <- xml_integer(xml_find_all(employee_data, ".//salary")) # Format as a tibble df_dept_salary <- tibble(department = dept, salary = salary) df_dept_salary
Now we have the department names and salaries for all employees. From here, it’s easy to calculate the average salary per department (note that only the IT department occurs twice):
library(dplyr) # Group by department name to get average salary by department df_dept_salary %>% group_by(department) %>% summarise(salary = mean(salary))
In case you want to convert the entire XML document to an R data.frame, look no further than the XML
package. It has a convenient xmlToDataFrame()
method that does the job perfectly:
df_employees <- xmlToDataFrame(nodes = getNodeSet(employee_xml, "//employee")) df_employees
That’s all the loading and preprocessing needed before you can start analyzing and visualizing datasets. It’s also the most common pipeline you’ll have for loading XML files, so we’ll end today’s article here.
Summary of R XML
XML files are common in 2022 and you as a data professional must know how to work with them. Almost all R XML-related work you’ll do boils down to loading and parsing XML documents and converting them to an analysis-friendly format. Today you’ve learned how to do that with two excellent R packages.
For a homework assignment, try to read only the <hire_date>
attribute, and make sure to parse it as a date. Is there a built-in function, or do you need to take an extra step? Make sure to let us know in the comment section below.
Excel power user? You can combine R and Excel with these two packages.
The post R XML: How to Work With XML Files in R appeared first on Appsilon | Enterprise R Shiny Dashboards.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.