Dealing with a Byte Order Mark (BOM)

Posted on March 11, 2015 by andrew in R bloggers | 0 Comments

[This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have just been trying to import some data into R. The data were exported from a SQL Server client in tab-separated value (TSV) format. However, reading the data into R the “usual” way produced unexpected results:

> data <- read.delim("sample-query.tsv", header = FALSE, stringsAsFactors = FALSE)
> head(data)
                                   V1    V2
1 ï»¿7E51B3EC4263438B22811BE78391A823  2129
2    8617E5E557903C7FAF011FBE2DFCED1D  3518
3    1E8B37DFB143BEEEE052516D2F3B58F5  6018
4    60B8AA536CFD26C5B5CF5BA6D7B7893C  7811
5    5A3BA8589DCD62B31948DC2715CA3ED9 12850
6    3552BF8AF58A58C794A43D4AA21F4FBA 13284

Those weird characters in the first record… where did they come from? They don’t show up in a text editor, so they’re not easy to edit out.

Googling ensued and revealed that those weird characters were in fact the byte order mark (BOM), special characters which indicate the endianness of the file. This was quickly confirmed using CYGWIN. (Yes, shamefully, I am working under Windows at the moment!)

The solution is remarkably simple: just specify the correct character encoding.

> data <- read.delim("sample-query.tsv", header = FALSE, stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")
> head(data)
                                V1    V2
1 7E51B3EC4263438B22811BE78391A823  2129
2 8617E5E557903C7FAF011FBE2DFCED1D  3518
3 1E8B37DFB143BEEEE052516D2F3B58F5  6018
4 60B8AA536CFD26C5B5CF5BA6D7B7893C  7811
5 5A3BA8589DCD62B31948DC2715CA3ED9 12850
6 3552BF8AF58A58C794A43D4AA21F4FBA 13284

Problem solved.

The post Dealing with a Byte Order Mark (BOM) appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Dealing with a Byte Order Mark (BOM)

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)