Fixing aberrant files using R and the shell: a case study
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Once in a while, you embark on what looks like a simple computational procedure only to encounter frustration very early on. “I can’t even read my file into R!” you cry.
Step back, take a deep breath and take note of what the software is trying to tell you. Most times, you’ve just missed something very straightforward. Here’s an example.
Recently, I was trying to retrieve some data describing characteristics of microbial genomes from the NCBI FTP site. The file, lproks_0.txt (direct link), looked like a regular tab-delimited file with a couple of header lines:
head lproks_0.txt ## Microbial Organism Information Page ## Columns: "RefSeq project ID" "Project ID" "Taxonomy ID" "Organism Name" "Super Kingdom" "Group" "Sequence Status" "Genome Size" "GC Content" "Gram Stain" "Shape" "Arrangment" "Endospores" "Motility" "Salinity" "Oxygen Req" "Habitat" "Temp. range" "Optimal temp." "Pathogenic in" "Disease" "Genbank accessions" "Refseq accessions" 49725 30807 551115 'Nostoc azollae' 0708 Bacteria Cyanobacteria complete 5.532 38.3 Filaments Filaments Yes Aerobic Multiple Mesophilic - CP002059,CP002060,CP002061 - 55729 33011 592010 Abiotrophia defectiva ATCC 49176 Bacteria Firmicutes assembly 3.4774 37.1 + Cocci NoNo Facultative - ACIN00000000 -
Sharp eyes will notice a problem right there, on the first line of data. Less sharp-eyed users like me will open an R console to read the file, expecting no issues:
genomes <- read.table("lproks_0.txt", header = T, skip = 2, stringsAsFactors = F, sep = "\t") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 376 did not have 23 elements
“I can’t even read my file into R!”
My first mistake: reach for the tool that I know best, not the tool which is appropriate. In this case, my instinct was to count the fields in a line using awk. Since we skipped the first 2 lines in R, we want to examine line 378 in the original file:
awk '{FS="\t"; print NF}' lproks_0.txt | head -n 378 | nl | tail -n1 378 23
Sure looks like line 378 has 23 elements to me. Replacing “NF” with “$0″, we can examine the line content:
awk '{FS="\t"; print $0}' lproks_0.txt | head -n 378 tail -n1 49969 40633 376219 Arthrospira sp. PCC 8005 Bacteria Cyanobacteria unfinished 6.14555 44.7 Spiral Yes Facultative Aquatic Mesophilic - - -
Looks fine. What is the problem?
As it happens, R comes with several useful functions to examine the structure of files. One of these is count.fields(), which does what it says on the tin. We can combine it with table() to sum the field count for each line:
> table(count.fields("lproks_0.txt", skip = 2, sep = "\t")) 4 6 21 23 15 1 15 6753
There we are: indeed, R is seeing lines that do not have 23 fields.
Which lines are causing problems? We can find the indices of the problem lines like this:
> which(count.fields("lproks_0.txt", skip = 2, sep = "\t") != 23) [1] 377 684 685 830 831 835 836 837 838 839 840 841 842 916 918 [16] 1995 2709 2710 3248 3254 3255 3256 3257 3262 4086 4105 4679 6125 6293 6294 [31] 6556
Now we can go back to the original file and examine some of the problem lines, remembering to add 2 to the line number:
awk '{FS="\t"; print $0}' lproks_0.txt | head -n 379 | tail -n1 58297 13478 322098 Aster yellows witches'-broom phytoplasma AYWB Bacteria Firmicutes complete 0.727401 26.8 Sphere Aerobic Host-associated Mesophilic - Plant Aster yellows witches'-broom CP000061,CP000063,CP000064,CP000065,CP000062 - awk '{FS="\t"; print $0}' lproks_0.txt | head -n 844 | tail -n1 47787 535329 Brucella abortus bv. 1 str. 8-953#1146 Bacteria Alphaproteobacteria unfinished - Rod -- -
Problem solved. Some of the lines contain a single-quote character; read.table() thinks that this indicates a quoted field. Others contain the “#” symbol; read.table() interprets this as a comment character. No wonder that this file could not be read correctly.
How to fix the file? Ideally, you would find a way to read it “as is”. Alternatively, if altering the file contents is not an issue, we can bring sed into play:
# backup file and remove single quotes cp lproks_0.txt lproks_0.txt.orig sed -i "s/'//g" lproks_0.txt # replace # with - sed -i "s/#/-/g" lproks_0.txt
And back to the beginning:
genomes <- read.table("lproks_0.txt", header = T, skip = 2, stringsAsFactors = F, sep = "\t") dim(genomes) [1] 6783 23
All of which illustrates that 80% or more of my time spent on data analysis is spent on simply getting the data into a state that can be analysed.
Filed under: computing, programming, R, statistics Tagged: awk, bash, data, sed, shell
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.