Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
“Happy families are all alike; every unhappy family is unhappy in its own way” runs the opening sentence of Anna Karenina. Hadley Wickham echoes the sentiment in a somewhat different context: “Tidy datasets are all alike, but every messy dataset is messy in its own way”. Data analysis is mostly data wrangling. That is, before you can do anything at all with your data, you need to get it into a format that your software can read. More often than not, the stage between having collected (or found) some data and being able to analyze it is frustrating, awkward, and filled with difficulties particular to the data you are working with. Things are encoded this way rather than that; every fifth line has an extra column; the data file contains subtotals and running headers; the tables are only available as PDFs; the source website has no API, and so on.
This makes teaching data wrangling a little awkward, too. On the one
hand, R’s tidyr
package has a
bomb-disposal squad of functions designed to defuse these problems. But
looking at them piecemeal might make any particular one seem highly
specialized and hard to motivate in general. On the other hand, any
dataset in need of wrangling will likely have all kinds of idiosyncratic
problems, so a worked example may end up seeming far too specific.
In practice we bridge the two extremes by repeatedly showing tools in use, moving back and forth between some specific file problem and more general heuristics for diagnosing and solving problems; or between some specific function and the more general theory of data that it is trying to help you apply.
Here’s a real life case that I had fun with this week. I learned more about a general idea in the process of trying to solve a particular problem that had come up in the middle of trying to transform a PDF of more-or-less formatted tables into a tidy dataset.
Run-Length Encoding
The general idea I learned more about was the notion of run-length encoding. This is
a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run.
Wikipedia’s example comes from the transmission of TV signals:
Consider a screen containing plain black text on a solid white background, over hypothetical scan line, it can be rendered as follows:
12W1B12W3B24W1B14W
. This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc., and represents the original 67 characters in only 18. While the actual format used for the storage of images is generally binary rather than ASCII characters like this, the principle remains the same.
Though primarily a way of compressing data, we can use a run-length
encoding to keep track of how a sequence unfolds. For example, here’s a
vector of TRUE
and FALSE
values:
1 2 3 4 5 6 |
x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE) length(x) ## [1] 14 |
In R we can get a run-length encoding of this vector with the rle()
function. It returns an object with two pieces, the lengths and the
values.
1 2 3 4 5 |
rle(x) ## Run Length Encoding ## lengths: int [1:10] 2 1 1 3 1 1 1 2 1 1 ## values : logi [1:10] TRUE FALSE TRUE FALSE TRUE FALSE ... |
So here we have ten runs. TRUE
twice, then FALSE
once, then TRUE
once, then FALSE
three times, and so on. We can get the run lengths
alone with
1 2 3 |
rle(x)$lengths ## [1] 2 1 1 3 1 1 1 2 1 1 |
If we use sequence()
to count the runs back out, the result will be
equal to the length of the original vector:
1 2 3 |
sequence(rle(x)$lengths) ## [1] 1 2 1 1 1 2 3 1 1 1 1 2 1 1 |
This makes a sequence that starts counting at 1 and resets to 1 whenever
a new value is seen. Because the values are TRUE
and FALSE
only,
once we know the start value we can reconstitute the sequence knowing
just it and the runs. We can take advantage of the binary
TRUE
/FALSE
data (implicitly 1/0 numerically) in a different way,
too. With the original vector and the sequenced run, we can get
different ways of counting the sequence depending on how we multiply it
by the values of x
, or negate those values before multiplying them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
tibble(x = x, seq1 = sequence(rle(x)$lengths), seq2 = sequence(rle(!x)$lengths) * x, seq3 = sequence(rle(!x)$lengths) * !x) ## # A tibble: 14 × 4 ##x seq1 seq2 seq3 ##<lgl> <int> <int> <int> ## 1 TRUE 1 1 0 ## 2 TRUE 2 2 0 ## 3 FALSE 1 0 1 ## 4 TRUE 1 1 0 ## 5 FALSE 1 0 1 ## 6 FALSE 2 0 2 ## 7 FALSE 3 0 3 ## 8 TRUE 1 1 0 ## 9 FALSE 1 0 1 ## 10 TRUE 1 1 0 ## 11 FALSE 1 0 1 ## 12 FALSE 2 0 2 ## 13 TRUE 1 1 0 ## 14 FALSE 1 0 1 |
So for logical vectors we can make it so that our counter starts at
either 1 or 0 for TRUE
, and counts up from there. We can also have
FALSE
always coded as 0 (no matter how many times it’s repeated) but
TRUE
always coded as 1 and then counted up from there. Different
indexes for the sequence can be more or less convenient depending on
what we might need to keep track of in some vector.
A PDF of almost-regular tables
That’s the general bit. Now for the idiosyncratic one. This week I was working with about a hundred pages of legal billing data. Every page of the PDF had information about the date, rate, and hours of service billed, with a description of the service that could be anywhere from one to ten or so lines long. I wanted to get this PDF into R to do look at the records more systematically.
Now, In many cases, the most straightforward way to get a PDF table into some sort of plain-text format will not be to write any code at all. Instead you can use Excel, for instance, to read PDF tables a page at a time, or try Adobe’s PDF to Excel service if you’re a subscriber. These are useful tools, and if your job is straightforward you should consider starting with them.
We’re going to use R, because that’s how we roll around here. Even
within R there are some decisions to make. For example, I might have
used tabula, the Java-driven PDF-to-text
engine, via the tabulizer
package. But I didn’t have tabula installed, so instead to get the PDF
file into R we’ll use pdftools,
which depends on the poppler
rendering library, which I did have installed. From there we use some
standard tidyverse packages, notably dplyr
and tidyr
.
One thing to bear in mind is that, as often happens, the wrangling and cleaning sequence I’m about to show here developed fairly organically. That is, I was trying to get the hang of the data and this is how I proceeded. Some steps were driven partly by a desire to keep seeing the data in a way that was immediately comprehensible, just so that I could get a quick sense of whether I was making any big mistakes. If this were the sort of problem where doing it as efficiently as possible really mattered, the next step would be to investigate various “roads not taken” a bit more thoroughly. I guarantee you that there are better ways to this. But for my purposes, this way was good enough.
Ultimately we’re going to end up with a table containing columns with these names:
1 2 |
varnames <- c("date", "timekeeper", "description", "hours", "rate", "fees", "notes") |
Getting there will take a little while. A PDF file is an unfriendly thing to try to extract plain-text data
from. (Turning typeset tables or text back into plain-text is sometimes
likened to trying to reconstitute a pig from a packet of sausages.) In
this case, the data is all in there, which is to say that the numbers
and the text is all there in a way we can get to, but when we import it
into R we will lose all the tabular formatting. All we will have left is
a stream of text with lines designated by the special \n
character.
1 2 3 4 5 6 7 |
pgs <- pdftools::pdf_text("data/appendix_a.pdf") pgs <- set_names(pgs, nm = 1:length(pgs)) length(pgs) ## [1] 92 |
In this next part, we’ll use some regular expressions inside stringr
functions to strip some lines we don’t need (such as the “Page n of 92”
header lines); trim the strings of excess whitespace at the start and
end; and finally anonymize the data for the purposes of this post. I do
this last bit by looking for every word and replacing it with a random
word. The words come from stringr
’s built-in vector words
, which it
uses for examples. It has about a thousand miscellaneous words. The
regular expression \p{L}
will match any single unicode character that
is a letter. Writing \p{L}+
will make the expression keep matching
until a non-letter is encountered (e.g. a punctuation mark or a space).
Finally, because we are inside R, we need to double-escape the special
\p
code so that it is \\p
, in order for the backslash to be
preserved.
1 2 3 4 |
eg <- pgs |> str_remove_all("Page \\d{1,2} of 92") |> str_trim() |> str_replace_all("\\p{L}+", function(x) sample(stringr::words, 1)) |
OK, so now we can look at what we have. It’s a vector with 92 elements:
1 2 3 |
length(eg) ## [1] 92 |
Each of these elements is a single long, messy string that contains all
the characters from one page, including special ones like \n
for
newline. The newline characters are represented by the text \n
rather
than being interpreted as actual new lines. Here, for example, is page
four:
1 2 3 |
eg[4] ## [1] "4/22/14 stickcontact for world insure fun real (.3); correct shut‐realise cat, apart 1.2635762.00\nexample but trust invest, over door head responsible introduce involve, figure\naddress cheap english company (.9)\n4/22/14 wage at soon‐lose commit television break care'try lay (.2); admit hullo whole luck 0.4749299.60\nview strong involve\n4/27/14 charactershould between field double air look' clean‐week occasion 1.3635 825.50\n4/28/14 produceguess busy collect tree kill double' deep‐introduce lady 5.4635 3,429.00\n4/29/14 ideacall wee person common succeed catch' system‐big house 9.1635 5,778.50\n5/2/14available product hit worse stage' seem true can; house begin company 0.4749 299.60 0.8 product seem call/ year hot\nnotice presume\n5/9/14nationseat east operate farm common next0.1635 63.50\n5/15/14 individualnever not jump refer document monday consider (.5); new original then door chance 0.4635254.00 0.9 dog indeed bother/ function bad\nintroduce drop talk (.4)\n5/15/14 word little such tell paper knock lie town hundred with type bother 0.2749149.80 0.5 shop if council/ enough cake\n\n5/20/14 cup world right like top (1); clean total family (1.5); help 3.1749 2,321.90 6.2 confer hot sing/ switch a\ncar lead cup (3.7)\n5/23/14 poundprogramme rule stupid 0.1635 63.50\n5/23/14 question vote plus hospital what suit. top 0.1749 74.90\n6/9/14issue bed system now discuss converse'age though eight round (hospital kitchen2.7749 2,022.30 5.5 study large resource/ system arm\nlabour country); receive figure'meet he speak afternoon, specific common stand\nwhat list aware; television further, join side bar tell\n\n6/10/14 servicealthough front view pair occasion 0.3635190.50\n6/11/14 blue report day town too garden line call wear attend enough0.2749149.80\n6/16/14 amount to hell list simple life leg encourage; as contract 4.2749 3,145.80 8.5 post nine friday/ wrong honest\ncommon brief night student (.4); date 2elect fair beauty couple suggest unless function 2always finish\nmay type trouble (1.7); matter britain oppose lunch america out;\nscotland consider many without represent house during 2nine colleague tape\n(2.1); the down lead clock opportunity water individual thank simple bit\nhot (2.5); clothe balance suggest never moment company (1.8)\n\n6/17/14 class might bottom work afternoon yes hope 0.6749 449.40 1.3 true income air/ hall doubt\n6/22/14 worth educate air quick save, health six slow hospital share 3.2749 2,396.80 6.5 happen million lady/ sister field\ndrop choose regard yet\n6/23/14 createscotland next amount love favour2.9635 1,841.50 5.8 bar week cross/ service turn" |
It goes on for a long way off to the right there.
Breaking up the lines
The first thing we will do is take each element in the vector (i.e. the long, long string that is each
page), and break it on every newline character. That is, we’ll split the
string so that every time we encounter the sequence \n
we will make it
an actual new line. We use the
1 |
eg <- str_split(eg, pattern = "\\n") |
Because we have split each \n
character into an actual newline, the
structure of our eg
data object has changed. It used to be a vector.
But now it is a list:
1 2 3 4 5 6 7 |
class(eg) ## [1] "list" length(eg) ## [1] 92 |
It’s a list of the same length the vector was, but now each list element contains a bunch of lines. Here is page 4 again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
eg[4] ## [[1]] ## [1] "4/22/14 stickcontact for world insure fun real (.3); correct shut‐realise cat, apart 1.2635762.00" ## [2] "example but trust invest, over door head responsible introduce involve, figure" ## [3] "address cheap english company (.9)" ## [4] "4/22/14 wage at soon‐lose commit television break care'try lay (.2); admit hullo whole luck 0.4749299.60" ## [5] "view strong involve" ## [6] "4/27/14 charactershould between field double air look' clean‐week occasion 1.3635 825.50" ## [7] "4/28/14 produceguess busy collect tree kill double' deep‐introduce lady 5.4635 3,429.00" ## [8] "4/29/14 ideacall wee person common succeed catch' system‐big house 9.1635 5,778.50" ## [9] "5/2/14available product hit worse stage' seem true can; house begin company 0.4749 299.60 0.8 product seem call/ year hot" ## [10] "notice presume" ## [11] "5/9/14nationseat east operate farm common next0.1635 63.50" ## [12] "5/15/14 individualnever not jump refer document monday consider (.5); new original then door chance 0.4635254.00 0.9 dog indeed bother/ function bad" ## [13] "introduce drop talk (.4)" ## [14] "5/15/14 word little such tell paper knock lie town hundred with type bother 0.2749149.80 0.5 shop if council/ enough cake" ## [15] "" ## [16] "5/20/14 cup world right like top (1); clean total family (1.5); help 3.1749 2,321.90 6.2 confer hot sing/ switch a" ## [17] "car lead cup (3.7)" ## [18] "5/23/14 poundprogramme rule stupid 0.1635 63.50" ## [19] "5/23/14 question vote plus hospital what suit. top 0.1749 74.90" ## [20] "6/9/14issue bed system now discuss converse'age though eight round (hospital kitchen2.7749 2,022.30 5.5 study large resource/ system arm" ## [21] "labour country); receive figure'meet he speak afternoon, specific common stand" ## [22] "what list aware; television further, join side bar tell" ## [23] "" ## [24] "6/10/14 servicealthough front view pair occasion 0.3635190.50" ## [25] "6/11/14 blue report day town too garden line call wear attend enough0.2749149.80" ## [26] "6/16/14 amount to hell list simple life leg encourage; as contract 4.2749 3,145.80 8.5 post nine friday/ wrong honest" ## [27] "common brief night student (.4); date 2elect fair beauty couple suggest unless function 2always finish" ## [28] "may type trouble (1.7); matter britain oppose lunch america out;" ## [29] "scotland consider many without represent house during 2nine colleague tape" ## [30] "(2.1); the down lead clock opportunity water individual thank simple bit" ## [31] "hot (2.5); clothe balance suggest never moment company (1.8)" ## [32] "" ## [33] "6/17/14 class might bottom work afternoon yes hope 0.6749 449.40 1.3 true income air/ hall doubt" ## [34] "6/22/14 worth educate air quick save, health six slow hospital share 3.2749 2,396.80 6.5 happen million lady/ sister field" ## [35] "drop choose regard yet" ## [36] "6/23/14 createscotland next amount love favour2.9635 1,841.50 5.8 bar week cross/ service turn" |
You can see from that double-bracketed [[1]]
at the top (“The first
element you asked for”) that we’re looking at a list. Inside is a series
of vectors, numbering however many lines we got on that particular page.
We could keep working with this list just as a list. But to make things
a little more convenient—really just to make it easier to look at its
contents as we go—we’re going to convert the series of lines inside each
list element to a tibble (i.e. a nice data table). To begin with, it
will have just three columns: text
, which will contain each line of
text on the page; line
, which numbers the line, and page
for the
page. We’re doing the same thing to each element of this vector, so that
is a kind of iteration. But rather than write a loop, the way we do this
in a pipeline is to apply or map a function or series of actions to
each element of the list. This lets us keep things in a pipeline. To
generate the page
variable we use imap()
, which can access the name
of the current list element. In this case that’s just a number corresponding to the page.
If you’re not used to reading a pipeline like this, read |>
as “and
then”. Imagine starting with the data, eg
, and then doing a series
of things to it. Each line hands on the result to the next line, which
takes it as input.
1 2 3 4 |
eg <- eg |> map(~ tibble(text = unlist(.x), line = 1:length(text))) |> imap(~ .x |> mutate(page = .y)) |
So what did that do? Here’s page 4 again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
eg[4] ## [[1]] ## # A tibble: 36 × 3 ## text line page ## <chr> <int> <int> ## 1 "4/22/14 stickcontact for world insure fun real (.3); corr… 1 4 ## 2 "example but trust invest, over door head respon… 2 4 ## 3 "address cheap english company (.9)" 3 4 ## 4 "4/22/14 wage at soon‐lose commit television break care'try … 4 4 ## 5 "view strong involve" 5 4 ## 6 "4/27/14 charactershould between field double air look' cl… 6 4 ## 7 "4/28/14 produceguess busy collect tree kill double' deep‐… 7 4 ## 8 "4/29/14 ideacall wee person common succeed catch' system‐… 8 4 ## 9 "5/2/14available product hit worse stage' seem true can; h… 9 4 ## 10 "notice presume" 10 4 ## # … with 26 more rows |
As you can see, page 4 has 36 lines of text. All our data is in that
text
column. We’re beginning to see what it’s going to look like as a
table with proper columns.
Let’s do a little preliminary cleaning and reorganization before we try
splitting up text
into separate columns. Again we are using map()
because we have to do this to each page individually. Some of these
lines (e.g. removing section headers) deal with things that I know are
in the data but which I don’t need, because I looked at the PDF. At the
end of this process we bind all the list elements together by row, so
that they become a single big tibble.
1 2 3 4 5 6 7 |
eg <- eg |> map(~ .x |> relocate(page, line) |> # move page and line to the left filter(!str_detect(text, '^$')) |> # Remove any blank lines filter(!str_detect(text, "^\\d{1,2}\\. ")) # Remove section headers ) |> bind_rows() |> # Convert to single tibble tail(-2) # Strip very first two lines (from page 1), they're not needed |
Now we have a single tibble, rather than a list of tibbles.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
eg ## # A tibble: 3,158 × 3 ## page line text ##<int> <int> <chr> ## 1 1 4 "6/25/13 photographmust: germany allow direct accou… ## 2 1 5 "6/26/13 settleoppose paragraph wish what feed defi… ## 3 1 6 "6/27/13 strongmight new use course full afternoon … ## 4 1 7 "6/28/13 experiencegrow well improve need bet … ## 5 1 8 "7/1/13 middledanger bother okay bloke per… ## 6 1 9 "7/2/13 eggthis across also play toward (1.6); hun… ## 7 110 "7/2/13 morning radio pair stuff effect name friday … ## 8 111 " god south authority (.3)" ## 9 112 "7/3/13 carry instead yet private hospital responsib… ## 10 113 " bear once king fit link" ## # … with 3,148 more rows |
This means that if we want to see page 4 now, we filter on the page
column:
1 2 |
eg |> filter(page == 4) |
# A tibble: 33 × 3
page line text
##< int> < int> < chr>
1 4 1 “4/22/14 stickcontact for world insure fun real (.3); corr…
2 4 2 “example but trust invest, over door head respon…
3 4 3 “address cheap english company (.9)”
4 4 4 “4/22/14 wage at soon‐lose commit television break care’try …
5 4 5 “view strong involve”
6 4 6 “4/27/14 charactershould between field double air look’ cl…
7 4 7 “4/28/14 produceguess busy collect tree kill double’ deep‐…
8 4 8 “4/29/14 ideacall wee person common succeed catch’ system‐…
9 4 9 “5/2/14available product hit worse stage’ seem true can; h…
10 410 “notice presume”
# … with 23 more rows
At this point we can see that we are in for a little trouble. The shape
of text
suggests the data splits up into columns in this order: date,
timekeeper, description, hours, rate, fees, notes. (They may be out of
view here but the ends of the lines of text
have the numeric
corresponding to fees and so on.) Unfortunately, we can see that,
e.g. on lines 2 and 3 on page 4, that some rows are blank except for
text and also do not start with anything that looks like a date. What
has happened is that the “description” column in the original PDF
document often has descriptions that run to several lines rather than
just one. When that happens, either the original data entry person or
(more likely) the PDF-generating software has inserted newline so that
the full description will display on the page in its table cell. That’s
annoying, because the effect—after we have split our file on \n
characters—is that some records to have part of their “description”
content moved to a new line, or series of lines. If the description had
been the last column in the original PDF then rejoining these
description-fragments to their correct row would have been easier. But
because it’s in the middle of the table things are more difficult. The
number of added description-fragments varies irregularly, too, anywhere
from one to eight or nine additional lines.
As noted earlier, if this were the sort of problem where I knew I’d be encountering data in just this form often (a new PDF of a hundred-odd pages of billing data coming in every week, or something) then this would be the ideal spot to stop and ask, “How can I avoid getting myself into this situation in the first place?” I’d go back to the initial reading-in stage and try to see if there was something I could do to immediately distinguish the description-fragment lines from “proper” lines. I did think about it briefly, but no obvious (to me) solution immediately presented itself. So, instead, I’m just going to keep going and solve the problem as it stands.
As a first move towards fixing this problem, we can create a new column that flags whether a line of text begins with something that looks like a date. I know from the original records that every distinct billing entry does in fact begin with a date, so this will be helpful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
eg <- eg |> mutate(has_date = str_detect(text, "^\\d{1,2}/\\d{1,2}/\\d{1,2}")) eg ## # A tibble: 3,158 × 4 ## page line text has_date ##<int> <int> <chr><lgl> ## 1 1 4 "6/25/13 photographmust: germany allow dir… TRUE ## 2 1 5 "6/26/13 settleoppose paragraph wish what … TRUE ## 3 1 6 "6/27/13 strongmight new use course full a… TRUE ## 4 1 7 "6/28/13 experiencegrow well improve need … TRUE ## 5 1 8 "7/1/13 middledanger bother okay bloke pe… TRUE ## 6 1 9 "7/2/13 eggthis across also play toward (… TRUE ## 7 110 "7/2/13 morning radio pair stuff effect nam… TRUE ## 8 111 " god south authority (.3)"FALSE ## 9 112 "7/3/13 carry instead yet private hospital … TRUE ## 10 113 " bear once king fit link" FALSE ## # … with 3,148 more rows |
Separating out the columns
So, with the foreknowledge that this is not going to work properly, we
trim the front and end of each line of text. Then we separate out all
the columns into what ought to be the correct series of variable names
(which we wrote down above as varnames
). We tell the separate()
function to split wherever it encounters more than two spaces in a row,
and to name the new columns. (I know, because I checked, that there
aren’t any sentences where a period is followed by two spaces.
Two-spacers are moral monsters, by the way.) If there’s any extra
material we haven’t seen it will get filled to the right. In this step
we also add an explicit row id, just to help us keep track of things.
1 2 3 4 5 6 |
eg <- eg |> mutate(text = str_trim(text)) |> separate(text, sep = "\\s{2,}", into = varnames, fill = "right", extra = "merge") |> rowid_to_column() |
Now what we have is finally starting to look more like a dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
eg |> filter(page == 4) ## # A tibble: 33 × 11 ##rowid page line date timekeeper description hours rate fees notes ##<int> <int> <int> <chr><chr> <chr> <chr> <chr> <chr> <chr> ## 1 100 4 1 4/22/14 stick contact fo… 1.2 635 762.… <NA> ## 2 101 4 2 example but… <NA> <NA><NA> <NA> <NA> <NA> ## 3 102 4 3 address che… <NA> <NA><NA> <NA> <NA> <NA> ## 4 103 4 4 4/22/14 wage at soon‐lo… 0.4 749 299.… <NA> ## 5 104 4 5 view strong… <NA> <NA><NA> <NA> <NA> <NA> ## 6 105 4 6 4/27/14 character should bet… 1.3 635 825.… <NA> ## 7 106 4 7 4/28/14 produceguess busy… 5.4 635 3,42… <NA> ## 8 107 4 8 4/29/14 idea call wee p… 9.1 635 5,77… <NA> ## 9 108 4 9 5/2/14 available product hi… 0.4 749 299.… <NA> ## 10 109 410 notice pres… <NA> <NA><NA> <NA> <NA> <NA> ## # … with 23 more rows, and 1 more variable: has_date <lgl> |
You can see that, as expected, things did not go exactly as we would
have liked, thanks to the line fragments from the description
column.
Each time we hit one of those we get a new line where the description
content ends up in the date
column and all the other columns are
empty, hence the <NA>
missing value designation. We need to fix this.
Dealing with the description-fragments
Our goal is to get each description fragment re-attached to the
description
field in the correct row, in the right order. How to do
this? The answer is always “Well there’s more than one way to do it.”
But here’s what I did. I know that if I can reliably group lines with
an identifier that says “These lines are all really from the same
billing record” I can solve my problem. So the problem becomes how to do
that. To begin with, I know that every row that starts with a date—where
has_date
is TRUE
—is the first line of a valid record. Many records
have only one line. But many others are followed by some number of
description-fragments, which continue for however long as they do. Then
we move on to the next record. So we need to distinguish the boundaries.
This is what led me to mess around with run-length encoding. In the end,
I didn’t actually need it to solve the problem, but I am going to keep
it here anyway. First I’ll show the simpler way to get the group id we
need. I’ll pick out just a few columns to make things easier to read. To
get a groupid
, we first assign the rowid
to every row that has a
date:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
eg |> select(page, rowid, has_date, description) |> mutate(groupid = ifelse(has_date == TRUE, rowid, NA), .after = has_date) |> filter(page == 4) ## # A tibble: 33 × 5 ## page rowid has_date groupid description ##<int> <int> <lgl> <int> <chr> ## 1 4 100 TRUE 100 contact for world insure fun real (.3); correct… ## 2 4 101 FALSE NA <NA> ## 3 4 102 FALSE NA <NA> ## 4 4 103 TRUE 103 at soon‐lose commit television break care'try l… ## 5 4 104 FALSE NA <NA> ## 6 4 105 TRUE 105 should between field double air look' clean‐wee… ## 7 4 106 TRUE 106 guess busy collect tree kill double' deep‐intro… ## 8 4 107 TRUE 107 call wee person common succeed catch' system‐bi… ## 9 4 108 TRUE 108 product hit worse stage' seem true can; house b… ## 10 4 109 FALSE NA <NA> ## # … with 23 more rows |
Now, for every NA
value of groupid
we encounter, we use fill()
to
copy down the nearest previous valid groupid
. And those are our
groups.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
eg |> select(page, rowid, has_date, description) |> mutate(groupid = ifelse(has_date == TRUE, rowid, NA), .after = has_date) |> fill(groupid) |> filter(page == 4) ## # A tibble: 33 × 5 ## page rowid has_date groupid description ##<int> <int> <lgl> <int> <chr> ## 1 4 100 TRUE 100 contact for world insure fun real (.3); correct… ## 2 4 101 FALSE100 <NA> ## 3 4 102 FALSE100 <NA> ## 4 4 103 TRUE 103 at soon‐lose commit television break care'try l… ## 5 4 104 FALSE103 <NA> ## 6 4 105 TRUE 105 should between field double air look' clean‐wee… ## 7 4 106 TRUE 106 guess busy collect tree kill double' deep‐intro… ## 8 4 107 TRUE 107 call wee person common succeed catch' system‐bi… ## 9 4 108 TRUE 108 product hit worse stage' seem true can; house b… ## 10 4 109 FALSE108 <NA> ## # … with 23 more rows |
That’s all we need to proceed. But because I experimented with it first, here are two other variables that might be useful under other circumstances. The first is the run-length sequence, written in a way where every row that’s the start of a record has a value of zero, and every description-fragment has a counter starting from one. The second, calculated from that, is a variable that flags whether a line is a single-line record or part of a multi-line record. We get this by flagging it as “Multi” if either the current row’s run-length counter is greater than zero or the run-length counter of the row below it is greater than zero.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
eg |> mutate(groupid = ifelse(has_date == TRUE, rowid, NA), rlecount = sequence(rle(!has_date)$lengths) * !has_date, one_or_multi = if_else(rlecount > 0 | lead(rlecount > 0), "Multi", "One")) |> fill(groupid) |> relocate(has_date, one_or_multi, rlecount, groupid, .after = line) |> filter(page == 4) ## # A tibble: 33 × 14 ##rowid page line has_date one_or_multi rlecount groupid date timekeeper ##<int> <int> <int> <lgl><chr> <int> <int> <chr> <chr> ## 1 100 4 1 TRUE Multi 0 100 4/22/14 stick ## 2 101 4 2 FALSEMulti 1 100 example … <NA> ## 3 102 4 3 FALSEMulti 2 100 address … <NA> ## 4 103 4 4 TRUE Multi 0 103 4/22/14 wage ## 5 104 4 5 FALSEMulti 1 103 view str… <NA> ## 6 105 4 6 TRUE One 0 105 4/27/14 character ## 7 106 4 7 TRUE One 0 106 4/28/14 produce ## 8 107 4 8 TRUE One 0 107 4/29/14 idea ## 9 108 4 9 TRUE Multi 0 108 5/2/14available ## 10 109 410 FALSEMulti 1 108 notice p… <NA> ## # … with 23 more rows, and 5 more variables: description <chr>, hours <chr>, ## # rate <chr>, fees <chr>, notes <chr> |
Let’s put all that into an object called eg_gid
for the whole data
set.
1 2 3 4 5 6 7 8 |
eg_gid <- eg |> mutate(groupid = ifelse(has_date == TRUE, rowid, NA), rlecount = sequence(rle(!has_date)$lengths) * !has_date, one_or_multi = if_else(rlecount > 0 | lead(rlecount > 0), "Multi", "One")) |> fill(groupid) |> relocate(has_date, one_or_multi, rlecount, groupid, .after = line) |
Next, we extract all the “true” first rows. We can do this by filtering
on has_date
…
1 2 |
eg_first <- eg_gid |> filter(has_date == TRUE) |
… or, for the same effect, group by groupid
and slice out the first
row of each group:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
eg_first <- eg_gid |> group_by(groupid) |> slice_head(n = 1) eg_first |> filter(page == 4) ## # A tibble: 19 × 14 ## # Groups: groupid [19] ##rowid page line has_date one_or_multi rlecount groupid datetimekeeper ##<int> <int> <int> <lgl><chr> <int> <int> <chr> <chr> ## 1 100 4 1 TRUE Multi 0 100 4/22/14 stick ## 2 103 4 4 TRUE Multi 0 103 4/22/14 wage ## 3 105 4 6 TRUE One 0 105 4/27/14 character ## 4 106 4 7 TRUE One 0 106 4/28/14 produce ## 5 107 4 8 TRUE One 0 107 4/29/14 idea ## 6 108 4 9 TRUE Multi 0 108 5/2/14 available ## 7 110 411 TRUE One 0 110 5/9/14 nation ## 8 111 412 TRUE Multi 0 111 5/15/14 individual ## 9 113 414 TRUE One 0 113 5/15/14 word ## 10 114 416 TRUE Multi 0 114 5/20/14 cup ## 11 116 418 TRUE One 0 116 5/23/14 pound ## 12 117 419 TRUE One 0 117 5/23/14 question ## 13 118 420 TRUE Multi 0 118 6/9/14 issue ## 14 121 424 TRUE One 0 121 6/10/14 service ## 15 122 425 TRUE One 0 122 6/11/14 blue ## 16 123 426 TRUE Multi 0 123 6/16/14 amount ## 17 129 433 TRUE One 0 129 6/17/14 class ## 18 130 434 TRUE Multi 0 130 6/22/14 worth ## 19 132 436 TRUE One 0 132 6/23/14 create ## # … with 5 more variables: description <chr>, hours <chr>, rate <chr>, ## # fees <chr>, notes <chr> |
Bear in mind that eg_first
contains the first rows of all the
records in the data. It includes both the records that are just one row
in length and the first line only of any record that also has
description-fragments on subsequent lines.
With this table in hand, we can take advantage of dplyr
’s core
competence of summarizing tables. We start with all the data, but this
time we use filter
to get all those rows where rlecount
is not zero.
Then we group those rows by groupid
and summarize their date
field
(which is where all the decription-fragments are, remember). How do we
summarize this text? In this case, by creating a new column called
extra_text
and pasting all the date
text from any rows within the
group into a single string:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
eg_gid |> filter(rlecount != 0) |> group_by(groupid) |> summarize(extra_text = paste0(date, collapse = "")) ## # A tibble: 891 × 2 ##groupid extra_text ## <int> <chr> ## 1 7 god south authority (.3) ## 2 9 bear once king fit link ## 3 20 visit base club ## 4 22 (.3); beat occasion place send (.3) ## 5 24 month seven ring. apparent father raise ## 6 28 write show stuff shall; identify issue account jump please; figureim… ## 7 31 each major plan welcome (.7); situate scotland region real trust ove… ## 8 35 strike two up; god morning even amount require extra; telephone ## 9 44 play; nice tax situate stop rule chairman sing floor call arrange; t… ## 10 46 lady close plus; sudden health music collect gas (1.1); sign departm… ## # … with 881 more rows |
Now we now we can get all the description-fragments down to one-row per
record, and have them identified by their groupid
. This means we can
join them to eg_first
, the table of all record first-rows. So we go
back to eg_gid
, do our summarizing, then right-join this result to the
true first row table, eg_first
, joining by groupid
. Then we make a
new column called full_description
and paste the description
and
extra_text
fields together.
1 2 3 4 5 6 7 8 |
eg_clean <- eg_gid |> filter(rlecount != 0) |> group_by(groupid) |> summarize(extra_text = paste0(date, collapse = "")) |> right_join(eg_first, by = "groupid") |> mutate(full_description = paste(description, extra_text)) |> select(rowid, groupid, page:timekeeper, hours:fees, full_description) |
We’re nearly finished. How is page 4 doing?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
eg_clean |> select(page, line, groupid, one_or_multi, full_description) |> filter(page == 4) ## # A tibble: 19 × 5 ## page line groupid one_or_multi full_description ##<int> <int> <int> <chr><chr> ## 1 4 1 100 Multicontact for world insure fun real (.3); cor… ## 2 4 4 103 Multiat soon‐lose commit television break care't… ## 3 4 9 108 Multiproduct hit worse stage' seem true can; hou… ## 4 412 111 Multinever not jump refer document monday consid… ## 5 416 114 Multiworld right like top (1); clean total famil… ## 6 420 118 Multibed system now discuss converse'age though … ## 7 426 123 Multito hell list simple life leg encourage; as … ## 8 434 130 Multieducate air quick save, health six slow hos… ## 9 4 6 105 One should between field double air look' clean… ## 10 4 7 106 One guess busy collect tree kill double' deep‐i… ## 11 4 8 107 One call wee person common succeed catch' syste… ## 12 411 110 One seat east operate farm common next NA ## 13 414 113 One little such tell paper knock lie town hundr… ## 14 418 116 One programme rule stupid NA ## 15 419 117 One vote plus hospital what suit. top NA ## 16 424 121 One although front view pair occasion NA ## 17 425 122 One report day town too garden line call wear a… ## 18 433 129 One might bottom work afternoon yes hope NA ## 19 436 132 One scotland next amount love favour NA |
Looking good. You can see that all the Multi
rows are in one bloc and
followed by all the One
rows, after the join. Now all that remains is
for us to clean up the remaining columns. For example, some entries in
the rate
and fee
columns have additional notes after their numbers,
so we separate
those out on the space character. We strip unnecessary
commas from the numeric columns. And then we fix their types, turning
the rate
and fee
columns from character to numeric and making the
date
column a proper date.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
eg_clean <- eg_clean |> separate(fees, sep = "\\s{1}", into = c("fees", "fee_note"), extra = "merge", fill = "right") |> separate(rate, sep = "\\s{1}", into = c("rate", "rate_note"), extra = "merge", fill = "right") |> mutate(fees = str_remove_all(fees, ","), # strip commas from numbers rate = str_remove_all(rate, ","), fees = as.numeric(fees), rate = as.numeric(rate), hours = as.numeric(hours), date = lubridate::mdy(date), full_description = str_remove(full_description, " NA$")) |> arrange(rowid) |
And we’re done:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
eg_clean |> filter(page == 4) |> select(date, hours, rate, full_description) ## # A tibble: 19 × 4 ##date hours rate full_description ##<date> <dbl> <dbl> <chr> ## 1 2014-04-22 1.2 635 contact for world insure fun real (.3); correct shut‐… ## 2 2014-04-22 0.4 749 at soon‐lose commit television break care'try lay (.2… ## 3 2014-04-27 1.3 635 should between field double air look' clean‐week occa… ## 4 2014-04-28 5.4 635 guess busy collect tree kill double' deep‐introduce l… ## 5 2014-04-29 9.1 635 call wee person common succeed catch' system‐big house ## 6 2014-05-02 0.4 749 product hit worse stage' seem true can; house begin c… ## 7 2014-05-09 0.1 635 seat east operate farm common next ## 8 2014-05-15 0.4 635 never not jump refer document monday consider (.5); n… ## 9 2014-05-15 0.2 749 little such tell paper knock lie town hundred with ty… ## 10 2014-05-20 3.1 749 world right like top (1); clean total family (1.5); h… ## 11 2014-05-23 0.1 635 programme rule stupid ## 12 2014-05-23 0.1 749 vote plus hospital what suit. top ## 13 2014-06-09 2.7 749 bed system now discuss converse'age though eight roun… ## 14 2014-06-10 0.3 635 although front view pair occasion ## 15 2014-06-11 0.2 749 report day town too garden line call wear attend enou… ## 16 2014-06-16 4.2 749 to hell list simple life leg encourage; as contract c… ## 17 2014-06-17 0.6 749 might bottom work afternoon yes hope ## 18 2014-06-22 3.2 749 educate air quick save, health six slow hospital shar… ## 19 2014-06-23 2.9 635 scotland next amount love favour |
A clean and tidy dataset ready to be investigated properly. Unhappy in its own way, but now at least in a form where we can ask it some questions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.