Site icon R-bloggers

Original Text

[This article was first published on Matias Andina, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been lately noticing a glaring trend in some of the non-fiction books that I read: the use and abuse of verbatim quotes. They come in the shape of:

“As such and such said: INSERT LONG VERBATIM TEXT HERE”

Of course, there are no rules regarding the use of verbatim text1. But, if I can get a sense of overuse only from reading the book, it makes me curious to go get look at the data.

How much of the book is actually a verbatim text dump? Would I bet is 10%? Maybe 20%? Would lower percentages make me go easier on the author or is this a lost cause (i.e., if I notice the overuse by reading, all hope is lost)?

< section id="example-book" class="level2">

Example Book

Enough of chatter. Let’s try to answer this by analyzing one of the books in question: “Do Nothing” by Celeste Headlee. Reading the book in R using the epubr package gives us this table:

< details> < summary>Code
# read the epub
book_text <- epubr::epub("Do Nothing - Celeste Headlee.epub")
book_text$data[[1]] %>% 
  mutate(text = str_sub(text, 0, 20),
         text = paste(text, "...")) %>% 
  gt::gt() # needs to get the text stings truncated
section text nword nchar
titlepage 0 0
part0001.xhtml 0 0
part0002.xhtml 0 0
part0003.xhtml Copyright © 2020 by … 111 1014
part0004.xhtml CONTENTSCoverTitle P … 93 685
part0005.xhtml INTRODUCTIONIt will … 3242 18783
part0006.xhtml PART IThe Cult of Ef … 5 28
part0007.xhtml Chapter 1MIND THE GA … 3270 18268
part0008.xhtml Chapter 2IT STARTS W … 4880 28878
part0009.xhtml Chapter 3WORK ETHICI … 3812 22815
part0010.xhtml Chapter 4TIME BECOME … 8804 52065
part0011.xhtml Chapter 5WORK COMES … 4271 25298
part0012.xhtml Chapter 6THE BUSIEST … 4773 28006
part0013.xhtml Chapter 7DO WE LIVE … 5022 28949
part0014.xhtml Chapter 8UNIVERSAL H … 5930 35332
part0015.xhtml Chapter 9IS TECH TO … 6036 35402
part0016.xhtml PART IILeaving the C … 12 61
part0017.xhtml Life-Back OneCHALLEN … 2345 13558
part0018.xhtml Life-Back TwoTAKE TH … 2570 15330
part0019.xhtml Life-Back ThreeSTEP … 3933 22790
part0020.xhtml Life-Back FourINVEST … 1925 10856
part0021.xhtml Life-Back FiveMAKE R … 2857 16804
part0022.xhtml Life-Back SixTAKE TH … 2161 12337
part0023.xhtml CONCLUSIONWe have ch … 2084 12378
part0024.xhtml For Theresa, who has … 14 72
part0025.xhtml ACKNOWLEDGMENTSI WOR … 355 1995
part0026.xhtml NOTESIntroduction”Ou … 4067 28683
part0027.xhtml ABOUT THE AUTHORCELE … 318 1971
part0028.xhtml What’s next onyour r … 19 139

We can get rid of the legal stuff that normally goes before the text and everything that comes after the content (i.e., acknowledgements and references).

< details> < summary>Code
# A simple slice operation would do
book_txt <- book_text$data[[1]] %>% 
  slice(6:24)

We can also get some metadata from the text (will come useful for later).

< details> < summary>Code
# bind previous word and character counts
<- book_txt %>% 
  select(section, nword, nchar) %>% 
  mutate(part = paste0("part", 5:23)) %>% 
  select(-section)

Now that we have the text, we can find all the instances of "something in between these quotes here" using stringr::str_locate_all():

< details> < summary>Code
# extract the text
match_df <- stringr::str_locate_all(book_txt$text, '"(.*?)"') %>% 
  # give names for future binding
  # parts go from 5 to 23 (idx goes 6:24)
  set_names(nm = paste0("part", 5:23)) %>% 
  # convert into tibble for easy binding
  map(as_tibble) %>% 
  bind_rows(.id =  "part")

Below, I’m showing a slice with an example of matched character positions and how they would look like in the text. I want to direct your attention to the second and third row. I hope you notice that these two quotes are, in fact, one single quote that was split into two.

< details> < summary>Code
# This is an example
match_df %>% 
  slice(8:10) %>% 
  mutate(quote = map2_chr(
    start, end,  
    function(.x, .y) str_sub(book_txt$text[[1]], .x, .y)
  )) %>% 
  gt::gt() %>%
  gt::tab_style(
    style = gt::cell_text(weight = "bold"),
    locations = gt::cells_column_labels()
        )
part start end quote
part5 14893 14905 “inefficient”
part5 16002 16132 “I can hunch over my computer screen for half the day churning frenetically through emails without getting much of substance done,”
part5 16186 16336 “all the while telling myself what a loser I am, and leave at 6:00 p.m. feeling like I put in a full day. And given my level of mental fatigue, I did!”
< section id="merging-quotes" class="level3">

Merging Quotes

The issue of quotes being split arises not because of a bug in code, but because the author writes in this way. She would do something like:

“A palm tree”, somebody said, “belongs to the Plant Kingdom.”

These stylistic choices will modify the statistics for the direct quotes (e.g., the average length of a quote will be much lesser than if these quotes were kept verbatim). I decided that I want to merge quotes if they are too close to each other (I will try 100 characters2). This will slightly inflate my % counts, since I’m attributing characters that are not direct quotes to actual quotes. Thus, when I calculate percentages, I will do so without merging (see @percentages-with-no-merging).

There’s one neat trick using lag and cumsum with a condition to achieve conditional grouping. We can see that rows 9 and 10 are marked as belonging to the same group now 🎉.

< details> < summary>Code
threshold <- 100  # Define your threshold

merged_quotes <- match_df %>%
  mutate(
    .by = part, 
    prev_end = lag(end),
    distance = start - prev_end,
    merge_group = cumsum(ifelse(is.na(distance) | distance > threshold, 1, 0))
  ) 

# 
head(merged_quotes, n = 10)
# A tibble: 10 × 6
   part  start   end prev_end distance merge_group
   <chr> <int> <int>    <int>    <int>       <dbl>
 1 part5   574   597       NA       NA           1
 2 part5  1342  1361      597      745           2
 3 part5  1876  1904     1361      515           3
 4 part5  6036  6051     1904     4132           4
 5 part5  8751  8944     6051     2700           5
 6 part5  9276  9373     8944      332           6
 7 part5 13258 13265     9373     3885           7
 8 part5 14893 14905    13265     1628           8
 9 part5 16002 16132    14905     1097           9
10 part5 16186 16336    16132       54           9

This intermediate step also gives us the answer to a new question:

What is the average distance between quotes?

The answer is x̄= 740 ± sd = 970 . On average, you start a new qoute after 130 words of original content. Is that a lot? Is that too little?

To be honest, it feels true to the reading experience. My sensation was that the author was using the verbatim quotes with high frequency, and the data seems to align with that. But don’t take my word for it, let’s try to visualize it.

We are two steps away from the viz.

  1. Do the actual merge
  2. Add the end of each chapter

We can do Step 1 using the code below:

< details> < summary>Code
merged_quotes <- merged_quotes %>%
  summarize(
    .by = c(merge_group, part),
    part = first(part),
    start = first(start),
    end = last(end)
  ) %>% 
  # add the lag again to see where the original text starts
  mutate(text_start = lag(end, default = 0), .by = part)

Right now, we have the start of the original text in text_start and the start and end of each verbatim quote. We need to make use of the metadata stored in meta to add the end of the original content for of each chapter. This only matters for the very last portion that we are going to plot, so I will make a new data set that contains those values instead of merging everything together. To visualize it, I’m going to make use of a package I developed called ggethos. You can check it out here or adapt the code to work with geom_segment().

< details> < summary>Code
# pad parts  for plotting
format_part <- function(part_name) {
  # Extract the numeric part
  part_number <- as.integer(str_extract(part_name, "\\d+"))

  # Pad the number with zeros and prepend 'part'
  formatted_part <- str_c("part", str_pad(part_number, width = 2, pad = "0"))
  
  return(formatted_part)
}

# make tail end segments
tail_data <- merged_quotes %>% 
  summarise(.by = part, last_quote_end = max(end)) %>% 
  left_join(meta, by='part') %>% 
  # fix the padding after merging
  mutate(part = format_part(part))

# fix the padding here too
merged_quotes <- merged_quotes  %>% mutate(part = format_part(part))


ggplot(data=merged_quotes) + 
  geom_ethogram(aes(x=text_start, xend=start, y = part), color ="gray30") +
  geom_ethogram(data=tail_data, aes(x=last_quote_end, 
                                    xend=nchar, y = part), color ="gray30") +
  geom_ethogram(aes(x=start, xend=end, y = part), color = "red")+
  cowplot::theme_nothing() +
  labs(title = "'Do Nothing' is Peppered by Quotes",
       subtitle = "<span style = 'color:gray30'>Original Text</span> and <span style = 'color:red'>Verbatim quotes</span>",
       caption = "Viz: Matias Andina",
       y = "Chapter") +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = ggtext::element_markdown(hjust = 0.5),
    plot.background = element_rect(fill = "black"),
    text = element_text(color = 'gray80'),
    axis.title.y = element_text(angle = 90),
    plot.caption = element_text(size = 8, hjust = .95))

I believe this plot conveys a good mental image of what reading the book feels like in terms of verbatim text usage.

< section id="percentages-with-no-merges" class="level3">

Percentages with no merges

As mentioned in the beginning of the article, I was curious about how much verbatim text there was. Again, using the number of characters in each chapter stored in the meta object, we can easily calculate the percentage of all characters that are directly quoted:

< details> < summary>Code
match_df %>% 
  mutate(quote_chars = end - start) %>% 
  summarise(.by = part, 
            quote_chars = sum(quote_chars)) %>% 
  left_join(meta, by = "part") %>% 
  mutate(quote_frac = quote_chars / nchar,
         part = fct_reorder(part, quote_frac)) %>% 
  ggplot() +
  geom_hline(aes(yintercept = mean(quote_frac)), lty = 4) +
  geom_point(aes(as.numeric(part), quote_frac), 
             size = 4, alpha = 0.9, color = "darkred") +
  geom_label(aes(x = 15.5, y = 0.17,
                 label = paste(part[which.max(quote_frac)],
                               scales::percent(max(quote_frac)),
                               sep="\n"
                 ))) +
  scale_y_continuous(labels = scales::label_percent(),
                     expand = expansion(add = c(0.01, 0.05)))+
  labs(y = "Verbatim Quotes",
       x = "Book Part\n(ascending quote % order)",
       title = "'Do Nothing' contains ~10% verbatim quoted text",
       subtitle = "Some parts are as high as 17%!")+
  cowplot::theme_minimal_hgrid()

< section id="a-silver-lining" class="level2">

A silver lining

Most non-fiction books are a regurgitation of something somebody else said a long time ago (there’s nothing new under the sun). In a sense then, it’s more truthful for an author to quote verbatim from the original source than to paraphrase whatever they took out of it and hide the initial message under a footnote3.

< section id="footnotes" class="footnotes footnotes-end-of-document">

Footnotes

  1. But I’m sure a copyright lawyer would know much more than I do regarding how much verbatim text you can include and still claim ownership of your work.↩︎

  2. Of course, this threshold is arbitrary. How did I come up with it? I asked ChatGPT to come up with 10 interjections that were a bit longer than “they said” and phrases where sitting comfortably around 50. I doubled it to be super sure that we were not missing instances.↩︎

  3. This paragraph was indeed a paraphrase of my editor’s (read wife’s) reaction to my article. Talking to her is a great exercise in positive reframing.↩︎

< section class="quarto-appendix-contents">

Reuse

https://creativecommons.org/licenses/by/4.0/
< section class="quarto-appendix-contents">

Citation

BibTeX citation:
@online{andina2023,
  author = {Andina, Matias},
  title = {Original {Text}},
  date = {2023-11-07},
  url = {https://matiasandina.com/posts/2023-11-07-original-text},
  langid = {en}
}
For attribution, please cite this work as:
Andina, Matias. 2023. “Original Text.” November 7, 2023. https://matiasandina.com/posts/2023-11-07-original-text.
To leave a comment for the author, please follow the link and comment on their blog: Matias Andina.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version