Original Text

Posted on November 6, 2023 by Matias Andina in R bloggers | 0 Comments

[This article was first published on Matias Andina, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been lately noticing a glaring trend in some of the non-fiction books that I read: the use and abuse of verbatim quotes. They come in the shape of:

“As such and such said: INSERT LONG VERBATIM TEXT HERE”

Of course, there are no rules regarding the use of verbatim text¹. But, if I can get a sense of overuse only from reading the book, it makes me curious to go get look at the data.

How much of the book is actually a verbatim text dump? Would I bet is 10%? Maybe 20%? Would lower percentages make me go easier on the author or is this a lost cause (i.e., if I notice the overuse by reading, all hope is lost)?

Example Book

Enough of chatter. Let’s try to answer this by analyzing one of the books in question: “Do Nothing” by Celeste Headlee. Reading the book in R using the epubr package gives us this table:

Code

# read the epub
book_text <- epubr::epub("Do Nothing - Celeste Headlee.epub")
book_text$data[[1]] %>% 
  mutate(text = str_sub(text, 0, 20),
         text = paste(text, "...")) %>% 
  gt::gt() # needs to get the text stings truncated

section	text	nword	nchar
titlepage	…	0	0
part0001.xhtml	…	0	0
part0002.xhtml	…	0	0
part0003.xhtml	Copyright © 2020 by …	111	1014
part0004.xhtml	CONTENTSCoverTitle P …	93	685
part0005.xhtml	INTRODUCTIONIt will …	3242	18783
part0006.xhtml	PART IThe Cult of Ef …	5	28
part0007.xhtml	Chapter 1MIND THE GA …	3270	18268
part0008.xhtml	Chapter 2IT STARTS W …	4880	28878
part0009.xhtml	Chapter 3WORK ETHICI …	3812	22815
part0010.xhtml	Chapter 4TIME BECOME …	8804	52065
part0011.xhtml	Chapter 5WORK COMES …	4271	25298
part0012.xhtml	Chapter 6THE BUSIEST …	4773	28006
part0013.xhtml	Chapter 7DO WE LIVE …	5022	28949
part0014.xhtml	Chapter 8UNIVERSAL H …	5930	35332
part0015.xhtml	Chapter 9IS TECH TO …	6036	35402
part0016.xhtml	PART IILeaving the C …	12	61
part0017.xhtml	Life-Back OneCHALLEN …	2345	13558
part0018.xhtml	Life-Back TwoTAKE TH …	2570	15330
part0019.xhtml	Life-Back ThreeSTEP …	3933	22790
part0020.xhtml	Life-Back FourINVEST …	1925	10856
part0021.xhtml	Life-Back FiveMAKE R …	2857	16804
part0022.xhtml	Life-Back SixTAKE TH …	2161	12337
part0023.xhtml	CONCLUSIONWe have ch …	2084	12378
part0024.xhtml	For Theresa, who has …	14	72
part0025.xhtml	ACKNOWLEDGMENTSI WOR …	355	1995
part0026.xhtml	NOTESIntroduction”Ou …	4067	28683
part0027.xhtml	ABOUT THE AUTHORCELE …	318	1971
part0028.xhtml	What’s next onyour r …	19	139

We can get rid of the legal stuff that normally goes before the text and everything that comes after the content (i.e., acknowledgements and references).

Code

# A simple slice operation would do
book_txt <- book_text$data[[1]] %>% 
  slice(6:24)

We can also get some metadata from the text (will come useful for later).

Code

# bind previous word and character counts
meta <- book_txt %>% 
  select(section, nword, nchar) %>% 
  mutate(part = paste0("part", 5:23)) %>% 
  select(-section)

Now that we have the text, we can find all the instances of "something in between these quotes here" using stringr::str_locate_all():

Code

# extract the text
match_df <- stringr::str_locate_all(book_txt$text, '"(.*?)"') %>% 
  # give names for future binding
  # parts go from 5 to 23 (idx goes 6:24)
  set_names(nm = paste0("part", 5:23)) %>% 
  # convert into tibble for easy binding
  map(as_tibble) %>% 
  bind_rows(.id =  "part")

Below, I’m showing a slice with an example of matched character positions and how they would look like in the text. I want to direct your attention to the second and third row. I hope you notice that these two quotes are, in fact, one single quote that was split into two.

Code

# This is an example
match_df %>% 
  slice(8:10) %>% 
  mutate(quote = map2_chr(
    start, end,  
    function(.x, .y) str_sub(book_txt$text[[1]], .x, .y)
  )) %>% 
  gt::gt() %>%
  gt::tab_style(
    style = gt::cell_text(weight = "bold"),
    locations = gt::cells_column_labels()
        )

part	start	end	quote
part5	14893	14905	“inefficient”
part5	16002	16132	“I can hunch over my computer screen for half the day churning frenetically through emails without getting much of substance done,”
part5	16186	16336	“all the while telling myself what a loser I am, and leave at 6:00 p.m. feeling like I put in a full day. And given my level of mental fatigue, I did!”

Merging Quotes

The issue of quotes being split arises not because of a bug in code, but because the author writes in this way. She would do something like:

“A palm tree”, somebody said, “belongs to the Plant Kingdom.”

These stylistic choices will modify the statistics for the direct quotes (e.g., the average length of a quote will be much lesser than if these quotes were kept verbatim). I decided that I want to merge quotes if they are too close to each other (I will try 100 characters²). This will slightly inflate my % counts, since I’m attributing characters that are not direct quotes to actual quotes. Thus, when I calculate percentages, I will do so without merging (see @percentages-with-no-merging).

There’s one neat trick using lag and cumsum with a condition to achieve conditional grouping. We can see that rows 9 and 10 are marked as belonging to the same group now 🎉.

Code

threshold <- 100  # Define your threshold

merged_quotes <- match_df %>%
  mutate(
    .by = part, 
    prev_end = lag(end),
    distance = start - prev_end,
    merge_group = cumsum(ifelse(is.na(distance) | distance > threshold, 1, 0))
  ) 

# 
head(merged_quotes, n = 10)

# A tibble: 10 × 6
   part  start   end prev_end distance merge_group
   <chr> <int> <int>    <int>    <int>       <dbl>
 1 part5   574   597       NA       NA           1
 2 part5  1342  1361      597      745           2
 3 part5  1876  1904     1361      515           3
 4 part5  6036  6051     1904     4132           4
 5 part5  8751  8944     6051     2700           5
 6 part5  9276  9373     8944      332           6
 7 part5 13258 13265     9373     3885           7
 8 part5 14893 14905    13265     1628           8
 9 part5 16002 16132    14905     1097           9
10 part5 16186 16336    16132       54           9

This intermediate step also gives us the answer to a new question:

What is the average distance between quotes?

The answer is x̄= 740 ± sd = 970 . On average, you start a new qoute after 130 words of original content. Is that a lot? Is that too little?

To be honest, it feels true to the reading experience. My sensation was that the author was using the verbatim quotes with high frequency, and the data seems to align with that. But don’t take my word for it, let’s try to visualize it.

We are two steps away from the viz.

Do the actual merge
Add the end of each chapter

We can do Step 1 using the code below:

Code

merged_quotes <- merged_quotes %>%
  summarize(
    .by = c(merge_group, part),
    part = first(part),
    start = first(start),
    end = last(end)
  ) %>% 
  # add the lag again to see where the original text starts
  mutate(text_start = lag(end, default = 0), .by = part)

Right now, we have the start of the original text in text_start and the start and end of each verbatim quote. We need to make use of the metadata stored in meta to add the end of the original content for of each chapter. This only matters for the very last portion that we are going to plot, so I will make a new data set that contains those values instead of merging everything together. To visualize it, I’m going to make use of a package I developed called ggethos. You can check it out here or adapt the code to work with geom_segment().

Code

# pad parts  for plotting
format_part <- function(part_name) {
  # Extract the numeric part
  part_number <- as.integer(str_extract(part_name, "\\d+"))

  # Pad the number with zeros and prepend 'part'
  formatted_part <- str_c("part", str_pad(part_number, width = 2, pad = "0"))
  
  return(formatted_part)
}

# make tail end segments
tail_data <- merged_quotes %>% 
  summarise(.by = part, last_quote_end = max(end)) %>% 
  left_join(meta, by='part') %>% 
  # fix the padding after merging
  mutate(part = format_part(part))

# fix the padding here too
merged_quotes <- merged_quotes  %>% mutate(part = format_part(part))


ggplot(data=merged_quotes) + 
  geom_ethogram(aes(x=text_start, xend=start, y = part), color ="gray30") +
  geom_ethogram(data=tail_data, aes(x=last_quote_end, 
                                    xend=nchar, y = part), color ="gray30") +
  geom_ethogram(aes(x=start, xend=end, y = part), color = "red")+
  cowplot::theme_nothing() +
  labs(title = "'Do Nothing' is Peppered by Quotes",
       subtitle = "<span style = 'color:gray30'>Original Text</span> and <span style = 'color:red'>Verbatim quotes</span>",
       caption = "Viz: Matias Andina",
       y = "Chapter") +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = ggtext::element_markdown(hjust = 0.5),
    plot.background = element_rect(fill = "black"),
    text = element_text(color = 'gray80'),
    axis.title.y = element_text(angle = 90),
    plot.caption = element_text(size = 8, hjust = .95))

I believe this plot conveys a good mental image of what reading the book feels like in terms of verbatim text usage.

Percentages with no merges

As mentioned in the beginning of the article, I was curious about how much verbatim text there was. Again, using the number of characters in each chapter stored in the meta object, we can easily calculate the percentage of all characters that are directly quoted:

Code

match_df %>% 
  mutate(quote_chars = end - start) %>% 
  summarise(.by = part, 
            quote_chars = sum(quote_chars)) %>% 
  left_join(meta, by = "part") %>% 
  mutate(quote_frac = quote_chars / nchar,
         part = fct_reorder(part, quote_frac)) %>% 
  ggplot() +
  geom_hline(aes(yintercept = mean(quote_frac)), lty = 4) +
  geom_point(aes(as.numeric(part), quote_frac), 
             size = 4, alpha = 0.9, color = "darkred") +
  geom_label(aes(x = 15.5, y = 0.17,
                 label = paste(part[which.max(quote_frac)],
                               scales::percent(max(quote_frac)),
                               sep="\n"
                 ))) +
  scale_y_continuous(labels = scales::label_percent(),
                     expand = expansion(add = c(0.01, 0.05)))+
  labs(y = "Verbatim Quotes",
       x = "Book Part\n(ascending quote % order)",
       title = "'Do Nothing' contains ~10% verbatim quoted text",
       subtitle = "Some parts are as high as 17%!")+
  cowplot::theme_minimal_hgrid()

A silver lining

Most non-fiction books are a regurgitation of something somebody else said a long time ago (there’s nothing new under the sun). In a sense then, it’s more truthful for an author to quote verbatim from the original source than to paraphrase whatever they took out of it and hide the initial message under a footnote³.

Footnotes

But I’m sure a copyright lawyer would know much more than I do regarding how much verbatim text you can include and still claim ownership of your work.↩︎
Of course, this threshold is arbitrary. How did I come up with it? I asked ChatGPT to come up with 10 interjections that were a bit longer than “they said” and phrases where sitting comfortably around 50. I doubled it to be super sure that we were not missing instances.↩︎
This paragraph was indeed a paraphrase of my editor’s (read wife’s) reaction to my article. Talking to her is a great exercise in positive reframing.↩︎

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{andina2023,
  author = {Andina, Matias},
  title = {Original {Text}},
  date = {2023-11-07},
  url = {https://matiasandina.com/posts/2023-11-07-original-text},
  langid = {en}
}

For attribution, please cite this work as:

Andina, Matias. 2023. “Original Text.” November 7, 2023. https://matiasandina.com/posts/2023-11-07-original-text.

To leave a comment for the author, please follow the link and comment on their blog: Matias Andina.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Original Text

Example Book

Merging Quotes

Percentages with no merges

A silver lining

Footnotes

Reuse

Citation

Related

Example Book

Merging Quotes

Percentages with no merges

A silver lining

Footnotes

Reuse

Citation

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)