Original Text
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been lately noticing a glaring trend in some of the non-fiction books that I read: the use and abuse of verbatim quotes. They come in the shape of:
“As such and such said: INSERT LONG VERBATIM TEXT HERE”
Of course, there are no rules regarding the use of verbatim text1. But, if I can get a sense of overuse only from reading the book, it makes me curious to go get look at the data.
How much of the book is actually a verbatim text dump? Would I bet is 10%? Maybe 20%? Would lower percentages make me go easier on the author or is this a lost cause (i.e., if I notice the overuse by reading, all hope is lost)?
Example Book
Enough of chatter. Let’s try to answer this by analyzing one of the books in question: “Do Nothing” by Celeste Headlee. Reading the book in R using the epubr
package gives us this table:
Code
# read the epub book_text <- epubr::epub("Do Nothing - Celeste Headlee.epub") book_text$data[[1]] %>% mutate(text = str_sub(text, 0, 20), text = paste(text, "...")) %>% gt::gt() # needs to get the text stings truncated
section | text | nword | nchar |
---|---|---|---|
titlepage | … | 0 | 0 |
part0001.xhtml | … | 0 | 0 |
part0002.xhtml | … | 0 | 0 |
part0003.xhtml | Copyright © 2020 by … | 111 | 1014 |
part0004.xhtml | CONTENTSCoverTitle P … | 93 | 685 |
part0005.xhtml | INTRODUCTIONIt will … | 3242 | 18783 |
part0006.xhtml | PART IThe Cult of Ef … | 5 | 28 |
part0007.xhtml | Chapter 1MIND THE GA … | 3270 | 18268 |
part0008.xhtml | Chapter 2IT STARTS W … | 4880 | 28878 |
part0009.xhtml | Chapter 3WORK ETHICI … | 3812 | 22815 |
part0010.xhtml | Chapter 4TIME BECOME … | 8804 | 52065 |
part0011.xhtml | Chapter 5WORK COMES … | 4271 | 25298 |
part0012.xhtml | Chapter 6THE BUSIEST … | 4773 | 28006 |
part0013.xhtml | Chapter 7DO WE LIVE … | 5022 | 28949 |
part0014.xhtml | Chapter 8UNIVERSAL H … | 5930 | 35332 |
part0015.xhtml | Chapter 9IS TECH TO … | 6036 | 35402 |
part0016.xhtml | PART IILeaving the C … | 12 | 61 |
part0017.xhtml | Life-Back OneCHALLEN … | 2345 | 13558 |
part0018.xhtml | Life-Back TwoTAKE TH … | 2570 | 15330 |
part0019.xhtml | Life-Back ThreeSTEP … | 3933 | 22790 |
part0020.xhtml | Life-Back FourINVEST … | 1925 | 10856 |
part0021.xhtml | Life-Back FiveMAKE R … | 2857 | 16804 |
part0022.xhtml | Life-Back SixTAKE TH … | 2161 | 12337 |
part0023.xhtml | CONCLUSIONWe have ch … | 2084 | 12378 |
part0024.xhtml | For Theresa, who has … | 14 | 72 |
part0025.xhtml | ACKNOWLEDGMENTSI WOR … | 355 | 1995 |
part0026.xhtml | NOTESIntroduction”Ou … | 4067 | 28683 |
part0027.xhtml | ABOUT THE AUTHORCELE … | 318 | 1971 |
part0028.xhtml | What’s next onyour r … | 19 | 139 |
We can get rid of the legal stuff that normally goes before the text and everything that comes after the content (i.e., acknowledgements and references).
Code
# A simple slice operation would do book_txt <- book_text$data[[1]] %>% slice(6:24)
We can also get some metadata from the text (will come useful for later).
Code
# bind previous word and character counts meta <- book_txt %>% select(section, nword, nchar) %>% mutate(part = paste0("part", 5:23)) %>% select(-section)
Now that we have the text, we can find all the instances of "something in between these quotes here"
using stringr::str_locate_all():
Code
# extract the text match_df <- stringr::str_locate_all(book_txt$text, '"(.*?)"') %>% # give names for future binding # parts go from 5 to 23 (idx goes 6:24) set_names(nm = paste0("part", 5:23)) %>% # convert into tibble for easy binding map(as_tibble) %>% bind_rows(.id = "part")
Below, I’m showing a slice with an example of matched character positions and how they would look like in the text. I want to direct your attention to the second and third row. I hope you notice that these two quotes are, in fact, one single quote that was split into two.
Code
# This is an example match_df %>% slice(8:10) %>% mutate(quote = map2_chr( start, end, function(.x, .y) str_sub(book_txt$text[[1]], .x, .y) )) %>% gt::gt() %>% gt::tab_style( style = gt::cell_text(weight = "bold"), locations = gt::cells_column_labels() )
part | start | end | quote |
---|---|---|---|
part5 | 14893 | 14905 | “inefficient” |
part5 | 16002 | 16132 | “I can hunch over my computer screen for half the day churning frenetically through emails without getting much of substance done,” |
part5 | 16186 | 16336 | “all the while telling myself what a loser I am, and leave at 6:00 p.m. feeling like I put in a full day. And given my level of mental fatigue, I did!” |
Merging Quotes
The issue of quotes being split arises not because of a bug in code, but because the author writes in this way. She would do something like:
“A palm tree”, somebody said, “belongs to the Plant Kingdom.”
These stylistic choices will modify the statistics for the direct quotes (e.g., the average length of a quote will be much lesser than if these quotes were kept verbatim). I decided that I want to merge quotes if they are too close to each other (I will try 100 characters2). This will slightly inflate my % counts, since I’m attributing characters that are not direct quotes to actual quotes. Thus, when I calculate percentages, I will do so without merging (see @percentages-with-no-merging).
There’s one neat trick using lag
and cumsum
with a condition to achieve conditional grouping. We can see that rows 9 and 10 are marked as belonging to the same group now 🎉.
Code
threshold <- 100 # Define your threshold merged_quotes <- match_df %>% mutate( .by = part, prev_end = lag(end), distance = start - prev_end, merge_group = cumsum(ifelse(is.na(distance) | distance > threshold, 1, 0)) ) # head(merged_quotes, n = 10)
# A tibble: 10 × 6 part start end prev_end distance merge_group <chr> <int> <int> <int> <int> <dbl> 1 part5 574 597 NA NA 1 2 part5 1342 1361 597 745 2 3 part5 1876 1904 1361 515 3 4 part5 6036 6051 1904 4132 4 5 part5 8751 8944 6051 2700 5 6 part5 9276 9373 8944 332 6 7 part5 13258 13265 9373 3885 7 8 part5 14893 14905 13265 1628 8 9 part5 16002 16132 14905 1097 9 10 part5 16186 16336 16132 54 9
This intermediate step also gives us the answer to a new question:
What is the average distance between quotes?
The answer is x̄= 740 ± sd = 970 . On average, you start a new qoute after 130 words of original content. Is that a lot? Is that too little?
To be honest, it feels true to the reading experience. My sensation was that the author was using the verbatim quotes with high frequency, and the data seems to align with that. But don’t take my word for it, let’s try to visualize it.
We are two steps away from the viz.
- Do the actual merge
- Add the end of each chapter
We can do Step 1
using the code below:
Code
merged_quotes <- merged_quotes %>% summarize( .by = c(merge_group, part), part = first(part), start = first(start), end = last(end) ) %>% # add the lag again to see where the original text starts mutate(text_start = lag(end, default = 0), .by = part)
Right now, we have the start of the original text in text_start
and the start
and end
of each verbatim quote. We need to make use of the metadata stored in meta
to add the end of the original content for of each chapter. This only matters for the very last portion that we are going to plot, so I will make a new data set that contains those values instead of merging everything together. To visualize it, I’m going to make use of a package I developed called ggethos
. You can check it out here or adapt the code to work with geom_segment()
.
Code
# pad parts for plotting format_part <- function(part_name) { # Extract the numeric part part_number <- as.integer(str_extract(part_name, "\\d+")) # Pad the number with zeros and prepend 'part' formatted_part <- str_c("part", str_pad(part_number, width = 2, pad = "0")) return(formatted_part) } # make tail end segments tail_data <- merged_quotes %>% summarise(.by = part, last_quote_end = max(end)) %>% left_join(meta, by='part') %>% # fix the padding after merging mutate(part = format_part(part)) # fix the padding here too merged_quotes <- merged_quotes %>% mutate(part = format_part(part)) ggplot(data=merged_quotes) + geom_ethogram(aes(x=text_start, xend=start, y = part), color ="gray30") + geom_ethogram(data=tail_data, aes(x=last_quote_end, xend=nchar, y = part), color ="gray30") + geom_ethogram(aes(x=start, xend=end, y = part), color = "red")+ cowplot::theme_nothing() + labs(title = "'Do Nothing' is Peppered by Quotes", subtitle = "<span style = 'color:gray30'>Original Text</span> and <span style = 'color:red'>Verbatim quotes</span>", caption = "Viz: Matias Andina", y = "Chapter") + theme( plot.title = element_text(hjust = 0.5), plot.subtitle = ggtext::element_markdown(hjust = 0.5), plot.background = element_rect(fill = "black"), text = element_text(color = 'gray80'), axis.title.y = element_text(angle = 90), plot.caption = element_text(size = 8, hjust = .95))
I believe this plot conveys a good mental image of what reading the book feels like in terms of verbatim text usage.
Percentages with no merges
As mentioned in the beginning of the article, I was curious about how much verbatim text there was. Again, using the number of characters in each chapter stored in the meta
object, we can easily calculate the percentage of all characters that are directly quoted:
Code
match_df %>% mutate(quote_chars = end - start) %>% summarise(.by = part, quote_chars = sum(quote_chars)) %>% left_join(meta, by = "part") %>% mutate(quote_frac = quote_chars / nchar, part = fct_reorder(part, quote_frac)) %>% ggplot() + geom_hline(aes(yintercept = mean(quote_frac)), lty = 4) + geom_point(aes(as.numeric(part), quote_frac), size = 4, alpha = 0.9, color = "darkred") + geom_label(aes(x = 15.5, y = 0.17, label = paste(part[which.max(quote_frac)], scales::percent(max(quote_frac)), sep="\n" ))) + scale_y_continuous(labels = scales::label_percent(), expand = expansion(add = c(0.01, 0.05)))+ labs(y = "Verbatim Quotes", x = "Book Part\n(ascending quote % order)", title = "'Do Nothing' contains ~10% verbatim quoted text", subtitle = "Some parts are as high as 17%!")+ cowplot::theme_minimal_hgrid()
A silver lining
Most non-fiction books are a regurgitation of something somebody else said a long time ago (there’s nothing new under the sun). In a sense then, it’s more truthful for an author to quote verbatim from the original source than to paraphrase whatever they took out of it and hide the initial message under a footnote3.
Footnotes
But I’m sure a copyright lawyer would know much more than I do regarding how much verbatim text you can include and still claim ownership of your work.↩︎
Of course, this threshold is arbitrary. How did I come up with it? I asked ChatGPT to come up with 10 interjections that were a bit longer than “they said” and phrases where sitting comfortably around 50. I doubled it to be super sure that we were not missing instances.↩︎
This paragraph was indeed a paraphrase of my editor’s (read wife’s) reaction to my article. Talking to her is a great exercise in positive reframing.↩︎
Reuse
Citation
@online{andina2023, author = {Andina, Matias}, title = {Original {Text}}, date = {2023-11-07}, url = {https://matiasandina.com/posts/2023-11-07-original-text}, langid = {en} }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.