Site icon R-bloggers

Retrospective: Writing an O’Reilly Book

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A number of people have asked me about the amount of time and effort involved in writing a book as I just completed one for O’Reilly (published in April and is available on Amazon).   The process at O’Reilly is unique in that a book is written and initially formatted on an application called Atlas. Ilya Grigorik also wrote an O’Reilly book and kept a detailed log of time spent. I cannot claim the same – but have enough data available from Atlas and email to reconstruct the schedule and areas of focus throughout the project.

O’Reilly Atlas

Atlas is O’Reilly’s Git-backed, web-based platform for publishing. Authors enter markup which is persisted and managed in a Git repository. The author can choose to “build” a book in its current state to PDF, HTML, mobi and/or epub formats. It was introduced as at FluentConf 2013 under the direction of Rune Skjoldborg Madsen.

The fact that Altas is web-based makes collaboration and access trivial. Authors and editors can work on the book in different geographic locations through a web browser. The browser based application is light-weight and does not require special software installations and associated licensing.

The fact that it is Git-backed means all of the benefits of version control are available to writers. Version control is taken for granted by programmers, but most of the rest of the world is largely unaware of its benefits. Imagine if a writer could not simply retrieve a document in its current state, but actually see changes that were made over time. If needed, they could revert a document to a previous version. This is not the stuff of imagination, it is functionality readily available through version control systems and has been available for decades in various forms. Systems like Atlas are extending version control’s usage to a wider audience that includes writers and publishers.

Project Analysis

Summary data from Atlas while working on the book indicates the primary work writing the book spanned 331 days.  A timeline that encompasses the entire project (derived from emails) indicates that an additional 6 months or so passed during the project outside of Atlas. The entire project spanned 523 days. (R note: I don’t know of any timeline chart library in R, and so wrote the a function that takes a dataframe containing a Date and Event column and plots them on a timeline.  All R code related to charts that appear in the article is listed below).



< !--begin.rcode fig.width=10, fig.height=10, tidy: true library(ggplot2) library(grid) timeline < - function(df){ if (!"Date" %in% names(df)) { stop('DataFrame passed as arg 1 must contain a column named Date with data format YYYY-MM-DD') } if (! "Event" %in% names(df)) { stop('DataFrame passed as arg 1 must contain a column of character type named Event') } ggplot(df, aes(x=Date, y=0, color=Event)) + geom_segment(aes(x = min(Date)-60, y = 0, xend = max(Date)+60, yend = 0), arrow = arrow(ends="both",length = unit(0.5, "cm")), color='black')+ geom_point(size=3) + geom_segment(aes(x=Date, y=0, xend=Date, yend= labelPos)) + geom_text(label=df$Event, size=4, hjust=0, vjust=0, y=df$labelPos ) + theme(panel.background = element_rect(fill = "white"), panel.grid.major = element_line(size = 0, linetype = "dotted"), axis.title.y=element_blank(), axis.text.y = element_blank(), axis.ticks.y= element_blank(), legend.position="none" ) } labelPosGen< -function(size){ step< -round(75/size, 2) seq(75, 1, -step)/ 250 - 0.1 } # randLabelPosGen< -function(size){(sample(180,size)/250) - 0.25} oreilly.dates < - read.csv("~/book-writing-statistics/oreilly-dates.csv", stringsAsFactors=FALSE) oreilly.dates$Date< -as.Date(oreilly.dates$Date) oreilly.dates$labelPos< - labelPosGen(length(oreilly.dates$Date)) timeline(oreilly.dates) max(oreilly.dates$Date)-min(oreilly.dates$Date) end.rcode-->
The most active time within the total 523 days were the 331 days in Atlas and a period of time of working with editors in the final weeks before publication. This was an unsolicited proposal, so the last weeks of 2012 included relatively little active work. Once the proposal was accepted, writing kicked into high gear. Also, not included in this time was the construction of the inital proposal and related projects. I don’t have any detailed information related to this activity, but my sense is that it involved about 6 weeks of relatively relaxed creation of demo programs and formulation of an overall plan and outline for the book.




The commits, files and lines per month also substantiate my recollection of the way work proceeded. After an initial three month burst of work, I slowed down a bit in early summer. Then in August, my commit count dropped, but the lines and files increased due committing larger chunks of work during this period. September was a ridiculously busy month due to matters unrelated to the book, so productivity dropped off quite a bit at that point. By October, we started discussing technical review, which kicked me into high gear to get the final portion of the book completed. Work was largely done my December, but edits continued based on the feedback by reviewers. The overall trend of lines-per-month declining also fits my recollection: lots of original writing early in the project, lots of touch up, corrections, and editing at the end.
< !--begin.rcode fig.width=10, fig.height=10 library(ggplot2) library(gridExtra) # Import writing.statistics < - read.csv("writing-statistics.csv", row.names=1, as.is=TRUE) # Pre-process the data # Files and lines were reported as cumulative, so break need deltas per.month < - apply(writing.statistics[-1], 2, diff) colnames(per.month)< -c('Files.per.month', 'Lines.per.month') # Merge these derived columns back into the original data frame df < - merge(writing.statistics, as.data.frame(per.month), by="row.names") # Write out the plot plot1< -ggplot(df, aes(x=as.character(Row.names), y=Commits, group=1)) + geom_line() + geom_smooth(method="lm", se = FALSE) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_discrete("Month") plot2< -ggplot(df, aes(x=as.character(Row.names), y=Files.per.month, group=1)) + geom_line() + geom_smooth(method="lm", se = FALSE) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_discrete("Month") plot3< -ggplot(df, aes(x=as.character(Row.names), y=Lines.per.month, group=1)) + geom_line() + geom_smooth(method="lm", se = FALSE) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_discrete("Month") grid.arrange(plot1, plot2, plot3, ncol=1) end.rcode-->

Conclusion

Stephen Wolfram claims that “One day I’m sure everyone will routinely collect all sorts of data about themselves.” Without making any specific attempt to collect data, Atlas and email provided a good deal of information about the overall schedule and amount effort required to write a book. R and Ggplot2 presentation of the data gives a clear picture that confirms my personal recollection, and checks it in a few cases as well.

What is not reflected in the data is the ups and downs throughout a project that spans this amount of time. I have written many articles, but maintaining continuity and consistency is very challenging at book-scale. Having deadlines is essential for maintaining focus, but at times writing is a bit more inspired if you are able to sort of relax without the specter of a missed deadline hovering. Writing – like many activities – requires attentiveness on many different levels, not all of which you can keep in view at the same time. Having the team at O’Reilly, the technical reviewers and others who read early versions of chapters were essential to producing a quality final project. Paradoxically, I am amazed both at how many people were required to create the final project as well as how much work it was for me personally. I have a new appreciation for anyone who authors a book, and great admiration for those who do it well.

So to those considering writing a book – go for it!  It is an exciting journey quite unlike any other in very subtle ways.  Of course, your mileage will vary based on the type of writing involved and your own personal skills and situation.  In case you were wondering, the book I wrote is on recent trends in web application development and directed towards developers using Java/JVM languages on the server and modern JavaScript frameworks browser-side.  My O’Reilly animal is a Large Indian Civet which coincidentally looks quite a bit like our family cat.  You don’t get those with other publishers :).  Tools like Atlas speed up collaboration and small feedback loops make the process far easier than in the past. Just remember, it is a marathon, not a sprint. Or better yet, an endurance race that requires the support of a dedicated team to cross the finish line.



#############
# Related Code Below: or see the version of this article at RPubs.
#############
total.files < - 96="" ="">< !----->< !----->->
total.lines < - 5748="" ="">< !----->< !----->->
total.commits < - 2820="" ="">< !----->< !----->->
first.commit < - as.date="" ="">< !----->< !----->->
last.commit < - as.date="" ="" nbsp="">< !----->< !----->->
stats.generated < - as.date="" ="">< !----->< !----->->

last.commit – first.commit

#############
library(ggplot2)
library(grid)

timeline < - df="" ="" function="">< !----->< !----->->
  
  if (!”Date” %in% names(df)) {
    stop(‘DataFrame passed as arg 1 must contain a column named Date with data format YYYY-MM-DD’)
  }
  
  if (! “Event” %in% names(df)) {
    stop(‘DataFrame passed as arg 1 must contain a column of character type named Event’)
  }
  
  ggplot(df, aes(x=Date, y=0, color=Event)) + 
    geom_segment(aes(x    = min(Date)-60,    y = 0, 
                     xend = max(Date)+60, yend = 0), 
                 arrow = arrow(ends=”both”,length = unit(0.5, “cm”)), color=’black’)+
    geom_point(size=3) + 
    geom_segment(aes(x=Date, y=0, xend=Date, yend= labelPos)) +
    geom_text(label=df$Event, size=4, hjust=0, vjust=0, y=df$labelPos ) +
    theme(panel.background = element_rect(fill = “white”), 
          panel.grid.major = element_line(size = 0, linetype = “dotted”),
          axis.title.y=element_blank(),
          axis.text.y = element_blank(),
          axis.ticks.y= element_blank(),
          legend.position=”none”
    )
}

labelPosGen< -function ="" size="">< !---function-->< !---function-->-function>
  step< -round 2="" ="" nbsp="" size="">< !---round-->< !---round-->-round>
  seq(75, 1, -step)/ 250 – 0.1


# randLabelPosGen< -function -="" 0.25="" ="" sample="" size="">< !---function-->< !---function-->-function>

oreilly.dates < - book-writing-statistics="" oreilly-dates.csv="" read.csv="" stringsasfactors="FALSE)<">< !----->< !----->->
oreilly.dates$Date< -as .date="" ate="" ="" oreilly.dates="">< !---as-->< !---as-->-as>
oreilly.dates$labelPos< - ate="" ="" labelposgen="" length="" oreilly.dates="">< !----->< !----->->
timeline(oreilly.dates)


max(oreilly.dates$Date)-min(oreilly.dates$Date)



#############
library(ggplot2)
library(gridExtra)

# Import
writing.statistics < - ="" nbsp="" read.csv="" writing-statistics.csv="">< !----->< !----->->
                               row.names=1, 
                               as.is=TRUE)

# Pre-process the data
# Files and lines were reported as cumulative, so break need deltas
per.month < - 2="" apply="" diff="" ="" writing.statistics="">< !----->< !----->->

colnames(per.month)< -c ="" iles.per.month="" nbsp="">< !---c-->< !---c-->-c>
                       ‘Lines.per.month’)

# Merge these derived columns back into the original data frame
df < - ="" merge="" nbsp="" writing.statistics="">< !----->< !----->->
            as.data.frame(per.month), 
            by=”row.names”)

# Write out the plot
plot1< -ggplot aes="" df="" ="" group="1))" nbsp="" x="as.character(Row.names)," y="Commits,">< !---ggplot-->< !---ggplot-->-ggplot>
  geom_line() + 
  geom_smooth(method=”lm”, se = FALSE) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_x_discrete(“Month”)

plot2< -ggplot aes="" df="" ="" group="1))" nbsp="" x="as.character(Row.names)," y="Files.per.month,">< !---ggplot-->< !---ggplot-->-ggplot>
  geom_line() + 
  geom_smooth(method=”lm”, se = FALSE) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_x_discrete(“Month”)

plot3< -ggplot aes="" df="" ="" group="1))" nbsp="" x="as.character(Row.names)," y="Lines.per.month,">< !---ggplot-->< !---ggplot-->-ggplot>
  geom_line() + 
  geom_smooth(method=”lm”, se = FALSE) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_x_discrete(“Month”)


grid.arrange(plot1, plot2, plot3, ncol=1)

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.