Linking text, results, and analyses: Increasing transparency and efficiency

Jeromy Anglim

13 years ago

[This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have recently been thinking about the relationship between text in a final report and data analysis. The broader concern is with making the conduct and reporting of statistical analyses more transparent. I am inspired by the ideas of literate programming, Sweave, and open access to data.

Something to aspire to:

Raw data is shared (ethics, copyright, and other considerations permitting).
Code is shared that shows how the data was imported, transformed, and analysed. This code is well written, commented, and documented.
The report is shared as opposed to requiring a paid subscription.
Report output including tables, figures, and some text is linked directly to the analyses in code.

While the aspirations transcend R, I like the prospect of having analyses in R integrated with a final report. The inclusion of tables and figures , at least conceptually is a straightforward idea. However, the inclusion of text in a results section is a little fuzzier. Surely, text in a results section (I’ll call it “results text” for short) varies in how it relates to actual analyses. Thus, I had the following questions: 1) What is the unit of results text? 2) How does results text vary and what should be automatically supplied by R?; 3) For results text that should not be supplied by R, how should it be integrated into an analysis process?

Initial thoughts: After a little reflection I had the following thoughts:

A unit of results text is any continuous string of text. For example, “r=.23” and “F(2, 23) = 7.89” are both continuous strings of text. Such a unit includes multiple elements of information, but it could be imported from R as a continuous string of information and only one additional bit of information would be required to define the text’s location in the report.
Results text can be classified as either numeric or qualitative. Numeric results text includes correlations, means, percentages, significance values, effect size measures, and so on, and any standardised reporting text that surrounds its presentation (e.g, “r = ” in “r = .23” or the F, brackets, equals signs in an F test). Qualitative results text includes a wide range of content: a) description of analysis steps; b) justification of analyses; c) general comments about the pattern of results; d) non-numeric statements relating to statistical significance, direction of effect, effect size; e) statements about the relationship between results and expectations possibly with some explanation.
Results text varies in the degree to which it is contingent on the actual results of data analyses. At one end there is text that is not influenced (e.g., text introducing a table or figure; text justifying an analysis strategy; text setting out the steps taken to produce the results). Numeric results text is at the other end of the continuum and is altered by the slightest of changes to the data or analytic approach (e.g., the sample size or exact correlation will change after a case is deleted). There is also a wide variety of contingent qualitative results text (e.g., comments on the general pattern of results; comments about the size of a relationship).

Implications for integrating statistical analyses (and R in particular) with report writing:

Numeric results text should be integrated automatically into the final report.
Qualitative results text should be distinguished based on whether it is contingent on the results or not.
Noncontingent qualitative results text should be written up first.
Contingent qualitative results text should be written up after examining the contingent analysis output.
Contingent qualitative results text should be flagged in the word processor.
Contingent qualitative results text is based on underlying data and output. Whenever this data and output is changed, the text should be audited to see whether it needs to be changed as a consequence.
Placeholders for contingent results text (numeric or qualitative) can be placed in the document in preparation for completion of analyses.

Final comments:

I plan to write more at a later point on how this integration could be achieved.

To leave a comment for the author, please follow the link and comment on their blog: Jeromy Anglim's Blog: Psychology and Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.