Writing reproducibly in the open with knitr

Sweave is something of a gold standard in reproducible research. It creates a dynamic document, written in a mix of LaTeX and R code where the results of the analysis (numbers, figures, tables) are automatically generated from the code and inserted into the resulting pdf document, making them easy to update if the data or methods change. It’s a nice idea, in principle.

However, the practical troubles are many. Coauthors don’t know LaTeX, publishers who don’t accept LaTeX or pdfs. The LaTeX myth that you are freed from thinking about formatting, when in fact you have to fill your document with LaTeX specific markup that makes it a burden both to type and to read the source-code. Compiling and debugging your text. And then the reproducibility comes from sharing that Sweave file – a mix of LaTeX and R that almost no one can read easily. Where’s the elegance in that? ((I’m glossing over the additional challenges of highlighting, caching, and formating on the R code side, which have been largely addressed by additional packages and are elegantly solved in knitr.)) Sure, none of these are show-stoppers – I’ve been content with LaTeX for years – but suddenly there’s a better way.

Thanks to knitr, a successor of Sweave, I can write my publications in markdown. Unlike LaTeX, HTML, or other markup languages, markdown is designed to be easily read as plain text, but can also be interpreted into pretty HTML, and now into almost any other format thanks to pandoc. All of which is to say that writing and sharing just got a lot easier.

As I have written previously, I already use this markdown format for my notes and code, so there’s no re-typing required. When working on the paper, I can just write. I can edit the code without flipping back and forth between files. Knitr can run the code blocks, caching parts that have already run for efficiency, and upload the resulting figures in png format automatically to the Internet. Github displays the resulting document and the

(https://github.com/ropensci/rfishbase/blob/master/inst/doc/rfishbase/source/rfishbase.md), while also tracking the versions as my writing progresses.

Different output formats for the manuscript
Different output formats for the manuscript

Pandoc allows me to transform these notes into a LaTeX file that can generate professional-looking pdfs with given journal .cls files by using a custom latex template. Pandoc can also generate the less pretty but often required word documents. A separate Rscript combines with a Makefile to control the relevant formatting – for LaTeX output I want high-quality pdf graphics, for Word-doc output I want eps graphics which are created but not pasted into the Word file, for the drafts I want png graphics stored online for easy sharing. Pandoc allows citations to be extracted from my Mendeley library (via Bibtex files) and inserted into each of these output formats (doc, pdf, github markdown).

Getting the LaTeX template, Makefile, and knit script set up for this pipeline takes a little care – mostly to ensure figures and tables look appropriate in all outputs. Once these files are created though, they can be easily reused on other manuscripts. A simple make pdf builds the pdf copy, make docx builds a MS Word copy, ((though these binary files aren’t stored in the git repository)) and make github the copy that displays with images on Github.

The links in this post point to what is an active draft of a little manuscript at the time of this writing. In addition to making the final result reproducible, Github captures the provenance or history of the research and writing process. It’s not a perfect system, but it’s a nice step.