Some thoughts on collaborating with markdown / dynamic document workflow
Collaborating on manuscripts with other researchers when not writing in MS Word has been a perpetual nuisance for those of us not using Word (no doubt the others might say the same). When I first began writing papers I worked in LaTeX, and at that time I collaborated largely with others who already knew TeX (e.g. my mentors), so this wasn’t much of a problem. While moving my workflow into markdown has simplified collaborations with (often junior) researchers who know markdown better than tex, it has managed to make the potential for mismatches even greater, as it creates a barrier for both co-authors working in LaTeX as well as those working in Word.
This has always been more of a nuisance than a real problem. I’ve usually just sent co-authors some derived copy (e.g. a pdf, or sometimes creating a Word or TeX document from the markdown using pandoc and sending that). This means I have to transcribe or at least copy and paste the edits, though that’s never all that time consuming a process.
I still have high hopes that RStudio’s
rmarkdown format will make it practical for co-authors to edit and compile my
.Rmd files directly. Meanwhile, a mentor who frequently uses LaTeX in collaborating with Word users suggested a much simpler solution that has proven very pratical for me so far.
A simple solution
Based on his suggestion, I just paste the contents of the
.Rmd file into a Word (well, LibreOffice) document and send that. Most collaborators can just ignore the code blocks and LaTeX equations, etc, and edit the text directly. I send the compiled pdf as well for the figures and rendered equations. A collaborator cannot easily re-compile their changes, but I can by copy-pasting back into the
.Rmd file. They can track changes via Word, and I can track the same changes through the version control when I paste their changes back in. It’s not perfect, but it’s simple.
Challenges in submission systems
I use semi-transparent graphs to show ensemble distributions of stochastic trajectories. First, a quick example as to why:
ggplot(sims_data) + geom_line(aes(time, fishstock, group=interaction(reps,method), color=method), alpha=0.1) + scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) + facet_wrap(~method) + guides(legend.position="none")
Statistical summaries can get around this approach, but don’t really reflect the true rather binary nature of the ensemble (that some trajectories go to zero while others remain around a constant level):
ggplot(sims_data) + stat_summary(aes(time, fishstock), fun.data = "mean_sdl", geom="ribbon", fill='grey80', col='grey80') + stat_summary(aes(time, fishstock, col=method), fun.y = "mean", geom="line") + # geom_line(aes(time, fishstock, group=interaction(reps,method), color=method), alpha=0.1) + scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) + facet_wrap(~method) + guides(legend.position="none")
Tweaking the statistical definitions can remove the more obvious errors from this, but still give the wrong impression:
mymin <- function(x) mean(x) - sd(x) ggplot(sims_data) + stat_summary(aes(time, fishstock), fun.y = "mean", fun.ymin = "mymin", fun.ymax="max", geom="ribbon", fill='grey80', col='grey80') + stat_summary(aes(time, fishstock, col=method), fun.y = "mean", geom="line") + # geom_line(aes(time, fishstock, group=interaction(reps,method), color=method), alpha=0.1) + scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) + facet_wrap(~method) + guides(legend.position="none")
Technical challenges in submitting transparent figures
All this works great with R+ggplot2, generating svg versions shown here or in generating pdfs for the final manuscript. Try and upload those pdfs to manuscriptcentral though (or arXiv, actually) and they will render improperly or not at all. What ever happened to the “portable” in “portable document format”? (Note that submission systems can take EPS/PS, though we’d need to change our LaTeX flavor for it, but R’s graphic devices for those formats don’t seem to support transparency.
It seems the problem for pdfs arises from different versions (thanks to Rich FitzJohn for figuring this out, I would never have managed). Transparency is natively supported in pdf >= 1.4, while in earlier versions it is just emulated. R can generate pdfs in 1.3 (using
dev.args = list(version="1.3") as the knitr chunk option), but unfortunately, ggplot promotes pdfs to version 1.4:
Saving 7 x 6.99 in image Warning message: In grid.Call.graphics(L_lines, x$x, x$y, index, x$arrow) : increasing the PDF version to 1.4
I’m not quite clear what part of the ggplot command triggers this, as some ggplot figures do render in version 1.3. To add one more gotcha, RStudio’s
rmarkdown by default runs pdfcrop which also promotes the pdfs to version 1.4.
It seems that pdf 1.5 works however – opening the v1.4 pdf in inkscape and saving as v1.5 seems to do the trick. (the current R version seems to be 1.7, though R supports up to 1.4). This is the route I took for the time being with mansucriptcentral, though frustrating that it requires a step external to the
They can also take TIFF for graphics (though (pdf)LaTeX can’t). I suppose one could submit jpg/png images as supplementary files for the tex compilation, which would have been a workable solution if annoying to use rasters when a vector graphic is preferred (and much smaller – I can’t understand why manuscriptcentral takes 10^4 times as long to upload a document as arXiv or other platforms).
12 Jun 2014
Used some down-time while traveling to hammer out a long overdue update to my knitcitations package.
I followed this with a ground-up rewrite, as I summarize in NEWS:
This version is a ground-up rewrite of knitcitations, providing a more powerful interface while also streamlining the back end, mostly by relying more on external libraries for knitty gritty. While an effort has been made to preserve the most common uses, some lesser-used functions or function arguments have been significantly altered or removed. Bug reports greatly appreciated.
citepnow accept more options. In addition to the four previously supported options (DOI, URL, bibentry or bibkey (of a previously cited work)), these now accept a plain text query (used in a CrossRef Search), or a path to a PDF file (which attempts metadata extraction).
Citation key generation is now handled internally, and cannot be configured just by providing a named argument to
citefunction is replaced by
bib_metadata. This function takes any argument to
citepas before (including the new arguments), see docs.
Linked inline citations now use the configuration:
cite_options(style="markdown", hyperlink="to.doc")provides a link to the DOI or URL of the document, using markdown format.
Support for cito and tooltip have been removed. These may be restored at a later date. (The earlier implementation did not appropriately abstract the use of these features from the style/formatting of printing the citation, making generalization hard.
bibliographynow includes CSL support directly for entries with a DOI using the
style=argument. No need to provide a CSL file itself, just the name of the journal (or rather, the name of the corresponding csl file: full journal name, all lower case, spaces as dashes). See https://github.com/cboettig/knitcitations/issues/38
bibliographyformatting has otherwise been completely rewritten, and no longer uses
print_rdfamethods. rdfa is no longer available, and other formats are controlled through
cite_options. For formal publication pandoc mode is recommended instead of
This version was developed on a separate branch (
v1), and has only just been merged back into master. CRAN doesn’t like getting multiple updates in the same month or so, but hopefully waiting a bit longer will give users and I a chance to shake out bugs anway. Meanwhile grab it from github with:
You can see this package in use, for instance, in providing dynamic citations for my
RNeXML mansucript draft.
04 Jun 2014
Ben Bolker has an excellent post on this complex issue over at Dynamic Ecology, which got me thinking about writing my own thoughts on the topic in reply.
Google recently announced that it will be making it’s own self-driving cars, rather than modifying those of others. Cars that won’t have steering wheels and pedals. Just a button that says “stop.” What does this tell us about the future of user-friendly complex statistical software?
Ben quotes prominent statisticians voicing fears that echo common concerns about self-driving cars:
Andrew Gelman attributes to Brad Efron the idea that “recommending that scientists use Bayes’ theorem is like giving the neighbourhood kids the key to your F-16″.
I think it is particularly interesting and instructive that the quote Gelman attributes to Efron is about a mathematical theorem rather than about software (e.g. Bayes Theorem, not WinBUGS). Even relatively simple statistical concepts like \(p\) values can cause plenty of confusion, statistical package or no. The concerns are not unique to software, so the solutions cannot come through limiting access to software.
I am very wary of the suggestion that we should address concerns of appropriate application by raising barriers to access. Those arguments have been made about knowledge of all forms, from access to publications, to raw data, to things as basic as education and democratic voting.
There are many good reasons for not creating a statistical software implementation of a new method, but I argue here that fear of misuse just is not one of them.
- The barriers created by not having a convenient software implementation are not an appropriate filter to keep out people who can miss-interpret or miss-use the software. As you know, a fundamentally different skillset is required to program a published algorithm (say, MCMC), than to correctly interpret the statistical consequences.
We must be wary of a different kind of statistical machismo, in which we use the ability to implement a method by one’s self as a proxy for interpreting it correctly.
1a) One immediate corollary of (1) is that: Like it or not, someone is going to build a method that is “easy to use”, e.g. remove the programming barriers.
1b) The second corollary is that: individuals with excellent understanding of the proper interpretation / statistics will frequently make mistakes in the computational implementation.
Both mistakes will happen. And both are much more formidable problems in the complex methodology of today than when computer was a job description.
So, what do we do? I think we should abandon the false dichotomy between “usability” and “correctness.”. Just because software that is easy to use is easy to misuse, does not imply that decreasing usability increases correctness. I think that is a dangerous fallacy.
A software implementation should aim first to remove the programming barriers rather than statistical knowledge barriers. Best practices such as modularity and documentation should make it easy for users and developers to understand and build upon it. I agree with Ben that software error messages are poor teachers. I agree that a tool cannot be foolproof, no tool ever has been.
Someone does not misuse a piece of software merely because they do not understand it. Misuse comes from mistakenly thinking you understand it. The premise that most researchers will use something they do not understand just because it is easy to use is distasteful.
Kevin Slavin gives a fantastic Ted talk on the ubiquitous role of algorithms in today’s world. His conclusion is neither one of panacea or doom, but rather that we seek to understand and characterize them, learn their strengths and weaknesses like a naturalist studies a new species.
More widespread adoption of software such as BUGS & relatives has indeed increased the amount of misuse and false conclusions. But it has also dramatically increased awareness of issues ranging from computational aspects peculiar to particular implementations to general understanding and discourse about Bayesian methods. Like Kevin, I don’t think we can escape the algorithms, but I do think we can learn to understand and live with them.
30 May 2014
PLOS has posted an excellent update reflecting on their experiences a few months in to their new data sharing policy, which requires authors to include a statement of where the data can be obtained rather than providing it upon request. They do a rather excellent job of highlighting common concerns and offering well justified and explained replies where appropriate.
At the end of the piece they pose several excellent questions, which I reflect on here (mostly as a way of figuring out my own thoughts on these issues).
- When should an author choose Supplementary Files vs a repository vs figures and tables?
To me, repositories should always be the default. Academic repositories provide robust permanent archiving (such as CLOCKSS backup), independent DOIs to content, often tracking of use metrics, enhanced discoverability, clear and appropriate licenses, richer metadata, as well as frequently providing things like API access and easy-to-use interfaces. They are the Silicon Valley of publishing innovation today.
Today I think it is much more likely that some material is not appropriate for a ‘journal supplement’ rather than not being able to find an appropriate repository (enough are free, subject agnostic and accept almost any file types). In my opinion the primary challenge is for publishers to tightly integrate the repository contents with their own website, something that the repositories themselves can support with good APIs and embedding tools (many do, PLOS’s coordination with figshare for individual figures being a great example).
I’m not clear on “vs figures and tables”, as this seems like a content question of “What” should be archived rather than “Where” (unless it is referring to separately archiving the figures and tables of the main text, which sounds like a great idea to me).
- Should software/code be treated any differently from ‘data’? How should materials-sharing differ?
At the highest level I think it is possible to see software as a ‘type’ of data. Like other data, it is in need of appropriate licensing, a management plan, documentation/metadata, and conforming to appropriate standards and placed in appropriate repositories. Of course what is meant by “appropriate” differs, but that is also true between other types of data. The same motivations for requiring data sharing (understanding and replicating the work, facilitating future work, increasing impact) apply.
I think we as a scientific community (or rather, many loosely federated communities) are still working out just how best to share scientific code and the unique challenges that it raises. Traditional scientific data repositories are well ahead in establishing best practices for other data, but are rapidly working out approaches to code. The guidelines from the Journal of Open Research Software from the UK Software Sustainability Institute are a great example. (I’ve written on this topic before, such as what I look for in software papers and on the topic of the Mozilla Science Code review pilot
I’m not informed enough to speak to sharing of non-digital material.
- What does peer review of data mean, and should reviewers and editors be paying more attention to data than they did previously, now that they can do so?
In as much as we are satisfied with the current definition of peer review for journal articles I think this is a false dichotomy. Landmark papers, at least in my field, five or six decades ago (e.g. about as old as the current peer review system) frequently contained all the data in the paper (papers were longer and data was smaller). Somehow the data outgrew the paper and it just became okay to omit it, just as methods have gotten more complex and papers today frequently gloss over methodological details. The problem, then, is not one of type but one of scale: how do you review data when it takes up more than half a page of printed text.
The problem of scale is of course not limited to data. Papers frequently have many more authors than reviewers, often representing disparate and highly specialized expertise over possibly years of work, depend upon more than 100 citations and be accompanied by as many pages of supplemental material. To the extent that we’re satisfied with how reviewers and editors have coped with these trends, we can hope for the same for data.
Meanwhile, data transparency and data reuse may be more effective safe guards. Yes, errors in the data may cause trouble before they can be brought to light, just like bugs in software. But in this way they do eventually come to light, and that is somewhat less worrying if we view data the way we currently build publications (e.g. as fundamental building blocks of research) and publications as we currently view data (e.g. as a means to an ends, illustrated in the idea that it is okay to have mistakes in the data as long as they don’t change the conclusions). Jonathan Eisen has some excellent examples in which openly sharing the data led to rapid discovery and correction of errors that might have been difficult to detect otherwise.
- And getting at the reason why we encourage data sharing: how much data, metadata, and explanation is necessary for replication?
I agree that the “What” question is a crux issue, and one we are still figuring out by community. There are really two issues here: what data to include, and what metadata (which to me includes any explanation or other documentation of the data) to provide for whatever data is included.
On appropriate metadata, we’ll never have a one-size-fits-all answer, but I think the key is to at least uphold current community best-practices (best != mode), whatever they may be. Parts of this are easy: scholarly archives everywhere include basic Dublin Core Elements metadata like title, author, date, subject and unique identifier, and most data repositories will attach this information in a machine-readable metadata format with minimal burden on the author (e.g. Dryad, or to lesser extent, figshare). Many fields already have well-established and tested standards for data documentation, such as the [Ecological Metadata Langauge], which helps ecologists document things like column names and units in an appropriate and consistent way without constraining how the data is collected or structured.
What data we include in the first place is more challenging, particularly as there is no good definition of ‘raw data’ (one person’s raw data being another person’s highly processed data). I think a useful minimum might be to provide any data shown in a figure or used in a statistical test that appears in the paper.
Journal policies can help most in each of these cases by pointing authors to the policies of repositories and to subject-specific publications on these best practices.
- A crucial issue that is much wider than PLOS is how to cite data and give academic credit for data reuse, to encourage researchers to make data sharing part of their everyday routine.
Again I agree that credit for data reuse is an important and largely cultural issue. Certainly editors can play there part as they already do in encouraging authors to cite the corresponding papers on the methods used, etc.
I think the cultural challenge is much greater for the “long tail” content than it is for the most impactful data. I think most of the top-cited papers over the last two decades have been methods papers (or are cited for the use of a method that has become the standard of a field; often as software). As with any citation, there’s a positive feedback as more people are aware of it. I suspect that the papers announcing the first full genomes of commonly studied organisms (essentially data papers, though published by the most exclusive journals) did not lack citations. For data (or methods for that matter) that do not anticipate that level of reuse, the concern of appropriate credit is more real. Even if a researcher can assume they will be cited by future reuse of their data, they may not feel that sufficient compensation if it means one less paper to their name.
Unfortunately I think these are not issues unique to data publication but germane to academic credit in general. Citations, journal names, and so forth are not meaningless metrics, but very noisy ones. I think it is too easy to fool ourselves by looking only at cases where statistical averages are large enough to see the signal – datasets like the human genome and algorithms like BLAST we know are impactful, and the citation record bears this out. Really well cited papers or well-cited journals tend to coincide with our notions of impact, so it is easy to overestimate the fidelity of citation statistics when the sample size is much smaller. Besides, academic work is a high-dimensional creature not easily reduced to a few scalar metrics.
- And for long-term preservation, we must ask who funds the costs of data sharing? What file formats should be acceptable and what will happen in the future with data in obsolete file formats? Is there likely to be universal agreement on how long researchers should store data, given the different current requirements of institutions and funders?
I think these are questions for the scientific data repositories and the fields they serve, rather than the journals, and for the most part they are handling them well.
Repositories like Dryad have clear pricing schemes closely integrated with other publication costs, and standing at least an order of magnitude less than most journal publication fees look like a bargain. (Not so if you publish in subscription journals I hear you say. Well, I would not be surprised if we start seeing such repositories negotiate institutional subscriptions to cover the costs of their authors).
I think the question of data formats is closely tied to that of metadata, as they are all topics of best-practices in archiving. Many scientific data repositories have usually put a lot of thought into these issues and also weigh them against the needs and ease-of-use of the communities they serve. Journal data archiving policies can play their part by encouraging best practices by pointing authors to repository guidelines as well as published articles from their community (such as the Nine Simple Ways paper by White et al.)
I feel the almost rhetorical question about ‘universal agreement’ is unnecessarily pessimistic. I suspect that much of the variance in recommendations for the duration a researcher should archive their own work predates the widespread emergence of data repositories, which have vastly simplified the issue from when it was left up to each individual lab. Do we ask this question of the scientific literature? No, largely because many major journals have already provided robust long term archiving with CLOCKSS/LOCKSS backup agreements. Likewise scientific data repositories seem to have settled for indefinite archiving. It seems both reasonable and practical that data archiving can be held to the same standard as the journal article itself. (Sure there are lots of challenging issues to be worked out here, the key is only to leave it in the hands of those already leading the way and not re-invent the wheel).
28 May 2014
I’m pretty happy with the way
rmarkdown looks like it can pretty much replace my Makefile approach with a simple R command to
rmarkdown::render(). Notably, a lot of the pandoc configuration can already go into the document’s
yaml header (bibliography, csl, template, documentclass, etc), avoiding any messing around with the Makefile, etc.
Even more exciting is the pending RStudio integration with pandoc. This exposes the features of the
rmarkdown package to the RStudio IDE buttons, but more importantly, seems like it will simplify the pandoc/latex dependency issues cross-platform.
In light of these developments, I wonder if I should separate my manuscripts from their corresponding R packages entirely (and/or treat them as vignettes?) I think it would be ideal to point people to a single
.Rmd file and say “load this in RStudio” rather than passing along a whole working directory.
rmarkdown::render workflow doesn’t cover installing the dependencies, or downloading the a pre-built cache. I’ve been relying on the R package mechanism itself to handle dependencies, though I list all packages loaded by the manuscript but not needed by the package functions themselves as
SUGGESTS, as one would do with a vignette. Consequently, I’ve had to add an install.R script to my template, to make sure these packages are installed before a user attempts to run the document. The install script feels like a bit of a hack, and makes me think that RStudio’s packrat may be what I actually want for this. So I finally got around to playing with packrat.
Packrat isn’t yet on CRAN, and for an RStudio package I admit that it feels a bit clunky still. Having a single
packrat.lock file (think
Gemfile.lock I suppose) seems like a great idea. Carting around the hidden files
.Renviron, and the
tar.gz sources for all the dependences (in
packrat.sources) seems heavy and clunky, and logging in and out all the time feels like a hack.
Am I really supposed to commit the
.tar.gzfiles? packrat/issues/59 (Summary: option coming)
Do I really need to restart R packrat/issues/60 (Summary: yes).
The first discussion led to an interesting question about just how big are CRAN packages these days anyhow? Thanks to this clever
rsync trick from Duncan, I could quickly explore this:
txt = system("rsync --list-only cran.r-project.org::CRAN/src/contrib/ | grep .tar.gz", intern = TRUE) setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", from) ) ) ans = read.table(textConnection(txt), colClasses=c("character", "num.with.commas", "Date", "character")) ggplot(ans, aes(V3, V2)) + geom_point() sum(ans$V2>1e6) sum(ans$V2/1e6)
Note that there are 711 packages over 1 MB, for a total weight of over 2.8 GB. Not huge but more than you might want in a Git repo all the same.
Nevertheless, packrat works pretty well. Using a bit of a hack we can just version manage/ship the
packrat.lock file and let packrat try and restore the rest.
packrat::packify() source(".Rprofile"); readRenviron(".Renviron") packrat::restore() source(".Rprofile"); readRenviron(".Renviron")
readRenviron calls should really be restarts to R. Tried replacing this with calls to
Rscript -e "packrat::packify() etc. but that fails to find
packrat on the second call. (Attempting to reinstall it doesn’t work either).
Provided the sources haven’t disappeared from their locations on Github, CRAN, etc., I think this strategy should work just fine. More long-term, we would want to archive a tarball with the
packrat.sources, perhaps downloading it from a script as I currently do with the cache archive.
Debugging check reveals some pretty tricky behavior on R’s part: it wants to check the R code in my vignette even though it’s not building the vignette. It does this by tangling out the code chunks, which ignores in-line code. Not sure if this should be a bug in knitr or R, but it can’t be my fault ;-). See knitr/issues/784
With checks passing, have sent v0.6 to CRAN. Fingers crossed…
Milestones for version 0.7 should be able to address the print formatting issues, hopefully as a new
citation_format option and without breaking backwards compatibility.
07 May 2014
For a while now most of my active research is developed through
.Rmd scripts connected to a particular project repository (something I discuss at length in deep challenges with knitr workflows). In the previous post I discuss creating a
template package with a more transparent organization of files, such as moving manuscripts from
inst/doc/ to simply
manuscripts/. This left these exploratory analysis scripts in
inst/examples in a similarly unintuitive place. Though I like having these scripts as part of the repository (which keeps everything for a project in one place, as it were), like the manuscript they aren’t really part of the R package, particularly as I have gotten better at creating proper unit tests in place of just rerunning dynamic scripts occasionally.
I’ve also been nagged by the idea of having to always just link to these nice dynamic documents from my lab notebook. Sure Github renders the markdown so that it’s easy enough to see highlighted code and figures etc., but it still makes them seem rather external. Occasionally I would copy the complete
.md file into a notebook post, but this divorced it of it’s original version history and associated
One option would be to move them all directly into my lab notebook,
.md file on Github.
In the recent ropensci/docs project we are exploring a way to have Jekyll automatically compile (potentially with caching) a site that uses
.Rmd posts and deploy to Github all using
travis, but we’re not quite finished and this is potentially fragile particularly with the hundreds of posts in this notebook. Besides this, the notebook structure is rather temporally oriented, (posts are chronological and reflected in my URL structure) while these scripts are largely project-oriented. (Consistent use of categories and tags would ameliorate this).
Embedding images in
A persistent challenge has been how best to deal with images created by these scripts, some of which I may run many times. By default
png images, which as binary files are ill suited for committing to Github, and which could bloat a repository rather quickly. For a long while I have used custom hooks to push these images to
flickr, (see flickr.com/cboettig), inserting the permanent flickr URL into the output markdown.
Recently Martin Fenner convinced me that
svg files would both render more nicely across a range of devices (being vector graphics), and could be easily committed to Github as they are text-based (XML) files, so that reproducing the same image in repeated runs wouldn’t take up any more space. We can then browse a nice version history of the any particular figure, and this also keeps all the output material together, making it easier to archive permanently (certainly nicer than my old archiving solution using data URIs.). Lastly,
svg is both web native, being a standard namespace of HTML5, and potentially interactive, as the SVGAnnotation R package illustrates. So, lots of advantages in using
svg files also bring some unique challenges. Unlike when
png files are added to Github, webpages cannot directly link them since Github enforces rendering them as text instead of an image through its choice of HTML header, for security reasons. This means the only way to link to an
svg file on Github is to have that file on a
gh-pages branch, where it can be rendered as a website. A distinct disadvantage of this approach is that while we can link to a specific version of any file on Github, we see only the most recent version rendered on the website created by a
On the other hand, having the
svg files on the
gh-pages branch further keeps down the footprint of the project
master branch. This leads rather naturally to the idea that the
.Rmd files and their
.md outputs should also appear on the
gh-pages branch. This removes them from their awkward home in
To provide a consistent look and feel, I merely copied over the
_includes from my lab notebook, tweaking them slightly to use the assets already hosted there. I add custom domain name for the all my
gh-pages as a sub-domain,
io.carlboettiger.info 1, and now instead of having script output appear like so:
I have the same page rendered on my
with its mathjax, disqus, matching css, URL and nav elements.
An obvious extension of this approach is to grab a copy of the repository README and rename it
index.md and add a yaml header such that it serves as a landing page for the repository. A few lines of Liquid code can then generate the links to the other output scripts, as in this example:
I have added a
gh-pages branch with this set up to my new
template repository, with some more basic documentation and examples.
There’s no need to use a different sub-domain than the rest of my website, other than that it would require my notebook be hosted on the cboettig.github.com repo instead of labnotebook. However I prefer keeping my hosting on the repository I already have, and it also seems a bit unorthodox to host all my repositories on my main domain. In particular, it increases the chance for URL collisions if I create a repository with the same name as a page or directory on my website. Having gh-pages on the
iosub-domain feels like just the right amount of separation to me.↩
06 May 2014
While I have made my workflow for most of my ongoing projects available on Github for some time, this does not mean that it has been particularly easy to follow. Further, as I move from project to project I have slowly improved how I handle projects. For instance, I have since added unit tests (with
testthat) and continuous integration (with travis-ci) to my repositories, and my handling of manuscripts has gotten more automated, with richer latex templates, yaml metadata, and simpler and more powerful makefiles.
Though I have typically used my most recent project as a template for my next one (not so trivial as I work on several at a time), I realized it would make sense to just maintain a general template repo with all the latest goodies. I have now launched my template on Github.
I toyed with the idea of just treating the manuscript as a standard vignette, but this would make
pandoc an external dependency for the package, putting an unecessary burden on
travis and users. I settled on creating a
manuscripts directory in the project root folder as the most semantically obvious place. This is added to
.Rbuildignore as it doesn’t fit the standard structure of an R package, but since it is not a vignette and cannot be built with the package dependencies anyhow, this seems to make sense to me.1
The manuscript itslef is written in
.Rmd, with a
yaml header for the usual metadata of authors, affiliations, and so forth. Pandoc’s recent support for yaml metadata makes it much easier to use
.Rmd with a LaTeX template, making
.Rnw rather unnecessary. My template includes a custom
LaTeX template that includes pandoc’s macros for inserting authors, affiliations, and so forth in the correct LaTeX elements, though pandoc’s default template is rather good and already has macros for most things in place (meaning you can merely declare the layout or font in the yaml header and magically see the tex interpret it).
I have tried to keep the
manuscripts directory relatively clean, placing
cache/ and other such files in a
components/ sub-directory. I have also tried to keep the
Makefile as platform-independent as possible by having it call little Rscripts (also housed in
components/) rather than commandline utilities like
sed -i and
wget that may not behave the same way on all platforms.
Lastly, Ryan Batts recently convinced me that providing binary cache files of results was an important way to allow a reader to quickly engage in exploring an analysis without having to first let potentially long-running code execute.
knitr provides an excellent way to create and manage this caching on a code chunk by chunk level, which is also crucial when editing a dynamic document with intensive code (no one wants to rerun your MCMC just to rebuild the pdf). Since git/Github seems like a poor option for distributing binaries, I have for the moment just archived the cache on a (university) web server and added a Make/Rscript line to that can restore it from that location. Upon publication this cache could be permanently archived (along with plain text tables of the graphs) and then installed from that archive instead.
I have also added a separate README in the manuscripts directory to provide some guidance to a user seeking to build the manuscript.
Perhaps I should not have the manuscript on the master branch at all, but putting it on another branch would defeat the purpose of having it in an obviously-named directory of the repository home page where it is most easy to discover.↩
05 May 2014
We often discuss dynamic documents such as
knitr in reference to final products such as publications or software package vignettes. In this case, all the elements involved are already fixed: external functions, code, text, and so forth. The dynamic documentation engine is really just a tool to combine them (knit them together). Using dynamic documentation on a day-to-day basis on ongoing research presents a compelling opportunity but a rather more complex challenge as well. The code base grows, some of it gets turned into external custom functions where it continues to change. One analysis script branches into multiple that vary this or that. The text and figures are likewise subject to the same revision as the code, expanding and contracting, or being removed or shunted off into an appendix.
Structuring a dynamic document when all the parts are morphing and moving is one of the major opportunities for the dynamic approach, but also the most challenging. Here I describe some of those challenges along with various tricks I have adopted to deal with them, mostly in hopes that someone with a better strategy might be inspired to fill me in.
The old way
For a while now I have been using the knitr dynamic documentation/reproducible research software for my project workflow. Most discussion of dynamic documentation focuses on ‘finished’ products such as journal articles or reports. Over the past year, I have found the dynamic documentation framework to be particularly useful as I develop ideas, and remarkably more challenging to then integrate into a final paper in a way that really takes advantage of its features. I explain both in some detail here.
My former workflow followed a pattern no doubt familiar to many:
- Bash away in an R terminal, paste useful bits into an R script…
- Write manuscript separately, pasting in figures, tables, and in-line values returned from R.
This doesn’t leave much of a record of what I did or why, which is particularly frustrating when some discussion reminds me of an earlier idea.
When I begin a new project, I now start off writing a
.Rmd file, intermixing notes to myself and code chunks. Chunks break up the code into conceptual elements, markdown gives me a more expressive way to write notes than comment lines do. Output figures, tables, and in-line values inserted. So far so good. I version manage this creature in git/Github. Great, now I have a trackable history of what is going on, and all is well:
Document my thinking and code as I go along on a single file scratch-pad
Version-stamped history of what I put in and what I got out on each step of the way
Rich markup with equations, figures, tables, embedded.
Caching of script chunks, allowing me to tweak and rerun an analysis without having to execute the whole script. While we can of course duplicate that behavior with careful save and load commands in a script, in knitr this comes for free.
Limitations to .Rmd alone
As I go along, the
.Rmdfiles starts getting too big and cluttered to easily follow the big picture of what I’m trying to do.
Before long, my investigation branches. Having followed one
.Rmdscript to some interesting results, I start a new
.Rmdscript representing a new line of investigation. This new direction will nevertheless want to re-use large amounts of code from the first file.
A solution? The R package “research compendium” approach
I start abstracting tasks performed in chunks into functions, so I can re-use these things elsewhere, loop over them, and document them carefully somewhere I can reference that won’t be in the way of what I’m thinking. I start to move these functions into
R/ directory of an R package structure, documenting with
Roxygen. I write unit tests for these functions (in
inst/tests) to have quick tests to check their sanity without running my big scripts (recent habit). The package structure helps me:
- Reuse the same code between two analyses without copy-paste or getting our of sync
- Document complicated algorithms outside of my working scripts
- Test complicated algorithms outside of my working scripts (
devtools::checkand/or unit tests)
- Manage dependencies on other packages (DESCRIPTION, NAMESPACE), including other projects of mine
This runs into trouble in several ways.
Problem 1: Reuse of code chunks
What to do with code I want to reuse across blocks but do not want to write as a function, document, or test?
Perhaps this category of problem doesn’t exist, except in my laziness.
This situation arises all the time, usually through the following mechanism: almost any script performs several steps that are best represented as chunks calling different functions, such as
plot_fits, etc. I then want to re-run almost the same script, but with a slightly different configuration (such as a different data set or extra iterations in the fixed parameters). For just a few such cases, it doesn’t make sense to write these into a single function,1 instead, I copy this script to a new file and make the changes there.
This is great until I want to change something in about the way both scripts behave that cannot be handled just by changing the
R/ functions they share. Plotting options are a good example of this (I tend to avoid wrapping
ggplot calls as separate functions, as it seems to obfuscate what is otherwise a rather semantic and widely recognized, if sometimes verbose, function call).
I have explored using
knitr’s support for external chunk inclusion, which allows me to maintain a single R script with all commonly used chunks, and then import these chunks into multiple
.Rmd files. An example of this can be seen in my
nonparametric-bayes repo, where several files (in the same directory) draw most of their code from external-chunks.R.
Problem 2: package-level reproducibility
Minor/relatively easy to fix.
Separate files can frustrate reproducibility of a given commit. As I change the functions in
.Rmd file can give different results despite being unchanged. (Or fail to reflect changes because it is caching chunks and does not recognize the function definitions have changed underneath it). Git provides a solution to this: since the
.Rmd file lives in the same git repository (
inst/examples) as the package, I can make sure the whole repository matches the hash of the
install_github("packagename", "cboettig", "hash").
This solution is not fail-safe: the installed version, the potentially uncommitted (but possibly installed) version of the R functions in the working directory, and the R functions present at the commit of the
.Rmd file (and thus matching the hash) could all be different. If we commit and install before every
knit, we can avoid these potential errors (at the cost of some computational overhead), restoring reproducibility to the chain.
Problem 3: Synthesizing results into a manuscript
In some ways this is the easiest part, since the code-base is relatively static and it is just a matter of selecting which results and figures to include and what code is necessary to generate it. A few organizational challenges remain:
While we generally want
knitr code chunks for the figures and tables that will appear, we usually aren’t interested in displaying much, if any, of the actual code in the document text (unlike the examples until this point, where this was a major advantage of the knitr approach). In principle, this is as simple as setting
echo=FALSE in the global chunk options. In practice, it means there is little benefit to having the chunks interwoven in the document. What I tend to want is having all the chunks run at the beginning, such that any variables or results can easily be added (and their appearance tweaked by editing the code) as figure chunks or in-line expressions. The only purpose of maintaining chunks instead of a simple script is the piecewise caching of chunk dependencies which can help debugging.
Since displaying the code is suppressed, we are then left with the somewhat ironic challenge of how best to present code as a supplement. One option is simply to point to the source
.Rmd, another is to use the
tangle() option to extract all the code as a separate
.R file. In either case, the user must also identify the correct version of the R package itself for the external
Problem 4: Branching into other projects
Things get most complicated when projects begin to branch into other projects. In an ideal world this is simple: a new idea can be explored on a new branch of the version control system and merged back in when necessary, and an entirely new project can be built as a new R package in a different repo that depends on the existing project. After several examples of each, I have learned that it is not so simple. Despite the nice tools, I’ve learned I still need to be careful in managing my workflows in order to leave behind material that is understandable, reproducible, and reflects clear provenance. So far, I’ve learned this the hard way. I use this last section of the post to reflect on two of my own examples, as writing this helps me work through what I should have done differently.
example: warning-signals project
For instance, my work on early warning signals dates back to the start of my open notebook on openwetware, when my code lived on a Google code page which seems to have disappeared. (At the time it was part of my ‘stochastic population dynamics’ project). When I moved to Github, this project got it’s own repository, warningsignals, though after a major re-factorization of the code I moved to a new repository, earlywarning. Okay, so far that was due to me not really knowing what I was doing.
My first paper on this topic was based on the master branch of that repository, which still contains the code required. When one of the R dependencies was moved from CRAN I was able to update the codebase to reflect the replacement package (see issue #10). Even before that paper appeared I started exploring other issues on different branches, with the
prosecutor branch eventually becoming it’s own paper, and then it’s own repository.
That paper sparked a comment letter in response to it, and the analysis involved in my reply piece was just developed on the same master branch of the prosecutor-fallacy repository. This leaves me with a total of three repositories across four branches, with one repo that corresponds more-or-less directly to a paper, one to two papers, and one to no papers.
All four branches have diverged and unmerge-able code. Despite sharing and reusing functions across these projects, I often found it better to simply change the function on the new branch or new repo as I desired for the new work. These changes could not be easily merged back as they broke the original function calls of the earlier work.
Hindsight being 20-20, it would have been preferable that I had maintained one repository, perhaps developed each paper on a different branch and clearly tagged the commit corresponding to the submission of each publication. Ideally these could be merged back where possible to a master branch. Tagged commits provide a more natural solution than unmerged branches to deal with changes to the package that would break methods from earlier publications.
example: optimal control projects
A different line of research began through a NIMBioS working group called “Pretty Darn Good Control”, beginning it’s digital life in my pdg_control repository. Working in different break-out groups as well as further investigation on my own soon created several different projects. Some of these have continue running towards publication, others terminating in dead ends, and still others becoming completely separate lines of work. Later work I have done in optimal control, such nonparametric-bayes and multiple_uncertainty depend on this package for certain basic functions, though both also contain their own diverged versions of functions that first appeared in pdg_control.
Because the topics are rather different and the shared code footprint is quite small, separate repositories probably makes more sense here. Still, managing the code dependencies in separate repositories requires extra care, as checking out the right version of the focal repository does not guarantee that one will also have the right version of the [pdg_control] repository. Ideally I should note the hash of [pdg_control] on which I depend, and preferably install that package at that hash (easy enough thanks to
devtools), since depending on a separate project that is also still changing can be troublesome. Alternatively it might make more sense to just duplicate the original code and remove this potentially frail dependency. After all, documenting the provenance need not rely on the dependency, and it is more natural to think of these separate repos as divergent forks.
If I have a lot of different configurations, it may make sense to wrap up all these steps into a single function that takes input data and/or parameters as it’s argument and outputs a data frame with the results and inputs.↩
04 May 2014
For the past four years I have made an effort to sign all my reviews (which I try to keep to about one a month). It isn’t because I believe in radical openness or something crazy like that. Its really just my self interest involved – at least mostly. Writing a review is an incredibly time consuming, and largely thankless task. Supposedly anonymous peer review is supposed to protect the reviewer, particularly the scenario of the less established scientist critiquing the work of the more established. I am sure it occasionally serves that purpose. On the other hand, that very scenario can be the most profitable time to sign a review. Really, when are you more likely to get an esteemed colleague to closely read your every argument than when you’re holding up their publication?
While the possibility of a vindictive and powerful author sounds daunting, but rather inconsistent with my impression of most scientists, who are more apt to be impressed by an intelligent even if flawed critique than by simple praise. I find it hardest to sign a review that I have found very little constructive criticism to offer, though after a decade of being trained to critique science one can always find something. (Of course signing can be hard on the occasional terrible paper for which it is hard to offer much constructive criticism, but fortunately that has been very rare). Both authors and other reviewers (who are sometimes sent the other reviews, a practice I find very educational as a reviewer) have on occasion commented or complemented me on reviews or acknowledged me in the papers, suggesting that the practice does indeed provide for some simple recognition. At times, it may sow seeds for future collaboration.
Signing my reviews has on occasion given the author a chance to follow up with me directly. While I’m not certain about journal policies in this regard, I suspect we can assume that we’re all adults capable of civil discussion. In any event, a phone call or even a few back-and-forth emails can be immensely more efficient in allowing an author to clarify elements that I have sometimes misunderstood or been unable to follow from the text, as well as making it easier to communicate my difficulties with the paper. In my experience this has resulted in both a faster and more satisfactory resolution to issues that have led to see some papers published more quickly and without as many tedious multiple rounds of revision. Given that many competitive journals simply cut off papers that might otherwise be successful with a bit more dialog between reviewer and author, because multiple “Revise and resubmits” put too much demand on editors, this seems like a desirable outcome for all involved. I’m not suggesting that such direct dialog is always desirable, but that no doubt many of us have been in the position in which a little dialog might have resolved issues more satisfactorily.