Lab Notebook

(Introduction)

Coding

  • cboettig pushed to master at cboettig/labnotebook: notes random thoughts 07:30 2014/07/24
  • cboettig pushed to master at cboettig/pdg_control: data from supp figures updated manuscript, draft sent to paul 06:19 2014/07/24
  • cboettig starred nimble-dev/nimble: 04:52 2014/07/24
  • cboettig commented on issue swcarpentry/bc#524: Sorry I don't think I quite got what you wanted to do. Using base.dir and base.url etc should be able to handle knitting from a different location, … 03:27 2014/07/24
  • cboettig commented on issue swcarpentry/bc#524: All this is possible just by configuring knitr appropriately. See package options like base.dir and base.urlhttp://yihui.name/knitr/options#package… 12:44 2014/07/24

Discussing

Reading

  • A generalized perturbation approach for exploring stock recruitment relationships: Theoretical Ecology (2014). Justin D. Yeakel, Marc Mangel et al. 10:44 2014/07/09
  • Temporal variability of forest communities: empirical estimates of population change in 4000 tree species: Ecology Letters (2014). Pages: n/a-n/a. Ryan a. Chisholm, Richard Condit, K. Abd. Rahman, Patrick J. Baker, Sarayudh Bunyavejchewin, Yu-Yun Chen, George Chuyong, H. S. Dattaraja, Stuart Davies, Corneille E. N. Ewango, C. V. S. Gunatilleke, I. a. U. Nimal Gunatilleke, Stephen Hubbell, David Kenfack, Somboon Kiratiprayoon, Yiching Lin, Jean-Remy Makana, Nantachai Pongpattananurak, Sandeep Pulla, Ruwan Punchi-Manage, Raman Sukumar, Sheng-Hsin Su, I-Fang Sun, H. S. Suresh, Sylvester Tan, Duncan Thomas, Sandra Yap et al. 09:51 2014/07/09
  • The importance of individual developmental variation in stage-structured population models: Ecology Letters (2014). Pages: n/a-n/a. Perry de Valpine, Katherine Scranton, Jonas Knape, Karthik Ram, Nicholas J. Mills et al. 09:51 2014/07/09
  • A semiparametric Bayesian method for detecting Allee effects: Ecology (2013). Volume: 94, Issue: 5. Pages: 1196-1204. Masatoshi Sugeno, Stephan B Munch et al. 09:51 2014/07/09

Entries

Notes

21 Jul 2014

Reading

Looking for this in my notes and couldn’t find it: An excellent paper from the Software Sustainability Institute and friends outlining the need and possible structure for sustaining career paths for software developers. “The research software engineer”, Dirk Gorissen. Provides a good response to the software issues highlighted by climategate, etc.

Remote conferencing

(Based on earlier unposted notes). With so much going on, it’s nice to be able to follow highlights from some conferences remotely

Misc code-tricks

For a question raised during the Mozilla sprint: had to remember how to write custom hooks for knitr (e.g. for kramdown compatibility):

hook.t <- function(x, options) paste0("\n\n~~~\n", paste0(x, collapse="\n"), "~~~\n\n")
hook.r <- function(x, options) {
       paste0("\n\n~~~ ", tolower(options$engine), "\n", paste0(x, collapse="\n"), "\n~~~\n\n")
  }
knitr::knit_hooks$set(source=hook.r, output=hook.t, warning=hook.t,
                              error=hook.t, message=hook.t)

Read more



Ups And Data Vs Optimal Control

21 Jul 2014

Random idea for possible further exploration:

The use of ‘big data’ by UPS to perform lots of small efficiency gains seems to be everybody’s favorite example (NPR, The Economist). During a typical applications of optimal control for ecological conservation talk yesterday I couldn’t help thinking back to that story. The paradigm shift is not so much the kind or amount of the data being used as it is the control levers themselves. As the Economist (rightly) argues, everyone typically assumes that a few principle actions are responsible for 80% of possible improvement.

Optimal control tends to focus on these big things, which are also usually particularly thorny optimizations. Most of the classic textbook hard optimization problems could have come right from the UPS case: the traveling salesman, the inventory packing/set cover problems, and so forth. Impossible to solve exactly on large networks, approximate dynamic programming approaches have since been the work-around. Yet the “Big Data” approach takes a rather different strategy all together, tackling many small problems instead of one big one. Our typical approach of theoretical abstractions to simple models is designed to focus on these big overarching problems. In abstracting the problem, we focus on the big picture stuff that should matter most – stuff like figuring out the optimal route to travel, and so forth. But when the gains through increasing optimization of these things are marginal, focusing on the “other 20%” can make more sense. However, that means abandoning the abstraction and going back to the original messy problem. It means knowing about all the other little levers and switches we can control. In the UPS context, this means thinking about how many times a truck backs up, or idles at a stop light, or what hand the deliveryman holds the pen in. Given both the data and the ability to control so many of these little things, optimizing each one in the first place can be more valuable than focusing on the big abstract optimizations.

So, does this work only once the heuristic solutions to the big problems are nearly optimal, so improved approximations have very limited gains? Or can this also be a route forward when the big problems are primarily intractable as well? The former certainly seems the more likely, but if the latter is true, it could prove very interesting.

So this got me thinking – if we accept the latter premise we find a case closely analogous to the very messy optimizations we face in conservation decision-making. Could the many little levers be an alternative? It’s unlikely given both the need for the kind of arbitrarily detailed data at almost no cost available to the UPS problem, and also the kind of totalitarian control UPS can apply to control all the little levers, while the conservation problem more frequently has nothing bit a scrawny blunt stick to toggle in the first place. Nevertheless, it’s hard to know what possible gains we have already excluded when we focus only on the big abstractions and the controls relevant to them. Could conservation decision-making think more outside the box about the many little things we might be able to more effectively influence?

Read more



Notes

18 Jul 2014

CRAN trivia

When should you bump version for a (rejected) resubmission?

Once accepted, any change other than to the metadata (essentially the DESCRIPTION file) needs an increased version. For submissions, we prefer (but do not insist on) a new number of each attempt.

–Prof Brian Ripley

Using non-CRAN repositories in SUGGESTS or ENAHANCES

More dubious tricks, from Yihui:

FYI, here is how R core checks dependencies: https://github.com/wch/r-source/blob/trunk/src/library/tools/R/QC.R#L5195

Because I know this, sometimes I intentionally use something like (function(pkg) library(pkg, character.only = TRUE))(“foo”) to silence R CMD check and cheat when I (optionally) need a package but do not want CRAN maintainers to know it

Package maintainence

  • knitcitations 0.1.1 on CRAN now.

  • See RNeXML check results: http://cran.r-project.org/web/checks/check_results_RNeXML.html

  • RNeXML Submitted a series of patches that allow tests not to fail when external resources (packages, web APIs) that are not available.

  • rfigshare updated #84. Triggered occassional errors from API failing, so most tests now skipped if authentication call fails. (Reworked authentication a bit)

  • bugfix: knitcitations/#63

Misc

Taking a look at auto for bifurcation diagrams (ht Noam, who is using the XPP wrapper).

Read more



Notes On Tricks In Manuscript Submission And Collaboration

10 Jul 2014

Some thoughts on collaborating with markdown / dynamic document workflow

Collaborating on manuscripts with other researchers when not writing in MS Word has been a perpetual nuisance for those of us not using Word (no doubt the others might say the same). When I first began writing papers I worked in LaTeX, and at that time I collaborated largely with others who already knew TeX (e.g. my mentors), so this wasn’t much of a problem. While moving my workflow into markdown has simplified collaborations with (often junior) researchers who know markdown better than tex, it has managed to make the potential for mismatches even greater, as it creates a barrier for both co-authors working in LaTeX as well as those working in Word.

This has always been more of a nuisance than a real problem. I’ve usually just sent co-authors some derived copy (e.g. a pdf, or sometimes creating a Word or TeX document from the markdown using pandoc and sending that). This means I have to transcribe or at least copy and paste the edits, though that’s never all that time consuming a process.

I still have high hopes that RStudio’s rmarkdown format will make it practical for co-authors to edit and compile my .Rmd files directly. Meanwhile, a mentor who frequently uses LaTeX in collaborating with Word users suggested a much simpler solution that has proven very pratical for me so far.

A simple solution

Based on his suggestion, I just paste the contents of the .Rmd file into a Word (well, LibreOffice) document and send that. Most collaborators can just ignore the code blocks and LaTeX equations, etc, and edit the text directly. I send the compiled pdf as well for the figures and rendered equations. A collaborator cannot easily re-compile their changes, but I can by copy-pasting back into the .Rmd file. They can track changes via Word, and I can track the same changes through the version control when I paste their changes back in. It’s not perfect, but it’s simple.

Challenges in submission systems

transparent figures

I use semi-transparent graphs to show ensemble distributions of stochastic trajectories. First, a quick example as to why:

ggplot(sims_data) +
  geom_line(aes(time, fishstock, group=interaction(reps,method), color=method), alpha=0.1) +
  scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) +
  facet_wrap(~method) + guides(legend.position="none")

Statistical summaries can get around this approach, but don’t really reflect the true rather binary nature of the ensemble (that some trajectories go to zero while others remain around a constant level):

ggplot(sims_data) +

  stat_summary(aes(time, fishstock), fun.data = "mean_sdl", geom="ribbon", fill='grey80', col='grey80') +
  stat_summary(aes(time, fishstock, col=method), fun.y = "mean", geom="line") +

#  geom_line(aes(time, fishstock, group=interaction(reps,method), color=method), alpha=0.1) +
  scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) +
  facet_wrap(~method) + guides(legend.position="none")

Tweaking the statistical definitions can remove the more obvious errors from this, but still give the wrong impression:

mymin <- function(x) mean(x) - sd(x)
ggplot(sims_data) +

  stat_summary(aes(time, fishstock), fun.y = "mean", fun.ymin = "mymin", fun.ymax="max", geom="ribbon", fill='grey80', col='grey80') +
  stat_summary(aes(time, fishstock, col=method), fun.y = "mean", geom="line") +

#  geom_line(aes(time, fishstock, group=interaction(reps,method), color=method), alpha=0.1) +
  scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) +
    facet_wrap(~method) + guides(legend.position="none")

Technical challenges in submitting transparent figures

All this works great with R+ggplot2, generating svg versions shown here or in generating pdfs for the final manuscript. Try and upload those pdfs to manuscriptcentral though (or arXiv, actually) and they will render improperly or not at all. What ever happened to the “portable” in “portable document format”? (Note that submission systems can take EPS/PS, though we’d need to change our LaTeX flavor for it, but R’s graphic devices for those formats don’t seem to support transparency.

It seems the problem for pdfs arises from different versions (thanks to Rich FitzJohn for figuring this out, I would never have managed). Transparency is natively supported in pdf >= 1.4, while in earlier versions it is just emulated. R can generate pdfs in 1.3 (using dev.args = list(version="1.3") as the knitr chunk option), but unfortunately, ggplot promotes pdfs to version 1.4:

Saving 7 x 6.99 in image
Warning message:
In grid.Call.graphics(L_lines, x$x, x$y, index, x$arrow) :
  increasing the PDF version to 1.4

I’m not quite clear what part of the ggplot command triggers this, as some ggplot figures do render in version 1.3. To add one more gotcha, RStudio’s rmarkdown by default runs pdfcrop which also promotes the pdfs to version 1.4.

It seems that pdf 1.5 works however – opening the v1.4 pdf in inkscape and saving as v1.5 seems to do the trick. (the current R version seems to be 1.7, though R supports up to 1.4). This is the route I took for the time being with mansucriptcentral, though frustrating that it requires a step external to the rmarkdown::render process.

They can also take TIFF for graphics (though (pdf)LaTeX can’t). I suppose one could submit jpg/png images as supplementary files for the tex compilation, which would have been a workable solution if annoying to use rasters when a vector graphic is preferred (and much smaller – I can’t understand why manuscriptcentral takes 10^4 times as long to upload a document as arXiv or other platforms).

Read more



Knitcitations Updates

12 Jun 2014

Used some down-time while traveling to hammer out a long overdue update to my knitcitations package.

My first task inovled a backwards-compatible update fixing a few minor issues (see NEWS) and providing pandoc style inline citations v0.6-2, on CRAN.

I followed this with a ground-up rewrite, as I summarize in NEWS:

v1.0-1

This version is a ground-up rewrite of knitcitations, providing a more powerful interface while also streamlining the back end, mostly by relying more on external libraries for knitty gritty. While an effort has been made to preserve the most common uses, some lesser-used functions or function arguments have been significantly altered or removed. Bug reports greatly appreciated.

  • citet/citep now accept more options. In addition to the four previously supported options (DOI, URL, bibentry or bibkey (of a previously cited work)), these now accept a plain text query (used in a CrossRef Search), or a path to a PDF file (which attempts metadata extraction).

  • Citation key generation is now handled internally, and cannot be configured just by providing a named argument to citet/citep.

  • The cite function is replaced by bib_metadata. This function takes any argument to citet/citep as before (including the new arguments), see docs.

  • Linked inline citations now use the configuration: cite_options(style="markdown", hyperlink="to.doc") provides a link to the DOI or URL of the document, using markdown format.

  • Support for cito and tooltip have been removed. These may be restored at a later date. (The earlier implementation did not appropriately abstract the use of these features from the style/formatting of printing the citation, making generalization hard.

  • bibliography now includes CSL support directly for entries with a DOI using the style= argument. No need to provide a CSL file itself, just the name of the journal (or rather, the name of the corresponding csl file: full journal name, all lower case, spaces as dashes). See https://github.com/cboettig/knitcitations/issues/38

  • bibliography formatting has otherwise been completely rewritten, and no longer uses print_markdown, print_html, and print_rdfa methods. rdfa is no longer available, and other formats are controlled through cite_options. For formal publication pandoc mode is recommended instead of bibliography.

This version was developed on a separate branch (v1), and has only just been merged back into master. CRAN doesn’t like getting multiple updates in the same month or so, but hopefully waiting a bit longer will give users and I a chance to shake out bugs anway. Meanwhile grab it from github with:

devtools::install_github("cboettig/knitcitations@v1")

You can see this package in use, for instance, in providing dynamic citations for my RNeXML mansucript draft.

Read more



Is statistical software harmful?

04 Jun 2014

Ben Bolker has an excellent post on this complex issue over at Dynamic Ecology, which got me thinking about writing my own thoughts on the topic in reply.


Google recently announced that it will be making it’s own self-driving cars, rather than modifying those of others. Cars that won’t have steering wheels and pedals. Just a button that says “stop.” What does this tell us about the future of user-friendly complex statistical software?

Ben quotes prominent statisticians voicing fears that echo common concerns about self-driving cars:

Andrew Gelman attributes to Brad Efron the idea that “recommending that scientists use Bayes’ theorem is like giving the neighbourhood kids the key to your F-16″.

I think it is particularly interesting and instructive that the quote Gelman attributes to Efron is about a mathematical theorem rather than about software (e.g. Bayes Theorem, not WinBUGS). Even relatively simple statistical concepts like \(p\) values can cause plenty of confusion, statistical package or no. The concerns are not unique to software, so the solutions cannot come through limiting access to software.

I am very wary of the suggestion that we should address concerns of appropriate application by raising barriers to access. Those arguments have been made about knowledge of all forms, from access to publications, to raw data, to things as basic as education and democratic voting.

There are many good reasons for not creating a statistical software implementation of a new method, but I argue here that fear of misuse just is not one of them.

  1. The barriers created by not having a convenient software implementation are not an appropriate filter to keep out people who can miss-interpret or miss-use the software. As you know, a fundamentally different skillset is required to program a published algorithm (say, MCMC), than to correctly interpret the statistical consequences.

We must be wary of a different kind of statistical machismo, in which we use the ability to implement a method by one’s self as a proxy for interpreting it correctly.

1a) One immediate corollary of (1) is that: Like it or not, someone is going to build a method that is “easy to use”, e.g. remove the programming barriers.

1b) The second corollary is that: individuals with excellent understanding of the proper interpretation / statistics will frequently make mistakes in the computational implementation.

Both mistakes will happen. And both are much more formidable problems in the complex methodology of today than when computer was a job description.

So, what do we do? I think we should abandon the false dichotomy between “usability” and “correctness.”. Just because software that is easy to use is easy to misuse, does not imply that decreasing usability increases correctness. I think that is a dangerous fallacy.

A software implementation should aim first to remove the programming barriers rather than statistical knowledge barriers. Best practices such as modularity and documentation should make it easy for users and developers to understand and build upon it. I agree with Ben that software error messages are poor teachers. I agree that a tool cannot be foolproof, no tool ever has been.

Someone does not misuse a piece of software merely because they do not understand it. Misuse comes from mistakenly thinking you understand it. The premise that most researchers will use something they do not understand just because it is easy to use is distasteful.

Kevin Slavin gives a fantastic Ted talk on the ubiquitous role of algorithms in today’s world. His conclusion is neither one of panacea or doom, but rather that we seek to understand and characterize them, learn their strengths and weaknesses like a naturalist studies a new species.

More widespread adoption of software such as BUGS & relatives has indeed increased the amount of misuse and false conclusions. But it has also dramatically increased awareness of issues ranging from computational aspects peculiar to particular implementations to general understanding and discourse about Bayesian methods. Like Kevin, I don’t think we can escape the algorithms, but I do think we can learn to understand and live with them.

Read more



Plos Data Sharing Policy Reflections

30 May 2014

PLOS has posted an excellent update reflecting on their experiences a few months in to their new data sharing policy, which requires authors to include a statement of where the data can be obtained rather than providing it upon request. They do a rather excellent job of highlighting common concerns and offering well justified and explained replies where appropriate.

At the end of the piece they pose several excellent questions, which I reflect on here (mostly as a way of figuring out my own thoughts on these issues).


  • When should an author choose Supplementary Files vs a repository vs figures and tables?

To me, repositories should always be the default. Academic repositories provide robust permanent archiving (such as CLOCKSS backup), independent DOIs to content, often tracking of use metrics, enhanced discoverability, clear and appropriate licenses, richer metadata, as well as frequently providing things like API access and easy-to-use interfaces. They are the Silicon Valley of publishing innovation today.

Today I think it is much more likely that some material is not appropriate for a ‘journal supplement’ rather than not being able to find an appropriate repository (enough are free, subject agnostic and accept almost any file types). In my opinion the primary challenge is for publishers to tightly integrate the repository contents with their own website, something that the repositories themselves can support with good APIs and embedding tools (many do, PLOS’s coordination with figshare for individual figures being a great example).

I’m not clear on “vs figures and tables”, as this seems like a content question of “What” should be archived rather than “Where” (unless it is referring to separately archiving the figures and tables of the main text, which sounds like a great idea to me).

  • Should software/code be treated any differently from ‘data’? How should materials-sharing differ?

At the highest level I think it is possible to see software as a ‘type’ of data. Like other data, it is in need of appropriate licensing, a management plan, documentation/metadata, and conforming to appropriate standards and placed in appropriate repositories. Of course what is meant by “appropriate” differs, but that is also true between other types of data. The same motivations for requiring data sharing (understanding and replicating the work, facilitating future work, increasing impact) apply.

I think we as a scientific community (or rather, many loosely federated communities) are still working out just how best to share scientific code and the unique challenges that it raises. Traditional scientific data repositories are well ahead in establishing best practices for other data, but are rapidly working out approaches to code. The guidelines from the Journal of Open Research Software from the UK Software Sustainability Institute are a great example. (I’ve written on this topic before, such as what I look for in software papers and on the topic of the Mozilla Science Code review pilot

I’m not informed enough to speak to sharing of non-digital material.

  • What does peer review of data mean, and should reviewers and editors be paying more attention to data than they did previously, now that they can do so?

In as much as we are satisfied with the current definition of peer review for journal articles I think this is a false dichotomy. Landmark papers, at least in my field, five or six decades ago (e.g. about as old as the current peer review system) frequently contained all the data in the paper (papers were longer and data was smaller). Somehow the data outgrew the paper and it just became okay to omit it, just as methods have gotten more complex and papers today frequently gloss over methodological details. The problem, then, is not one of type but one of scale: how do you review data when it takes up more than half a page of printed text.

The problem of scale is of course not limited to data. Papers frequently have many more authors than reviewers, often representing disparate and highly specialized expertise over possibly years of work, depend upon more than 100 citations and be accompanied by as many pages of supplemental material. To the extent that we’re satisfied with how reviewers and editors have coped with these trends, we can hope for the same for data.

Meanwhile, data transparency and data reuse may be more effective safe guards. Yes, errors in the data may cause trouble before they can be brought to light, just like bugs in software. But in this way they do eventually come to light, and that is somewhat less worrying if we view data the way we currently build publications (e.g. as fundamental building blocks of research) and publications as we currently view data (e.g. as a means to an ends, illustrated in the idea that it is okay to have mistakes in the data as long as they don’t change the conclusions). Jonathan Eisen has some excellent examples in which openly sharing the data led to rapid discovery and correction of errors that might have been difficult to detect otherwise.

  • And getting at the reason why we encourage data sharing: how much data, metadata, and explanation is necessary for replication?

I agree that the “What” question is a crux issue, and one we are still figuring out by community. There are really two issues here: what data to include, and what metadata (which to me includes any explanation or other documentation of the data) to provide for whatever data is included.

On appropriate metadata, we’ll never have a one-size-fits-all answer, but I think the key is to at least uphold current community best-practices (best != mode), whatever they may be. Parts of this are easy: scholarly archives everywhere include basic Dublin Core Elements metadata like title, author, date, subject and unique identifier, and most data repositories will attach this information in a machine-readable metadata format with minimal burden on the author (e.g. Dryad, or to lesser extent, figshare). Many fields already have well-established and tested standards for data documentation, such as the [Ecological Metadata Langauge], which helps ecologists document things like column names and units in an appropriate and consistent way without constraining how the data is collected or structured.

What data we include in the first place is more challenging, particularly as there is no good definition of ‘raw data’ (one person’s raw data being another person’s highly processed data). I think a useful minimum might be to provide any data shown in a figure or used in a statistical test that appears in the paper.

Journal policies can help most in each of these cases by pointing authors to the policies of repositories and to subject-specific publications on these best practices.

  • A crucial issue that is much wider than PLOS is how to cite data and give academic credit for data reuse, to encourage researchers to make data sharing part of their everyday routine.

Again I agree that credit for data reuse is an important and largely cultural issue. Certainly editors can play there part as they already do in encouraging authors to cite the corresponding papers on the methods used, etc.

I think the cultural challenge is much greater for the “long tail” content than it is for the most impactful data. I think most of the top-cited papers over the last two decades have been methods papers (or are cited for the use of a method that has become the standard of a field; often as software). As with any citation, there’s a positive feedback as more people are aware of it. I suspect that the papers announcing the first full genomes of commonly studied organisms (essentially data papers, though published by the most exclusive journals) did not lack citations. For data (or methods for that matter) that do not anticipate that level of reuse, the concern of appropriate credit is more real. Even if a researcher can assume they will be cited by future reuse of their data, they may not feel that sufficient compensation if it means one less paper to their name.

Unfortunately I think these are not issues unique to data publication but germane to academic credit in general. Citations, journal names, and so forth are not meaningless metrics, but very noisy ones. I think it is too easy to fool ourselves by looking only at cases where statistical averages are large enough to see the signal – datasets like the human genome and algorithms like BLAST we know are impactful, and the citation record bears this out. Really well cited papers or well-cited journals tend to coincide with our notions of impact, so it is easy to overestimate the fidelity of citation statistics when the sample size is much smaller. Besides, academic work is a high-dimensional creature not easily reduced to a few scalar metrics.

  • And for long-term preservation, we must ask who funds the costs of data sharing? What file formats should be acceptable and what will happen in the future with data in obsolete file formats? Is there likely to be universal agreement on how long researchers should store data, given the different current requirements of institutions and funders?

I think these are questions for the scientific data repositories and the fields they serve, rather than the journals, and for the most part they are handling them well.

Repositories like Dryad have clear pricing schemes closely integrated with other publication costs, and standing at least an order of magnitude less than most journal publication fees look like a bargain. (Not so if you publish in subscription journals I hear you say. Well, I would not be surprised if we start seeing such repositories negotiate institutional subscriptions to cover the costs of their authors).

I think the question of data formats is closely tied to that of metadata, as they are all topics of best-practices in archiving. Many scientific data repositories have usually put a lot of thought into these issues and also weigh them against the needs and ease-of-use of the communities they serve. Journal data archiving policies can play their part by encouraging best practices by pointing authors to repository guidelines as well as published articles from their community (such as the Nine Simple Ways paper by White et al.)

I feel the almost rhetorical question about ‘universal agreement’ is unnecessarily pessimistic. I suspect that much of the variance in recommendations for the duration a researcher should archive their own work predates the widespread emergence of data repositories, which have vastly simplified the issue from when it was left up to each individual lab. Do we ask this question of the scientific literature? No, largely because many major journals have already provided robust long term archiving with CLOCKSS/LOCKSS backup agreements. Likewise scientific data repositories seem to have settled for indefinite archiving. It seems both reasonable and practical that data archiving can be held to the same standard as the journal article itself. (Sure there are lots of challenging issues to be worked out here, the key is only to leave it in the hands of those already leading the way and not re-invent the wheel).

Read more



packrat and rmarkdown

28 May 2014

I’m pretty happy with the way rmarkdown looks like it can pretty much replace my Makefile approach with a simple R command to rmarkdown::render(). Notably, a lot of the pandoc configuration can already go into the document’s yaml header (bibliography, csl, template, documentclass, etc), avoiding any messing around with the Makefile, etc.

Even more exciting is the pending RStudio integration with pandoc. This exposes the features of the rmarkdown package to the RStudio IDE buttons, but more importantly, seems like it will simplify the pandoc/latex dependency issues cross-platform.

In light of these developments, I wonder if I should separate my manuscripts from their corresponding R packages entirely (and/or treat them as vignettes?) I think it would be ideal to point people to a single .Rmd file and say “load this in RStudio” rather than passing along a whole working directory.

The rmarkdown::render workflow doesn’t cover installing the dependencies, or downloading the a pre-built cache. I’ve been relying on the R package mechanism itself to handle dependencies, though I list all packages loaded by the manuscript but not needed by the package functions themselves as SUGGESTS, as one would do with a vignette. Consequently, I’ve had to add an install.R script to my template, to make sure these packages are installed before a user attempts to run the document. The install script feels like a bit of a hack, and makes me think that RStudio’s packrat may be what I actually want for this. So I finally got around to playing with packrat.

packrat

Packrat isn’t yet on CRAN, and for an RStudio package I admit that it feels a bit clunky still. Having a single packrat.lock file (think Gemfile.lock I suppose) seems like a great idea. Carting around the hidden files .Rprofile, .Renviron, and the tar.gz sources for all the dependences (in packrat.sources) seems heavy and clunky, and logging in and out all the time feels like a hack.

The first discussion led to an interesting question about just how big are CRAN packages these days anyhow? Thanks to this clever rsync trick from Duncan, I could quickly explore this:

txt = system("rsync --list-only cran.r-project.org::CRAN/src/contrib/ | grep .tar.gz", intern = TRUE)
setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", from) ) )
ans = read.table(textConnection(txt), colClasses=c("character", "num.with.commas", "Date", "character"))
ggplot(ans, aes(V3, V2)) + geom_point()
sum(ans$V2>1e6)
sum(ans$V2/1e6)
cran
cran

Note that there are 711 packages over 1 MB, for a total weight of over 2.8 GB. Not huge but more than you might want in a Git repo all the same.

Nevertheless, packrat works pretty well. Using a bit of a hack we can just version manage/ship the packrat.lock file and let packrat try and restore the rest.

packrat::packify()
source(".Rprofile"); readRenviron(".Renviron")
packrat::restore()
source(".Rprofile"); readRenviron(".Renviron")

The source/readRenviron calls should really be restarts to R. Tried replacing this with calls to Rscript -e "packrat::packify() etc. but that fails to find packrat on the second call. (Attempting to reinstall it doesn’t work either).

Provided the sources haven’t disappeared from their locations on Github, CRAN, etc., I think this strategy should work just fine. More long-term, we would want to archive a tarball with the packrat.sources, perhaps downloading it from a script as I currently do with the cache archive.

knitcitations

Debugging check reveals some pretty tricky behavior on R’s part: it wants to check the R code in my vignette even though it’s not building the vignette. It does this by tangling out the code chunks, which ignores in-line code. Not sure if this should be a bug in knitr or R, but it can’t be my fault ;-). See knitr/issues/784

With checks passing, have sent v0.6 to CRAN. Fingers crossed…

Milestones for version 0.7 should be able to address the print formatting issues, hopefully as a new citation_format option and without breaking backwards compatibility.

Read more



Notes

27 May 2014

nonparametric-bayes

  • nonparametric-bayes sensitivity-trends runs.
  • nonparametric-bayes sensitivity.R runs: explore facets to see what setting causes the below-perfect performance cluster.

knitcitations

  • updates to knitcitations (paste fron NEWS)
  • Looking at adapting CSL for inline text formatting csl/issues/33
  • Implemented pandoc rendering

ropensci

  • Use of .Renviron vs .Rprofile for API keys

Seems the difference between .Rprofile and .Renviron is that (a) the latter is just a named character vector, and (b) the latter is accessed by Sys.getenv(). This keeps the working space clean (so does our use of options, instead of just writing the API key into the .Rprofile directly). Sys.getenv automatically seems to load the environmental variables of the shell (for instance, Sys.getenv(“USER”) returns the username of the computer/active shell, even if no .Renviron file exists). This is kinda convenient, e.g. with travis, if you were encrypting your API keys you could load them with Sys.getenv() without any further step. I guess it makes sense to think of security credentials as environmental variables. In principle (e.g. in the travis case) a user might do this when accessing the same keys across different software, rather than storing different files for R vs python etc.

Note that a user could have a different .Renviron file in each working directory which is loaded first, which could allow separate projects in separate working directories to only load their own keys.

I was wondering if we should provide helper functions that would write the keys to .Rprofile or .Renviron or wherever they should go from R, rather than asking the user to locate these hidden files?

rmarkdown exploration

  • Yes, you need to write a custom template to have multiple pdf templates. rmarkdown/issues/113

  • Whoops, for RStudio 0.98b (preview) I do need to apt-get install texlive-fonts-recommended even though my command-line pandoc/latex has no trouble finding some other copy of these fonts on my system.

  • Who knew? rmarkdown sets dev=png for html and dev=pdf for pdf automatically. rmarkdown/issues/111

Read more