Data Management Plan

Writing out a data management plan for myself. Suggestions and feedback welcome.

Data Types and Structure

All source code, documentation, scripts and data for the analyses performed in the course of this research shall be maintained in a digital compendium using the R package structure as recommended in Gentleman and Lang (2007). The progress and results of this research shall be regularly chronicled in an electronic lab notebook, maintained openly at carlboettiger.info/lab-notebook.html. Figures and other results are maintained with the source-code required to reproduce them by writing these analyses using knitr dynamic documentation.

Data Acquisition, Integrity and Quality

The lab notebook is written and maintained in plain text (UTF-8) and rendered in HTML5. Likewise, the R package compendium will maintain all source code, scripts and documentation in plain-text (UTF-8) files. Plain-text files with standard encodings help retain compatibility independent of software. Both the notebook entries and the compendium will be maintained in unique git repositories.

Git repositories use unique SHA hashes to protect against corruption. Synchronized backups of the git repositories are maintained on local and remote servers (RAID 6) to protect against hardware failures, as well as on the public international software repository, Github github.com/cboettig. Version history preserves a time-line of changes and protects against user error.

Archival copies of notebook entries shall be published annually to figshare where they are assigned DOIs and preserved by the CLOCKSS geopolitically distributed 12 node global archive. Likewise an archival copy of the R compendium shall be published to figshare at the time of each peer-reviewed publication.

Any data produced by simulations used in these analyses will be archived accordingly in the compendium. Any data associated with a peer-reviewed publication will be deposited in plain text csv files in the Dryad digital repository for the biosciences, unless provided by a third party whose terms of use forbid this. No generation of empirical data is anticipated.

Rights Management & Dissemination

All products generated by this research will be licensed under permissive licenses supporting reuse, re-distribution, and derivatives for free for any purpose without request from a major online repository; Table 1.

Licenses reflect the Panton Principles for open data and the Science Code Manifesto. Distribution uses internationally recognized and freely accessible public repositories best suited for the dissemination of each product.
Product License Distribution
publications cc-by arXiv
software FreeBSD Github, figshare
data cc-zero Dryad
research notebook cc-zero Github, figshare

Peer-reviewed publications will target preprint-friendly publishers and an author’s preprint will be posted on the arXiv under a Creative Commons Attribution license to facilitate access and distribution.

References

Gentleman R and Temple Lang D (2007). “Statistical Analyses And Reproducible Research.” Journal of Computational And Graphical Statistics, 16. ISSN 1061-8600, doi: 10.1198/106186007X178663