Assembling various loose ends and some ropensci work.

Misc things to work out and write at some date

  • Catch up on literature reading (unsorted list in Mendeley). Get entries into notebook.
  • Write thoughts on selecting links anchor text in the notebook
  • knitr script workflow: externalizing chunks, cache and figure prefixes.
  • branching projects and unique repositories
  • source code for all manuscripts into repositories
  • clean up directories. in particular, establish separate prosecutors-fallacy repository
  • Comments on workflow: Date-oriented (notebook) vs File-oriented (Github) explanation. (Or is Github both, via history button, and the notebook just the summary that links to it?)


As per yesterday’s announcement, we are officially funded by a generous grant from the Alfred P. Sloan Foundation!


  • Writing feedback on goals, comments to advisors


ropensci website

Scott has a good question about automatic TOC sidebars.

Can generate TOC with redcarpet filter, but it grabs all headers. Unclear how to modify so that it grabs only headers above a certain level. See my attempt as an issue in redcarpet repo. Similarly pandoc only provides --toc as a compile flag, which does not give us the same control over it’s placement in the HTML.

misc semantics

  • Interesting example of RDF visualization and linking to external linked-data resources. In particular, good use/example of linking to DBpedia information. See issue #63


How I use git with svn repositories

I don’t use SVN much, but in my opinion “commits” mean something different in SVN then in Git anyway, so I’d rather not entangle the commit history. Because merging is hard in SVN and all local changes are slave to the master central repository, commits tend to be larger, more polished things. Commits and merges are cheap and easy in git, so I commit much more often. By having to commit twice, I am deliberate about whether I’m committing to SVN, with all the overhead that involves, or whether I’m just committing to git.

I think my approach is much easier than this

My replies to data quality survey

What constitutes data quality?

Statistical quality Measurement error, the degree to which potential covariates are either controlled for or co-measured, and the sample size.

Computational quality Open and standardized format (csv, xml in defined schema of the field, etc) Thorough metadata conforming to tightly defined and verifiable schema. Terms defined as URIs / linked data. Openly accessible under permissive licensing.

How might data quality be assessed generally?

Statistical quality depends on what the researcher wants to do with it, and is generally easy for the researcher to assess. All the elements listed under statistical quality should simply be included in metadata, and researchers can draw their own conclusions about sample size, covariates, etc.

Computational quality measurement should reflect data management practices more broadly. Something like Tim Berners Lee’s five-star mechanism for linked data, or badges that indicate if data includes verification tools, etc.

DataCite’s Role?

Good quality data should be incentivised the same way good publications are. If good quality data can make a career, researchers will rise to the challenge with a little help for proofreading and formatting. Ease of depositing data in a useful format and sharing that data, along with metrics to reflect the impact it has, can go a long way in this regard.