# Archiving the lab notebook on figshare

## Robust archiving through CLOCKSS

One of the most comprehensive approaches I have come across so far uses figshare. This offers the most promising avenue for content preservation, but is weakest in managing the URIs and associating them with the original content. All figshare content is archived by CLOCKSS, an international library cooperation providing redundant and geopolitically distributed backup of the archives around the world (and used by many academic journals, both subscription based & open access). Should figshare vanish from the face of the planet, it will trigger the release of all of its content to resolve through the CLOCKSS servers, with the same appearance and resolving at the same URLs as the original figshare content. Presumably the DOIs provided to figshare content will also continue to resolve there.

## Challenges

It would certainly be preferable to have the notebook archived by CLOCKSS directly, since the association between the original online content at carlboettiger.info is lost in archiving the entries on figshare. More problematically, the content as archived on figshare is not recognized by search engines, etc., as a separate HTML pages to index, but merely as a bundle of attached text files. On the upside, the content becomes part of the global scientific datasets preserved and indexed by figshare with appropriate metadata, etc., increasing the chances for discovery through that venue. Also, figshare provides a convenient API that can help automate deposition.

### What content? What format?

Deciding just what to archive in the figshare database is also less straight forward than it may seem. I have gone through a few iterations:

1. Archiving the markdown.
2. Archiving external images with Data URIs.
3. Archiving the HTML versions of pages alone
4. Archiving the whole git repository, _site HTML included (?)

Lastly there is the concern of preserving the version history of entries. Though figshare provides versioning of its content, this doesn’t capture finer resolution of individual page changes available through the Github repository. At the expense of creating an ever more cumbersome archival object, one could include the .git history, either for the HTML rendered version (which lives at cboettig.github.com) or the source files used to create it (labnotebook).
Of course this fails to address the preservation of externally linked content. The most frequent outbound links point to other publications through, usually their DOIs, which we hope will take care of themselves. The most important externally linked content in the notebook entries are the links to scripts, functions, and manuscripts in the various project repositories on Github. The simplest solution is to embed the most important scripts in the notebook entries themselves. Archiving the project repositories is an additional challenge, but if a user can recover a copy of the project repository (along with it’s .git history) then it would be possible to identify the linked file using the SHA hash from these links (by matching it against the SHAs in the log). See my entry on SHA hashes for more on this topic.