Notebook Maintenance And Scaling

Electronic notebooks may not run out of pages like a paper notebook, but with five years of entries (963 posts, with a repository size approaching half a gigabyte), together with various files, layouts, experimentation and version history, some thought must be given to scale. Two closely related considerations add to this further: dynamic builds with knitr from .Rmd versions and hosting image files directly in the notebook repository rather than uploading to an external site (previously flickr or on the gh-pages of other project repositories). This has several advantages (more on that later) but in the immediate term it makes building the repository potentially slower (though knitr’s caching helps) and increases the repository size more rapidly (even with text-based svg images).

The current Jekyll system keeps all posts in a single repository and rebuilds the HTML files for each every time. This is already showing some strains: for instance, for some reason the git hashes when generating the site automatically on Travis cease updating for older posts, though this problem doesn’t occur when building locally. Overall, the Jekyll platform is rather snappy so this isn’t an unmanageable size, but is sufficient to demonstrate that the approach isn’t able to scale indefinitely either.

So, as with the paper notebook whose pages are filled, it’s time to crack open a new binding and shelve the old notebooks – somewhere handy to be sure, but no longer one voluminous tome on the desk.

A multi-repository approach

To address this, I’m am trying out breaking the notebook over multiple repositories: using a new repository for each year’s worth of entries, and an additional repository to provide the basic pages (home, teaching, vita, etc. from the navbar) along with the assets used by all the other sites (css, fonts, javascript, etc). This avoids rebuilding the posts of notebooks from all previous years every time the Jekyll site is compiled, keeping the repositories smaller, the site more modular and more easy to scale.

This raises some challenges such as keeping the layout and appearance consistent without maintaining copies of layout files across multiple repositories; managing URLs and paths across different repositories, and aggregating metadata (posts, tags, categories).

Repos, Paths, and URLs for the multi-notebook

Even with the source files (such as .md entries, templates, etc.) in different repositories it would be simple enough to combine the generated HTML files from each repository into a single output directory serving the site (on Github or elsewhere). However, GitHub’s gh-pages provide an elegantly more modular way to do this already. GitHub uses the URL of the user’s repository (the repo named username.github.io, which also serves as the site URL unless a different domain is specified using a CNAME file) as the root domain for all other gh-pages branches on the Github repo.

Thus, I have created repositories named 2015, 2014, etc, which will serve the notebooks for the corresponding year from their own gh-pages branch. Moving my www.carlboettiger.info (the use of a subdomain such as www is required in order to benefit from Github’s CDN, though if it is omitted the domain provider will add it) from my labnotebook repo to my cboettig.github.io repository means that the annual repositories now have base URLs such as www.carlboettiger.info/2015, www.carlboettiger.info/2014. Adjusting the _config.yml to omit /year: from the permalink, since it is already in the base URL, is all that is needed to ensure that the posts of all my old URLs will still resolve to the same pages. Excellent.

Dealing with the site pages is more tricky than dealing with the posts. Pages come in two variates: some, like index.html, research.html, vita.html, contain only content that is independent of whatever is in the notebook pages and thus can live quite happily in the cboettig.github.io repository. Others, like tags.html, categories.html, archive.html, lab-notebook.html, atom.xml and other tag-specific RSS feeds are dynamically generated by Jekyll using the metadata of the posts, and thus need to live in the individual notebook repositories instead.

This instead of just having the page: carlboettiger.info/tags, each year begins a new notebook with it’s own tags, categories, etc: carlboettiger.info/2014/tags, carlboettiger.info/2013/tags. For tags, categories,it makes some sense to have this information aggregated by year, avoiding the clutter of too many or too stale tags or categories (though perhaps something is lost by not being able to see this in aggregate across all years, at least not without some effort). Likewise for the list of posts by date (previously at archive.html, now just turned into index.html) is produced for each annual notebook, such that carlboettiger.info/2014 resolves a reverse-chronological list of posts for that year alone.

I must then address what to do about the original URLs such as carlboettiger.info/tags. Using a Jekyll liquid filter it is easy to define automatic redirects for /tags.html and /categories.html that will forward to the current year’s tag’s and categories, though perhaps an aggregated view would be preferable. For carlboettiger.info/archive I have provided manual links to the index of each annual notebook rather than a redirect to the index of only the most current notebook. Likewise for one of my most popular pages, carlboettiger.info/lab-notebook, I have retained the automated feeds from Github, Twitter, and Mendeley, but replaced the previews of the most recent posts with the less aesthetic link to the notebook by year. Meanwhile, I have provided each notebook with it’s own nine-panel preview page such as carlboettiger.info/2014/lab-notebook, which has the preview but not the network feeds (Perhaps it would be better to move this to the index page). In this way, the social feeds can be updated merely by updating the cboettig.github.io repo (since these are rendered as static text rather than javascript, written using the relevant API at the time the site is built.)

A more tricky case is that of the atom feeds. It doesn’t really make sense to subscribe to a carlboettiger.info/2015/blog.xml feed that will be inactive in a year. Using HTML redirects in a .xml file doesn’t make too much sense, so I will try the RSS-flavor redirect:

<newLocation>
https://www.carlboettiger.info/2015/blog.xml
</newLocation>

though this seems less than ideal.

Automated deploy

As I use the jeykll-pandoc gem to have pandoc render the markdown, along with a few other custom plugins, I cannot take advantage of Github’s automated build for Jekyll and have instead relied on the trick of having Travis-CI build and deploy the site. Adding automated knitr building to the mix will make this too heavy for travis, even for more modular notebooks. Instead, I am relying on local building, together with automated builds from my own server running a Drone CI instance. More on this in a separate post.

Site assets, templates

Individual notebook repositories are thus much more light-weight. All css assets are in the root cboettig.github.io repository or already provided by external CDNs (such as the FontAwesome icons or MathJax, and Bootstrap javascript). However, it is necessary that both all annual notebook repositories and the base repo have the Jekyll _layouts and _includes files required to template and build the pages. This is unfortunate, since it means maintaining multiple copies of the same file, but I haven’t figured out an easy way around it.

Pruning history

In breaking labnotebook into component repos by year, I only want to preserve the history of that year, thus keeping the repositories small. This is particularly important for the root repo, cboettig.github.io, since it will remain active.

  • edit _config.yml to remove /:year from _config.yml (the repository name will automatically be used as part of the URL)
  • delete all posts from different years (preferable to just wait until deleting their history, which will remove the files as well), e.g. for 2014:
files=`echo {_posts/2008-*,_posts/2009-*,_posts/2010-*,_posts/2011-*,_posts/2012-*,_posts/2013-*}`
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD

and remove the temporary backups immediately so that repository actually shrinks in size:

git update-ref -d refs/original/refs/heads/master
git reflog expire --expire=now --all
git gc --prune=now

This is more important in the root repository, since this will remain active. If the annual notebook entry repositories have some extra stuff in their .git history it isn’t such an issue since they no longer need to grow or be moved around as much. (See this SO on rewriting git history.)

My progress notes during the remapping:

  • [x] delete the CNAME file.

  • [x] delete all the relatively static pages files that will be hosted directly from cboettig.github.io (index.html, research.md, etc., but not dynamically created tags.html etc).

  • [x] adjust repo: in _config.yml to match the repository year. This will automatically fix the sha and history links in the sidebar.

  • [x] Other tweaks to the sidebar: site.repo liquid must be added to categories, tags, next, and previous links.

  • [x] Automated deploy for active and root repositories.

  • [x] Plan for labnotebook repo. History is preserved, but issues, github stars, etc. Use as template for the new years?

  • [x] Activate! Remove CNAME from labnotebook repo, add www CNAME to cboettig.github.io. Consider removing gh-pages branch of lab-notebook?

  • [x] Fix / workaround for the root atom feeds.

  • [x] Syncing assets, layout, and deploy scripts? Perhaps it is best to allow these to diverge and newer notebooks to look different than older ones?