Lab Notebook



  • cboettig pushed to master at cboettig/labnotebook: drafts and posts new post 09:45 2014/10/14
  • cboettig edited the rocker-org/rocker wiki: 08:55 2014/10/14
  • cboettig commented on issue rocker-org/rocker-versioned#2: is a puzzlement. is now also a SO question: 08:33 2014/10/14
  • cboettig commented on issue rocker-org/rocker-versioned#1: @gmbecker this sounds good to me. I guess this allows the user to install different combinations of packages/package-versions on any given image of … 08:19 2014/10/14
  • cboettig commented on issue rocker-org/rocker-versioned#1: @gmbecker thanks. Given R's growth trajectory the preponderance of use cases will be far more recent than 2.0.0 anyway, but just through that up th… 08:04 2014/10/14



  • Collapse of an ecological network in Ancient Egypt: Pages: 1-6. Justin D Yeakel, Mathias M Pires, Lars Rudolf, Nathaniel J Dominy, Paul L Koch et al. 07:35 2014/09/11
  • Regime shifts in models of dryland vegetation Regime shifts in models of dryland vegetation: Yuval R Zelnik, Shai Kinast, Hezi Yizhaq, Golan Bel, Ehud Meron, Phil Trans R Soc A et al.Published using Mendeley: Academic software for researchers 07:35 2014/09/11
  • Temporal ecology in the Anthropocene: Ecology Letters (2014). Pages: n/a-n/a. E. M. Wolkovich, B. I. Cook, K. K. McLauchlan, T. J. Davies et al. 07:35 2014/09/11
  • Uncertainty , learning , and the optimal management of wildlife: Environmental and Ecological Statistics (2001). Volume: 8, Issue: 3. Pages: 269-288. Byron K. Williams et al. 08:57 2014/08/06


Rocker Versioning

14 Oct 2014

Been looking into building versioned images for previous R releases using Docker, based on somewhat common requests to our recently begun rocker project. Versioning is under early development and the best way to go about this is not yet clear. Getting the correct version of R installed is not always trivial but is relatively straight forward, and I outline two approaches below.

Getting the correct version of packages (or even merely any compatible version of the package) to install is a considerably more difficult problem, which I’ll discuss later.

We’re is considering two different strategies, each with strengths and weaknesses:

Compiled builds

The most straight-forward recipe seems to be to adapt the rocker/r-devel file to compile the desired version by pulling from the appropriate tag in the R SVN repository, as suggested by @eddelbuettel

As Dirk suggested, we can build on the r-devel recipe, simply by pointing to the appropriate tag. Occasionally this needs a few extra packages added to the list. I was able to get R 2.0.0 to compile, but not R 1.0.0. More recent versions than 2.0.0 don’t seem to pose any difficulty for compiling. Nonetheless, installing additional packages is still an issue.

Binary builds

At different approach is to use the binary versions from old Debian images. This approach works rather well when docker images are available for earlier Debian releases (oldstable and stable, which currently go back as far as Debian 6.0 and 7.0; while the main rocker release builds on Debian testing which is at 8.0). Merely using the earlier Debian releases, we can jump back to certain versions of R:

  • 6.0 : R 2.11.1
  • 7.0 : R 2.15.1

The advantage of this is that binary forms of many common R packages may also be available from the same repositories.

Dirk @eddelbuettel also suggests looking at Debian snapshot archive binaries. This allows us to install intermediate versions of R in binary form, as well as specific versions of any package for which debian binaries have been built. The brilliant bit about this is that we can add any particular snapshot time-period as a normal repository, e.g.

deb lenny main
deb-src lenny main
deb lenny/updates main
deb-src lenny/updates main

and the package manage can thus handle all the dependencies. As noted, this works only for those packages for which we have debian binaries available in the release.

This is limited to what we can use as a base image, particularly for old versions of R where the binaries are only available for i386 architectures (while there are some Docker images providing i386 architectures, we’ve so far used only amd64 versions). Given the rapid growth of R however, it is likely that the preponderance of use-cases will focus on relatively recent R versions.

Unfortunately, I haven’t gotten this working yet (See issue #2).

Installing R packages

Installing packages directly from CRAN is more dubious. We may be able to install earlier versions of particular packages from the CRAN archives using the CRAN data from metacran/crandb as @hadley recommended.

Hoping that we can do this more generally by building on @gmbecker ’s work, which does just this using R versions built as Amazon EC2 AMIs. (See issue #1).

Read more

Response To Software Discovery Index Report

08 Oct 2014

The NIH has recently announced the report of a landmark meeting which presents a vision for a Software Discovery Index (SDI). The report is both timely and focused on the key issues of locating, citing, reusing software:

Software developers face challenges disseminating their software and measuring its adoption. Software users have difficulty identifying the most appropriate software for their work. Journal publishers lack a consistent way to handle software citations or to ensure reproducibility of published findings. Funding agencies struggle to make informed funding decisions about which software projects to support, while reviewers have a hard time understanding the relevancy and effectiveness of proposed software in the context of data management plans and proposed analysis.

To address this, they propose an Index which would do three things:

  1. to assign standard and unambiguous identifiers to reference all software,
  2. to track specific metadata features that describe that software, and
  3. to enable robust querying of all relevant information for users.

The report is both timely and focused on key issues confronting our community, including the challenges of identifying, citing, and reusing software. The appendices do an excellent job in outlining key metadata, metrics, and use cases which help frame the discussion. The proposal does well to focus on the importance of identifiers and the creation of a query-able metadata index for research software, but leaves out an essential element necessary to make this work.

This proposal sounds very much like the CrossRef and DataCite infrastructure already in place for academic literature and data, respectively; and indeed this is an excellent model to follow. However, a key piece of that infrastructure is missing from the present proposal – the social contract between repository or publisher and the index itself.

CrossRef provides unique identifiers for the academic literature (CrossRef DOIs), but it also defines specific metadata that describe that literature (as well as metrics of its use), and embed that information into a robust, query-able, machine-readable format. DataCite does the same for scientific data. These are exactly the features that the authors of the report seek to emulate.

Just as CrossRef itself does not host academic papers but only the metadata records, the SDI does not propose to host software itself. This introduces a substantial challenge in maintaining the link between the metadata and the software itself. The authors have simply proposed that the metadata include “Links to the code repository.” If CrossRef or DataCite DOIs worked in this way, we would soon loose all ability to recover many of the papers or the data itself, and we would be left with only access to the metadata record and a broken link. DOIs were created explicitly to solve this problem, not through technology, but through a social contract.

The scientific publishers who host the actual publications are responsible for ensuring that this link is always maintained when they change names, etc. Should the publisher go out of business, these links may be adjusted to point to a new home, such as CLOCKSS. This guarantees that the DOI always resolves to the resource in question, regardless of where it moves. Should a publisher fail to maintain these links, CrossRef may refuse to provide the publisher any additional DOIs, cutting it off from this key index. This is the social contract. Data repositories work in exactly the same way, purchasing their DOIs from DataCite. (While financial transaction isn’t strictly necessary for the financial contract, it provides a clear business model for maintaining the key organization responsible for the index).

Without such a mechanism, links in the SDI would surely rot away, all the more rapidly in the fast-moving world of software. Without links to the software itself, the function of the index would be purely academic. Yet such a mechanism requires that the software repositories, not the individual software authors, would be willing to accept the same social contract, receiving (and possibly paying for) identifiers on the condition that they assume the responsibility of maintaining the links. It is unclear that the primary software repositories in use to day (Sourceforge, Github, Bitbucket, etc) would be willing to accept this.

Data repositories already offer many of the compelling features of this proposal. Many data repositories accept a wide array of file formats including software packages, and would provide such software with a permanent unique identifier in the form of a DataCite DOI, as well as collecting much of the essential metadata listed in report’s Appendix 1, which would then already be accessible through the DataCite API in a nice machine-readable format. This strategy finds several aspects wanting.

The primary barrier to using data repositories indexed by DataCite arises from the dynamic nature of software relative to data. Data repositories are designed to serve relatively static content with few versions. Software repositories, by contrast, are usually built upon explicit version control platforms such as Git or Subversion designed explicitly for handling continual changes, including branches and mergers, of software code. The report discusses the challenges of software versions as a reason for that citing a software paper as a proxy for citing software is not ideal: the citation to the paper does not convey what version was used. Rapid versioning creates other problems though, both in the number of identifiers that might be created (is each commit a new identifier?) and defining the relationship between different versions of the same software. Branches and merges exacerbate this problem. Existing approaches that provide the user a one-time way to import software from a software repository to a data repository such as those cited in the report (“One significant initiative is a collaboration between Mozilla, figshare, GitHub, and Zenodo”) do nothing to address this issues.

Less challenging issues involve resolving differences between DataCite metadata and the proposed metadata records for software. Most obviously, the metadata would need a way to declare the object involved software instead of data per se, which would thus allow queries to restrict results to ‘software’ objects to avoid cluttering searches. Ideally, one would also create tools that can import such metadata from the format in which it is usually already defined in software, into the desired format of the index, rather than requiring manual double-entry of this information. These are important but more straight-forward problems which the report already seeks to address.

Read more


02 Oct 2014



  • Discussion with Dirk on repositories, library paths, versions.
  • library paths: apt-get users the usr/lib path, while user-run install commands (e.g. install.packages) uses usr/local/lib/, path. Dirk recommends that /usr/local/lib/R/site-library is configured to be user-writable for package installation, rather than installing into home.
  • building directly from CRAN
  • building dependencies: apt-get build-dep, needs the corresponding deb-src lines.
  • issues and tweaks to littler see PR #2


  • Discussion on minimal images
  • Discussion on analogsea + docker
  • Blog coverage of Dirk’s talk on our Docker work.

  • boot2docker-cli includes linux flavors, so I might get a look at what the docker experience feels like for those poor souls who can only live it through full virutalization. No go on the install methods listed, but the binary seems to work. Unfortunately my laptop cannot run 64 bit virtualbox…



  • Carsten’s suggestion re docker registeries, is this something that scientific repositories might one day support? (Excerpt from my reply post):

While I see your point that the Docker Hub might not be ideal for all cases, I think the most important attribute of a repository should be longevity. Certainly Docker Hub won’t be around forever, but at this stage with 60 million in it’s latest VC round it’s likely to be more long-lasting than anything that a small organization like rOpenSci would host. It would be great to see existing scientific repositories show an interest in archiving images in this way though, since organizations like DataONE and Dryad already have recognition in the scientific community and better discoverability / search / metadata features. Building of the docker registry technology would of course make a lot more sense than just archiving static binary docker images, which lack both the space saving features and the ease of download / integration that examples like the docker registry have.

I think it would be interesting to bring that discussion to some of the scientific data repositories and see what they say. Of course there’s the chicken & egg problem in that most researchers have never heard of docker. Would be curious what others think of this.


Read more


24 Sep 2014

rocker / docker

  • Talking over strategy with Dirk, see summary in #1 and follow-up issues, #3, #4, #5.

rdataone & EML

Trying to fix travis issues. Much craziness. doesn’t provide notes on setup for repos where the R package is in a subdirectory. Looks like cd commands are persistent though throughout a travis file. okay.

  • dataone is imported by EML but suggested by dataone. Since install_github likes to install the suggests list, this creates problems:

  • install_github has no way to indicate that I don’t want to install suggests list.

  • install_github("DataONEorg/dataone/dataone", dependencies=NA) should get around this by not installing the suggested EML package when trying to install dataon. For reasons inexplicable to me, that doesn’t seem to work(!), at least on travis. I’ve had to remove the package from the suggests list.

The build environment often seems a lot more fragile than the package itself. EML travis builds were rather badly broken with both dataone and rrdf disappearing from CRAN. If we were installing them from the ubuntu binaries, of course this wouldn’t be quite as common a problem. Or even better, if our build environment came as a custom docker image. Fixing this is not completely trivial: we now have to install these packages from github, which in the case of rrdf means installing the latex build environment simply to build the rrdf vignette that we don’t need.

While these issues can no doubt frustrate users as well, I’m not convinced that CI should really be testing build environment problems when I want it to be testing changes I’m making to my package. In the big picture, we need more stable build environments, and of course I’m asking for trouble by depending on lots of packages, particularly new, complex and otherwise fragile packages, so this testing is valuable. But on the other hand, this mostly just gets in the way. Ideally I should be able to point to a stable build environment and just ignore changes to the later packages until I want to deal with them. That’s what most users do with their own systems – not upgrading their personal libraries, distributions, etc, until they are ready to deal with anything that breaks. Being forced onto the bleeding edge all the time forces me to waste considerable time or accept a broken CI state that need not actually be broken.

Read more

Containerizing My Development Environment

22 Sep 2014

A key challenge for reproducible research is developing solutions that integrate easily into a researcher’s existing workflow. Having to move all of one’s development onto remote machines, into a particular piece of workflow software or IDE, or even just constrained to a window running a local virtual machine in an unfamiliar or primitive environment isn’t particularly appealing. In my experience this doesn’t reflect the workflow of even those folks already concerned about reproducibility, and is, I suspect, a major barrier in adoption of such tools.

One reason I find docker particularly attractive for reproducible research it the idea of containerizing my development environment into something I can transport or recreate identically anywhere, particularly on another Linux machine. This also provides a convenient backup system for my development environment, no need to remember each different program or how I installed or configured it when moving to a new machine.

Using aliases

For me, a convenient way to do this involves creating a simple alias for running a container. This allows me to distinguish between running any software and the container, while managing my files and data through my native operating system tools. I’ve set the following alias in my bashrc.

alias c='docker run --rm -it -v $(pwd):/home/$USER/`basename $PWD` -w /home/$USER/`basename $PWD` -e HOME=$HOME -e USER=$USER --user=$USER strata'

I can then just do c R (think c for container) to get R running in a container, c bash to drop into a bash shell on the container, c pandoc --version echoes the version of pandoc available on our container (or otherwise execute the container version of pandoc), and so forth.

explanation: a non-root container

The trick here is primarily to handle permissions appropriately. Docker is run as a root user by default, which results in any files created or modified become owned by root instead of the user, which is clearly not desirable. Getting around this requires quite a bit of trickery. The break down of each of these arguments is as follows:

  • --rm remove this container when we quit, we don’t need to let it persist as a stopped container we could later return to.
  • -it Interactive terminal
  • -v binds a host volume to the container. Files on the host working directory (pwd) will be available on the container, and changes made on the container are immediately written to the host directory:
-v $(pwd):/home/$USER/`basename $PWD`

The path after the colon specifies where this directory should live on the container: we specify in a directory that has the same name as the current working directory basename $PWD, located in the home directory of the user (e.g. where the user has write permissions).

  • -w specifies the working directory we should drop into when our session on the container starts. We set this to match the path where we have just mounted our current working directory:
-w /home/$USER/`basename $PWD`
  • -e HOME=$HOME sets the value of the environmental variable HOME to whatever it is on the host machine (e.g. /home/username), so that when R tries to access ~/, it gets the user’s directory and not the root directory.

  • -e USER=$USER though this seems redundant, we set the user environmental variable by default in the cboettig/rstudio image, so this overrides that environmental variable with the current user.

  • --user=$USER Specifies the user we log in as. This is rather important, otherwise the we find that we are the root (or whatever user has been set in the Dockerfile). That would cause any files we generate from the container to be owned by the root user, not our local user. Note that this only works if the specific user has already been created (e.g. by adduser) on the container, otherwise this will fail.

  • strata the name of the container (could be cboettig/ropensci, but my strata image provides a few additional customizations, created by it’s own Dockerfile. That Dockerfile (and its FROM dependencies) specify all the software available on this container. Importantly, it also already creates my username in it’s Dockerfile. Otherwise, the argument given above should use --user=rstudio, since the rstudio user is already created by the base image cboettig/rstudio, and thus available in cboettig/ropensci and strata. Note that this user can be created interactively by passing the environmental variable -e USER=$USER when running in deamon mode, since the user is then created by the start-up script. However, when we provide a custom command (like /usr/bin/R in this example, the CMD from the Dockerfile is overriden and the user isn’t created.

A stricter alias I considered first enforces running R as a container rather than a local operation:

alias R='docker run --rm -it -v $(pwd):/home/$USER/`basename $PWD` -w /home/$USER/`basename $PWD` -e HOME=$HOME --user=$USER strata /usr/bin/R'

Why not separate containers per application?

A more natural / more docker-esque approach might simply be to have separate containers for each application (R, pandoc, etc). This idealism belies the fact that I already need many of these tools installed on the same container, as they regularly interact in a deep way (e.g. R packages like rmarkdown already depend on pandoc), so these should really be thought of as a single development environment.

Read more

Server Backups

09 Sep 2014

Digital Ocean Snapshots

At $0.02 per gig per month, this looks like this is the cheapest way to make complete backups.

The process is rather manual: we have to sudo poweroff the droplet and then trigger the snapshot (the container will come back online after that, though we have to restart the services / active docker containers). We also have to delete old snapshots manually. Some of this can be automated from the API. DigitalOcean uses redundant storage for these (paying $0.01/month/gigabyte to Amazon Glacier), but at the moment we can’t export these images. Snapshots are also handy to deploy to a larger (but not smaller) droplet.

Digital Ocean Backups

These backups are an automated, always-online alternative to snapshots but must me initialized when the droplet is created and cost more (20% of server cost).

Manually configuring backups

To have the flexibility to restore individual pieces, to move between machines, etc we need a different approach.

Container backups

Docker containers, including running containers, should be effectively backed up by either of these approaches to the state we would be in after a power cycle (e.g. we may need to start stopped containers, but not rebuild them from scratch).

Nevertheless we may want to back up containers themselves. For many containers this is trivial (e.g. our ssh container): we can just commit the running container to an image and save that as a tar archive (or equivalently, just export the container to a tarball).

If the containers have a VOLUME command in their dockerfile or in their execution however, this is insufficient. Containers using volumes (such as sameersbn/gitlab and mattgruter/drone) need four things to be backed up:

  • Dockerfile (or container image, from save or commit)
  • volume
  • volume path in container
  • name of the container the volume belongs to

A utility makes this easier.


Sparkleshare is a git-backed dropbox alterantive. With binaries for most major platforms (Windows, Mac, Ubuntu/Linux) it’s pretty easy to set up and acts in much the same way, with automated synch and notifications. The backend just needs a server running git – Gitlab is a great way to set this up to permit relatively easy sharing / user management. (Ignore the information about setting up separately on a server, Gitlab is much easier. Also ignore advice about building from source on Ubuntu, installing the binary is far more straight forward: apt-get install sparkleshare. Certainly it is not as feature rich as dropbox (e.g. email links to add users, web links to share individual files), but easy sharing over the server at no extra cost. The Sparkleshare directory is also a fully functional git repo.

Encrypted backup of filesystem with duplicity

See Duplicity setup

Good for backing up to another host for which we have ssh access, or to an Amazon S3 bucket, etc. (Unclear if this works with Glacier due to upload-only et-up).

Some other rates for data storage:

  • Compare to S3 ($0.03 /gig/month)
  • EBS ($0.12 /gig/month) (really for computing I/O, not storage).
  • Remarkably, Google Drive and Dropbox now offer 1 TB at $10 / mo. Clearly a lot can be saved by ‘overselling’ (most users will not use their capacity) and by shared files (counting against the space for all users but requiring no more storage capacity). Nonetheless, impressive, on par with Glacier (without the bandwidth charges or delay).
  • For comparison, (non-redundant, non-enterprise, disk-based) storage is roughly $100/TB, or on order of that annual cost.

Read more

Server Security Basics

08 Sep 2014

Security configuration

We set up SSH key-only login on non-standard port, with root login forbidden. We then set up ufw firewall, fail2ban, and tripwire.

  1. Configure an SSH key login. Next, Create a user, add to sudoers, and then disable root login.. Edits /etc/ssh/sshd_config:
  • Disabling root logins. (We’ll need to add ourselves to sudo first: (adduser, edit /etc/sudoers)
  • Change ssh port from default to something else.
  • Whitelist user login ids

Additionally, let’s be sure to disable password authentication: Add PasswordAuthentication no to /etc/ssh/sshd_config. (editing PermitRootLogin only doesn’t do this).

Locally add an entry in ~/.ssh/config to alias the host and port to avoid having to remember these numbers for login. Run ssh-copy-id <droplet-ip> to enable key-based login for the user.

  1. Install and configure ufw firewall. As we’re not using the default ssh port, we need to explicitly tell ufw which ssh port to allow.
sudo ufw allow <PORT>/tcp

(The /tcp part is optional, saying only allow tcp protocol over that port, not other protocols.)

We must also tell ufw to allow Docker: In /etc/default/ufw change DEFAULT_FORWARD_POLICY to ACCEPT, then:

sudo ufw reload
sudo ufw allow 2375/tcp

and similarly allow any ports we export for our various services (Gitlab, Drone, etc).

  1. Install and configure fail2ban. Prevents brute force password attacks. Be sure to assign the config to match chosen ssh port.

  2. Install and configure tripwire (intrusion detection).

  3. Update software:

sudo apt-get -q update && sudo apt-get -qy dist-upgrade

and then also update tripwire log:

sudo tripwire --check --interactive

Note: Clearly all these steps need to be running on the server itself, not merely in a container image deployed on server so that they are securing access to the actual host.

Additional configuration

While we’re doing things, add user to the docker group for convenience: sudo addgroup cboettig docker

Enable swap on a small instance. Here we set up 1GB of swap (setting swap at twice the available RAM is the recommended rule-of-thumb, though makes less sense once RAM is large)

sudo fallocate -l 1G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

To make this persistant on reboot edit /etc/fstab:

sudo echo "/swapfile       none    swap    sw      0       0" >> /etc/fstab

For better performance, we might tweak swappiness to 10 (default is 60 out of 100, where 0 is never swap and 1 is swap frequently):

echo 10 | sudo tee /proc/sys/vm/swappiness
echo vm.swappiness = 10 | sudo tee -a /etc/sysctl.conf

Set ownership

sudo chown root:root /swapfile
sudo chmod 0600 /swapfile

Server modules

Running different services as their own docker containers offers serveral advantages:

  • Containers often make it easier to install and deploy existing services, since the necessary configuration is scripted in the Dockerfile and we can often find Dockerfiles already made on Docker Hub for common services. This note illustrates several examples.

  • Containers may provide an added level of stability, since they run largely in isolation from each other.

  • Containers can be resource limited, e.g.

docker run -it -m 100m -c 100 ubuntu /bin/bash

would provide the container with 100 MB of RAM and 100 “shares” of CPU (acts kind of like a niceness, where the default share of a container is 1024. On multicore machines you can also pass --cpuset "0" or --cpuset "0,1" etc, which is a list of which cpus (numbered 0 to n-1, as in /proc/cpuinfo) the container is permitted to use.

As noted in the link, restricting disk space is more tricky, though might become easier down the road.

ssh server:

Permit users to ssh directly into a container rather than access the server itself. Despite its simplicity, I found this a bit tricky to set up correctly, particularly in managing users.

Here’s the basic Dockerfile for an image we’ll call ssh. This creates a user given by the environmental variable. A few tricks:

  • We use adduser instead of useradd so that we get the home directory for the user created and granted the correct permissions automaticalliy. We need the --gecos information so that we’re not prompted to enter the user’s full name etc. We use --disabled-password rather than set a password here.
  • Login is still possible through ssh key (as well as through nsenter on the host machine). We go ahead and add the ssh key now, though this could be done after the container is running by using nsenter.
  • In this dockerfile, we’ve added the user to sudoers group for root access on the container (installing software, etc). This won’t be active until the user has a password.
FROM     ubuntu:14.04
ENV USER cboettig
RUN apt-get update && apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN adduser --disabled-password --gecos "" $USER
RUN adduser $USER sudo
ADD authorized_keys /home/$USER/.ssh/authorized_keys
RUN chown $USER /home/$USER/.ssh/authorized_keys
CMD    ["/usr/sbin/sshd", "-D"]

When building the image, note that a copy of authorized_keys (contains the contents of the public key) file must be found in the same directory as the Dockerfile so that it can be added to the image.

Start the ssh server on port 2200:

docker run -d -p 2200:22 --name="ssh" ssh

Add to the firewall permissions

sudo ufw add 2200/tcp

From here I can now ssh in from the computer housing the private key pair to the public key that is added to the image here. However, that user doesn’t have root access since we haven’t provided a password.

Use nsenter to enter the instance:

docker run -v /usr/local/bin:/target jpetazzo/nsenter
nsenter -m -u -n -i -p -t `docker inspect --format '' ssh` /bin/bash

Create a password for the user to enable root access:

echo '$USER:<password>' | chpasswd

We can create other users and add them to sudoers or not as desired, e.g. add interactively using:

useradd <username>

users can later change their passwords once they log in.

Restoring containers when restarting

A certain times we need to power cycle the server (e.g. after certain software updates), using the sudo reboot now command or the DigitalOcean console. Any running containers will be automatically stopped. Once the machine is back up and we log back in, these containers can be brought back up with docker restart <container_id> (or <container_name>), and then everything is back to normal.

Note that while we can stop and restart a container, it seems we cannot simply save a container (e.g. with docker commit or docker save and re-run it and expect the server functionality to be restored after the container is destroyed (e.g. by docker rm -f). (See previous notes 2014-09-05) for an illustration of this problem. This occurs because the container image does not include the volume where it writes its data, and that volume address is generated uniquely each time a container is run.

Consequently, a different (probably more traditional) approach is needed to backup the configuration of a server such as Gitlab or Drone-CI even when running in a container. Will explore this later.

Meanwhile, remember remove unneeded containers with

docker rm $(docker ps -a | grep Exited | awk '{print $1}')

and not with -f (destroying running containers!)

Key security

We can largely avoid needing a private ssh key for the server, though may use https authentication to let us use git (rather than, say, rsync) to develop and save changes made remotely (say, through RStudio server).

Backing up keys

Probably unnecessary to have a backup of the ssh private RSA key, as we can access the DigitalOcean Server or Github through the web consoles and add a new public key and replace our private key.

Read more

Drone Ci And Docker

05 Sep 2014

Drone CI: Continous integration in custom docker environments

Having gotten accustomed to Docker, configuring the appropriate build environment for a Continuous Integration system like Travis CI or Shippable CI starts to feel incredibly tedious and archaic (particularly if you work primarily in a language like R or haskell that usually isn’t supported out of the box).

  • We do not have to hack together a custom image environment
  • We can build and test our environment locally instead of having to rely on trial-and-error pushes to the CI server
  • We do not have to download, compile and install the development environment each time, (which frequently takes longer than the CI checks themselves and can break)

(Shippable provides a persistent environment too, by preserving the state of your ‘minion’. But unlike Shippable, I believe the Drone approach is unlikely that you can create troublesome side-effects in your environment, such as removing a necessary dependency from the shippable.yml and yet not catching it since the dependency is still available on the minion from before. In the Drone approach, we start on the same docker image each time, but merely avoid the few minutes it might take to download that image).

Unfortunately, custom images are not available on the fully hosted system. (Though perhaps they’d accept pull requests that would add an R environment to their image library). Fortunately, the Drone team kindly provides an open source version of their platform that can be hosted on a self-hosted / private server (such as the new web darling DigitalOcean or Amazon’s EC2). This has other advantages as well – such as using privately hosted repositories (it also integrates with BitBucket and GitLab) or running very long tests / simulations (since we’re now paying for the server time ourselves, after all).

The easy way: use docker

We can deploy the Drone CI server somewhat more seamlessly by running it in a container itself. Rather than worry about the above configuration, we can simply launch an existing docker image for Drone, rather cleverly created by Matt Gruter:

docker run --name='drone' -d -p 8080:80 --privileged mattgruter/drone
  • Now we can follow the setup instructions. Be sure to use the matching case in the application name (Drone not drone) and the appropriate URLs for the authorization call back.

Note that we must use a different port than 80, and that we must give this port explicitly in the Authorization callback URL: http://localhost:8080/auth/login/github in order to authenticate.

Also note that in this approach, the Drone CI’s docker image library will be separate from the docker image library. To manage or update the images available, we have to first nsenter into the Drone CI container.

This runs rather nicely on a tiny DigitalOcean droplet. Bare in mind that the tiny droplet has only 20 GB of disk space though, which can be consumed rather quickly with too many images. If many of the images use the same base templates, the total disk space required will fortunately be much lower than the sum of their virtual sizes.

experimenting with saving images

Being a docker image, we can snapshot and export it for later use, and meanwhile can even destroy our server instance.

docker export drone > dronedroplet.tar

Not clear that this works. Consider saving an image instead? Save container named drone as image named drone:droplet

docker commit drone drone:droplet
docker save drone:droplet > dronedroplet.tar

are these identical?

Hmm, doesn’t seem to store configuration, login is no longer valid. Starting a stopped container maintains the configuration of course, but not launching from scratch (e.g. the sqlite database is local to the container, not accessible through an externally linked volume).

Note that this tarball does not include the Drone CI image library itself, which is not part of the container but rather connected as a volume. This makes it quite a bit smaller, and that library can presumably be reconstructed from the docker hub.

Configuring Drone CI: the hard way

  • Install and launch drone: (see drone/README)
  • Add DOCKER_OPTS="-H -d" to /etc/default/docker
  • Kill the docker deamon and restart docker. Or run docker with the explicit binding:
sudo docker -d -H &

Configuring an already-running docker session

Launch a named repository in deamon mode:

docker run -d -p 8787:8787 --name='drone' mattgruter/drone

Use a docker-based install to add nsenter into your executable path:

docker run -v /usr/local/bin:/target jpetazzo/nsenter

Run nsenter to log into the docker image:

nsenter -m -u -n -i -p -t `docker inspect --format '{{ .State.Pid }}' drone` /bin/bash

Now we can update or delete images with docker pull, docker rmi, etc.

This is useful with many containers, for instance, with our ssh container or rstudio container we may want to modify usernames and passwords, etc:

useradd -m $USER && echo "$USER:$PASSWORD" | chpasswd

Making this easier:

Add to .bashrc:

function dock { sudo nsenter -m -u -n -i -p -t `docker inspect --format  "$1"` /bin/bash; }

This defines the function dock such that dock <name> will enter a running container named <name>. Note that we have to have nsenter bound to the executable path as indicated above. Yay less typing.

Read more

Docker tricks of the trade and best practices thoughts

29 Aug 2014

Best practices questions

Here are some tricks that may or may not be in keeping with best practices, input would be appreciated.

  • Keep images small: use the --no-install-recommends option for apt-get, install true dependencies rather than big metapackages (like texlive-full).
  • Avoid creating additional AUFS layers by combining RUN commands, etc? (limit was once 42, but is now at least 127).
  • Can use RUN git clone ... to add data to a container in place of ADD, which invalidates caching.

  • Use automated builds linked to Github-based Dockerfiles rather than pushing local image builds. Not only does this make the Dockerfile transparently available and provide a link to the repository where one can file issues, but it also helps ensure that the image available on the hub gets its base image (FROM entry) from the hub instead of whatever was available locally. This can help avoid various out-of-sync errors that might otherwise emerge.

Docker’s use of tags

Unfortunately, Docker seems to use the term tag to refer both to the label applied to an image (e.g. in docker build -t imagelabel . the -t argument “tags” the image as ‘imagelabel’ so we need not remember its hash), but also uses tag to refer to the string applied to the end of an image name after a colon, e.g. latest in ubuntu:latest. The latter is the definition of “tags” as listed under the “tags” tab on the Docker Hub. Best practices for this kind of tag (which I’ll arbitrarily refer to as a ‘version tag’ to distinguish it) are unclear.

One case that is clear is tagging specific versions. Docker’s automated builds lets a user link a “version tag” to either to a branch or a tag in the git history. A “branch” in this case can refer either to a different git branch or merely a different sub-directory. Matching to a Git tag provides the most clear-cut use of the docker version-tag; providing a relatively static version stable link. (I say “relatively” static because even when we do not change the Dockerfile, if we re-build the Dockerfile we may get a new image due the presence of newer versions of the software included. This can be good with respect to fixing security vulnerabilities, but may also break a previously valid environment).

The use case that is less clear is the practice of using these “version tags” in Docker to indicate other differences between related images, such as eddelbuettel/docker-ubuntu-r:add-r and eddelbuettel/docker-ubuntu-r:add-r-devel. Why these are different tags instead of different roots is unclear, unless it is for the convenience of multiple docker files in a single Github repository. Still, it is perfectly possible to configure completely separate docker hub automated builds pointing at the same Github repo, rather than adding additional builds as tags in the same docker hub repo.

Docker linguistics borrow from git terminology, but it’s rather dangerous to interpret these too literally.

Keeping a clean docker environment

  • run interactive containers with --rm flag to avoid having to remove them later.

  • Remove all stopped containers:

docker rm $(docker ps -a | grep Exited | awk '{print $1}')
  • Clean up un-tagged docker images:
docker rmi $(docker images -q --filter "dangling=true")
  • Stop and remove all containers (including running containers!)
docker rm -f $(docker ps -a -q)

Docker and Continuous Integration

  • We can install but cannot run Docker on Travis-CI at this time. It appears the linux kernel available there is much too old. Maybe when they upgrade to Ubuntu 14:04 images…

  • We cannot run Docker on the docker-based Shippable-CI (at least without a vagrant/virtualbox layer in-between). Docker on Docker is not possible (see below).

  • For the same reason, we cannot run Docker on CI. However, Drone provides an open-source version of it’s system that can be run on your own server, which unlike the fully hosted offering, permits custom images. Unfortunately I cannot get it working at this time.

Docker inside docker:

We cannot directly install docker inside a docker container. We can get around this by adding a complete virtualization layer – e.g. docker running in vagrant/virtualbox running in docker.

Alternatively, we can be somewhat more clever and tell our docker to simply use a different volume to store its AUFS layers. Matt Gruter has a very clever example of this, which can be used, e.g. to run a Drone server (which runs docker) inside a Docker container (mattgruter/drone).

I believe this only works if we run the outer docker image with --privileged permissions, e.g. we cannot use this approach on a server like Shippable that is dropping us into a prebuilt docker container.

Read more