# Lab Notebook

## (Introduction)

#### Coding

• cboettig pushed to master at cboettig/labnotebook: notes 11:07 2014/09/24
• cboettig commented on issue ropensci/ecoretriever#38: yup; some grepping could then format that appropriately On Wed, Sep 24, 2014 at 3:58 PM, Scott Chamberlain 11:01 2014/09/24
• cboettig commented on issue ropensci/ecoretriever#38: Sure, but it would also be nice to have this information (groups, datasets, etc) returned as R objects that a user can then manipulate, rather than … 10:52 2014/09/24
• cboettig opened issue weecology/retriever#210: Content and formatting in the output of retriever ls 10:23 2014/09/24
• cboettig commented on issue eddelbuettel/rocker#4: Right, I could see it either way. Also wonder if our directory structure should be modified to reflect the difference between base images and speci… 10:19 2014/09/24

#### Discussing

• Collapse of an ecological network in Ancient Egypt: Pages: 1-6. Justin D Yeakel, Mathias M Pires, Lars Rudolf, Nathaniel J Dominy, Paul L Koch et al. 07:35 2014/09/11
• Regime shifts in models of dryland vegetation Regime shifts in models of dryland vegetation: Yuval R Zelnik, Shai Kinast, Hezi Yizhaq, Golan Bel, Ehud Meron, Phil Trans R Soc A et al.Published using Mendeley: Academic software for researchers 07:35 2014/09/11
• Temporal ecology in the Anthropocene: Ecology Letters (2014). Pages: n/a-n/a. E. M. Wolkovich, B. I. Cook, K. K. McLauchlan, T. J. Davies et al. 07:35 2014/09/11
• Uncertainty , learning , and the optimal management of wildlife: Environmental and Ecological Statistics (2001). Volume: 8, Issue: 3. Pages: 269-288. Byron K. Williams et al. 08:57 2014/08/06

24 Sep 2014

## rocker / docker

• Talking over strategy with Dirk, see summary in #1 and follow-up issues, #3, #4, #5.

## rdataone & EML

Trying to fix travis issues. Much craziness.

travis.sh doesn’t provide notes on setup for repos where the R package is in a subdirectory. Looks like cd commands are persistent though throughout a travis file. okay.

• dataone is imported by EML but suggested by dataone. Since install_github likes to install the suggests list, this creates problems:

• travis.sh install_github has no way to indicate that I don’t want to install suggests list.

• install_github("DataONEorg/dataone/dataone", dependencies=NA) should get around this by not installing the suggested EML package when trying to install dataon. For reasons inexplicable to me, that doesn’t seem to work(!), at least on travis. I’ve had to remove the package from the suggests list.

The build environment often seems a lot more fragile than the package itself. EML travis builds were rather badly broken with both dataone and rrdf disappearing from CRAN. If we were installing them from the ubuntu binaries, of course this wouldn’t be quite as common a problem. Or even better, if our build environment came as a custom docker image. Fixing this is not completely trivial: we now have to install these packages from github, which in the case of rrdf means installing the latex build environment simply to build the rrdf vignette that we don’t need.

While these issues can no doubt frustrate users as well, I’m not convinced that CI should really be testing build environment problems when I want it to be testing changes I’m making to my package. In the big picture, we need more stable build environments, and of course I’m asking for trouble by depending on lots of packages, particularly new, complex and otherwise fragile packages, so this testing is valuable. But on the other hand, this mostly just gets in the way. Ideally I should be able to point to a stable build environment and just ignore changes to the later packages until I want to deal with them. That’s what most users do with their own systems – not upgrading their personal libraries, distributions, etc, until they are ready to deal with anything that breaks. Being forced onto the bleeding edge all the time forces me to waste considerable time or accept a broken CI state that need not actually be broken.

#### Containerizing My Development Environment

22 Sep 2014

A key challenge for reproducible research is developing solutions that integrate easily into a researcher’s existing workflow. Having to move all of one’s development onto remote machines, into a particular piece of workflow software or IDE, or even just constrained to a window running a local virtual machine in an unfamiliar or primitive environment isn’t particularly appealing. In my experience this doesn’t reflect the workflow of even those folks already concerned about reproducibility, and is, I suspect, a major barrier in adoption of such tools.

One reason I find docker particularly attractive for reproducible research it the idea of containerizing my development environment into something I can transport or recreate identically anywhere, particularly on another Linux machine. This also provides a convenient backup system for my development environment, no need to remember each different program or how I installed or configured it when moving to a new machine.

## Using aliases

For me, a convenient way to do this involves creating a simple alias for running a container. This allows me to distinguish between running any software and the container, while managing my files and data through my native operating system tools. I’ve set the following alias in my bashrc.

alias c='docker run --rm -it -v $(pwd):/home/$USER/basename $PWD -w /home/$USER/basename $PWD -e HOME=$HOME -e USER=$USER --user=$USER strata'

I can then just do c R (think c for container) to get R running in a container, c bash to drop into a bash shell on the container, c pandoc --version echoes the version of pandoc available on our container (or otherwise execute the container version of pandoc), and so forth.

### explanation: a non-root container

The trick here is primarily to handle permissions appropriately. Docker is run as a root user by default, which results in any files created or modified become owned by root instead of the user, which is clearly not desirable. Getting around this requires quite a bit of trickery. The break down of each of these arguments is as follows:

• --rm remove this container when we quit, we don’t need to let it persist as a stopped container we could later return to.
• -it Interactive terminal
• -v binds a host volume to the container. Files on the host working directory (pwd) will be available on the container, and changes made on the container are immediately written to the host directory:
-v $(pwd):/home/$USER/basename $PWD The path after the colon specifies where this directory should live on the container: we specify in a directory that has the same name as the current working directory basename$PWD, located in the home directory of the user (e.g. where the user has write permissions).

• -w specifies the working directory we should drop into when our session on the container starts. We set this to match the path where we have just mounted our current working directory:
-w /home/$USER/basename$PWD
• -e HOME=$HOME sets the value of the environmental variable HOME to whatever it is on the host machine (e.g. /home/username), so that when R tries to access ~/, it gets the user’s directory and not the root directory. • -e USER=$USER though this seems redundant, we set the user environmental variable by default in the cboettig/rstudio image, so this overrides that environmental variable with the current user.

• --user=$USER Specifies the user we log in as. This is rather important, otherwise the we find that we are the root (or whatever user has been set in the Dockerfile). That would cause any files we generate from the container to be owned by the root user, not our local user. Note that this only works if the specific user has already been created (e.g. by adduser) on the container, otherwise this will fail. • strata the name of the container (could be cboettig/ropensci, but my strata image provides a few additional customizations, created by it’s own Dockerfile. That Dockerfile (and its FROM dependencies) specify all the software available on this container. Importantly, it also already creates my username in it’s Dockerfile. Otherwise, the argument given above should use --user=rstudio, since the rstudio user is already created by the base image cboettig/rstudio, and thus available in cboettig/ropensci and strata. Note that this user can be created interactively by passing the environmental variable -e USER=$USER when running in deamon mode, since the user is then created by the start-up script. However, when we provide a custom command (like /usr/bin/R in this example, the CMD from the Dockerfile is overriden and the user isn’t created.

A stricter alias I considered first enforces running R as a container rather than a local operation:

The process is rather manual: we have to sudo poweroff the droplet and then trigger the snapshot (the container will come back online after that, though we have to restart the services / active docker containers). We also have to delete old snapshots manually. Some of this can be automated from the API. DigitalOcean uses redundant storage for these (paying $0.01/month/gigabyte to Amazon Glacier), but at the moment we can’t export these images. Snapshots are also handy to deploy to a larger (but not smaller) droplet. ### Digital Ocean Backups These backups are an automated, always-online alternative to snapshots but must me initialized when the droplet is created and cost more (20% of server cost). ## Manually configuring backups To have the flexibility to restore individual pieces, to move between machines, etc we need a different approach. ### Container backups Docker containers, including running containers, should be effectively backed up by either of these approaches to the state we would be in after a power cycle (e.g. we may need to start stopped containers, but not rebuild them from scratch). Nevertheless we may want to back up containers themselves. For many containers this is trivial (e.g. our ssh container): we can just commit the running container to an image and save that as a tar archive (or equivalently, just export the container to a tarball). If the containers have a VOLUME command in their dockerfile or in their execution however, this is insufficient. Containers using volumes (such as sameersbn/gitlab and mattgruter/drone) need four things to be backed up: • Dockerfile (or container image, from save or commit) • volume • volume path in container • name of the container the volume belongs to A utility makes this easier. ### Sparkleshare Sparkleshare is a git-backed dropbox alterantive. With binaries for most major platforms (Windows, Mac, Ubuntu/Linux) it’s pretty easy to set up and acts in much the same way, with automated synch and notifications. The backend just needs a server running git – Gitlab is a great way to set this up to permit relatively easy sharing / user management. (Ignore the information about setting up separately on a server, Gitlab is much easier. Also ignore advice about building from source on Ubuntu, installing the binary is far more straight forward: apt-get install sparkleshare. Certainly it is not as feature rich as dropbox (e.g. email links to add users, web links to share individual files), but easy sharing over the server at no extra cost. The Sparkleshare directory is also a fully functional git repo. ### Encrypted backup of filesystem with duplicity Good for backing up to another host for which we have ssh access, or to an Amazon S3 bucket, etc. (Unclear if this works with Glacier due to upload-only et-up). ### Some other rates for data storage: • Compare to S3 ($0.03 /gig/month)
• EBS ($0.12 /gig/month) (really for computing I/O, not storage). • Remarkably, Google Drive and Dropbox now offer 1 TB at$10 / mo. Clearly a lot can be saved by ‘overselling’ (most users will not use their capacity) and by shared files (counting against the space for all users but requiring no more storage capacity). Nonetheless, impressive, on par with Glacier (without the bandwidth charges or delay).
• For comparison, (non-redundant, non-enterprise, disk-based) storage is roughly $100/TB, or on order of that annual cost. #### Server Security Basics 08 Sep 2014 ## Security configuration We set up SSH key-only login on non-standard port, with root login forbidden. We then set up ufw firewall, fail2ban, and tripwire. 1. Configure an SSH key login. Next, Create a user, add to sudoers, and then disable root login.. Edits /etc/ssh/sshd_config: • Disabling root logins. (We’ll need to add ourselves to sudo first: (adduser, edit /etc/sudoers) • Change ssh port from default to something else. • Whitelist user login ids Additionally, let’s be sure to disable password authentication: Add PasswordAuthentication no to /etc/ssh/sshd_config. (editing PermitRootLogin only doesn’t do this). Locally add an entry in ~/.ssh/config to alias the host and port to avoid having to remember these numbers for login. Run ssh-copy-id <droplet-ip> to enable key-based login for the user. 1. Install and configure ufw firewall. As we’re not using the default ssh port, we need to explicitly tell ufw which ssh port to allow. sudo ufw allow <PORT>/tcp (The /tcp part is optional, saying only allow tcp protocol over that port, not other protocols.) We must also tell ufw to allow Docker: In /etc/default/ufw change DEFAULT_FORWARD_POLICY to ACCEPT, then: sudo ufw reload sudo ufw allow 2375/tcp and similarly allow any ports we export for our various services (Gitlab, Drone, etc). 1. Install and configure fail2ban. Prevents brute force password attacks. Be sure to assign the config to match chosen ssh port. 2. Install and configure tripwire (intrusion detection). 3. Update software: sudo apt-get -q update && sudo apt-get -qy dist-upgrade and then also update tripwire log: sudo tripwire --check --interactive Note: Clearly all these steps need to be running on the server itself, not merely in a container image deployed on server so that they are securing access to the actual host. ## Additional configuration While we’re doing things, add user to the docker group for convenience: sudo addgroup cboettig docker Enable swap on a small instance. Here we set up 1GB of swap (setting swap at twice the available RAM is the recommended rule-of-thumb, though makes less sense once RAM is large) sudo fallocate -l 1G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile To make this persistant on reboot edit /etc/fstab: sudo echo "/swapfile none swap sw 0 0" >> /etc/fstab For better performance, we might tweak swappiness to 10 (default is 60 out of 100, where 0 is never swap and 1 is swap frequently): echo 10 | sudo tee /proc/sys/vm/swappiness echo vm.swappiness = 10 | sudo tee -a /etc/sysctl.conf Set ownership sudo chown root:root /swapfile sudo chmod 0600 /swapfile ## Server modules Running different services as their own docker containers offers serveral advantages: • Containers often make it easier to install and deploy existing services, since the necessary configuration is scripted in the Dockerfile and we can often find Dockerfiles already made on Docker Hub for common services. This note illustrates several examples. • Containers may provide an added level of stability, since they run largely in isolation from each other. • Containers can be resource limited, e.g. docker run -it -m 100m -c 100 ubuntu /bin/bash would provide the container with 100 MB of RAM and 100 “shares” of CPU (acts kind of like a niceness, where the default share of a container is 1024. On multicore machines you can also pass --cpuset "0" or --cpuset "0,1" etc, which is a list of which cpus (numbered 0 to n-1, as in /proc/cpuinfo) the container is permitted to use. As noted in the link, restricting disk space is more tricky, though might become easier down the road. ### ssh server: Permit users to ssh directly into a container rather than access the server itself. Despite its simplicity, I found this a bit tricky to set up correctly, particularly in managing users. Here’s the basic Dockerfile for an image we’ll call ssh. This creates a user given by the environmental variable. A few tricks: • We use adduser instead of useradd so that we get the home directory for the user created and granted the correct permissions automaticalliy. We need the --gecos information so that we’re not prompted to enter the user’s full name etc. We use --disabled-password rather than set a password here. • Login is still possible through ssh key (as well as through nsenter on the host machine). We go ahead and add the ssh key now, though this could be done after the container is running by using nsenter. • In this dockerfile, we’ve added the user to sudoers group for root access on the container (installing software, etc). This won’t be active until the user has a password. FROM ubuntu:14.04 ENV USER cboettig RUN apt-get update && apt-get install -y openssh-server RUN mkdir /var/run/sshd RUN adduser --disabled-password --gecos ""$USER
RUN adduser $USER sudo ADD authorized_keys /home/$USER/.ssh/authorized_keys
RUN chown $USER /home/$USER/.ssh/authorized_keys
EXPOSE 22
CMD    ["/usr/sbin/sshd", "-D"]

When building the image, note that a copy of authorized_keys (contains the contents of the id_rda.pub public key) file must be found in the same directory as the Dockerfile so that it can be added to the image.

Start the ssh server on port 2200:

docker run -d -p 2200:22 --name="ssh" ssh

sudo ufw add 2200/tcp

From here I can now ssh in from the computer housing the private key pair to the public key that is added to the image here. However, that user doesn’t have root access since we haven’t provided a password.

Use nsenter to enter the instance:

docker run -v /usr/local/bin:/target jpetazzo/nsenter
nsenter -m -u -n -i -p -t docker inspect --format '' ssh /bin/bash

Create a password for the user to enable root access:

### Making this easier:

Add to .bashrc:

function dock { sudo nsenter -m -u -n -i -p -t docker inspect --format  "$1" /bin/bash; } This defines the function dock such that dock <name> will enter a running container named <name>. Note that we have to have nsenter bound to the executable path as indicated above. Yay less typing. #### Docker tricks of the trade and best practices thoughts 29 Aug 2014 ## Best practices questions Here are some tricks that may or may not be in keeping with best practices, input would be appreciated. • Keep images small: use the --no-install-recommends option for apt-get, install true dependencies rather than big metapackages (like texlive-full). • Avoid creating additional AUFS layers by combining RUN commands, etc? (limit was once 42, but is now at least 127). • Can use RUN git clone ... to add data to a container in place of ADD, which invalidates caching. • Use automated builds linked to Github-based Dockerfiles rather than pushing local image builds. Not only does this make the Dockerfile transparently available and provide a link to the repository where one can file issues, but it also helps ensure that the image available on the hub gets its base image (FROM entry) from the hub instead of whatever was available locally. This can help avoid various out-of-sync errors that might otherwise emerge. ### Docker’s use of tags Unfortunately, Docker seems to use the term tag to refer both to the label applied to an image (e.g. in docker build -t imagelabel . the -t argument “tags” the image as ‘imagelabel’ so we need not remember its hash), but also uses tag to refer to the string applied to the end of an image name after a colon, e.g. latest in ubuntu:latest. The latter is the definition of “tags” as listed under the “tags” tab on the Docker Hub. Best practices for this kind of tag (which I’ll arbitrarily refer to as a ‘version tag’ to distinguish it) are unclear. One case that is clear is tagging specific versions. Docker’s automated builds lets a user link a “version tag” to either to a branch or a tag in the git history. A “branch” in this case can refer either to a different git branch or merely a different sub-directory. Matching to a Git tag provides the most clear-cut use of the docker version-tag; providing a relatively static version stable link. (I say “relatively” static because even when we do not change the Dockerfile, if we re-build the Dockerfile we may get a new image due the presence of newer versions of the software included. This can be good with respect to fixing security vulnerabilities, but may also break a previously valid environment). The use case that is less clear is the practice of using these “version tags” in Docker to indicate other differences between related images, such as eddelbuettel/docker-ubuntu-r:add-r and eddelbuettel/docker-ubuntu-r:add-r-devel. Why these are different tags instead of different roots is unclear, unless it is for the convenience of multiple docker files in a single Github repository. Still, it is perfectly possible to configure completely separate docker hub automated builds pointing at the same Github repo, rather than adding additional builds as tags in the same docker hub repo. Docker linguistics borrow from git terminology, but it’s rather dangerous to interpret these too literally. ## Keeping a clean docker environment • run interactive containers with --rm flag to avoid having to remove them later. • Remove all stopped containers: docker rm$(docker ps -a | grep Exited | awk '{print $1}') • Clean up un-tagged docker images: docker rmi$(docker images -q --filter "dangling=true")
• Stop and remove all containers (including running containers!)
docker rm -f $(docker ps -a -q) ## Docker and Continuous Integration • We can install but cannot run Docker on Travis-CI at this time. It appears the linux kernel available there is much too old. Maybe when they upgrade to Ubuntu 14:04 images… • We cannot run Docker on the docker-based Shippable-CI (at least without a vagrant/virtualbox layer in-between). Docker on Docker is not possible (see below). • For the same reason, we cannot run Docker on drone.io CI. However, Drone provides an open-source version of it’s system that can be run on your own server, which unlike the fully hosted offering, permits custom images. Unfortunately I cannot get it working at this time. ## Docker inside docker: We cannot directly install docker inside a docker container. We can get around this by adding a complete virtualization layer – e.g. docker running in vagrant/virtualbox running in docker. Alternatively, we can be somewhat more clever and tell our docker to simply use a different volume to store its AUFS layers. Matt Gruter has a very clever example of this, which can be used, e.g. to run a Drone server (which runs docker) inside a Docker container (mattgruter/drone). I believe this only works if we run the outer docker image with --privileged permissions, e.g. we cannot use this approach on a server like Shippable that is dropping us into a prebuilt docker container. #### Pdg Controlfest Notes 14 Aug 2014 Just wanted to give a quick update on stuff relevant to our adjustment costs paper in events of this week. I think the talk on Tuesday went all right, (though thanks to a technology snafu going from reveal.js to pdf my most useful figure actually showing the bluefin tuna didn’t display – I tried not to let on). I tried to keep the focus pretty big-picture throughout (we ignore these costs when we model, they matter) and avoid being too bold / prescriptive (e.g. not suggesting we found the ‘right’ way to model these costs). I also could not stop myself from calling the adjustment cost models L1 L2 L3 instead of “linear” “quadratic” and “fixed”, or _1,2,3. whoops. One question asked about asymmetric costs. You may recall we started off doing but ran into some unexpected results where they just looked like the cost free case, possibly due to problems with the code. We should probably at least say this is an area for further study. Another question asked about just restricting the period of adjustment, say, once every 5 years or so. I answered that we hoped to see what cost structures “induced” that behavior rather than enforcing it explicitly; but I should probably add some mention of this to the text as well. I think the other questions were more straight forward but don’t remember any particulars. The Monday meeting was very helpful for me in framing the kind of big questions around the paper: 1. Can we make this story about more than TAC-managed fisheries? My ideal paper would be something people could cite to show that simply using profit functions with diminishing returns is not a sufficient way to reflect this reality (could be the opposite if reality is more like a transaction fee), and that this mistake can be large. But all our examples are in the fisheries context, so this may take some finesse. (Since we’re aiming for Eco Apps rather than, say, Can Jor Fisheries) 2. Emphasizing the “Pretty Darn Good” angle – thinking of the policies we derive with adjustment costs not as the “True optimum” but as a “Pretty Darn Good” policy that can be more robust to adjustment costs – (Provided you have intuition to know if those costs are more like a fixed transaction fee or some proportional cost). The last two figures help with this, since they show using policies under different cost regimes than those under which they were computed to be optimal. 3. Need to figure out what to say about policies that can ‘self-adjust’, e.g. when you don’t have to change the law to respond to the fluctuations. (Jim pointed out that Salmon are the best/only case where you can actually manage by “escapement” since you get a complete population census from the annual runs). • Stripping down the complexity of the charts • Conversely, may need to show some examples of the fish stock dynamics (In search for simplicity I’ve focused almost all the graphs on harvest dynamics). • Calibrating and running the case of quadratic control term for comparison As a bonus, I quickly ran the tipping point models, and it looks like these stay really close to the Reed solution – e.g. relative to the safer Beverton Holt world, they are much happier to pay whatever the adjustment cost might be to stick with the optimal than they are to risk total collapse. Not sure but maybe should add this into the paper… #### Docker Notes 14 Aug 2014 Ticking through a few more of the challenges I raised in my first post on docker; here I explore some of the issues about facilitating interaction with a docker container so that a user’s experience is more similar to working in their own environment and less like working on a remote terminal over ssh. While technically minor, these issues are probably the first stumbling blocks in making this a valid platform for new users. ## Sharing a local directory Launch a bash shell on the container that shares the current working directory of the host machine (from pwd) with the /host directory on the container (thanks to Dirk for this solution): docker run -it -v$(pwd):/host cboettig/ropensci-docker /bin/bash

This allows a user to move files on and off the container, use a familiar editor and even handle things like git commits / pulls / pushes in their working directory as before. Then the code can be executed in the containerized environment which handles all the dependencies. From the terminal docker opens, we just cd /host where we find our working directory files, and can launch R and run the scripts. A rather clean way of maintaining the local development experience but containerizing execution.

In particular, this frees us from having to pass our git credentials etc to the container, though is not so useful if we’re wanting to interact with the container via the RStudio server instead of R running in the terminal. (More on getting around this below).

Unfortunately, Mac and Windows users have to run Docker inside an already-virualized environment such as provided by boot2docker or vagrant. This means that it is only the directories on the virtualized environment, not those on the native OS, can be shared in this way. While one could presumably keep a directory synced between this virtual environment and the native OS, (standard in in vagrant), this is a problem for the easier-to-use boot2docker at this time: (docker/issues/7249).

## A Docker Desktop

Dirk brought this docker-desktop to my attention; which uses Xpra (in place of X11 forwarding) to provide a window with fluxbox running on Ubuntu along with common applications like libreoffce, firefox, and rox file manager. Pretty clever, and worked just fine for me, but needs Xpra on the client machine and requires some extra steps (run the container, query for passwords and ports, run ssh to connect, then run Xpra to launch the window). The result is reasonably responsive but still slower than virtualbox, and probably too slow for real work.

## Base images?

The basic Ubuntu:14.04 seems like a good lightweight base image (at 192 MB), but other images try to give more useful building blocks, like phusion/baseimage (423 MB). Their docker-bash script and other utilities provide some handy features for managing / debugging containers.

## Other ways to share files?

Took a quick look at this Dockerfile for running dropbox, which works rather well (at least on a linux machine, since it requires local directory sharing). Could probably be done without explicit linking to local directories to faciliate moving files on and off the container. Of course one can always scp/rsync files on and off containers if ssh is set up, but that is unlikely to be a popular solution for students.

While we have rstudio server running nicely in a Docker container for local or cloud use, it’s still an issue getting Github ssh keys set up to be able to push changes to a repo. We can get around this by linking to our keys directory with the same -v option shown above. We still need a few more steps: setting the Git username and email, and running ssh-add for the key. Presumably we could do this with environmental variables and some adjustment to the Dockerfile:

docker run -it -v /path/to/keys:/home/rstudio/.ssh/ -e "USERNAME=Carl Boettiger" -e "EMAIL=cboettig@example.org" cboettig/ropensci-docker

which would prevent storing these secure values on the image itself.

#### An appropriate amount of fun with docker?

08 Aug 2014

An update on my exploration with Docker. Title courtesy of Ted, with my hopes that this really does move us in a direction where we can spend less time thinking about the tools and computational environments. Not there yet though

I’ve gotten RStudio Server working in the ropensci-docker image (Issues/pull requests welcome!).

docker run -d -p 8787:8787 cboettig/ropensci-docker

will make an RStudio server instance available to you in your browser at localhost:8787. (Change the first number after the -p to have a different address). You can log in with username:pw rstudio:rstudio and have fun.

One thing I like about this is the ease with which I can now get an RStudio server up and running in the cloud (e.g. I took this for sail on DigitalOcean.com today). This means in few minutes and 1 penny you have a URL that you and any collaborators could use to interact with R using the familiar RStudio interface, already provisioned with your data and dependencies in place.

For me this is a pretty key development. It replaces a lot of command-line only interaction with probably the most familiar R environment out there, online or off. For more widespread use or teaching this probably needs to get simpler still. I’m still skeptical that this will make it out beyond the crazies, but I’m less skeptical than I was when starting this out.

The ropensci-docker image could no doubt be more modular (and better documented). I’d be curious to hear if anyone has had success or problems running docker on windows / mac platforms. Issues or pull requests on the repo would be welcome! https://github.com/ropensci/docker-ubuntu-r/blob/master/add-r-ropensci/Dockerfile (maybe the repo needs to be renamed from it’s original fork now too…)

Rich et al highlighted several “remaining challenges” in their original post. Here’s my take on where those stand in the Docker framework, though I’d welcome other impressions:

1. dependencies could still be missed by incompletely documentation

I think this one is largely addressed, at least assuming a user loads the Docker image. I’m still concerned that later builds of the docker image could simply break the build (though earlier images may still be available). Does anyone know how to roll back to earlier images in docker?

1. The set of scripts for managing reproducibility are at least as complex as the analysis itself

I think a lot of that size is due to the lack of an R image for Travis and the need to install many common tools from scratch. Because docker is both modular and easily shared via docker hub, it’s much easier to write a really small script that builds on existing images, (as I show in cboettig/rnexml)

1. Travis.org CI constraints: public/open github repository with analyses that run in under 50 minutes.

Docker has two advantages and also some weaknesses here: (1) it should be easy to run locally, while accomplishing much of the same thing as running on travis (though clearly that’s not as nice as running automatically & in the cloud on every push). (2) It’s easier to take advantage of caching – for instance, cboettig/rnexml provides the knitr cache files in the image so that a user can start exploring without waiting for all the data to download and code to run.

It seems that Travis CI doesn’t currently support docker since the linux kernel they use is too old. (Presumably they’ll update one day. Anyone try Shippable CI? (which supports docker))

1. The learning curve is still prohibitive

I think that’s still true. But what surprised me is that I’m not sure that it’s gotten any worse by adding docker than it was to begin with using Travis CI. Because the approach can be used both locally and for scaling up in the cloud, I think it offers some more immediate payoffs to users than learning a Github+CI approach does. (Notably it doesn’t require any git just to deploy something ‘reproducible’, though of course it works nicely with git.