Lab Notebook



  • cboettig pushed to master at cboettig/labnotebook: notes 11:07 2014/09/24
  • cboettig commented on issue ropensci/ecoretriever#38: yup; some grepping could then format that appropriately On Wed, Sep 24, 2014 at 3:58 PM, Scott Chamberlain 11:01 2014/09/24
  • cboettig commented on issue ropensci/ecoretriever#38: Sure, but it would also be nice to have this information (groups, datasets, etc) returned as R objects that a user can then manipulate, rather than … 10:52 2014/09/24
  • cboettig opened issue weecology/retriever#210: Content and formatting in the output of `retriever ls` 10:23 2014/09/24
  • cboettig commented on issue eddelbuettel/rocker#4: Right, I could see it either way. Also wonder if our directory structure should be modified to reflect the difference between base images and speci… 10:19 2014/09/24



  • Collapse of an ecological network in Ancient Egypt: Pages: 1-6. Justin D Yeakel, Mathias M Pires, Lars Rudolf, Nathaniel J Dominy, Paul L Koch et al. 07:35 2014/09/11
  • Regime shifts in models of dryland vegetation Regime shifts in models of dryland vegetation: Yuval R Zelnik, Shai Kinast, Hezi Yizhaq, Golan Bel, Ehud Meron, Phil Trans R Soc A et al.Published using Mendeley: Academic software for researchers 07:35 2014/09/11
  • Temporal ecology in the Anthropocene: Ecology Letters (2014). Pages: n/a-n/a. E. M. Wolkovich, B. I. Cook, K. K. McLauchlan, T. J. Davies et al. 07:35 2014/09/11
  • Uncertainty , learning , and the optimal management of wildlife: Environmental and Ecological Statistics (2001). Volume: 8, Issue: 3. Pages: 269-288. Byron K. Williams et al. 08:57 2014/08/06



24 Sep 2014

rocker / docker

  • Talking over strategy with Dirk, see summary in #1 and follow-up issues, #3, #4, #5.

rdataone & EML

Trying to fix travis issues. Much craziness. doesn’t provide notes on setup for repos where the R package is in a subdirectory. Looks like cd commands are persistent though throughout a travis file. okay.

  • dataone is imported by EML but suggested by dataone. Since install_github likes to install the suggests list, this creates problems:

  • install_github has no way to indicate that I don’t want to install suggests list.

  • install_github("DataONEorg/dataone/dataone", dependencies=NA) should get around this by not installing the suggested EML package when trying to install dataon. For reasons inexplicable to me, that doesn’t seem to work(!), at least on travis. I’ve had to remove the package from the suggests list.

The build environment often seems a lot more fragile than the package itself. EML travis builds were rather badly broken with both dataone and rrdf disappearing from CRAN. If we were installing them from the ubuntu binaries, of course this wouldn’t be quite as common a problem. Or even better, if our build environment came as a custom docker image. Fixing this is not completely trivial: we now have to install these packages from github, which in the case of rrdf means installing the latex build environment simply to build the rrdf vignette that we don’t need.

While these issues can no doubt frustrate users as well, I’m not convinced that CI should really be testing build environment problems when I want it to be testing changes I’m making to my package. In the big picture, we need more stable build environments, and of course I’m asking for trouble by depending on lots of packages, particularly new, complex and otherwise fragile packages, so this testing is valuable. But on the other hand, this mostly just gets in the way. Ideally I should be able to point to a stable build environment and just ignore changes to the later packages until I want to deal with them. That’s what most users do with their own systems – not upgrading their personal libraries, distributions, etc, until they are ready to deal with anything that breaks. Being forced onto the bleeding edge all the time forces me to waste considerable time or accept a broken CI state that need not actually be broken.

Read more

Containerizing My Development Environment

22 Sep 2014

A key challenge for reproducible research is developing solutions that integrate easily into a researcher’s existing workflow. Having to move all of one’s development onto remote machines, into a particular piece of workflow software or IDE, or even just constrained to a window running a local virtual machine in an unfamiliar or primitive environment isn’t particularly appealing. In my experience this doesn’t reflect the workflow of even those folks already concerned about reproducibility, and is, I suspect, a major barrier in adoption of such tools.

One reason I find docker particularly attractive for reproducible research it the idea of containerizing my development environment into something I can transport or recreate identically anywhere, particularly on another Linux machine. This also provides a convenient backup system for my development environment, no need to remember each different program or how I installed or configured it when moving to a new machine.

Using aliases

For me, a convenient way to do this involves creating a simple alias for running a container. This allows me to distinguish between running any software and the container, while managing my files and data through my native operating system tools. I’ve set the following alias in my bashrc.

alias c='docker run --rm -it -v $(pwd):/home/$USER/`basename $PWD` -w /home/$USER/`basename $PWD` -e HOME=$HOME -e USER=$USER --user=$USER strata'

I can then just do c R (think c for container) to get R running in a container, c bash to drop into a bash shell on the container, c pandoc --version echoes the version of pandoc available on our container (or otherwise execute the container version of pandoc), and so forth.

explanation: a non-root container

The trick here is primarily to handle permissions appropriately. Docker is run as a root user by default, which results in any files created or modified become owned by root instead of the user, which is clearly not desirable. Getting around this requires quite a bit of trickery. The break down of each of these arguments is as follows:

  • --rm remove this container when we quit, we don’t need to let it persist as a stopped container we could later return to.
  • -it Interactive terminal
  • -v binds a host volume to the container. Files on the host working directory (pwd) will be available on the container, and changes made on the container are immediately written to the host directory:
-v $(pwd):/home/$USER/`basename $PWD`

The path after the colon specifies where this directory should live on the container: we specify in a directory that has the same name as the current working directory basename $PWD, located in the home directory of the user (e.g. where the user has write permissions).

  • -w specifies the working directory we should drop into when our session on the container starts. We set this to match the path where we have just mounted our current working directory:
-w /home/$USER/`basename $PWD`
  • -e HOME=$HOME sets the value of the environmental variable HOME to whatever it is on the host machine (e.g. /home/username), so that when R tries to access ~/, it gets the user’s directory and not the root directory.

  • -e USER=$USER though this seems redundant, we set the user environmental variable by default in the cboettig/rstudio image, so this overrides that environmental variable with the current user.

  • --user=$USER Specifies the user we log in as. This is rather important, otherwise the we find that we are the root (or whatever user has been set in the Dockerfile). That would cause any files we generate from the container to be owned by the root user, not our local user. Note that this only works if the specific user has already been created (e.g. by adduser) on the container, otherwise this will fail.

  • strata the name of the container (could be cboettig/ropensci, but my strata image provides a few additional customizations, created by it’s own Dockerfile. That Dockerfile (and its FROM dependencies) specify all the software available on this container. Importantly, it also already creates my username in it’s Dockerfile. Otherwise, the argument given above should use --user=rstudio, since the rstudio user is already created by the base image cboettig/rstudio, and thus available in cboettig/ropensci and strata. Note that this user can be created interactively by passing the environmental variable -e USER=$USER when running in deamon mode, since the user is then created by the start-up script. However, when we provide a custom command (like /usr/bin/R in this example, the CMD from the Dockerfile is overriden and the user isn’t created.

A stricter alias I considered first enforces running R as a container rather than a local operation:

alias R='docker run --rm -it -v $(pwd):/home/$USER/`basename $PWD` -w /home/$USER/`basename $PWD` -e HOME=$HOME --user=$USER strata /usr/bin/R'

Why not separate containers per application?

A more natural / more docker-esque approach might simply be to have separate containers for each application (R, pandoc, etc). This idealism belies the fact that I already need many of these tools installed on the same container, as they regularly interact in a deep way (e.g. R packages like rmarkdown already depend on pandoc), so these should really be thought of as a single development environment.

Read more

Server Backups

09 Sep 2014

Digital Ocean Snapshots

At $0.02 per gig per month, this looks like this is the cheapest way to make complete backups.

The process is rather manual: we have to sudo poweroff the droplet and then trigger the snapshot (the container will come back online after that, though we have to restart the services / active docker containers). We also have to delete old snapshots manually. Some of this can be automated from the API. DigitalOcean uses redundant storage for these (paying $0.01/month/gigabyte to Amazon Glacier), but at the moment we can’t export these images. Snapshots are also handy to deploy to a larger (but not smaller) droplet.

Digital Ocean Backups

These backups are an automated, always-online alternative to snapshots but must me initialized when the droplet is created and cost more (20% of server cost).

Manually configuring backups

To have the flexibility to restore individual pieces, to move between machines, etc we need a different approach.

Container backups

Docker containers, including running containers, should be effectively backed up by either of these approaches to the state we would be in after a power cycle (e.g. we may need to start stopped containers, but not rebuild them from scratch).

Nevertheless we may want to back up containers themselves. For many containers this is trivial (e.g. our ssh container): we can just commit the running container to an image and save that as a tar archive (or equivalently, just export the container to a tarball).

If the containers have a VOLUME command in their dockerfile or in their execution however, this is insufficient. Containers using volumes (such as sameersbn/gitlab and mattgruter/drone) need four things to be backed up:

  • Dockerfile (or container image, from save or commit)
  • volume
  • volume path in container
  • name of the container the volume belongs to

A utility makes this easier.


Sparkleshare is a git-backed dropbox alterantive. With binaries for most major platforms (Windows, Mac, Ubuntu/Linux) it’s pretty easy to set up and acts in much the same way, with automated synch and notifications. The backend just needs a server running git – Gitlab is a great way to set this up to permit relatively easy sharing / user management. (Ignore the information about setting up separately on a server, Gitlab is much easier. Also ignore advice about building from source on Ubuntu, installing the binary is far more straight forward: apt-get install sparkleshare. Certainly it is not as feature rich as dropbox (e.g. email links to add users, web links to share individual files), but easy sharing over the server at no extra cost. The Sparkleshare directory is also a fully functional git repo.

Encrypted backup of filesystem with duplicity

See Duplicity setup

Good for backing up to another host for which we have ssh access, or to an Amazon S3 bucket, etc. (Unclear if this works with Glacier due to upload-only et-up).

Some other rates for data storage:

  • Compare to S3 ($0.03 /gig/month)
  • EBS ($0.12 /gig/month) (really for computing I/O, not storage).
  • Remarkably, Google Drive and Dropbox now offer 1 TB at $10 / mo. Clearly a lot can be saved by ‘overselling’ (most users will not use their capacity) and by shared files (counting against the space for all users but requiring no more storage capacity). Nonetheless, impressive, on par with Glacier (without the bandwidth charges or delay).
  • For comparison, (non-redundant, non-enterprise, disk-based) storage is roughly $100/TB, or on order of that annual cost.

Read more

Server Security Basics

08 Sep 2014

Security configuration

We set up SSH key-only login on non-standard port, with root login forbidden. We then set up ufw firewall, fail2ban, and tripwire.

  1. Configure an SSH key login. Next, Create a user, add to sudoers, and then disable root login.. Edits /etc/ssh/sshd_config:
  • Disabling root logins. (We’ll need to add ourselves to sudo first: (adduser, edit /etc/sudoers)
  • Change ssh port from default to something else.
  • Whitelist user login ids

Additionally, let’s be sure to disable password authentication: Add PasswordAuthentication no to /etc/ssh/sshd_config. (editing PermitRootLogin only doesn’t do this).

Locally add an entry in ~/.ssh/config to alias the host and port to avoid having to remember these numbers for login. Run ssh-copy-id <droplet-ip> to enable key-based login for the user.

  1. Install and configure ufw firewall. As we’re not using the default ssh port, we need to explicitly tell ufw which ssh port to allow.
sudo ufw allow <PORT>/tcp

(The /tcp part is optional, saying only allow tcp protocol over that port, not other protocols.)

We must also tell ufw to allow Docker: In /etc/default/ufw change DEFAULT_FORWARD_POLICY to ACCEPT, then:

sudo ufw reload
sudo ufw allow 2375/tcp

and similarly allow any ports we export for our various services (Gitlab, Drone, etc).

  1. Install and configure fail2ban. Prevents brute force password attacks. Be sure to assign the config to match chosen ssh port.

  2. Install and configure tripwire (intrusion detection).

  3. Update software:

sudo apt-get -q update && sudo apt-get -qy dist-upgrade

and then also update tripwire log:

sudo tripwire --check --interactive

Note: Clearly all these steps need to be running on the server itself, not merely in a container image deployed on server so that they are securing access to the actual host.

Additional configuration

While we’re doing things, add user to the docker group for convenience: sudo addgroup cboettig docker

Enable swap on a small instance. Here we set up 1GB of swap (setting swap at twice the available RAM is the recommended rule-of-thumb, though makes less sense once RAM is large)

sudo fallocate -l 1G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

To make this persistant on reboot edit /etc/fstab:

sudo echo "/swapfile       none    swap    sw      0       0" >> /etc/fstab

For better performance, we might tweak swappiness to 10 (default is 60 out of 100, where 0 is never swap and 1 is swap frequently):

echo 10 | sudo tee /proc/sys/vm/swappiness
echo vm.swappiness = 10 | sudo tee -a /etc/sysctl.conf

Set ownership

sudo chown root:root /swapfile
sudo chmod 0600 /swapfile

Server modules

Running different services as their own docker containers offers serveral advantages:

  • Containers often make it easier to install and deploy existing services, since the necessary configuration is scripted in the Dockerfile and we can often find Dockerfiles already made on Docker Hub for common services. This note illustrates several examples.

  • Containers may provide an added level of stability, since they run largely in isolation from each other.

  • Containers can be resource limited, e.g.

docker run -it -m 100m -c 100 ubuntu /bin/bash

would provide the container with 100 MB of RAM and 100 “shares” of CPU (acts kind of like a niceness, where the default share of a container is 1024. On multicore machines you can also pass --cpuset "0" or --cpuset "0,1" etc, which is a list of which cpus (numbered 0 to n-1, as in /proc/cpuinfo) the container is permitted to use.

As noted in the link, restricting disk space is more tricky, though might become easier down the road.

ssh server:

Permit users to ssh directly into a container rather than access the server itself. Despite its simplicity, I found this a bit tricky to set up correctly, particularly in managing users.

Here’s the basic Dockerfile for an image we’ll call ssh. This creates a user given by the environmental variable. A few tricks:

  • We use adduser instead of useradd so that we get the home directory for the user created and granted the correct permissions automaticalliy. We need the --gecos information so that we’re not prompted to enter the user’s full name etc. We use --disabled-password rather than set a password here.
  • Login is still possible through ssh key (as well as through nsenter on the host machine). We go ahead and add the ssh key now, though this could be done after the container is running by using nsenter.
  • In this dockerfile, we’ve added the user to sudoers group for root access on the container (installing software, etc). This won’t be active until the user has a password.
FROM     ubuntu:14.04
ENV USER cboettig
RUN apt-get update && apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN adduser --disabled-password --gecos "" $USER
RUN adduser $USER sudo
ADD authorized_keys /home/$USER/.ssh/authorized_keys
RUN chown $USER /home/$USER/.ssh/authorized_keys
CMD    ["/usr/sbin/sshd", "-D"]

When building the image, note that a copy of authorized_keys (contains the contents of the public key) file must be found in the same directory as the Dockerfile so that it can be added to the image.

Start the ssh server on port 2200:

docker run -d -p 2200:22 --name="ssh" ssh

Add to the firewall permissions

sudo ufw add 2200/tcp

From here I can now ssh in from the computer housing the private key pair to the public key that is added to the image here. However, that user doesn’t have root access since we haven’t provided a password.

Use nsenter to enter the instance:

docker run -v /usr/local/bin:/target jpetazzo/nsenter
nsenter -m -u -n -i -p -t `docker inspect --format '' ssh` /bin/bash

Create a password for the user to enable root access:

echo '$USER:<password>' | chpasswd

We can create other users and add them to sudoers or not as desired, e.g. add interactively using:

useradd <username>

users can later change their passwords once they log in.

Restoring containers when restarting

A certain times we need to power cycle the server (e.g. after certain software updates), using the sudo reboot now command or the DigitalOcean console. Any running containers will be automatically stopped. Once the machine is back up and we log back in, these containers can be brought back up with docker restart <container_id> (or <container_name>), and then everything is back to normal.

Note that while we can stop and restart a container, it seems we cannot simply save a container (e.g. with docker commit or docker save and re-run it and expect the server functionality to be restored after the container is destroyed (e.g. by docker rm -f). (See previous notes 2014-09-05) for an illustration of this problem. This occurs because the container image does not include the volume where it writes its data, and that volume address is generated uniquely each time a container is run.

Consequently, a different (probably more traditional) approach is needed to backup the configuration of a server such as Gitlab or Drone-CI even when running in a container. Will explore this later.

Meanwhile, remember remove unneeded containers with

docker rm $(docker ps -a | grep Exited | awk '{print $1}')

and not with -f (destroying running containers!)

Key security

We can largely avoid needing a private ssh key for the server, though may use https authentication to let us use git (rather than, say, rsync) to develop and save changes made remotely (say, through RStudio server).

Backing up keys

Probably unnecessary to have a backup of the ssh private RSA key, as we can access the DigitalOcean Server or Github through the web consoles and add a new public key and replace our private key.

Read more

Drone Ci And Docker

05 Sep 2014

Drone CI: Continous integration in custom docker environments

Having gotten accustomed to Docker, configuring the appropriate build environment for a Continuous Integration system like Travis CI or Shippable CI starts to feel incredibly tedious and archaic (particularly if you work primarily in a language like R or haskell that usually isn’t supported out of the box).

  • We do not have to hack together a custom image environment
  • We can build and test our environment locally instead of having to rely on trial-and-error pushes to the CI server
  • We do not have to download, compile and install the development environment each time, (which frequently takes longer than the CI checks themselves and can break)

(Shippable provides a persistent environment too, by preserving the state of your ‘minion’. But unlike Shippable, I believe the Drone approach is unlikely that you can create troublesome side-effects in your environment, such as removing a necessary dependency from the shippable.yml and yet not catching it since the dependency is still available on the minion from before. In the Drone approach, we start on the same docker image each time, but merely avoid the few minutes it might take to download that image).

Unfortunately, custom images are not available on the fully hosted system. (Though perhaps they’d accept pull requests that would add an R environment to their image library). Fortunately, the Drone team kindly provides an open source version of their platform that can be hosted on a self-hosted / private server (such as the new web darling DigitalOcean or Amazon’s EC2). This has other advantages as well – such as using privately hosted repositories (it also integrates with BitBucket and GitLab) or running very long tests / simulations (since we’re now paying for the server time ourselves, after all).

The easy way: use docker

We can deploy the Drone CI server somewhat more seamlessly by running it in a container itself. Rather than worry about the above configuration, we can simply launch an existing docker image for Drone, rather cleverly created by Matt Gruter:

docker run --name='drone' -d -p 8080:80 --privileged mattgruter/drone
  • Now we can follow the setup instructions. Be sure to use the matching case in the application name (Drone not drone) and the appropriate URLs for the authorization call back.

Note that we must use a different port than 80, and that we must give this port explicitly in the Authorization callback URL: http://localhost:8080/auth/login/github in order to authenticate.

Also note that in this approach, the Drone CI’s docker image library will be separate from the docker image library. To manage or update the images available, we have to first nsenter into the Drone CI container.

This runs rather nicely on a tiny DigitalOcean droplet. Bare in mind that the tiny droplet has only 20 GB of disk space though, which can be consumed rather quickly with too many images. If many of the images use the same base templates, the total disk space required will fortunately be much lower than the sum of their virtual sizes.

experimenting with saving images

Being a docker image, we can snapshot and export it for later use, and meanwhile can even destroy our server instance.

docker export drone > dronedroplet.tar

Not clear that this works. Consider saving an image instead? Save container named drone as image named drone:droplet

docker commit drone drone:droplet
docker save drone:droplet > dronedroplet.tar

are these identical?

Hmm, doesn’t seem to store configuration, login is no longer valid. Starting a stopped container maintains the configuration of course, but not launching from scratch (e.g. the sqlite database is local to the container, not accessible through an externally linked volume).

Note that this tarball does not include the Drone CI image library itself, which is not part of the container but rather connected as a volume. This makes it quite a bit smaller, and that library can presumably be reconstructed from the docker hub.

Configuring Drone CI: the hard way

  • Install and launch drone: (see drone/README)
  • Add DOCKER_OPTS="-H -d" to /etc/default/docker
  • Kill the docker deamon and restart docker. Or run docker with the explicit binding:
sudo docker -d -H &

Configuring an already-running docker session

Launch a named repository in deamon mode:

docker run -d -p 8787:8787 --name='drone' mattgruter/drone

Use a docker-based install to add nsenter into your executable path:

docker run -v /usr/local/bin:/target jpetazzo/nsenter

Run nsenter to log into the docker image:

nsenter -m -u -n -i -p -t `docker inspect --format '{{ .State.Pid }}' drone` /bin/bash

Now we can update or delete images with docker pull, docker rmi, etc.

This is useful with many containers, for instance, with our ssh container or rstudio container we may want to modify usernames and passwords, etc:

useradd -m $USER && echo "$USER:$PASSWORD" | chpasswd

Making this easier:

Add to .bashrc:

function dock { sudo nsenter -m -u -n -i -p -t `docker inspect --format  "$1"` /bin/bash; }

This defines the function dock such that dock <name> will enter a running container named <name>. Note that we have to have nsenter bound to the executable path as indicated above. Yay less typing.

Read more

Docker tricks of the trade and best practices thoughts

29 Aug 2014

Best practices questions

Here are some tricks that may or may not be in keeping with best practices, input would be appreciated.

  • Keep images small: use the --no-install-recommends option for apt-get, install true dependencies rather than big metapackages (like texlive-full).
  • Avoid creating additional AUFS layers by combining RUN commands, etc? (limit was once 42, but is now at least 127).
  • Can use RUN git clone ... to add data to a container in place of ADD, which invalidates caching.

  • Use automated builds linked to Github-based Dockerfiles rather than pushing local image builds. Not only does this make the Dockerfile transparently available and provide a link to the repository where one can file issues, but it also helps ensure that the image available on the hub gets its base image (FROM entry) from the hub instead of whatever was available locally. This can help avoid various out-of-sync errors that might otherwise emerge.

Docker’s use of tags

Unfortunately, Docker seems to use the term tag to refer both to the label applied to an image (e.g. in docker build -t imagelabel . the -t argument “tags” the image as ‘imagelabel’ so we need not remember its hash), but also uses tag to refer to the string applied to the end of an image name after a colon, e.g. latest in ubuntu:latest. The latter is the definition of “tags” as listed under the “tags” tab on the Docker Hub. Best practices for this kind of tag (which I’ll arbitrarily refer to as a ‘version tag’ to distinguish it) are unclear.

One case that is clear is tagging specific versions. Docker’s automated builds lets a user link a “version tag” to either to a branch or a tag in the git history. A “branch” in this case can refer either to a different git branch or merely a different sub-directory. Matching to a Git tag provides the most clear-cut use of the docker version-tag; providing a relatively static version stable link. (I say “relatively” static because even when we do not change the Dockerfile, if we re-build the Dockerfile we may get a new image due the presence of newer versions of the software included. This can be good with respect to fixing security vulnerabilities, but may also break a previously valid environment).

The use case that is less clear is the practice of using these “version tags” in Docker to indicate other differences between related images, such as eddelbuettel/docker-ubuntu-r:add-r and eddelbuettel/docker-ubuntu-r:add-r-devel. Why these are different tags instead of different roots is unclear, unless it is for the convenience of multiple docker files in a single Github repository. Still, it is perfectly possible to configure completely separate docker hub automated builds pointing at the same Github repo, rather than adding additional builds as tags in the same docker hub repo.

Docker linguistics borrow from git terminology, but it’s rather dangerous to interpret these too literally.

Keeping a clean docker environment

  • run interactive containers with --rm flag to avoid having to remove them later.

  • Remove all stopped containers:

docker rm $(docker ps -a | grep Exited | awk '{print $1}')
  • Clean up un-tagged docker images:
docker rmi $(docker images -q --filter "dangling=true")
  • Stop and remove all containers (including running containers!)
docker rm -f $(docker ps -a -q)

Docker and Continuous Integration

  • We can install but cannot run Docker on Travis-CI at this time. It appears the linux kernel available there is much too old. Maybe when they upgrade to Ubuntu 14:04 images…

  • We cannot run Docker on the docker-based Shippable-CI (at least without a vagrant/virtualbox layer in-between). Docker on Docker is not possible (see below).

  • For the same reason, we cannot run Docker on CI. However, Drone provides an open-source version of it’s system that can be run on your own server, which unlike the fully hosted offering, permits custom images. Unfortunately I cannot get it working at this time.

Docker inside docker:

We cannot directly install docker inside a docker container. We can get around this by adding a complete virtualization layer – e.g. docker running in vagrant/virtualbox running in docker.

Alternatively, we can be somewhat more clever and tell our docker to simply use a different volume to store its AUFS layers. Matt Gruter has a very clever example of this, which can be used, e.g. to run a Drone server (which runs docker) inside a Docker container (mattgruter/drone).

I believe this only works if we run the outer docker image with --privileged permissions, e.g. we cannot use this approach on a server like Shippable that is dropping us into a prebuilt docker container.

Read more

Pdg Controlfest Notes

14 Aug 2014

Just wanted to give a quick update on stuff relevant to our adjustment costs paper in events of this week.

I think the talk on Tuesday went all right, (though thanks to a technology snafu going from reveal.js to pdf my most useful figure actually showing the bluefin tuna didn’t display – I tried not to let on). I tried to keep the focus pretty big-picture throughout (we ignore these costs when we model, they matter) and avoid being too bold / prescriptive (e.g. not suggesting we found the ‘right’ way to model these costs). I also could not stop myself from calling the adjustment cost models L1 L2 L3 instead of “linear” “quadratic” and “fixed”, or _1,2,3. whoops.

One question asked about asymmetric costs. You may recall we started off doing but ran into some unexpected results where they just looked like the cost free case, possibly due to problems with the code. We should probably at least say this is an area for further study.

Another question asked about just restricting the period of adjustment, say, once every 5 years or so. I answered that we hoped to see what cost structures “induced” that behavior rather than enforcing it explicitly; but I should probably add some mention of this to the text as well.

I think the other questions were more straight forward but don’t remember any particulars.

The Monday meeting was very helpful for me in framing the kind of big questions around the paper:

  1. Can we make this story about more than TAC-managed fisheries? My ideal paper would be something people could cite to show that simply using profit functions with diminishing returns is not a sufficient way to reflect this reality (could be the opposite if reality is more like a transaction fee), and that this mistake can be large. But all our examples are in the fisheries context, so this may take some finesse. (Since we’re aiming for Eco Apps rather than, say, Can Jor Fisheries)

  2. Emphasizing the “Pretty Darn Good” angle – thinking of the policies we derive with adjustment costs not as the “True optimum” but as a “Pretty Darn Good” policy that can be more robust to adjustment costs – (Provided you have intuition to know if those costs are more like a fixed transaction fee or some proportional cost). The last two figures help with this, since they show using policies under different cost regimes than those under which they were computed to be optimal.

  3. Need to figure out what to say about policies that can ‘self-adjust’, e.g. when you don’t have to change the law to respond to the fluctuations. (Jim pointed out that Salmon are the best/only case where you can actually manage by “escapement” since you get a complete population census from the annual runs).

  • Stripping down the complexity of the charts
  • Conversely, may need to show some examples of the fish stock dynamics (In search for simplicity I’ve focused almost all the graphs on harvest dynamics).
  • Calibrating and running the case of quadratic control term for comparison

As a bonus, I quickly ran the tipping point models, and it looks like these stay really close to the Reed solution – e.g. relative to the safer Beverton Holt world, they are much happier to pay whatever the adjustment cost might be to stick with the optimal than they are to risk total collapse. Not sure but maybe should add this into the paper…

Read more

Docker Notes

14 Aug 2014

Ticking through a few more of the challenges I raised in my first post on docker; here I explore some of the issues about facilitating interaction with a docker container so that a user’s experience is more similar to working in their own environment and less like working on a remote terminal over ssh. While technically minor, these issues are probably the first stumbling blocks in making this a valid platform for new users.

Sharing a local directory

Launch a bash shell on the container that shares the current working directory of the host machine (from pwd) with the /host directory on the container (thanks to Dirk for this solution):

docker run -it -v $(pwd):/host cboettig/ropensci-docker /bin/bash

This allows a user to move files on and off the container, use a familiar editor and even handle things like git commits / pulls / pushes in their working directory as before. Then the code can be executed in the containerized environment which handles all the dependencies. From the terminal docker opens, we just cd /host where we find our working directory files, and can launch R and run the scripts. A rather clean way of maintaining the local development experience but containerizing execution.

In particular, this frees us from having to pass our git credentials etc to the container, though is not so useful if we’re wanting to interact with the container via the RStudio server instead of R running in the terminal. (More on getting around this below).

Unfortunately, Mac and Windows users have to run Docker inside an already-virualized environment such as provided by boot2docker or vagrant. This means that it is only the directories on the virtualized environment, not those on the native OS, can be shared in this way. While one could presumably keep a directory synced between this virtual environment and the native OS, (standard in in vagrant), this is a problem for the easier-to-use boot2docker at this time: (docker/issues/7249).

A Docker Desktop

Dirk brought this docker-desktop to my attention; which uses Xpra (in place of X11 forwarding) to provide a window with fluxbox running on Ubuntu along with common applications like libreoffce, firefox, and rox file manager. Pretty clever, and worked just fine for me, but needs Xpra on the client machine and requires some extra steps (run the container, query for passwords and ports, run ssh to connect, then run Xpra to launch the window). The result is reasonably responsive but still slower than virtualbox, and probably too slow for real work.

Base images?

The basic Ubuntu:14.04 seems like a good lightweight base image (at 192 MB), but other images try to give more useful building blocks, like phusion/baseimage (423 MB). Their docker-bash script and other utilities provide some handy features for managing / debugging containers.

Other ways to share files?

Took a quick look at this Dockerfile for running dropbox, which works rather well (at least on a linux machine, since it requires local directory sharing). Could probably be done without explicit linking to local directories to faciliate moving files on and off the container. Of course one can always scp/rsync files on and off containers if ssh is set up, but that is unlikely to be a popular solution for students.

While we have rstudio server running nicely in a Docker container for local or cloud use, it’s still an issue getting Github ssh keys set up to be able to push changes to a repo. We can get around this by linking to our keys directory with the same -v option shown above. We still need a few more steps: setting the Git username and email, and running ssh-add for the key. Presumably we could do this with environmental variables and some adjustment to the Dockerfile:

docker run -it -v /path/to/keys:/home/rstudio/.ssh/ -e "USERNAME=Carl Boettiger" -e "" cboettig/ropensci-docker

which would prevent storing these secure values on the image itself.

Read more

An appropriate amount of fun with docker?

08 Aug 2014

An update on my exploration with Docker. Title courtesy of Ted, with my hopes that this really does move us in a direction where we can spend less time thinking about the tools and computational environments. Not there yet though

I’ve gotten RStudio Server working in the ropensci-docker image (Issues/pull requests welcome!).

docker run -d -p 8787:8787 cboettig/ropensci-docker

will make an RStudio server instance available to you in your browser at localhost:8787. (Change the first number after the -p to have a different address). You can log in with username:pw rstudio:rstudio and have fun.

One thing I like about this is the ease with which I can now get an RStudio server up and running in the cloud (e.g. I took this for sail on today). This means in few minutes and 1 penny you have a URL that you and any collaborators could use to interact with R using the familiar RStudio interface, already provisioned with your data and dependencies in place.

For me this is a pretty key development. It replaces a lot of command-line only interaction with probably the most familiar R environment out there, online or off. For more widespread use or teaching this probably needs to get simpler still. I’m still skeptical that this will make it out beyond the crazies, but I’m less skeptical than I was when starting this out.

The ropensci-docker image could no doubt be more modular (and better documented). I’d be curious to hear if anyone has had success or problems running docker on windows / mac platforms. Issues or pull requests on the repo would be welcome! (maybe the repo needs to be renamed from it’s original fork now too…)

Rich et al highlighted several “remaining challenges” in their original post. Here’s my take on where those stand in the Docker framework, though I’d welcome other impressions:

  1. dependencies could still be missed by incompletely documentation

I think this one is largely addressed, at least assuming a user loads the Docker image. I’m still concerned that later builds of the docker image could simply break the build (though earlier images may still be available). Does anyone know how to roll back to earlier images in docker?

  1. The set of scripts for managing reproducibility are at least as complex as the analysis itself

I think a lot of that size is due to the lack of an R image for Travis and the need to install many common tools from scratch. Because docker is both modular and easily shared via docker hub, it’s much easier to write a really small script that builds on existing images, (as I show in cboettig/rnexml)

  1. CI constraints: public/open github repository with analyses that run in under 50 minutes.

Docker has two advantages and also some weaknesses here: (1) it should be easy to run locally, while accomplishing much of the same thing as running on travis (though clearly that’s not as nice as running automatically & in the cloud on every push). (2) It’s easier to take advantage of caching – for instance, cboettig/rnexml provides the knitr cache files in the image so that a user can start exploring without waiting for all the data to download and code to run.

It seems that Travis CI doesn’t currently support docker since the linux kernel they use is too old. (Presumably they’ll update one day. Anyone try Shippable CI? (which supports docker))

  1. The learning curve is still prohibitive

I think that’s still true. But what surprised me is that I’m not sure that it’s gotten any worse by adding docker than it was to begin with using Travis CI. Because the approach can be used both locally and for scaling up in the cloud, I think it offers some more immediate payoffs to users than learning a Github+CI approach does. (Notably it doesn’t require any git just to deploy something ‘reproducible’, though of course it works nicely with git.

Read more