Lab Notebook

(Introduction)

Coding

  • cboettig pushed to master at cboettig/labnotebook: update vita 05:42 2014/09/19
  • cboettig commented on issue ropensci/RNeXML#86: Good question. Wondering if there's a good way to coordinate thinking about this kind of thing at a higher level so at least whatever we do is cons… 04:50 2014/09/19
  • cboettig commented on issue eddelbuettel/rocker#1: Yup. So we don't have an image that provides the r-devel pre-release version. Should I add that to the r image, aliased as Rdevel? When you get a c… 10:27 2014/09/18
  • cboettig closed issue cboettig/pdg_control#42: possible literature tie-in 09:16 2014/09/18
  • cboettig closed issue cboettig/pdg_control#51: Emphasis on generalities or differences? 09:16 2014/09/18

Discussing

Reading

  • Collapse of an ecological network in Ancient Egypt: Pages: 1-6. Justin D Yeakel, Mathias M Pires, Lars Rudolf, Nathaniel J Dominy, Paul L Koch et al. 07:35 2014/09/11
  • Regime shifts in models of dryland vegetation Regime shifts in models of dryland vegetation: Yuval R Zelnik, Shai Kinast, Hezi Yizhaq, Golan Bel, Ehud Meron, Phil Trans R Soc A et al.Published using Mendeley: The reference software for researchers 07:35 2014/09/11
  • Temporal ecology in the Anthropocene: Ecology Letters (2014). Pages: n/a-n/a. E. M. Wolkovich, B. I. Cook, K. K. McLauchlan, T. J. Davies et al. 07:35 2014/09/11
  • Uncertainty , learning , and the optimal management of wildlife: Environmental and Ecological Statistics (2001). Volume: 8, Issue: 3. Pages: 269-288. Byron K. Williams et al. 08:57 2014/08/06

Entries

Server Backups

09 Sep 2014

Digital Ocean Snapshots

At $0.02 per gig per month, this looks like this is the cheapest way to make complete backups.

The process is rather manual: we have to sudo poweroff the droplet and then trigger the snapshot (the container will come back online after that, though we have to restart the services / active docker containers). We also have to delete old snapshots manually. Some of this can be automated from the API. DigitalOcean uses redundant storage for these (paying $0.01/month/gigabyte to Amazon Glacier), but at the moment we can’t export these images. Snapshots are also handy to deploy to a larger (but not smaller) droplet.

Digital Ocean Backups

These backups are an automated, always-online alternative to snapshots but must me initialized when the droplet is created and cost more (20% of server cost).

Manually configuring backups

To have the flexibility to restore individual pieces, to move between machines, etc we need a different approach.

Container backups

Docker containers, including running containers, should be effectively backed up by either of these approaches to the state we would be in after a power cycle (e.g. we may need to start stopped containers, but not rebuild them from scratch).

Nevertheless we may want to back up containers themselves. For many containers this is trivial (e.g. our ssh container): we can just commit the running container to an image and save that as a tar archive (or equivalently, just export the container to a tarball).

If the containers have a VOLUME command in their dockerfile or in their execution however, this is insufficient. Containers using volumes (such as sameersbn/gitlab and mattgruter/drone) need four things to be backed up:

  • Dockerfile (or container image, from save or commit)
  • volume
  • volume path in container
  • name of the container the volume belongs to

A utility makes this easier.

Sparkleshare

Sparkleshare is a git-backed dropbox alterantive. With binaries for most major platforms (Windows, Mac, Ubuntu/Linux) it’s pretty easy to set up and acts in much the same way, with automated synch and notifications. The backend just needs a server running git – Gitlab is a great way to set this up to permit relatively easy sharing / user management. (Ignore the information about setting up separately on a server, Gitlab is much easier. Also ignore advice about building from source on Ubuntu, installing the binary is far more straight forward: apt-get install sparkleshare. Certainly it is not as feature rich as dropbox (e.g. email links to add users, web links to share individual files), but easy sharing over the server at no extra cost. The Sparkleshare directory is also a fully functional git repo.

Encrypted backup of filesystem with duplicity

See Duplicity setup

Good for backing up to another host for which we have ssh access, or to an Amazon S3 bucket, etc. (Unclear if this works with Glacier due to upload-only et-up).

Some other rates for data storage:

  • Compare to S3 ($0.03 /gig/month)
  • EBS ($0.12 /gig/month) (really for computing I/O, not storage).
  • Remarkably, Google Drive and Dropbox now offer 1 TB at $10 / mo. Clearly a lot can be saved by ‘overselling’ (most users will not use their capacity) and by shared files (counting against the space for all users but requiring no more storage capacity). Nonetheless, impressive, on par with Glacier (without the bandwidth charges or delay).
  • For comparison, (non-redundant, non-enterprise, disk-based) storage is roughly $100/TB, or on order of that annual cost.

Read more



Server Security Basics

08 Sep 2014

Security configuration

We set up SSH key-only login on non-standard port, with root login forbidden. We then set up ufw firewall, fail2ban, and tripwire.

  1. Configure an SSH key login. Next, Create a user, add to sudoers, and then disable root login.. Edits /etc/ssh/sshd_config:
  • Disabling root logins. (We’ll need to add ourselves to sudo first: (adduser, edit /etc/sudoers)
  • Change ssh port from default to something else.
  • Whitelist user login ids

Additionally, let’s be sure to disable password authentication: Add PasswordAuthentication no to /etc/ssh/sshd_config. (editing PermitRootLogin only doesn’t do this).

Locally add an entry in ~/.ssh/config to alias the host and port to avoid having to remember these numbers for login. Run ssh-copy-id <droplet-ip> to enable key-based login for the user.

  1. Install and configure ufw firewall. As we’re not using the default ssh port, we need to explicitly tell ufw which ssh port to allow.
sudo ufw allow <PORT>/tcp

(The /tcp part is optional, saying only allow tcp protocol over that port, not other protocols.)

We must also tell ufw to allow Docker: In /etc/default/ufw change DEFAULT_FORWARD_POLICY to ACCEPT, then:

sudo ufw reload
sudo ufw allow 2375/tcp

and similarly allow any ports we export for our various services (Gitlab, Drone, etc).

  1. Install and configure fail2ban. Prevents brute force password attacks. Be sure to assign the config to match chosen ssh port.

  2. Install and configure tripwire (intrusion detection).

  3. Update software:

sudo apt-get -q update && sudo apt-get -qy dist-upgrade

and then also update tripwire log:

sudo tripwire --check --interactive

Note: Clearly all these steps need to be running on the server itself, not merely in a container image deployed on server so that they are securing access to the actual host.

Additional configuration

While we’re doing things, add user to the docker group for convenience: sudo addgroup cboettig docker

Enable swap on a small instance. Here we set up 1GB of swap (setting swap at twice the available RAM is the recommended rule-of-thumb, though makes less sense once RAM is large)

sudo fallocate -l 1G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

To make this persistant on reboot edit /etc/fstab:

sudo echo "/swapfile       none    swap    sw      0       0" >> /etc/fstab

For better performance, we might tweak swappiness to 10 (default is 60 out of 100, where 0 is never swap and 1 is swap frequently):

echo 10 | sudo tee /proc/sys/vm/swappiness
echo vm.swappiness = 10 | sudo tee -a /etc/sysctl.conf

Set ownership

sudo chown root:root /swapfile
sudo chmod 0600 /swapfile

Server modules

Running different services as their own docker containers offers serveral advantages:

  • Containers often make it easier to install and deploy existing services, since the necessary configuration is scripted in the Dockerfile and we can often find Dockerfiles already made on Docker Hub for common services. This note illustrates several examples.

  • Containers may provide an added level of stability, since they run largely in isolation from each other.

  • Containers can be resource limited, e.g.

docker run -it -m 100m -c 100 ubuntu /bin/bash

would provide the container with 100 MB of RAM and 100 “shares” of CPU (acts kind of like a niceness, where the default share of a container is 1024. On multicore machines you can also pass --cpuset "0" or --cpuset "0,1" etc, which is a list of which cpus (numbered 0 to n-1, as in /proc/cpuinfo) the container is permitted to use.

As noted in the link, restricting disk space is more tricky, though might become easier down the road.

ssh server:

Permit users to ssh directly into a container rather than access the server itself. Despite its simplicity, I found this a bit tricky to set up correctly, particularly in managing users.

Here’s the basic Dockerfile for an image we’ll call ssh. This creates a user given by the environmental variable. A few tricks:

  • We use adduser instead of useradd so that we get the home directory for the user created and granted the correct permissions automaticalliy. We need the --gecos information so that we’re not prompted to enter the user’s full name etc. We use --disabled-password rather than set a password here.
  • Login is still possible through ssh key (as well as through nsenter on the host machine). We go ahead and add the ssh key now, though this could be done after the container is running by using nsenter.
  • In this dockerfile, we’ve added the user to sudoers group for root access on the container (installing software, etc). This won’t be active until the user has a password.
FROM     ubuntu:14.04
ENV USER cboettig
RUN apt-get update && apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN adduser --disabled-password --gecos "" $USER
RUN adduser $USER sudo
ADD authorized_keys /home/$USER/.ssh/authorized_keys
RUN chown $USER /home/$USER/.ssh/authorized_keys
EXPOSE 22
CMD    ["/usr/sbin/sshd", "-D"]

When building the image, note that a copy of authorized_keys (contains the contents of the id_rda.pub public key) file must be found in the same directory as the Dockerfile so that it can be added to the image.

Start the ssh server on port 2200:

docker run -d -p 2200:22 --name="ssh" ssh

Add to the firewall permissions

sudo ufw add 2200/tcp

From here I can now ssh in from the computer housing the private key pair to the public key that is added to the image here. However, that user doesn’t have root access since we haven’t provided a password.

Use nsenter to enter the instance:

docker run -v /usr/local/bin:/target jpetazzo/nsenter
nsenter -m -u -n -i -p -t `docker inspect --format '' ssh` /bin/bash

Create a password for the user to enable root access:

echo '$USER:<password>' | chpasswd

We can create other users and add them to sudoers or not as desired, e.g. add interactively using:

useradd <username>

users can later change their passwords once they log in.

Restoring containers when restarting

A certain times we need to power cycle the server (e.g. after certain software updates), using the sudo reboot now command or the DigitalOcean console. Any running containers will be automatically stopped. Once the machine is back up and we log back in, these containers can be brought back up with docker restart <container_id> (or <container_name>), and then everything is back to normal.

Note that while we can stop and restart a container, it seems we cannot simply save a container (e.g. with docker commit or docker save and re-run it and expect the server functionality to be restored after the container is destroyed (e.g. by docker rm -f). (See previous notes 2014-09-05) for an illustration of this problem. This occurs because the container image does not include the volume where it writes its data, and that volume address is generated uniquely each time a container is run.

Consequently, a different (probably more traditional) approach is needed to backup the configuration of a server such as Gitlab or Drone-CI even when running in a container. Will explore this later.

Meanwhile, remember remove unneeded containers with

docker rm $(docker ps -a | grep Exited | awk '{print $1}')

and not with -f (destroying running containers!)

Key security

We can largely avoid needing a private ssh key for the server, though may use https authentication to let us use git (rather than, say, rsync) to develop and save changes made remotely (say, through RStudio server).

Backing up keys

Probably unnecessary to have a backup of the ssh private RSA key, as we can access the DigitalOcean Server or Github through the web consoles and add a new public key and replace our private key.

Read more



Drone Ci And Docker

05 Sep 2014

Drone CI: Continous integration in custom docker environments

Having gotten accustomed to Docker, configuring the appropriate build environment for a Continuous Integration system like Travis CI or Shippable CI starts to feel incredibly tedious and archaic (particularly if you work primarily in a language like R or haskell that usually isn’t supported out of the box).

  • We do not have to hack together a custom image environment
  • We can build and test our environment locally instead of having to rely on trial-and-error pushes to the CI server
  • We do not have to download, compile and install the development environment each time, (which frequently takes longer than the CI checks themselves and can break)

(Shippable provides a persistent environment too, by preserving the state of your ‘minion’. But unlike Shippable, I believe the Drone approach is unlikely that you can create troublesome side-effects in your environment, such as removing a necessary dependency from the shippable.yml and yet not catching it since the dependency is still available on the minion from before. In the Drone approach, we start on the same docker image each time, but merely avoid the few minutes it might take to download that image).

Unfortunately, custom images are not available on the fully hosted drone.io system. (Though perhaps they’d accept pull requests that would add an R environment to their image library). Fortunately, the Drone team kindly provides an open source version of their platform that can be hosted on a self-hosted / private server (such as the new web darling DigitalOcean or Amazon’s EC2). This has other advantages as well – such as using privately hosted repositories (it also integrates with BitBucket and GitLab) or running very long tests / simulations (since we’re now paying for the server time ourselves, after all).

The easy way: use docker

We can deploy the Drone CI server somewhat more seamlessly by running it in a container itself. Rather than worry about the above configuration, we can simply launch an existing docker image for Drone, rather cleverly created by Matt Gruter:

docker run --name='drone' -d -p 8080:80 --privileged mattgruter/drone
  • Now we can follow the setup instructions. Be sure to use the matching case in the application name (Drone not drone) and the appropriate URLs for the authorization call back.

Note that we must use a different port than 80, and that we must give this port explicitly in the Authorization callback URL: http://localhost:8080/auth/login/github in order to authenticate.

Also note that in this approach, the Drone CI’s docker image library will be separate from the docker image library. To manage or update the images available, we have to first nsenter into the Drone CI container.

This runs rather nicely on a tiny DigitalOcean droplet. Bare in mind that the tiny droplet has only 20 GB of disk space though, which can be consumed rather quickly with too many images. If many of the images use the same base templates, the total disk space required will fortunately be much lower than the sum of their virtual sizes.

experimenting with saving images

Being a docker image, we can snapshot and export it for later use, and meanwhile can even destroy our server instance.

docker export drone > dronedroplet.tar

Not clear that this works. Consider saving an image instead? Save container named drone as image named drone:droplet

docker commit drone drone:droplet
docker save drone:droplet > dronedroplet.tar

are these identical?

Hmm, doesn’t seem to store configuration, login is no longer valid. Starting a stopped container maintains the configuration of course, but not launching from scratch (e.g. the sqlite database is local to the container, not accessible through an externally linked volume).

Note that this tarball does not include the Drone CI image library itself, which is not part of the container but rather connected as a volume. This makes it quite a bit smaller, and that library can presumably be reconstructed from the docker hub.

Configuring Drone CI: the hard way

  • Install and launch drone: (see drone/README)
  • Add DOCKER_OPTS="-H 127.0.0.1:4243 -d" to /etc/default/docker
  • Kill the docker deamon and restart docker. Or run docker with the explicit binding:
sudo docker -d -H 127.0.0.1:4243 &

Configuring an already-running docker session

Launch a named repository in deamon mode:

docker run -d -p 8787:8787 --name='drone' mattgruter/drone

Use a docker-based install to add nsenter into your executable path:

docker run -v /usr/local/bin:/target jpetazzo/nsenter

Run nsenter to log into the docker image:

nsenter -m -u -n -i -p -t `docker inspect --format '{{ .State.Pid }}' drone` /bin/bash

Now we can update or delete images with docker pull, docker rmi, etc.

This is useful with many containers, for instance, with our ssh container or rstudio container we may want to modify usernames and passwords, etc:

useradd -m $USER && echo "$USER:$PASSWORD" | chpasswd

Making this easier:

Add to .bashrc:

function dock { sudo nsenter -m -u -n -i -p -t `docker inspect --format  "$1"` /bin/bash; }

This defines the function dock such that dock <name> will enter a running container named <name>. Note that we have to have nsenter bound to the executable path as indicated above. Yay less typing.

Read more



Docker tricks of the trade and best practices thoughts

29 Aug 2014

Best practices questions

Here are some tricks that may or may not be in keeping with best practices, input would be appreciated.

  • Keep images small: use the --no-install-recommends option for apt-get, install true dependencies rather than big metapackages (like texlive-full).
  • Avoid creating additional AUFS layers by combining RUN commands, etc? (limit was once 42, but is now at least 127).
  • Can use RUN git clone ... to add data to a container in place of ADD, which invalidates caching.

  • Use automated builds linked to Github-based Dockerfiles rather than pushing local image builds. Not only does this make the Dockerfile transparently available and provide a link to the repository where one can file issues, but it also helps ensure that the image available on the hub gets its base image (FROM entry) from the hub instead of whatever was available locally. This can help avoid various out-of-sync errors that might otherwise emerge.

Docker’s use of tags

Unfortunately, Docker seems to use the term tag to refer both to the label applied to an image (e.g. in docker build -t imagelabel . the -t argument “tags” the image as ‘imagelabel’ so we need not remember its hash), but also uses tag to refer to the string applied to the end of an image name after a colon, e.g. latest in ubuntu:latest. The latter is the definition of “tags” as listed under the “tags” tab on the Docker Hub. Best practices for this kind of tag (which I’ll arbitrarily refer to as a ‘version tag’ to distinguish it) are unclear.

One case that is clear is tagging specific versions. Docker’s automated builds lets a user link a “version tag” to either to a branch or a tag in the git history. A “branch” in this case can refer either to a different git branch or merely a different sub-directory. Matching to a Git tag provides the most clear-cut use of the docker version-tag; providing a relatively static version stable link. (I say “relatively” static because even when we do not change the Dockerfile, if we re-build the Dockerfile we may get a new image due the presence of newer versions of the software included. This can be good with respect to fixing security vulnerabilities, but may also break a previously valid environment).

The use case that is less clear is the practice of using these “version tags” in Docker to indicate other differences between related images, such as eddelbuettel/docker-ubuntu-r:add-r and eddelbuettel/docker-ubuntu-r:add-r-devel. Why these are different tags instead of different roots is unclear, unless it is for the convenience of multiple docker files in a single Github repository. Still, it is perfectly possible to configure completely separate docker hub automated builds pointing at the same Github repo, rather than adding additional builds as tags in the same docker hub repo.

Docker linguistics borrow from git terminology, but it’s rather dangerous to interpret these too literally.

Keeping a clean docker environment

  • run interactive containers with --rm flag to avoid having to remove them later.

  • Remove all stopped containers:

docker rm $(docker ps -a | grep Exited | awk '{print $1}')
  • Clean up un-tagged docker images:
docker rmi $(docker images -q --filter "dangling=true")
  • Stop and remove all containers (including running containers!)
docker rm -f $(docker ps -a -q)

Docker and Continuous Integration

  • We can install but cannot run Docker on Travis-CI at this time. It appears the linux kernel available there is much too old. Maybe when they upgrade to Ubuntu 14:04 images…

  • We cannot run Docker on the docker-based Shippable-CI (at least without a vagrant/virtualbox layer in-between). Docker on Docker is not possible (see below).

  • For the same reason, we cannot run Docker on drone.io CI. However, Drone provides an open-source version of it’s system that can be run on your own server, which unlike the fully hosted offering, permits custom images. Unfortunately I cannot get it working at this time.

Docker inside docker:

We cannot directly install docker inside a docker container. We can get around this by adding a complete virtualization layer – e.g. docker running in vagrant/virtualbox running in docker.

Alternatively, we can be somewhat more clever and tell our docker to simply use a different volume to store its AUFS layers. Matt Gruter has a very clever example of this, which can be used, e.g. to run a Drone server (which runs docker) inside a Docker container (mattgruter/drone).

I believe this only works if we run the outer docker image with --privileged permissions, e.g. we cannot use this approach on a server like Shippable that is dropping us into a prebuilt docker container.

Read more



Pdg Controlfest Notes

14 Aug 2014

Just wanted to give a quick update on stuff relevant to our adjustment costs paper in events of this week.

I think the talk on Tuesday went all right, (though thanks to a technology snafu going from reveal.js to pdf my most useful figure actually showing the bluefin tuna didn’t display – I tried not to let on). I tried to keep the focus pretty big-picture throughout (we ignore these costs when we model, they matter) and avoid being too bold / prescriptive (e.g. not suggesting we found the ‘right’ way to model these costs). I also could not stop myself from calling the adjustment cost models L1 L2 L3 instead of “linear” “quadratic” and “fixed”, or _1,2,3. whoops.

One question asked about asymmetric costs. You may recall we started off doing but ran into some unexpected results where they just looked like the cost free case, possibly due to problems with the code. We should probably at least say this is an area for further study.

Another question asked about just restricting the period of adjustment, say, once every 5 years or so. I answered that we hoped to see what cost structures “induced” that behavior rather than enforcing it explicitly; but I should probably add some mention of this to the text as well.

I think the other questions were more straight forward but don’t remember any particulars.

The Monday meeting was very helpful for me in framing the kind of big questions around the paper:

  1. Can we make this story about more than TAC-managed fisheries? My ideal paper would be something people could cite to show that simply using profit functions with diminishing returns is not a sufficient way to reflect this reality (could be the opposite if reality is more like a transaction fee), and that this mistake can be large. But all our examples are in the fisheries context, so this may take some finesse. (Since we’re aiming for Eco Apps rather than, say, Can Jor Fisheries)

  2. Emphasizing the “Pretty Darn Good” angle – thinking of the policies we derive with adjustment costs not as the “True optimum” but as a “Pretty Darn Good” policy that can be more robust to adjustment costs – (Provided you have intuition to know if those costs are more like a fixed transaction fee or some proportional cost). The last two figures help with this, since they show using policies under different cost regimes than those under which they were computed to be optimal.

  3. Need to figure out what to say about policies that can ‘self-adjust’, e.g. when you don’t have to change the law to respond to the fluctuations. (Jim pointed out that Salmon are the best/only case where you can actually manage by “escapement” since you get a complete population census from the annual runs).

  • Stripping down the complexity of the charts
  • Conversely, may need to show some examples of the fish stock dynamics (In search for simplicity I’ve focused almost all the graphs on harvest dynamics).
  • Calibrating and running the case of quadratic control term for comparison

As a bonus, I quickly ran the tipping point models, and it looks like these stay really close to the Reed solution – e.g. relative to the safer Beverton Holt world, they are much happier to pay whatever the adjustment cost might be to stick with the optimal than they are to risk total collapse. Not sure but maybe should add this into the paper…

Read more



Docker Notes

14 Aug 2014

Ticking through a few more of the challenges I raised in my first post on docker; here I explore some of the issues about facilitating interaction with a docker container so that a user’s experience is more similar to working in their own environment and less like working on a remote terminal over ssh. While technically minor, these issues are probably the first stumbling blocks in making this a valid platform for new users.

Sharing a local directory

Launch a bash shell on the container that shares the current working directory of the host machine (from pwd) with the /host directory on the container (thanks to Dirk for this solution):

docker run -it -v $(pwd):/host cboettig/ropensci-docker /bin/bash

This allows a user to move files on and off the container, use a familiar editor and even handle things like git commits / pulls / pushes in their working directory as before. Then the code can be executed in the containerized environment which handles all the dependencies. From the terminal docker opens, we just cd /host where we find our working directory files, and can launch R and run the scripts. A rather clean way of maintaining the local development experience but containerizing execution.

In particular, this frees us from having to pass our git credentials etc to the container, though is not so useful if we’re wanting to interact with the container via the RStudio server instead of R running in the terminal. (More on getting around this below).

Unfortunately, Mac and Windows users have to run Docker inside an already-virualized environment such as provided by boot2docker or vagrant. This means that it is only the directories on the virtualized environment, not those on the native OS, can be shared in this way. While one could presumably keep a directory synced between this virtual environment and the native OS, (standard in in vagrant), this is a problem for the easier-to-use boot2docker at this time: (docker/issues/7249).

A Docker Desktop

Dirk brought this docker-desktop to my attention; which uses Xpra (in place of X11 forwarding) to provide a window with fluxbox running on Ubuntu along with common applications like libreoffce, firefox, and rox file manager. Pretty clever, and worked just fine for me, but needs Xpra on the client machine and requires some extra steps (run the container, query for passwords and ports, run ssh to connect, then run Xpra to launch the window). The result is reasonably responsive but still slower than virtualbox, and probably too slow for real work.

Base images?

The basic Ubuntu:14.04 seems like a good lightweight base image (at 192 MB), but other images try to give more useful building blocks, like phusion/baseimage (423 MB). Their docker-bash script and other utilities provide some handy features for managing / debugging containers.

Other ways to share files?

Took a quick look at this Dockerfile for running dropbox, which works rather well (at least on a linux machine, since it requires local directory sharing). Could probably be done without explicit linking to local directories to faciliate moving files on and off the container. Of course one can always scp/rsync files on and off containers if ssh is set up, but that is unlikely to be a popular solution for students.

While we have rstudio server running nicely in a Docker container for local or cloud use, it’s still an issue getting Github ssh keys set up to be able to push changes to a repo. We can get around this by linking to our keys directory with the same -v option shown above. We still need a few more steps: setting the Git username and email, and running ssh-add for the key. Presumably we could do this with environmental variables and some adjustment to the Dockerfile:

docker run -it -v /path/to/keys:/home/rstudio/.ssh/ -e "USERNAME=Carl Boettiger" -e "EMAIL=cboettig@example.org" cboettig/ropensci-docker

which would prevent storing these secure values on the image itself.

Read more



An appropriate amount of fun with docker?

08 Aug 2014

An update on my exploration with Docker. Title courtesy of Ted, with my hopes that this really does move us in a direction where we can spend less time thinking about the tools and computational environments. Not there yet though

I’ve gotten RStudio Server working in the ropensci-docker image (Issues/pull requests welcome!).

docker run -d -p 8787:8787 cboettig/ropensci-docker

will make an RStudio server instance available to you in your browser at localhost:8787. (Change the first number after the -p to have a different address). You can log in with username:pw rstudio:rstudio and have fun.

One thing I like about this is the ease with which I can now get an RStudio server up and running in the cloud (e.g. I took this for sail on DigitalOcean.com today). This means in few minutes and 1 penny you have a URL that you and any collaborators could use to interact with R using the familiar RStudio interface, already provisioned with your data and dependencies in place.


For me this is a pretty key development. It replaces a lot of command-line only interaction with probably the most familiar R environment out there, online or off. For more widespread use or teaching this probably needs to get simpler still. I’m still skeptical that this will make it out beyond the crazies, but I’m less skeptical than I was when starting this out.

The ropensci-docker image could no doubt be more modular (and better documented). I’d be curious to hear if anyone has had success or problems running docker on windows / mac platforms. Issues or pull requests on the repo would be welcome! https://github.com/ropensci/docker-ubuntu-r/blob/master/add-r-ropensci/Dockerfile (maybe the repo needs to be renamed from it’s original fork now too…)

Rich et al highlighted several “remaining challenges” in their original post. Here’s my take on where those stand in the Docker framework, though I’d welcome other impressions:

  1. dependencies could still be missed by incompletely documentation

I think this one is largely addressed, at least assuming a user loads the Docker image. I’m still concerned that later builds of the docker image could simply break the build (though earlier images may still be available). Does anyone know how to roll back to earlier images in docker?

  1. The set of scripts for managing reproducibility are at least as complex as the analysis itself

I think a lot of that size is due to the lack of an R image for Travis and the need to install many common tools from scratch. Because docker is both modular and easily shared via docker hub, it’s much easier to write a really small script that builds on existing images, (as I show in cboettig/rnexml)

  1. Travis.org CI constraints: public/open github repository with analyses that run in under 50 minutes.

Docker has two advantages and also some weaknesses here: (1) it should be easy to run locally, while accomplishing much of the same thing as running on travis (though clearly that’s not as nice as running automatically & in the cloud on every push). (2) It’s easier to take advantage of caching – for instance, cboettig/rnexml provides the knitr cache files in the image so that a user can start exploring without waiting for all the data to download and code to run.

It seems that Travis CI doesn’t currently support docker since the linux kernel they use is too old. (Presumably they’ll update one day. Anyone try Shippable CI? (which supports docker))

  1. The learning curve is still prohibitive

I think that’s still true. But what surprised me is that I’m not sure that it’s gotten any worse by adding docker than it was to begin with using Travis CI. Because the approach can be used both locally and for scaling up in the cloud, I think it offers some more immediate payoffs to users than learning a Github+CI approach does. (Notably it doesn’t require any git just to deploy something ‘reproducible’, though of course it works nicely with git.

Read more



Too Much Fun With Docker

07 Aug 2014

NOTE: This post was originally drafted as a set of questions to the revived ropensci-discuss list, hopefully readers might join the discussion from there.

Been thinking about Docker and the discussion about reproducible research in the comments of Rich et al’s recent post on the rOpenSci blog where quite a few of people mentioned the potential for Docker as a way to facilitate this.

I’ve only just started playing around with Docker, and though I’m quite impressed, I’m still rather skeptical that non-crazies would ever use it productively. Nevertheless, I’ve worked up some Dockerfiles to explore how one might use this approach to transparently document and manage a computational environment, and I was hoping to get some feedback from all of you.

For those of you who are already much more familiar with Docker than me (or are looking for an excuse to explore!), I’d love to get your feedback on some of the particulars. For everyone, I’d be curious what you think about the general concept.

So far I’ve created a dockerfile and image

If you have docker up and running, perhaps you can give it a test drive:

docker run -it cboettig/ropensci-docker /bin/bash

You should find R installed with some common packages. This image builds on Dirk Eddelbuettel’s R docker images and serves as a starting point to test individual R packages or projects.

For instance, my RNeXML manuscript draft is a bit more of a bear then usual to run, since it needs rJava (requires external libs), Sxslt (only available on Omegahat and requires extra libs) and latest phytools (a tar.gz file from Liam’s website), along with the usual mess of pandoc/latex environment to compile the manuscript itself. By building on ropensci-docker, we need a pretty minimal docker file to compile this environment:

You can test drive it (docker image here):

docker run -it cboettig/rnexml /bin/bash

Once in bash, launch R and run rmarkdown::render("manuscript.Rmd"). This will recompile the manuscript from cache and leave you to interactively explore any of the R code shown.

Advantages / Goals

Being able to download a pre-compiled image means a user can run the code without dependency hell (often not as much an R problem as it is in Python, but nevertheless one that I hit frequently, particularly as my projects age), and also without altering their personal R environment. Third (in principle) this makes it easy to run the code on a cloud server, scaling the computing resources appropriately.

I think the real acid test for this is not merely that it recreates the results, but that others can build and extend on the work (with fewer rather than more barriers than usual). I believe most of that has nothing to do with this whole software image thing – providing the methods you use as general-purpose functions in an R package, or publishing the raw (& processed) data to Dryad with good documentation will always make work more modular and easier to re-use than cracking open someone’s virtual machine. But that is really a separate issue.

In this context, we look for an easy way to package up whatever a researcher or group is already doing into something portable and extensible. So, is this really portable and extensible?

Concerns:

  1. This presupposes someone can run docker on their OS – and from the command line at that. Perhaps that’s the biggest barrier to entry right now, (though given docker’s virulent popularity, maybe something smart people with big money might soon solve).

  2. The only way to interact with thing is through a bash shell running on the container. An RStudio server might be much nicer, but I haven’t been able to get that running. Anyone know how to run RStudio server from docker?

(I tried & failed)

  1. I don’t see how users can move local files on and off the docker container. In some ways this is a great virtue – forcing all code to use fully resolved paths like pulling data from Dryad instead of their hard-drive, and pushing results to a (possibly private) online site to view them. But obviously a barrier to entry. Is there a better way to do this?

Alternative strategies

  1. Docker is just one of many ways to do this (particularly if you’re not concerned about maximum performance speed), and quite probably not the easiest. Our friends at Berkeley D-Lab opted for a GUI-driven virtual machine instead, built with Packer and run in Virtualbox, after their experience proved that students were much more comfortable with the mouse-driven installation and a pixel-identical environment to the instructor’s (see their excellent paper on this).

  2. Will/should researchers be willing to work and develop in virtual environments? In some cases, the virtual environment can be closely coupled to the native one – you use your own editors etc to do all the writing, and then execute in the virtual environment (seems this is easier in docker/vagrant approach than in the BCE.

Read more



Notes

06 Aug 2014

Writing

Exploring

Reading

Fun piece in the Guardian from Digital Science manager Timo Hannay on the future of scientific publishing. I think (or choose to believe that) the thesis is at the end rather than in the title.

Also commented here on DocZen’s post

Read more