Lab Notebook

(Introduction)

Coding

  • cboettig pushed to master at cboettig/labnotebook: link to ropensci 10:20 2014/11/26
  • cboettig pushed to master at rocker-org/ropensci: chunk into more AUFS layers 10:26 2014/11/25
  • cboettig pushed to gh-pages at cboettig/labnotebook: Updating to cboettig/labnotebook@4f42e64. 09:48 2014/11/25
  • cboettig pushed to sandbox at rocker-org/rocker: image name had somehow been omitted, whoops 09:42 2014/11/25
  • cboettig pushed to master at cboettig/labnotebook: post new notes 09:39 2014/11/25

Discussing

Reading

  • Assessing trade-offs to inform ecosystem-based fisheries management of forage fish.: Scientific reports (2014). Pages: 7110. Andrew Olaf Shelton, Jameal F Samhouri, Adrian C Stier, Philip S Levin et al. 07:08 2014/11/25
  • The Southern Kalahari: a potential new dust source in the Southern Hemisphere?: Environmental Research Letters (2012). Volume: 7, Issue: 2. Pages: 024001. Abinash Bhattachan, Paolo D’Odorico, Matthew C Baddock, Ted M Zobeck, Gregory S Okin, Nicolas Cassar et al. 11:45 2014/11/04
  • Resilience and recovery potential of duneland vegetation in the southern Kalahari: Ecosphere (2014). Volume: 5, Issue: January. Pages: 1-14. A Bhattachan, P D'Odorico, K Dintwe et al. 11:45 2014/11/04
  • Potential dust emissions from the southern Kalahari's dunelands: Journal of Geophysical Research: Earth Surface (2013). Volume: 118, Issue: 1. Pages: 307-314. Abinash Bhattachan, Paolo D'Odorico, Gregory S. Okin, Kebonyethata Dintwe et al. 11:45 2014/11/04

Entries

Coreos Docker Registries Etc

24 Nov 2014

A secure docker registry

Running one’s own docker registry is far more elegant than moving tarballs between machines (e.g. when migrating between servers, particularly for images that may contain sensitive data such as security credentials). While it’s super convenient to have a containerized version of the Docker registry ready for action, it doesn’t do much good without putting it behind an HTTPS server (otherwise we have to restart our entire docker service with the insecure flag to permit communication with an unauthenticated registry – doesn’t sound like a good idea). So this meant learning how to use nginx load balancing, which I guess is useful to know more generally.

First pass: nginx on ubuntu server

After a few false starts, decided the digitalocean guide is easily the best (though steps 1-3 can be skipped by using a containerized registry instead). This runs nginx directly from the host OS, which is in some ways more straight forward but less portable. A few notes-to-self in working through the tutorial:

  • Note: At first, nginx refuses to run because there’s was default configuration in cd /etc/nginx/sites-enabled that tries to create a conflict. Remove this and things go pretty nicely.

  • Note: Running the registry container explicitly on port 127.0.0.1 provides an internal-only port that we can then point to from nginx. (Actually this will no longer matter when we use a containerized nginx, since we will simply not export these ports at all, but only expose the port of the nginx load balancer). Still, good to finally be aware of the difference between 127.0.0.1 and 0.0.0.0 (the publicly visible localhost, and the default if we supply only a port) in this context.

  • Note: Running and configuring nginx Note that keys are specific to the url. This is necessary for the server signing request, but I believe could have been omitted in the root certificate. Here’s how w ego about creating a root key and certificate (crt), a server key, server signing request (csr), and then sign the latter with the former to get the server certificate.

openssl genrsa -out dockerCA.key 2048
openssl req -x509 -new -nodes -key dockerCA.key -days 10000 -out dockerCA.crt -subj '/C=US/ST=Oregon/L=Portland/CN=coreos.carlboettiger.info'
openssl genrsa -out docker-registry.key 2048
openssl req -new -key docker-registry.key -out docker-registry.csr -subj '/C=US/ST=Oregon/L=Portland/CN=coreos.carlboettiger.info'
openssl x509 -req -in docker-registry.csr -CA dockerCA.crt -CAkey dockerCA.key -CAcreateserial -out docker-registry.crt -days 10000

Note that we also need the htpasswd file from above, which needs apache2-utils and so cannot be generated directly from the CoreOS terminal (though the openssl certs can):

sudo htpasswd -bc /etc/nginx/docker-registry.htpasswd $USERNAME $PASSWORD

Having created these ahead of time, I end up just copying my keys into the Dockerfile for my nginx instance (if we generated them on the container, we’d still need to get dockerCA.crt off the container to authenticate the client machines. Makes for a simple Dockerfile that we then build locally:

FROM ubuntu:14.04
RUN apt-get update && apt-get install -y apache2-utils curl nginx openssl supervisor
COPY docker-registry /etc/nginx/sites-available/docker-registry
RUN ln -s /etc/nginx/sites-available/docker-registry /etc/nginx/sites-enabled/docker-registry

## Copy over certificates ##
COPY docker-registry.crt /etc/ssl/certs/docker-registry 
COPY docker-registry.key /etc/ssl/private/docker-registry 
COPY docker-registry.htpasswd /etc/nginx/docker-registry.htpasswd


EXPOSE 8080

## use supervisord to persist
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf
CMD ["/usr/bin/supervisord"]

Note that we need to install the dockerCA.crt certificate on any client that wants to access the private registry. On Ubuntu this looks like:

sudo mkdir /usr/local/share/ca-certificates/docker-dev-cert
sudo cp devdockerCA.crt /usr/local/share/ca-certificates/docker-dev-cert
sudo update-ca-certificates 
sudo service docker restart

But on CoreOS we use a different directory (and restarting the docker service doesn’t seem possible or necessary):

sudo cp dockerCA.crt /etc/ssl/certs/docker-cert
sudo update-ca-certificates  
  • Note: Could not get the official nginx container to run the docker-registry config file as /etc/nginx/nginx.conf, either with or without adding daemon off; at the top of /etc/nginx/nginx.conf. With, it complains this is a duplicate, (despite being recommended on the nginx container documentation, though admittedly this already appears in the default command ["nginx" "-g" "daemon off;"]). Without, the error says that upstream directive is not allowed here. Not sure what to make of these errors, ended up running an ubuntu container and then just installing nginx etc following the digitalocean guide. Ended up dropping the daemon off; from the config file and running service nginx start through supervisord to ensure that the container stays up. Oh well.

  • Note: I got a 502 error when calling curl against the the nginx container-provided URL (with or without SSL enabled), since from inside the nginx container we cannot access the host addresses. The simplest solution is to add --net="host" when we docker run the nginx container, but this isn’t particularly secure. Instead, we’ll link directly to the ports of the registry container like this:

docker run  --name=registry -p 8080:8080 registry
docker run --name=nginx --net=container:registry nginx

Note that we do not need to export the registry port (e.g. -p 5000:5000) at all, but we do need to export the nginx load-balancer port from the registry container first, since we will simply be linking it’s network with the special --net=container:registry.

Note that we would probably want to link a local directory to provide persistent storage for the registry; in the above example images committed to registry are lost when the container is destroyed.

We can now log in:

docker login https://<YOUR-DOMAIN>:8080

We can now reference our private registry by using its full address in the namespace of the image in commands to docker pull, push, run etc.

Migrating gitlab between servers

This migration was my original motivation to configure the private docker registry; ironically it isn’t necessary for this case (though it’s useful for the drone image, for instance).

Note that there is no need to migrate the redis and postgresql containers manually. Migrating the backup file over to the corresponding location in the linked volume and then running the backup-restore is sufficient. Upgrading is also surprisingly smooth; we can backup (just in case), then stop and remove the container (leaving the redis and postgresql containers running), pull and relaunch with otherwise matched option arguments and the upgrade runs automatically.

When first launching the gitlab container on a tiny droplet running coreos, my droplet seems invariably to hang. Rebooting from the digitalocean terminal seems to fix this. A nice feature of fleet is that all the containers are restarted automatically after reboot, unlike when running these directly from docker on my ubuntu machine.

Notes on fleet unit files

Fleet unit files are actually pretty handy and straight forward. One trick is that we must quote commands in which we want to make use of environmental variables. For instance, one must write:

Environment="VERSION=1.0"
ExecStart=/bin/bash -c "/usr/bin/docker run image:${VERSION}"

in a Service block, rather than ExecStart=/usr/bin/docker run ... directly, for the substitution to work. It seems if we are using the more standard practice of environment files (which after all is the necessary approach to avoid having to edit the unit file directly one way or another anyway), we can avoid the bin/bash wrapper and insert the environment reference directly.

If we’re not doing anything fancy wrt load balancing between different servers, we don’t have that much use for the corresponding “sidekick” unit files that keep our global etcd registry up to date. Perhaps these will see more use later.

Cloud-config

Note that we need to refresh the discovery url pretty-much anytime we completely destroy the cluster.

A few edits to my cloud-config to handle initiating swap, essential for running most things (gitlab, rstudio) on tiny droplets. Still requires one manual reboot for the allocation to take effect. Adds this to the units section of #cloud-config:

    ## Configure SWAP as per https://github.com/coreos/docs/issues/52
    - name: swap.service
      command: start
      content: |
        [Unit]
        Description=Turn on swap

        [Service]
        Type=oneshot
        Environment="SWAPFILE=/1GiB.swap"
        RemainAfterExit=true
        ExecStartPre=/usr/sbin/losetup -f ${SWAPFILE}
        ExecStart=/usr/bin/sh -c "/sbin/swapon $(/usr/sbin/losetup -j ${SWAPFILE} | /usr/bin/cut -d : -f 1)"
        ExecStop=/usr/bin/sh -c "/sbin/swapoff $(/usr/sbin/losetup -j ${SWAPFILE} | /usr/bin/cut -d : -f 1)"
        ExecStopPost=/usr/bin/sh -c "/usr/sbin/losetup -d $(/usr/sbin/losetup -j ${SWAPFILE} | /usr/bin/cut -d : -f 1)"

        [Install]
        WantedBy=local.target

    - name: swapalloc.service
      command: start
      content: |
        [Unit]
        Description=Allocate swap

        [Service]
        Type=oneshot
        ExecStart=/bin/sh -c "sudo fallocate -l 1024m /1GiB.swap && sudo chmod 600 /1GiB.swap && sudo chattr +C /1GiB.swap && sudo mkswap /1GiB.swap"

More probably be structured more elegantly, but it works. (Not much luck trying to tweak this into a bunch of ExecStartPre commands though.

NFS sharing on CoreOS?

Couldn’t figure this one out. My SO Q here

Read more



Coreos And Other Infrastructure Notes

19 Nov 2014

CoreOS?

Security model looks excellent. Some things not so clear:

  • In a single node setup, what happens with updates? Would containers being run directly come down and not go back up automatically? In general, how effective or troublesome is it to run a single, low-demand app on a single node CoreOS rather than, say, an ubuntu image (e.g. just to benefit from the security updates model)? For instance, would an update cause a running app to exit in this scenario? (Say, if the container is launched directly with docker and not through fleet?) (Documentation merely notes that cluster allocation / fleet algorithm is fastest with between 3 & 9 nodes).

  • If I have a heterogenous cluster with one more powerful compute node, is there a way to direct that certain apps are run on that node and that other apps are not?

  • Looks like one needs a load-balancer to provide a consistent IP for containers that might be running on any node of the cluster?

  • Enabling swap. Works, but is there a way to do this completely in cloud-config?

Setting up my domain names for DigitalOcean

In Dreamhost DNS management:

  • I have my top-level domain registered through Dreamhost, uses dreamhost’s nameservers.
  • A-level entry for top level domain points to (the new) Github domain IP address
  • Have CNAME entries for www and io pointing to cboettig.github.io

First step

  • Add an A-level entry, server.carlboettiger.info, pointing to DigitalOcean server IP

Then go over to DigitalOcean panel.

From DigitalOcean DNS management:

  • add new (A level) DNS entry as server.carlboettiger.info pointing to DO server IP
  • Delete the existing three NS entries ns1.digitalocean.com etc.
  • Add three new NS entries using ns1.dreamhost.com etc

Things should be good to go!

Read more



Nimble Explore

14 Nov 2014

A quick first exploration of NIMBLE and some questions.

library("nimble")
library("sde")

Let’s simulate from a simple OU process: \(dX = \alpha (\theta - X) dt + \sigma dB_t\)

set.seed(123)
d <- expression(0.5 * (10-x))
s <- expression(1) 
data <- as.data.frame(sde.sim(X0=6,drift=d, sigma=s, T=100, N=400))
## sigma.x not provided, attempting symbolic derivation.

i.e. \(\alpha = 0.5\), \(\theta = 10\), \(\sigma=1\), starting at \(X_0 = 6\) and running for 100 time units with a dense sampling of 400 points.

Le’t now estimate a Ricker model based upon (set aside closed-form solutions to this estimate for the moment, since we’re investigating MCMC behavior here).

code <- modelCode({
      K ~ dunif(0.01, 40.0)
      r ~ dunif(0.01, 20.0)
  sigma ~ dunif(1e-6, 100)

  iQ <- 1 / (sigma * sigma)

  x[1] ~ dunif(0, 10)

  for(t in 1:(N-1)){
    mu[t] <- log(x[t]) + r * (1 - x[t]/K) 
    x[t+1] ~ dlnorm(mu[t], iQ) 
  }
})

constants <- list(N = length(data$x))
inits <- list(K = 6, r = 1, sigma = 1)

Rmodel <- nimbleModel(code=code, constants=constants, data=data, inits=inits)

NIMBLE certainly makes for a nice syntax so far. Here we go now: create MCMC specification and algorithm

mcmcspec <- MCMCspec(Rmodel)
Rmcmc <- buildMCMC(mcmcspec)

Note that we can also query some details regarding our specification (set by default)

mcmcspec$getSamplers()
## [1] RW sampler;   targetNode: K,  adaptive: TRUE,  adaptInterval: 200,  scale: 1
## [2] RW sampler;   targetNode: r,  adaptive: TRUE,  adaptInterval: 200,  scale: 1
## [3] RW sampler;   targetNode: sigma,  adaptive: TRUE,  adaptInterval: 200,  scale: 1
mcmcspec$getMonitors()
## thin = 1: K, r, sigma, x

Now we’re ready to compile model and MCMC algorithm

Cmodel <- compileNimble(Rmodel)
Cmcmc <- compileNimble(Rmcmc, project = Cmodel)

Note we could have specified the Rmodel as the “project” (as shown in the example from the Nimble website), but this is more explicit. Rather convenient way to add to an existing model in this manner.

And Now we can execute the MCMC algorithm in blazing fast C++ and then extract the samples

Cmcmc(10000)
## NULL
samples <- as.data.frame(as.matrix(nfVar(Cmcmc, 'mvSamples')))

How do these estimates compare to theory:

mean(samples$K)
## [1] 10.05681
mean(samples$r)
## [1] 0.180207

Some quick impressions:

  • Strange that Rmodel call has to be repeated before we can set up a custom MCMC (nimble docs). How/when was this object altered since it was defined in the above code? Seems like this could be problematic for interpreting / reproducing results?

  • What’s going on with getSamplers() and getMonitors()? Guessing these are in there just to show us what the defaults will be for our model?

  • why do we assign Cmodel if it seems we don’t use it? (the compilation needs to be done but isn’t explicitly passed to the next step). Seems we can use Cmodel instead of Rmodel in the Cmcmc <- compileNimble(Rmcmc, project = Cmodel), which makes the dependency more explicit, at least that notation is more explicit. Seems like it should be possiple to omit the first compileNimble() and have the second call the compileNimble automatically if it gets an object whose class is that of Rmodel instead?

  • Repeated calls to Cmcmc seem not to give the same results. Are we adding additional mcmc steps by doing this?

  • Thinking an as.data.frame would be nicer than as.matrix in the nfVar mvSamples coercion.

  • Don’t understand what simulate does (or why it always returns NULL?).

Read more



Dear DockerHub users: please configure your repository links

(for security's sake!)

07 Nov 2014

The DockerHub is a great resource for discovering and distributing Dockerfiles. Many users sharing public images take advantage of the Docker Hub’s Automated Build configuration, which is excellent as this automatically allows the Hub to display the Dockerfile and provides some medium of security above simply downloading and running some untrusted binary black box.

Unfortunately, far fewer users configure Repository Links to trigger builds to update even when the resulting Dockerfile is unchanged. As a result, many excellent Docker containers that are not under active development have not been rebuilt in several months, meaning that they still contain widely known dangerous security flaws such as Shellshock (September 2014).

This problem is easily avoided by configuring the Repository Links setting to point to the repository being used as a base image in FROM. The official base images such as debian and ubuntu (e.g. the images with no additional namespace) are regularly updated to patch security vulnerabilities as soon as they are discovered, resulting in updates being made every few days on average. Setting the repository link to the FROM source allows your repository to be rebuilt as soon as its base image has been updated, ensuring that you inherit those updates.

Naturally this strategy does not help if your FROM image isn’t an official base image and hasn’t configured Repository Links (or if such a break in the chain appears anywhere along the FROM recursion). In such cases, having a RUN apt-get update && apt-get upgrade -y command (or equivalent option for your distribution) might be a good idea to make sure that your image at least gets the latest updates, but you’ll still need to set up some automatic or manual Build Triggers to ensure this is run regularly; or better yet, just avoid building on or using stale images.

If you do have a reliable Repository Links chain to an official image, then apt-get upgrade is not necessary (and in fact is not advised in Best Practices). Instead, make sure all images in the chain call apt-get update in the same RUN line as apt-get install -y ..., which will ensure that cache is broken and the latest versions of the packages are installed. See the official Dockerfile Best Practices for more information.


NB: I’m not a security professional; this just looks like common sense usage

Read more



linking binaries from other containers

05 Nov 2014

Been thinking about this for a while, but @benmarwick ’s examples with --volumes-from convinced me to give this a try.

While there’s an obvious level of convenience in having something like LaTeX bundled into the hadleyverse container so that users can build nice pdfs, if often feels not very docker-esque to me to just throw the kitchen sink into a container. At the risk of some added complexity, we can provide LaTeX from a dedicated TeX container to a container that doesn’t have it built in, like rocker/rstudio. Check this out:

First, we run the docker container providing the texlive binaries as linked volume. Note that even after the 4 GB texlive container has been downloaded that this is slow to execute due to the volume linking flag (not really sure why that is).

docker run --name tex -v /usr/local/texlive leodido/texlive true

Once the above task is complete, we can run the rstudio container, which doesn’t have tex installed by itself, and access tex by linking:

docker run -dP --volumes-from tex \
 -e PATH=$PATH:/usr/local/texlive/2014/bin/x86_64-linux/ \
 rocker/rstudio

We can now log into RStudio, create a new Rnw file and presto, RStudio discovers the tex compilers and builds us a pdf. This does make our Docker execution lines a bit long, but that’s what fig is for. (Or a good ole Makefile).

Note this requires we build texlive in a way that isolates it to it’s own path (e.g. /usr/local/texlive). The default installation with apt-get installs everything in separate locations that overlap with existing directories (like /usr/bin), which makes linking clumsy or impossible (we would need separate paths for all the components, e.g. since shared libraries aren’t found under the bin path, and we cannot link such a volume to another container without destroying everything in it’s /usr/bin, clearly not a good idea). Instead, if we use the standard texlive install script from https://www.tug.org/texlive/, this installs everything into /usr/local/texlive which is much more portable as illustrated above. Not quite sure if it’s actually a good idea to build containers this way or not.

I’ll keep shipping latex inside the hadleyverse container (has about 300 MB of texlive that covers most common usecases), but this is certainly an intruging recipe to mix and match.

Read more



Three Interfaces For Docker

03 Nov 2014

Here I outline three broad, different strategies for incorporating Docker into a user’s workflow, particularly from the perspective of an instructor getting a group of students up and running in a containerized environment, but also in the context of more generic collaborations. The options require progressively more setup and result in a progressively more ‘native’ feel to running Docker. My emphasis is on running Dockerized R applications and RStudio, though much the same thing can be accomplished with iPython notebooks and many other web apps.

Of course the great strength of Docker is the relative ease with which one can move between these three strategies while using the identical container, maintaining a consistent computational environment in each case.

Web-hosted Docker

In this approach, RStudio-server is deployed on a web server and accessed through the browser. The use of Docker containers makes it easier for an instructor to deploy a consistent environment quickly with the desired software pre-installed and pre-configured.

Advantages:

  • A user just needs a web browser and the URL of the server.
  • No need to install any local software.
  • No need to download big files.
  • Should work with any device that supports a modern browser, including most tablets.
  • Convenient to temporarily scale computation onto a larger system.

Disadvantages:

  • requires a network connection (at all times)
  • requires access to a server with sufficient computational power for the task.
  • Someone has to manage user & network security (as with any web server).
  • Need additional mechanisms for moving files on and off the server, such as git.
  • No native interfaces available, must manage files, edit text etc. through the RStudio IDE

Setup:

A Docker container running RStudio can be deployed with a single command, see rocker wiki instructions on RStudio for details. The instructor or team-member responsible for the setup would simply need to install docker on server. If multiple students will be accessing a single RStudio-server instance, it must be configured for multiple users. Alternately multiple containers can be run on different ports of the same server. (See wiki).

Hint: Users can also take advantage of the new R package analogsea to quickly launch and manage an RStudio Server instance on the Digital Ocean cloud platform. analogsea can also facilitate transfers of code and other files onto and off of the server.

Self-hosted Docker

In this approach, the user installs docker (via boot2docker, if necessary) on their local machine, but still interacts with the container using the same web-based interface (e.g. rstudio-server, ipython-notebook) that one would use in the cloud-hosted model.

Advantages:

  • No need for a network connection (at least once the container image is downloaded / transfered)
  • No need to have a server available (with the associated cost and security overhead)

Disadvantages:

  • More initial setup: install docker locally, or install boot2docker for Mac/Windows users.
  • Need to use git or docker copy to move files from the container to the host or vice versa.

Hint: Users might also check out the R package harbor for interacting with Docker locally from R.

Setup:

Setup is much the same as on a remote server, though there is no need to set custom usernames or passwords since the instance will be accessible only to local users. See rocker wiki instructions on RStudio for details.

Integrated Docker

This approach is the same as the self-hosted approach, except that we link shared volumes with the host. At minimum this makes it easier to move files on and off the container without learning git.

An intriguing advantage of this approach is that it does not restrict the user to the RStudio IDE as a way of editing text, managing files and versions, etc. Most users do not rely exclusively on RStudio for these tasks, and may find that restriction limiting. The integrated approach may be more suited for experienced users who are set in their ways and do not need a pixel-identical work environment of RStudio useful for following directions in a classroom. In the integrated approach, a user can continue to rely on whatever their preferred native tools are, while ensuring that code execution occurs (invisibly) on a Dockerized container.

Advantages

  • Can use native OS tools (text editors, file browsers, version control front ends, etc) for all interactions
  • No network required (once image is downloaded / transfered).
  • No servers required

Disadvantages

  • Additional setup beyond self-hosting: mapping shared volumes, managing user permissions.
  • Potentially less well suited for classroom use, which may benefit from everyone using the identical RStudio interface rather than a range of different text editors, etc. (Of course one can still share volumes while using RStudio as the IDE).
  • Cannot open external windows (e.g. if running R in terminal instead of RStudio, the container running R cannot open an X11 window to display plots. Instead, a user must do something like ggsave() after plotting interactively to view the resulting graphic in the native file browser. (This is more tedious in base graphics that need dev.off() etc.). Of course this is not an issue when using RStudio with linked volumes.

Setup

The key here is simply to link the working directory on the host to the file system on the container. That way any changes made to the host copy using the host OS tools are immediately available to the container, and vice versa. Setup requires a bit more effort on Windows at this time, though is natively supported for Mac in Docker 1.3. Some care may also be necessary not to change the permissions of the file. See details in the rocker wiki on shared files

aliases

The most aggressive form of the integrated approach is to literally alias common commands like R or rstudio as the corresponding docker calls in .bashrc, e.g.

alias R='docker run --rm -it --user docker -v $(pwd):/home/docker/`basename $PWD` -w /home/docker/`basename $PWD` rocker/hadleyverse R'

makes the command R launch an instance of the rocker/hadleyverse container sharing the current working directory. Clearly different containers could be substituted in place of rocker/hadleyverse, including custom extensions. This helps ensure that R is always run in the portable, Dockerized environment. Other than the lack of X11 display for plots, this works and feels identical to an interactive R terminal session.

Other tweaks

Mac/Windows users might also want to customize boot2docker’s resources to make more of the host computer’s memory and processors available to Docker.

Read more



Goodbye Jekyll?

28 Oct 2014

The great strength of Jekyll is in providing a really convenient HTML templating system through the _layouts and _includes directories and Liquid variables (including the auto-populated ones like page.previous.url).

For quickly deploying simple sites though, this is often unnecessary: one or two layout files will suffice, and an _includes directory is not so useful with only a single layout. The ease of maintenance by having a template divided into modular chunks is somewhat trumped by the greater simplicity of copying a single template or set of templates over into a new directory.

And deploying Jekyll could be easier, particularly with pandoc as the parser. Despite plugins that nicely let pandoc act just like the built-in parsers and a CI setup with Travis to support automated building of my site on push, setting these components up repeatedly on every new repository is a bit tedious. Occassional updates of Jekyll and related gems have also broken my build pipeline more than once, though this is less of an issue now that I’ve added bundler and a Gemfile to restrict gem versions and provide a Dockerized setup for local deploying. These things keep the overhead low for my main site, but are an overhead to replicate.

Meanwhile, I’ve found Pandoc’s templating system to be immensely powerful, particularly with the yaml headers now supported. To provide a lightweight way to deploy a website on a gh-pages branch of a new repository, I’ve found this system works quite well. I’ve illustrated this on my gh-pages branch of my template repository. Previously, this used Jekyll with the built-in redcarpet markdown parser to deploy markdown files in a style consistent with my notebook.

Now, I’ve stripped this down to simply use a pandoc template, pandoc YAML, and a Makefile to accomplish much the same thing.

I was dissapointed to see that the _output.yaml used by rmarkdown for building multi-page websites did not leverage the generic metadata.yaml approach already built into pandoc. This prevents us specifying custom generic metadata the way one does in Jekyll with _config.yaml, as I describe in rmarkdown#297 I can work around this with the Makefile by calling pandoc manually with the additional metadta.yaml file, as follows:

%.html: %.Rmd
  R --vanilla --slave -e "knitr::knit('$<')"
  pandoc --template _layouts/default.html metadata.yaml -o $@ $(basename $<).md
  rm $(basename $<).md

%.html: %.md
  pandoc --template _layouts/default.html metadata.yaml -o $@ $< 

Note the Rmd building is somewhat more cumbersome since we have to bypass rmarkdown:render for this to work.

I had to collapse all my _includes and nested _layouts into a single layout, replace the Jekyll Liquid blocks, {{ with pandoc-template $ ones, and write out a basic metadata.yaml file, and things are good to go.


Nonetheless, I sometimes wish I could break templates into more re-usable components; similar to the the way _includes provides re-usable components for the templates specified in the _layouts directory of a jekyll site.

My first thought was to simply add the re-usable elements into a metadata block itself. (This seemed particularly promising since we can already have an external metadata.yaml to provide a metadata block we can use across multiple file). However, it seems that Pandoc always escape the html contents in my yaml metadata. For instance, if I add the block:

---
header: |
    <header class="something"><h1>$title$</h1></header>
---

and then in my template add $header$, I get the above block but with all the angle brackets escaped. I had thought since I have denoted this as a literal block with ‘|’ in the yaml I would get this block unaltered. How can I prevent pandoc from escaping the html? (I realize that still wouldn’t parse the $title$ metadata, but that’s a separate issue).

The other approach I considered is to exploit the --include-before-body and --include-after-body arguments. While more limited since I am restricted to these two variables, this approach does allow me to specify a file with a re-usable component block and avoids the issue of HTML escaping observed above. Other than the limit of two such variables, the other limit to this approach is that metadata elements like $title$ are processed only in templates, not in files.

It seems like pandoc is thus really close to being able to support templates that are made from re-usable blocks rather than completely specified from scratch, but not quite there. I realize pandoc isn’t trying to be a replacement for static website generation, but still feel that re-usable blocks would make the existing template system a bit more flexible and user-friendly.

Quite a few of pandoc’s current functions already approximate this behavior in a hard-wired fashion; e.g. $highlight-css$ uses the --highlight-style option to select among a bunch of pre-defined highlight blocks. Thus I suspect pandoc might be easier to extend in the future if such features could just be added through an include mechanism rather than this hardwired approach.

See this as a query to pandoc-discuss

Read more



Docker And User Permissions Crazyness

21 Oct 2014

Lots of crazyness getting to the bottom of permissions changes, as discussed in:

Long story short: docker cares only about UIDs, so we have to explicitly make sure these match. Some very good answers including from Docker core-team members on the discussion list. Overall approach outlined at the end of the rocker issues tracker.

Here’s the SO version of the question, for my reference:


Consider the following trivial Dockerfile:

FROM debian:testing
RUN  adduser --disabled-password --gecos '' docker
RUN  adduser --disabled-password --gecos '' bob 

in a working directory with nothing else. Build the docker image:

docker build -t test .

and then run a bash script on the container, linking the working directory into a new subdir on bob’s home directory:

docker run --rm -it -v $(pwd):/home/bob/subdir test 

Who owns the contents of subdir on the container? On the container, run:

cd /home/bob/subdir
ls -l

ad we see:

-rw-rw-r-- 1 docker docker 120 Oct 22 03:47 Dockerfile

Holy smokes! docker owns the contents! Back on the host machine outside the container, we see that our original user still owns the Dockerfile. Let’s try and fix the ownership of bob’s home directory. On the container, run:

chown -R bob:bob /home/bob
ls -l 

and we see:

-rw-rw-r-- 1 bob bob 120 Oct 22 03:47 Dockerfile

But wait! outside the container, we now run ls -l

-rw-rw-r-- 1 1001 1001 120 Oct 21 20:47 Dockerfile

we no longer own our own file. Terrible news!


If we had only added one user in the above example, everything would have gone more smoothly. For some reason, Docker seems to be making any home directory owned by the first non-root user it encounters (even if that user is declared on an earlier image). Likewise, this first user is the one that corresponds to the same ownership permissions as my home user.

Question 1 Is that correct? Can someone point me to documentation of this, I’m just conjecturing based on the above experiment.

Question 2: Perhaps this is just because they both have the same numerical value on the kernel, and if I tested on a system where my home user was not id 1000 then permissions would get changed in every case?

Question 3: The real question is, of course, ‘what do I do about this?’ If bob is logged in as bob on the given host machine, he should be able to run the container as bob and not have file permissions altered under his host account. As it stands, he actually needs to run the container as user docker to avoid having his account altered.

I hear you asking Why do I have such a weird Dockerfile anyway?. I wonder too sometimes. I am writing a container for a webapp (RStudio-server) that permits different users to log in, which simply uses the user names and credentials from the linux machine as the valid user names. This brings me the perhaps unusual motivation of wanting to create multiple users. I can get around this by creating the user only at runtime and everthing is fine. However, I use a base image that has added a single docker user so that it can be used interactively without running as root (as per best practice). This ruins everything since that user becomes the first user and ends up owning everything, so attempts to log on as other users fail (the app cannot start because it lacks write permissions). Having the startup script run chown first solves this issue, but at the cost of linked volumes changing permissions (obviously only a problem if we are linking volumes).

Read more



Notes

20 Oct 2014

Keep thinking about this quote from Jeroen Oom’s recent piece on the arxiv:

The role and shape of data is the main characteristic that distinguishes scientific computing. In most general purpose programming languages, data structures are instances of classes with well-defined fields and methods. […] Strictly defined structures make it possible to write code implementing all required operations in advance without knowing the actual content of the data. It also creates a clear separation between developers and users [emphasis added]. Most applications do not give users direct access to raw data. Developers focus in implementing code and designing data structures, whereas users merely get to execute a limited set of operations.

This paradigm does not work for scientific computing. Developers of statistical software have relatively little control over the structure, content, and quality of the data. Data analysis starts with the user supplying a dataset, which is rarely pretty. Real world data come in all shapes and formats. They are messy, have inconsistent structures, and invisible numeric properties. Therefore statistical programming languages define data structures relatively loosely and instead implement a rich lexicon for interactively manipulating and testing the data. Unlike software operating on well-defined data structures, it is nearly impossible to write code that accounts for any scenario and will work for every possible dataset. Many functions are not applicable to every instance of a particular class, or might behave differently based on dynamic properties such as size or dimensionality. For these reasons there is also less clear of a separation between developers and users in scientific computing. The data analysis process involves simultaneously debugging of code and data where the user iterates back and forth between manipulating and analyzing the data. Implementations of statistical methods tend to be very flexible with many parameters and settings to specify behavior for the broad range of possible data. And still the user might have to go through many steps of cleaning and reshaping to give data the appropriate structure and properties to perform a particular analysis.

Good inspiration for this week’s assignment:

Notes for SWC training: short motivational pitch on R

What’s the difference between a novice programmer and a professional programmer?
The novice pauses a moment before doing something stupid.

In this course, we’ll be learning about the R programming environment. You may already have heard of R or have used it before.

It’s the number one language for statistical programming. It’s used by most major companies, putting these skills in high demand.

Recently DICE Magazine showed that programmers with expertise in R topped the tech survey salary charts (at $115,531, above Hadoop, MapReduce, C, or Cloud, mobile or UI/UX design).

You may not (always) like the R programming environment. The syntax is often challenging, it can do counterintuitive things, and it’s terrible at mind reading.

Try not to take this out on your hardware. Hardware is expensive.

R’s strength lies in data.

No other major language has key statistical concepts like ‘missing data’ baked in at the lowest level.

We will not be approaching R as a series of recipes to perform predifined tasks.

In scientific research we can seldom predict just what our data will look like in advance or what our analysis will require. This makes it impossible to always rely on pre-made software and graphical interfaces you may be familar with, and blurs the lines between users developers. Unlike other languages such as C, Java or Python that are often used to build ‘end-user’ software in which the underlying language is completely invisible, R does not make this distnction. R gives a lto of power to the end user. And with great power comes great responsibility.

Read more