Lab Notebook

(Introduction)

Coding

  • cboettig pushed to master at cboettig/labnotebook: update Dockerfile, drop non-existant layout on atom feed update 02:22 2014/09/03
  • cboettig pushed to gh-pages at cboettig/labnotebook: Updating to cboettig/labnotebook@26552b2. 01:03 2014/09/03
  • cboettig pushed to master at cboettig/labnotebook: new post, updating Gemfile locale 12:56 2014/09/03
  • cboettig pushed to master at ropensci/docker: use the correct debian call for Rdevel install 10:57 2014/09/02
  • cboettig pushed to master at cboettig/collaboratool: add a Dockerfile providing RStudio-server and BCE image 10:56 2014/09/02

Discussing

  • "Early warnings of regime shifts: evaluation of spatial indicators from a whole-ecosystem experiment" http://t.co/sI7mbzv4AY

    05:04 2014/09/02
  • @emgbotany Others have also pointed that out for some time, but it's nice to see this review draw more attention to the issue [2]/[2]

    04:22 2014/09/01
  • @emgbotany Yup, I think spatial autocorrelation is routinely mistaken for environmental niche matching and too often ignored [1]/[2]

    04:21 2014/09/01
  • Mistaking geography for biology: inferring processes from species distributions: Trends in Ecology & Evolution http://t.co/7uXR4hca5X

    04:14 2014/08/30
  • RT @fperezorg: Git That Data: Berkeley students recount their experience in reproducible research #caldatasci @aculich @inundata http://t…

    08:47 2014/08/28

Reading

  • Uncertainty , learning , and the optimal management of wildlife: Environmental and Ecological Statistics (2001). Volume: 8, Issue: 3. Pages: 269-288. Byron K. Williams et al. 08:57 2014/08/06
  • Public goods in relation to competition, cooperation, and spite: Proceedings of the National Academy of Sciences (2014). S. a. Levin et al. 08:57 2014/08/06
  • Effects of Recent Environmental Change on Accuracy of Inferences of Extinction Status.: Conservation biology : the journal of the Society for Conservation Biology (2014). Volume: 28, Issue: 4. Pages: 971-981. Christopher F Clements, Ben Collen, Tim M Blackburn, Owen L Petchey et al. 08:57 2014/08/06
  • Linking Indices for Biodiversity Monitoring to Extinction Risk Theory.: Conservation biology : the journal of the Society for Conservation Biology (2014). Pages: 1-9. Michael a McCarthy, Alana L Moore, Jochen Krauss, John W Morgan, Christopher F Clements et al. 08:57 2014/08/06

Entries

Docker tricks of the trade and best practices thoughts

29 Aug 2014

Best practices questions

Here are some tricks that may or may not be in keeping with best practices, input would be appreciated.

  • Keep images small: use the --no-install-recommends option for apt-get, install true dependencies rather than big metapackages (like texlive-full).
  • Avoid creating additional AUFS layers by combining RUN commands, etc? (limit was once 42, but is now at least 127).
  • Can use RUN git clone ... to add data to a container in place of ADD, which invalidates caching.

  • Use automated builds linked to Github-based Dockerfiles rather than pushing local image builds. Not only does this make the Dockerfile transparently available and provide a link to the repository where one can file issues, but it also helps ensure that the image available on the hub gets its base image (FROM entry) from the hub instead of whatever was available locally. This can help avoid various out-of-sync errors that might otherwise emerge.

Docker’s use of tags

Unfortunately, Docker seems to use the term tag to refer both to the label applied to an image (e.g. in docker build -t imagelabel . the -t argument “tags” the image as ‘imagelabel’ so we need not remember its hash), but also uses tag to refer to the string applied to the end of an image name after a colon, e.g. latest in ubuntu:latest. The latter is the definition of “tags” as listed under the “tags” tab on the Docker Hub. Best practices for this kind of tag (which I’ll arbitrarily refer to as a ‘version tag’ to distinguish it) are unclear.

One case that is clear is tagging specific versions. Docker’s automated builds lets a user link a “version tag” to either to a branch or a tag in the git history. A “branch” in this case can refer either to a different git branch or merely a different sub-directory. Matching to a Git tag provides the most clear-cut use of the docker version-tag; providing a relatively static version stable link. (I say “relatively” static because even when we do not change the Dockerfile, if we re-build the Dockerfile we may get a new image due the presence of newer versions of the software included. This can be good with respect to fixing security vulnerabilities, but may also break a previously valid environment).

The use case that is less clear is the practice of using these “version tags” in Docker to indicate other differences between related images, such as eddelbuettel/docker-ubuntu-r:add-r and eddelbuettel/docker-ubuntu-r:add-r-devel. Why these are different tags instead of different roots is unclear, unless it is for the convenience of multiple docker files in a single Github repository. Still, it is perfectly possible to configure completely separate docker hub automated builds pointing at the same Github repo, rather than adding additional builds as tags in the same docker hub repo.

Docker linguistics borrow from git terminology, but it’s rather dangerous to interpret these too literally.

Keeping a clean docker environment

  • run interactive containers with --rm flag to avoid having to remove them later.
  • Clean up un-tagged docker images:
docker rmi $(docker images -q --filter "dangling=true")
  • Stop and remove all containers
docker rm -f $(docker ps -a -q)

Docker and Continuous Integration

  • We can install but cannot run Docker on Travis-CI at this time. It appears the linux kernel available there is much too old. Maybe when they upgrade to Ubuntu 14:04 images…

  • We cannot run Docker on the docker-based Shippable-CI (at least without a vagrant/virtualbox layer in-between). Docker on Docker is not possible (see below).

  • For the same reason, we cannot run Docker on drone.io CI. However, Drone provides an open-source version of it’s system that can be run on your own server, which unlike the fully hosted offering, permits custom images. Unfortunately I cannot get it working at this time.

Docker inside docker:

We cannot directly install docker inside a docker container. We can get around this by adding a complete virtualization layer – e.g. docker running in vagrant/virtualbox running in docker.

Alternatively, we can be somewhat more clever and tell our docker to simply use a different volume to store its AUFS layers. Matt Gruter has a very clever example of this, which can be used, e.g. to run a Drone server (which runs docker) inside a Docker container (mattgruter/drone).

I believe this only works if we run the outer docker image with --privileged permissions, e.g. we cannot use this approach on a server like Shippable that is dropping us into a prebuilt docker container.

Read more



Pdg Controlfest Notes

14 Aug 2014

Just wanted to give a quick update on stuff relevant to our adjustment costs paper in events of this week.

I think the talk on Tuesday went all right, (though thanks to a technology snafu going from reveal.js to pdf my most useful figure actually showing the bluefin tuna didn’t display – I tried not to let on). I tried to keep the focus pretty big-picture throughout (we ignore these costs when we model, they matter) and avoid being too bold / prescriptive (e.g. not suggesting we found the ‘right’ way to model these costs). I also could not stop myself from calling the adjustment cost models L1 L2 L3 instead of “linear” “quadratic” and “fixed”, or _1,2,3. whoops.

One question asked about asymmetric costs. You may recall we started off doing but ran into some unexpected results where they just looked like the cost free case, possibly due to problems with the code. We should probably at least say this is an area for further study.

Another question asked about just restricting the period of adjustment, say, once every 5 years or so. I answered that we hoped to see what cost structures “induced” that behavior rather than enforcing it explicitly; but I should probably add some mention of this to the text as well.

I think the other questions were more straight forward but don’t remember any particulars.

The Monday meeting was very helpful for me in framing the kind of big questions around the paper:

  1. Can we make this story about more than TAC-managed fisheries? My ideal paper would be something people could cite to show that simply using profit functions with diminishing returns is not a sufficient way to reflect this reality (could be the opposite if reality is more like a transaction fee), and that this mistake can be large. But all our examples are in the fisheries context, so this may take some finesse. (Since we’re aiming for Eco Apps rather than, say, Can Jor Fisheries)

  2. Emphasizing the “Pretty Darn Good” angle – thinking of the policies we derive with adjustment costs not as the “True optimum” but as a “Pretty Darn Good” policy that can be more robust to adjustment costs – (Provided you have intuition to know if those costs are more like a fixed transaction fee or some proportional cost). The last two figures help with this, since they show using policies under different cost regimes than those under which they were computed to be optimal.

  3. Need to figure out what to say about policies that can ‘self-adjust’, e.g. when you don’t have to change the law to respond to the fluctuations. (Jim pointed out that Salmon are the best/only case where you can actually manage by “escapement” since you get a complete population census from the annual runs).

  • Stripping down the complexity of the charts
  • Conversely, may need to show some examples of the fish stock dynamics (In search for simplicity I’ve focused almost all the graphs on harvest dynamics).
  • Calibrating and running the case of quadratic control term for comparison

As a bonus, I quickly ran the tipping point models, and it looks like these stay really close to the Reed solution – e.g. relative to the safer Beverton Holt world, they are much happier to pay whatever the adjustment cost might be to stick with the optimal than they are to risk total collapse. Not sure but maybe should add this into the paper…

Read more



Docker Notes

14 Aug 2014

Ticking through a few more of the challenges I raised in my first post on docker; here I explore some of the issues about facilitating interaction with a docker container so that a user’s experience is more similar to working in their own environment and less like working on a remote terminal over ssh. While technically minor, these issues are probably the first stumbling blocks in making this a valid platform for new users.

Sharing a local directory

Launch a bash shell on the container that shares the current working directory of the host machine (from pwd) with the /host directory on the container (thanks to Dirk for this solution):

docker run -it -v $(pwd):/host cboettig/ropensci-docker /bin/bash

This allows a user to move files on and off the container, use a familiar editor and even handle things like git commits / pulls / pushes in their working directory as before. Then the code can be executed in the containerized environment which handles all the dependencies. From the terminal docker opens, we just cd /host where we find our working directory files, and can launch R and run the scripts. A rather clean way of maintaining the local development experience but containerizing execution.

In particular, this frees us from having to pass our git credentials etc to the container, though is not so useful if we’re wanting to interact with the container via the RStudio server instead of R running in the terminal. (More on getting around this below).

Unfortunately, Mac and Windows users have to run Docker inside an already-virualized environment such as provided by boot2docker or vagrant. This means that it is only the directories on the virtualized environment, not those on the native OS, can be shared in this way. While one could presumably keep a directory synced between this virtual environment and the native OS, (standard in in vagrant), this is a problem for the easier-to-use boot2docker at this time: (docker/issues/7249).

A Docker Desktop

Dirk brought this docker-desktop to my attention; which uses Xpra (in place of X11 forwarding) to provide a window with fluxbox running on Ubuntu along with common applications like libreoffce, firefox, and rox file manager. Pretty clever, and worked just fine for me, but needs Xpra on the client machine and requires some extra steps (run the container, query for passwords and ports, run ssh to connect, then run Xpra to launch the window). The result is reasonably responsive but still slower than virtualbox, and probably too slow for real work.

Base images?

The basic Ubuntu:14.04 seems like a good lightweight base image (at 192 MB), but other images try to give more useful building blocks, like phusion/baseimage (423 MB). Their docker-bash script and other utilities provide some handy features for managing / debugging containers.

Other ways to share files?

Took a quick look at this Dockerfile for running dropbox, which works rather well (at least on a linux machine, since it requires local directory sharing). Could probably be done without explicit linking to local directories to faciliate moving files on and off the container. Of course one can always scp/rsync files on and off containers if ssh is set up, but that is unlikely to be a popular solution for students.

While we have rstudio server running nicely in a Docker container for local or cloud use, it’s still an issue getting Github ssh keys set up to be able to push changes to a repo. We can get around this by linking to our keys directory with the same -v option shown above. We still need a few more steps: setting the Git username and email, and running ssh-add for the key. Presumably we could do this with environmental variables and some adjustment to the Dockerfile:

docker run -it -v /path/to/keys:/home/rstudio/.ssh/ -e "USERNAME=Carl Boettiger" -e "EMAIL=cboettig@example.org" cboettig/ropensci-docker

which would prevent storing these secure values on the image itself.

Read more



An appropriate amount of fun with docker?

08 Aug 2014

An update on my exploration with Docker. Title courtesy of Ted, with my hopes that this really does move us in a direction where we can spend less time thinking about the tools and computational environments. Not there yet though

I’ve gotten RStudio Server working in the ropensci-docker image (Issues/pull requests welcome!).

docker run -d -p 8787:8787 cboettig/ropensci-docker

will make an RStudio server instance available to you in your browser at localhost:8787. (Change the first number after the -p to have a different address). You can log in with username:pw rstudio:rstudio and have fun.

One thing I like about this is the ease with which I can now get an RStudio server up and running in the cloud (e.g. I took this for sail on DigitalOcean.com today). This means in few minutes and 1 penny you have a URL that you and any collaborators could use to interact with R using the familiar RStudio interface, already provisioned with your data and dependencies in place.


For me this is a pretty key development. It replaces a lot of command-line only interaction with probably the most familiar R environment out there, online or off. For more widespread use or teaching this probably needs to get simpler still. I’m still skeptical that this will make it out beyond the crazies, but I’m less skeptical than I was when starting this out.

The ropensci-docker image could no doubt be more modular (and better documented). I’d be curious to hear if anyone has had success or problems running docker on windows / mac platforms. Issues or pull requests on the repo would be welcome! https://github.com/ropensci/docker-ubuntu-r/blob/master/add-r-ropensci/Dockerfile (maybe the repo needs to be renamed from it’s original fork now too…)

Rich et al highlighted several “remaining challenges” in their original post. Here’s my take on where those stand in the Docker framework, though I’d welcome other impressions:

  1. dependencies could still be missed by incompletely documentation

I think this one is largely addressed, at least assuming a user loads the Docker image. I’m still concerned that later builds of the docker image could simply break the build (though earlier images may still be available). Does anyone know how to roll back to earlier images in docker?

  1. The set of scripts for managing reproducibility are at least as complex as the analysis itself

I think a lot of that size is due to the lack of an R image for Travis and the need to install many common tools from scratch. Because docker is both modular and easily shared via docker hub, it’s much easier to write a really small script that builds on existing images, (as I show in cboettig/rnexml)

  1. Travis.org CI constraints: public/open github repository with analyses that run in under 50 minutes.

Docker has two advantages and also some weaknesses here: (1) it should be easy to run locally, while accomplishing much of the same thing as running on travis (though clearly that’s not as nice as running automatically & in the cloud on every push). (2) It’s easier to take advantage of caching – for instance, cboettig/rnexml provides the knitr cache files in the image so that a user can start exploring without waiting for all the data to download and code to run.

It seems that Travis CI doesn’t currently support docker since the linux kernel they use is too old. (Presumably they’ll update one day. Anyone try Shippable CI? (which supports docker))

  1. The learning curve is still prohibitive

I think that’s still true. But what surprised me is that I’m not sure that it’s gotten any worse by adding docker than it was to begin with using Travis CI. Because the approach can be used both locally and for scaling up in the cloud, I think it offers some more immediate payoffs to users than learning a Github+CI approach does. (Notably it doesn’t require any git just to deploy something ‘reproducible’, though of course it works nicely with git.

Read more



Too Much Fun With Docker

07 Aug 2014

NOTE: This post was originally drafted as a set of questions to the revived ropensci-discuss list, hopefully readers might join the discussion from there.

Been thinking about Docker and the discussion about reproducible research in the comments of Rich et al’s recent post on the rOpenSci blog where quite a few of people mentioned the potential for Docker as a way to facilitate this.

I’ve only just started playing around with Docker, and though I’m quite impressed, I’m still rather skeptical that non-crazies would ever use it productively. Nevertheless, I’ve worked up some Dockerfiles to explore how one might use this approach to transparently document and manage a computational environment, and I was hoping to get some feedback from all of you.

For those of you who are already much more familiar with Docker than me (or are looking for an excuse to explore!), I’d love to get your feedback on some of the particulars. For everyone, I’d be curious what you think about the general concept.

So far I’ve created a dockerfile and image

If you have docker up and running, perhaps you can give it a test drive:

docker run -it cboettig/ropensci-docker /bin/bash

You should find R installed with some common packages. This image builds on Dirk Eddelbuettel’s R docker images and serves as a starting point to test individual R packages or projects.

For instance, my RNeXML manuscript draft is a bit more of a bear then usual to run, since it needs rJava (requires external libs), Sxslt (only available on Omegahat and requires extra libs) and latest phytools (a tar.gz file from Liam’s website), along with the usual mess of pandoc/latex environment to compile the manuscript itself. By building on ropensci-docker, we need a pretty minimal docker file to compile this environment:

You can test drive it (docker image here):

docker run -it cboettig/rnexml /bin/bash

Once in bash, launch R and run rmarkdown::render("manuscript.Rmd"). This will recompile the manuscript from cache and leave you to interactively explore any of the R code shown.

Advantages / Goals

Being able to download a pre-compiled image means a user can run the code without dependency hell (often not as much an R problem as it is in Python, but nevertheless one that I hit frequently, particularly as my projects age), and also without altering their personal R environment. Third (in principle) this makes it easy to run the code on a cloud server, scaling the computing resources appropriately.

I think the real acid test for this is not merely that it recreates the results, but that others can build and extend on the work (with fewer rather than more barriers than usual). I believe most of that has nothing to do with this whole software image thing – providing the methods you use as general-purpose functions in an R package, or publishing the raw (& processed) data to Dryad with good documentation will always make work more modular and easier to re-use than cracking open someone’s virtual machine. But that is really a separate issue.

In this context, we look for an easy way to package up whatever a researcher or group is already doing into something portable and extensible. So, is this really portable and extensible?

Concerns:

  1. This presupposes someone can run docker on their OS – and from the command line at that. Perhaps that’s the biggest barrier to entry right now, (though given docker’s virulent popularity, maybe something smart people with big money might soon solve).

  2. The only way to interact with thing is through a bash shell running on the container. An RStudio server might be much nicer, but I haven’t been able to get that running. Anyone know how to run RStudio server from docker?

(I tried & failed)

  1. I don’t see how users can move local files on and off the docker container. In some ways this is a great virtue – forcing all code to use fully resolved paths like pulling data from Dryad instead of their hard-drive, and pushing results to a (possibly private) online site to view them. But obviously a barrier to entry. Is there a better way to do this?

Alternative strategies

  1. Docker is just one of many ways to do this (particularly if you’re not concerned about maximum performance speed), and quite probably not the easiest. Our friends at Berkeley D-Lab opted for a GUI-driven virtual machine instead, built with Packer and run in Virtualbox, after their experience proved that students were much more comfortable with the mouse-driven installation and a pixel-identical environment to the instructor’s (see their excellent paper on this).

  2. Will/should researchers be willing to work and develop in virtual environments? In some cases, the virtual environment can be closely coupled to the native one – you use your own editors etc to do all the writing, and then execute in the virtual environment (seems this is easier in docker/vagrant approach than in the BCE.

Read more



Notes

06 Aug 2014

Writing

Exploring

Reading

Fun piece in the Guardian from Digital Science manager Timo Hannay on the future of scientific publishing. I think (or choose to believe that) the thesis is at the end rather than in the title.

Also commented here on DocZen’s post

Read more



Notes

31 Jul 2014

Berkeley Collaborative Environment

Trying out the Berkeley image (Essentially ubuntu 14.04 XFCE with ipython and RStudio installed, but finely tuned to improve user experience; e.g. solid colors for faster remote window connections.)

Highly recommend their paper for the first explanation I’ve actually been able to follow that provides a definition of “DevOpts” (essentially, using scripts rather than documentation to manage consistent cross-platform installation) and explains the differences and similarities between the various programs operating in this sphere, sometimes as alternatives and simultaneously.

Virtual machines.

Their approach focuses on running on top of a complete virtual machine. Primarily considers Oracle’s virtualbox for local use, Amazon’s AMI for cloud use. (For an emerging alternative to the full virtual machine approach, they discuss Docker).

Configuration management (CM) tools:

  • Ansible (“playbooks”) used by the BCE to specify the complete software environment. Compares to: Chef (Ruby-based, “recipes”), Salt (Python based, “states”), and Puppet (“manifests”).

Provisioning tools:

  • Packer Used at build time to create a machine image for the VM. Packer can use Ansible/Chef/Salt/Puppet files to do this. Results in a nice AMI for Amazon web console or an image for the virtualbox GUI.

  • Vagrant Offers a different approach. Rather than the developer creating a VM image using Packer and Ansible script that is ready to run on virtualbox or Amazon, the end user installs vagrant instead of virtualbox. Vagrant also handles the job of Packer, in preparing an environment to run (on Vagrant’s virtual machine, rather than on virtualbox). (Note that conversely, Packer can create a Vagrant virtual machine just as easily as it can create the Oracle virtualbox or Amazon AMI). Vagrant feels a lot more native, as the user works within their familiar OS tools for editing, etc, while vagrant makes sure that the execution of the software happens in a controlled, identical environment.

  • Docker offers a more modular alternative to a full virtual machine, with performance that is more like running on ‘bare metal’, sharing the kernel of the native OS; though at the moment it requires that be a linux kernel. Docker is typically deployed using Vagrant, (though stand-alone setup is emerging).

From first glance, Vagrant sounds like a more elegant approach than having end users install Oracle’s virtual machine and then learn to live in a virtual window emulating the Ubuntu-XFCE desktop (particularly if those users are developers!). However, it seems that the BCE team found that Vagrant was harder for students to work with, since it required knowledge of the commandline.

The paper also provides a fabulous case study of a major scientific software project using the “DevOpts” philosophy but without any of these new and emerging tools, relying only on scripts, makefiles, and Linux distribution package managers.

Test drive impressions

Unfortunately, no luck getting virtualization running on my laptop (due to BIOS issues). Testing on Ubuntu desktop required a newer version of virtualbox than what’s in the 12.04 repos, but otherwise just worked. It’s straight forward to install additional needed software, and virtualbox gives the option of preserving the machine state on exit, presumably with the software. Not clear how that should be managed to avoid re-creating the problems that using a consistent image set out to avoid in the first place; perhaps requires re-provisioning a divergent image?

Nice experience testing out the Amazon machine image, but I think the workflow would be improved if it provided RStudio server for a more interactive interface.

Misc tasks

  • Working on reveal.js slides

  • RNeXML manuscript edits from Francois’s comments

Thoughts on namespaces (in R)

Chatting with Scott about namespace practices, thought I’d put some of this down.

I go back and forth on having a small namespace (particularly vs convenience functions). It is certainly easier to maintain, but from the user’s perspective I think it’s pretty easy to just ignore extra functions; as long as it’s well documented what functions they need to know to get started.

I guess if a user needs to know too many functions to do anything, than it becomes hard to keep track of. That’s partly why I started wrapping exposed functions. For instance, you can just use nexml_write all the time and never use add_characters, add_trees, etc, since nexml_write takes additional trees and characters as arguments.

But I like function calls to be as semantic as possible so the code is more self documenting. Sometimes add_characters(nex) is more self explanatory than nexml_write(nex, characters = characters).

The XML package really got me started on this. There’s usually like 4 or 5 ways to do the same thing. Sometimes that’s really annoying, but sometimes it helps write more transparent code or less verbose code (e.g. adding child nodes with addChildren vs passing as .children argument to newXMLNode – the former tends to be more semantic, the latter often more consise).

Code tricks: vim pandoc

Vim pandoc syntax highlighting

  • Recognize .Rmd as a pandoc-syntax file extension. In .vimrc do:
au BufRead,BufNewFile *.Rmd set filetype=pandoc
  • Enable syntax highlighting inside code blocks, by language. In vim session, do: PandocHighlight r. Unfortunately doesn’t recognize the default .Rmd format used by RStudio. This then enables syntax highlighting, folding, etc.

If I recall correctly, knitr’s default markdown syntax,

```{r}

was intended to be a valid markdown syntax. It seems Github Flavored markdown is happy to recognize this as markdown syntax for a code block in the R language, but while pandoc recognizes this as a code block, it does not recognize the language.

I believe Pandoc does recognize the format

```{.r options}

as the notation to specify the R language in this notation. I like this notation because it means that pandoc-aware syntax highlighters will highlight my code chunks as R code, which does not happen in the default syntax.

I realize I could define this as an input hook for knitr, but am reluctant to do so as it makes my code slightly less familiar/less portable to other users (e.g. RStudio expects and integrates with only the standard notation.)

Read more



Notes

21 Jul 2014

Reading

Looking for this in my notes and couldn’t find it: An excellent paper from the Software Sustainability Institute and friends outlining the need and possible structure for sustaining career paths for software developers. “The research software engineer”, Dirk Gorissen. Provides a good response to the software issues highlighted by climategate, etc.

Remote conferencing

(Based on earlier unposted notes). With so much going on, it’s nice to be able to follow highlights from some conferences remotely

Misc code-tricks

For a question raised during the Mozilla sprint: had to remember how to write custom hooks for knitr (e.g. for kramdown compatibility):

hook.t <- function(x, options) paste0("\n\n~~~\n", paste0(x, collapse="\n"), "~~~\n\n")
hook.r <- function(x, options) {
       paste0("\n\n~~~ ", tolower(options$engine), "\n", paste0(x, collapse="\n"), "\n~~~\n\n")
  }
knitr::knit_hooks$set(source=hook.r, output=hook.t, warning=hook.t,
                              error=hook.t, message=hook.t)

Read more



UPS and data vs optimal control

21 Jul 2014

Random idea for possible further exploration:

The use of ‘big data’ by UPS to perform lots of small efficiency gains seems to be everybody’s favorite example (NPR, The Economist). During a typical applications of optimal control for ecological conservation talk yesterday I couldn’t help thinking back to that story. The paradigm shift is not so much the kind or amount of the data being used as it is the control levers themselves. As the Economist (rightly) argues, everyone typically assumes that a few principle actions are responsible for 80% of possible improvement.

Optimal control tends to focus on these big things, which are also usually particularly thorny optimizations. Most of the classic textbook hard optimization problems could have come right from the UPS case: the traveling salesman, the inventory packing/set cover problems, and so forth. Impossible to solve exactly on large networks, approximate dynamic programming approaches have since been the work-around. Yet the “Big Data” approach takes a rather different strategy all together, tackling many small problems instead of one big one. Our typical approach of theoretical abstractions to simple models is designed to focus on these big overarching problems. In abstracting the problem, we focus on the big picture stuff that should matter most – stuff like figuring out the optimal route to travel, and so forth. But when the gains through increasing optimization of these things are marginal, focusing on the “other 20%” can make more sense. However, that means abandoning the abstraction and going back to the original messy problem. It means knowing about all the other little levers and switches we can control. In the UPS context, this means thinking about how many times a truck backs up, or idles at a stop light, or what hand the deliveryman holds the pen in. Given both the data and the ability to control so many of these little things, optimizing each one in the first place can be more valuable than focusing on the big abstract optimizations.

So, does this work only once the heuristic solutions to the big problems are nearly optimal, so improved approximations have very limited gains? Or can this also be a route forward when the big problems are primarily intractable as well? The former certainly seems the more likely, but if the latter is true, it could prove very interesting.

So this got me thinking – if we accept the latter premise we find a case closely analogous to the very messy optimizations we face in conservation decision-making. Could the many little levers be an alternative? It’s unlikely given both the need for the kind of arbitrarily detailed data at almost no cost available to the UPS problem, and also the kind of totalitarian control UPS can apply to control all the little levers, while the conservation problem more frequently has nothing bit a scrawny blunt stick to toggle in the first place. Nevertheless, it’s hard to know what possible gains we have already excluded when we focus only on the big abstractions and the controls relevant to them. Could conservation decision-making think more outside the box about the many little things we might be able to more effectively influence?

Read more