Brian O'Meara discusses new algorithms approaches on Phylo

Brian gave an excellent overview of Approximate Bayesian Computing (ABC) and described the TreEvo software he is developing with post-doc Barb Banbury. My notes from the seminar, with my own comments italics and  a few added references

Intro / Motivation

  • Rates of model innovation is rather slow

  • Can be impossible to specify in closed form (can’t specify median of binomial in closed form)

  • Perhaps it would be faster if we didn’t have to wait for someone to come up with all the likelihood functions

Reviews ABC method:

  • Simulate

  • Define a summary statistic

  • Define a “success” – distance metric on summary statistic: Acceptance-Rejection Sampling

Sufficient statistics

  • (examples from coin flip: actual string of HT, proportion of heads.  not sufficient: previous toss)

  • Problems: when there are no sufficient stats?

  • Haven’t proven sufficiency?

O’Meara’s approach: take 22 summary stats and prune the list (all the kinds of model parameters geiger will give you – lambda, kappa, etc.) – No way that lambda/kappa/etc are decent summary stats, some of these will even have identifiability issues on certain topologies. What about correlation among summary stats?

Following Wegman et al, (Guessing the citation for this is (Wegmann et. al. 2009)) Box-Cox transformation ahead of time. Not clear to me what the justification for this is.  Transformations are evolutionary hypotheses. What’s is the biology?

Describes Sequential Monte Carlo: Start by sampling from prior with liberal acceptance.  Update a posterior estimate as the new prior, and repeat.

  • New Models: Bounded models, discrete models such as autoregressive, character displacement models, etc.

  • Priors: uses a library of a few common distributions.  Discusses influence of different priors, priors with zero probability ranges.

  • Examples: character displacement (non-independent evolution on branches) & bounded side

  • Super concerned about identifiability issues, let alone sufficient statistics.

  • Some results look close to prior.

O’Meara on parameter estimation vs model selection:  “Tells you more about your power than your biology.”  Hmm…

Challenges

Mention of Robert et al. (Robert et. al. 2010), and preprint. Concerns on sufficient statistics.

  • Discusses examples of ways to fail: too little data, get the prior back

  • Overconfidence: Run different times, each gets tight, mostly non-overlapping distributions of a parameter back.

Features

  • Auto-tuning: runs enough generations and then stops.

  • Plotting results

  • Model menagerie

  • multiple character models coming soon

In progress:

  • check-pointing (restart partial runs, needed for large cluster runs to be able to restart killed/crashed runs)

  • speed optimization

  • testing testing testing

Advantages/Disadvantages

  • No model adequacy assessment yet

  • Usual plus/minus of Bayesian (priors, computational time, etc)

More resources

  • Brian recommends the Csillery et al TREE review on ABC (Csilléry et. al. 2010)

  • The project is a Google summer of code option

  • Brian mentions post-doc and other opportunities.

  • No release candidate on software yet, but Brian welcomes exploring the actively developed code on R-forge: Treevo

Watch the talk

References