TreeBase: exploratory exercises in database-driven phylogenetics

Taking a look at the TreeBase database for potential project with Gabe & Duncan’s Large Data seminar. It would be interesting to get a working knowledge of the database and the API, making large-scale meta-analyses easy. Many studies seem to select taxa based on what they have in their own lab rather than re-use existing phylogenies, not because it is better suited to answering the questions but because it is more familiar. Access to larger data-sets than typically available in any individual lab, and the ability to test hypotheses across multiple phylogenies could be rather useful.

A nice tutorial on what can be done with the existing data could be quite useful; my cursory search for such a tutorial didn’t turn up much. The ability to grab and re-analyze any set of phylogenetic trees through a pipeline in R would be very useful.

Overview of capabilities

In addition to the web interface, TreeBASE can be accessed programmatically through a stateless web service interface and URL architecture. This interface can deliver data in several different formats, including NEWICK, NEXUS, JSON, NeXML.

a PhyloWS RESTful API. A detailed description of TreeBASE’s PhyloWS implementation is on the TreeBASE wiki.
OAI-PMH metadata harvesting interface is available, though under development. A detailed description is on the TreeBASE wiki.
SQL data dumps – coming soon.
Treebase claims to have character data as well as phylogenies. No obvious way to query for these. Also has sequence data.

Potential exerices

Would be good to begin with a statistical summary of TreeBase: number of data sets, average size, taxa, coverage, frequency with which various methods were used, fraction that are time-calibrated, fraction that are species trees vs gene trees, number of sequences used etc.
No idea how hard it would be to establish an R implementation/interface to the API, but could be quite useful, particularly for comparative phylogenetics.
Would be useful to have a pipeline that could grab the phylogenies from within R given search terms or the study id, along with appropriate meta-data (publication information, taxa, etc).

Accompanying literature database:

There’s a rather nice database of all 2604 papers published containing phylogenetic trees with entries in TreeBase, cataloged in the Mendeley group.
Each has tags to the entries in TreeBase (note that all objects have been given new id’s in the current (version 2) TreeBase, so should search by those. Make sure to discard current search results (button) before searching again.
Can search the Mendeley collection (if I imported all the papers, then could do full-text search of the articles).
Potential to interface with other databases, including Wikipedia (see next section, below).