A few large data problems in Ecology & Evolution

I’ve joined a student-led seminar with Gabriel Becker in the Statistics department on dealing with large data, primarily in R. Offered to say a few words about large data problems in my field for a few minutes, so making some notes.

Large dataBases

  • Genebank: $ > 10^8$ sequences.  Phylogenetic inference

  • Morphological databases: Specialized databases with semantic tools (Dahdul et. al. 2010)

  • NEON observatories data Keller et. al. 2008

  • Satellite data (Overpeck et. al. 2011)

  • Micro-tracker data

Distributed Data

  • Opportunity: Amount of available data is rapidly increasing under requirements of journals and funders (Whitlock, 2011) (Reichman et. al. 2011) (MOORE et. al. 2010).

  • Citizen science (Silvertown, 2009)

  • Challenges: Decentralized, non-standard data formats.

  • Goal: From meta-analyses to synthetic analyses

Large Computation

  • Simulation studies – replicates and large sample-spaces

  • Model estimation – Bayesian estimation

  • Computing on large datasets – too large to download

  • Predictive modeling – real-time forecasting

References