Notes On Leveraging The Ecological Markup Language

This post creates new tag eml for posts discussing the Ecological Markup Langauge or the implementation of an R parser thereof (reml).

Why EML?

Jones et al.(2005) give an excellent explanation describing the relative advantage of the descriptive metadata approach over vertically integrated repositories (such as Genbank, TreeBase, or many other repostories that have been the focus of rOpenSci development to date).

In short, the EML approach allows archiving of very heterogeneous data without having to standardize everything into a narrow and pre-defined syntax.

Providers of EML

As far as I can tell, the long-term ecological research program sites are responsible for almost all available EML files. XML files in specificed schemas involve strict criteria and are thus best generated by software.

Use cases of EML

These sites seem to do a better job generating this data than making any use of it – a review of the project mentions “a common metadata” standard in passing and without citation. While providing nice examples of the breadth of ecological data as published by a range of researchers and journals, it provides little obvious example of synthesis. I just reread Jones et al. review to confirm that it provides no examples of papers involving data synthesis based on EML (instead citing a single example of manual data synthesis to illustrate the challenges involved).

Generating EML

Metacat a GPL program written in Java by EML authors Matt and co, describes itself as a database that uses EML to store data. It might be a useful tool for generating EML, but looks pretty intimidating to me. It probably provides the “Web Software Interface” that provides rather minimal EML

Morpho is aimed at documenting species trait data, and looks like a rather useful if tedious tool for generating EML files. Unfortunately, without the ability to script inputs or automatically detect existing data structures, we are forced through the rather arduous process of adding all metadata annotation each time.

What can a package offer?

Translating xpath commands into R function wrappers probably provides little utility. Do we really need a get_authors function in place of getNodeSet(doc, "//creator")? Of course we might want to convert the returned nodes into characters strings or R person objects. Or providing larger chunks of metadata extracted into a single R object, or text-based summary (e.g. markdown or yaml).

A package could also provide utilities to generate EML from R objects, leveraging the metadata implicit in R objects that is not present in a CSV (in which there is no built-in notion of whether a column is numeric or character string, what missing value characters it uses, or really if it is consistent at all. Avoiding manual specification of these things makes the metadata annotation less tedious as well.

Of course such ideas are just the beginning. Ideally a package could help tackle the grand challenge of heterogeneous data integration, though this might have to wait for more semantics than are natively found EML.

References

  • Eric H. Fegraus, Sandy Andelman, Matthew B. Jones, Mark Schildhauer, (2005) Maximizing The Value of Ecological Data With Structured Metadata: an Introduction to Ecological Metadata Language (Eml) And Principles For Metadata Creation. Bulletin of The Ecological Society of America 86 158-168 10.1890/0012-9623(2005)86[158:MTVOED]2.0.CO;2
  • M.B. Jones, C. Berkley, J. Bojilova, M. Schildhauer, (2001) Managing Scientific Metadata. Ieee Internet Computing 5 59-68 10.1109/4236.957896
  • Matthew B. Jones, Mark P. Schildhauer, O.J. Reichman, Shawn Bowers, (2006) The New Bioinformatics: Integrating Ecological Data From The Gene to The Biosphere. Annual Review of Ecology, Evolution, And Systematics 37 519-544 10.1146/annurev.ecolsys.37.091305.110031
  • Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer, Matthew B. Jones, (2008) Advancing Ecological Research With Ontologies. Trends in Ecology & Evolution 23 159-168 10.1016/j.tree.2007.11.007
  • William K. Michener, Matthew B. Jones, (2012) Ecoinformatics: Supporting Ecology as A Data-Intensive Science. Trends in Ecology & Evolution 27 85-93 10.1016/j.tree.2011.11.016
  • Felix Müller, Cornelia Baessler, Hendrik Schubert, Stefan Klotz, James R. Gosz, Robert B. Waide, John J. Magnuson, (unknown) Long-Term Ecological Research. Unknown