In order to facilitate the implementation of an Open Knowledge Biodiversity Management System, a total of four pilots are being conducted. These pilots deal with:
- Pilot 1 Interoperability model between taxon treatments from both legacy and prospective literature from three organismic domains (fungi, plants and animals)
- Pilot 2 Common query/response model for automated registration of higher plants (International Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)
- Pilot 3 Interoperability model between PLAZI and the EDIT Platform for Cybertaxonomy based on transformations between XML-repositories and CDM-stores
- Pilot 4 Revision of a tool (CharaParser) that generates identification keys by reusing morphological characters from published species descriptions
- 1 Taxa for pilots
- 2 Mistletoes (families Loranthaceae and Viscaceae)
- 3 Chenopodium (Goosefoots)
- 4 Fungi: Towards a (semi-) automated identification of European Agaricus species
- 5 Nephrolepis (Ferns)
- 6 Ants
- 7 Bryophyta - Campylopus
- 8 Chilopoda
- 9 Spiders
- 10 Roadmap
- 11 Mark-up of taxa
- 12 Mark-up requirements list
- 13 GoldenGATE training organised in 2013
Taxa for pilots
Animals, higher plants, fungi and bryophytes will be included in the pilots. Criteria used for selecting taxa are: a low number of species, ecological links to other organismal groups (i.e. parasites/hosts, symbiotic relationships, pollinators, mycorrhiza, etc.), availability of online links (in order to allow linking data with types, specimens, localities, etc.), and groups that require description of new species.
Mistletoes (families Loranthaceae and Viscaceae)
- Don Kirkup (specialist in Loranthaceae) - Royal Botanic Gardens, Kew (RBGK)
- Tony Walduck (RBGK)
- Eugenia Barnett (RBGK, Flora of Gabon)
- Mike Gilbert (RBGK, editor of Flora of China)
- Quentin Groom (Meise)
- Mohan Devkota
Purpose of pilot:
- 1. Workflow from legacy to prospective literature
- 2. Address the generation of interactive keys
- 3. Compare character parsing techniques
- 4. Study technical and semantic interoperability associated with the merging of separate floras, and dealing with associated organisms such as hosts, predators, dispersers and pollinators.
Mistletoes will use several sources, mostly already in digital format but in varying states of mark-up: (i) Databases compiled from legacy literature are available at RBGK (c.300 African spp.), (ii) 230 spp. for FM area have been recently marked-up by Naturalis, (iii) complete digitised but unmarked text is available for the Flora of China, (iv) other publications with relevant treatments such as Flore d’Afrique Centrale (c.100 pages in total) may also be incorporated. All these regional sources will be combined and the mark-up of geographical and morphological information will be refined as far as practicable. The resulting data will be used to study technical and semantic interoperability associated with the merging of separate floras, and dealing with associated organisms such as hosts, predators, dispersers and pollinators. The treatments for the genus Helixanthera (c. 60 spp.) which is common to Africa and Indo-Malesian regions will be further processed to develop an interactive key to the species, which is a precursor in the workflow to produce a full monograph of the group. This particular exercise will also allow a comparison of character parsing techniques. There are two further opportunities to explore how an e-infrastructure can contribute to existing Flora writing initiatives through gap filling. The merged Flora Malesiana and Flora of China accounts will provide a basis for the authoring of the Flora of Nepal mistletoes (Mohan Devkota in association with RBG Edinburgh), and the various African mistletoes accounts will provide the basis for the Flore du Gabon (Eugenia Barnett, in association with RBGK and WAG). The taxa will be re-described for the specific flora areas and any new taxa published online. Interactive keys will be developed for the mistletoe species of both of these floras.
- Quentin Groom (BGM), mark-up of legacy floras and herbaria.
- Susy Fuentes (Botanischer Garten und Botanisches Museum Berlin-Dahlem - FUB-BGBM), taxonomic expert of the genus Chenopodium.
- Patricia Kelbert (FUB-BGBM), import from TaxonX to CDM.
- Sabrina Eckert (FUB-BGBM), scans/pdfs from Floras and mark-up.
Purpose of pilot:
- 1. Give an overview of the whole process (ie. digitization of different sources, markup, storage, import into CDM database) to demonstrate the workflow from original publication to dissemination.
- 2. Mark up with the highest possible granularity level (including synonyms and localities).
- 3. Import of treatments from the Plazi treatment repository to Common Data Model using the TaxonX schema; TaxonX is tested to become the input format for CDM.
- 4. Use of CDM Data (from both marked up literature and unpublished data sources) to create new treatments.
- 5. Reuse of already marked up data (Botanic Garden, Meise).
The genus Chenopodium is comprised of about 100 species. We are focusing on marking-up treatments for a variety of floras and sourcing already marked-up data. We are analysing the technical problems we encounter during both the mark-up and processing to help us suggest improvements to the whole pipeline, from digitization to import into the CDM database. Different kinds of treatments are being tested, to see which elements may be missing from the TaxonX schema and what is or isn't compatible with the CDM. The mark-up has been done in cooperation with Plazi.
An example of the data can be viewed on the Chenopodium CDM Data Portal.
To understand how these kinds of data can be reused we are building a spatial time series of records for one particular species, Chenopodium vulvaria. This distinctly ugly, stinking weed is a red-listed species in some countries; an invasive alien in others and a common native weed in yet others. It has little economic importance, but for this reason it enables us to study the complex interaction between the presence and abundance of a species and the botanist's inclination to collect and record it. By gathering together as much distributional information as possible on one species we will be able to assess how useful it would be to extract distributional information from legacy literature.
Two examples of the results produced so far are these posters presented at the Biodiversity Informatics Horizons 2013
Fungi: Towards a (semi-) automated identification of European Agaricus species
- József Geml (Naturalis)
- Luis A. Parra (President of the Mycological Association of Aranda de Duero, Spain)
- Hong Cui (U. of Arizona, USA)
- Thibaut De Meulemeester (Naturalis)
- Soraya Sierra (Naturalis)
Purpose of pilot:
- 1. Mark-up the most comprehensive treatments of European Agaricus species
- 2. Construct digital taxonomic databases by character parsing techniques
- 3. Facilitate identification of European Agaricus species by non-specialists
Agaricus L. is the type genus of the family Agaricaceae in the order Agaricales. It is estimated that ca. 100 Agaricus species occur in Europe alone, while the total number of described species is 386 worldwide (Zhao et al. 2011). However, the number of total species likely exceed 400, as new species are still being described. Some species have economic importance, such as Agaricus bisporus (J.E. Lange) Imbach that is the most widely cultivated edible mushroom. Taxonomical classifications and identification of species in this diverse genus have been considered difficult by taxonomists and non-experts alike. However, in recent years, the incorporation of molecular tools in systematic studies has greatly improved species delimitation. In particular, works by Luis A. Parra have been fundamental in establishing a solid classification in Agaricus (Parra 2008, 2013) and provide valuable diagnostic morphological characters for taxa delimited by the combination of morphological and molecular phylogenetic data.
The goal of the fungi pilot is to provide tools that will facilitate species identification by scientists and lay people alike by making comparisons of morphological data more efficient. For this purpose, we focus on the mark-up of the taxonomic literature of European Agaricus species. In particular, we focus on sections Agaricus, Bivelares, Chitonioides, Sanguinolenti (with their subsections Bohusia and Sylvatici) and Spissicaules. These sections work well as a study case, because of the manageable number of species (ca. 30) and an available monograph. After completing the mark-up of the literature, we are working on the implementation of a character parsing software (CharaParser) to facilitate the development of (semi-) automated morphological comparisons of the taxa of focus. We are focusing on incorporating a set of key characters from various genera of macrofungi to parse species descriptions into character matrices. To date, we have been collecting such information from the genera Agaricus and Lactarius. The pilot will result in databases that, coupled with morphometric software, will be suitable for scientists from a variety of fields ranging from taxonomy to ecology and biotechnology and by lay people alike.
Parra LA (2008): Fungi Europaei, Volume 1, Agaricus L.: Allopsalliota, Nauta & Bas, Edizioni Candusso
Parra LA (2013): Fungi Europaei, Volume 1A, Agaricus L.: Allopsalliota, Nauta & Bas (Parte II), Candusso Edizioni
Zhao R, Karunarathna S, Raspé O et al. (2011): Major clades in tropical Agaricus. Fungal Diversity 31: 279-296.
- Peter Hovenkamp - Naturalis
Purpose of pilot:
Mark-up one relatively recent article with a very extensive list of scientific names including hundreds of synonyms and combinations, and containing treatments with a wide variety of detail (to accommodate treatments at levels ranging from species to cultivar), to assess the general usability of the tool used for the mark-up and delivery of data to Plazi and from there to CDM and EOL.
Results: Document has been marked up and uploaded to Plazi and CDM. mark-up refinements are more or less continuously being made.
Comments: Mark-up was a time-consuming process requiring a degree of attentiveness that is difficult to maintain for the time required for many of the tasks. This emphasizes the need for further investment in the user interface and usability of GoldenGATE. Further changes that offer users more flexibility in the revision or refinement of existing mark-up would be welcomed.
Additionally, this pilot also demonstrated that much of the time is spent on solving problems due to a lack of guidance for the required mark-up structure, e.g. with regard to nesting of tags or treatments. Attention should be given to stricter specifications of these in the mark-up schema’s.
- Donat Agosti - Plazi
- Brian Fisher (ants specialists) - California Academy of Sciences
Purpose of pilot:
- 1. Demonstrate the workflow from original publication to dissemination
- 2. Demonstrate the extraction of characters from text
- 3. Fine grained mark up, providing distribution data to GBIF
- 4. link to Species-ID and other users
- 5. Integration into existing data (all species and most specimen documented with standard digital images and keys, but not yet descriptive data)
- Anochetus. 10 species (+ few new).
- Pyramica. 50 species (+ few new).
These two taxa are part of one of the world's largest and most exhaustive inventory and thus lend themselves for modeling.
An overview of the project is here.
Bryophyta - Campylopus
- Sylvia Mota de Oliveira (Bryophyte specialist and editor of the Flora of the Guianas) - Naturalis Biodiversity Center
- Renato Gama
Purpose of pilot: Address the integration of data from different taxonomic treatments.
A total of 13 species of the genus Campylopus occur in the Guianas region (Guyana, Surinam, French Guiana). The genus was recently treated in the Flora of the Guianas (Series C, fascicle 2, 2011), but some of the species descriptions contain only a reference to the Flora of Suriname (vol. VI, part I, Musci I and II, 1964), where the complete description can be found. Both Floras, however, are only available as hard copies and the Flora of Suriname, with a very limited distribution, is out of print. In this pilot we want to markup the text files of both treatments and integrate the data, in order to obtain a single complete treatment for Campylopus and make it available online. Possibly, we will add molecular data from a revisionary work currently being done for the genus. At this phase (September 2013) the markup is finished and the resulting product - two documents - were uploaded to Plazi. The next steps are importing the data from Plazi into the Common Data Model and creating a portal where it can be accessed.
- Pavel Stoev (Chilopoda specialist) - Pensoft
- Teodor Georgiev
- Jordan Bisserkov
- Lyubomir Penev
Purpose of pilot:
- 1. To mark up of approximately 100 taxon treatments, comprising original descriptions and important nomenclature acts, by using Golden Gate software
- 2. To describe a new cave-dwelling species from a cave in Croatia
- 3. To integrate taxon treatments deriving from legacy and prospective literature
- 4. To publish enhanced taxonomic checklist of all taxa including links to BHL, Plazi, GBIF, EoL, Species-ID, etc.
The lithobiid subfamily Ethopolyinae Chamberlin, 1915 (Chilopoda: Lithobiomorpha) is know to comprise only four (sub?)genera – Bothropolys Wood, 1862 with around 40 species distributed in North America and East Asia, Archethopolys Chamberlin, 1925 and Zygethopolys Chamberlin, 1925, each with three species from from North America, and Eupolybothrus Verhoeff, 1907 with around 35 species in seven subgenera from Europe, Asia Minor and North Africa (Zapparoli and Edgecombe 2011). The primarily Mediterranean genus Eupolybothrus shows highest species diversity in the Apennine and Balkan peninsulas (Zapparoli 2003). It has been reviewed recently and a preliminary interactive key to all species in the genus was made with DELTA software (Stoev et al. 2010). Stoev et al's paper demonstrates several innovative methods of semantic tagging and semantic enhancements used to exemplify the Pensoft's Markup and Taxon profile tools. The new pilot aims to extend the existent knowledge by describing a new, cave-dwelling species from Croatia; by discussing the status of several old and uncertain taxa; and by providing a checklist of all taxa supported with links to original literature sources (BHL), taxon treatments (Plazi), occurrence data (GBIF), species descriptions (EoL, Species-ID), molecular data (GENBANK), Museum collections, etc.
- Jeremy Miller (spiders specialist) - Naturalis, Plazi
- Donat Agost – Plazi
- Guido Sautter – Plazi
- Terry Catapano – Plazi
- Christian Kropf – Naturhistorisches Museum, Bern
- Wolfgang Nentwig – University of Bern
- David King – The Open University
- Serrano Pereira – Naturalis
- Rutger Vos – Naturalis
- Soraya Sierra – Naturalis
Purpose of the pilot:
- Apply fine-grained markup of taxonomic treatments to extract primary data from legacy publications. These data are then linked to a variety of resources, including the Global Biodiversity Information Facility (GBIF, http://www.gbif.org/), Encyclopedia of Life (EOL, http://www.eol.org), and the new World Spider Catalog database to be based in Bern, Switzerland. A series of standard charts are developed to visualize specimen data in treatments or groups of treatments using a dashboard on Plazi. This pilot has modest goals in terms of the number of documents to be marked up, but will develop workflows for future scaled-up efforts.
Digital Data and Megadiverse Taxa
The digital world is awash with biodiversity data. Why then are we advocating such an ambitious program for fine-grained markup of vast amounts of taxonomic literature? To illustrate the answer, let us consider GBIF (http://www.gbif.org/), a database dedicated to aggregating and serving specimen data through a common portal. A leading model for getting data into GBIF involves aggregating data from a network of large institutional collections, especially natural history museums. But if we break down GBIF data by taxon, some strong patterns and biases emerge. Birds, which represent about 1% of animal species, account for more than half of all GBIF records. Less than 20% of described animal species have any data representation on GBIF, which says nothing of how complete those records are. (Plants are better with nearly 60% of species represented.) So the density of data for the worlds more diverse taxa, including spiders, is very low. This is not because megadiverse taxa are not well represented in the world’s natural history collections – far from it – it is because most of these specimen are not available in digital form. This means that if we want to use the data available today in digital form to address questions that might be relevant to important questions like setting conservation priorities or anticipating the effects of climate change, we are limited by biases in the currently available data. The collections-based model of data aggregation clearly serves some taxonomic groups very well. But the complementary approach of using the taxonomic literature as a data source of specimen data may be more successful for megadiverse taxa.
From Catalog to Digital Fauna
The World Spider Catalog (http://research.amnh.org/iz/spiders/catalog/) is arguably the most essential and useful resource for araneological research. Indexing all taxonomic acts and literature and presenting them as a single exhaustive online resource has promoted rigorous scholarship and magnified productivity. Norman Platnick's spider catalog first came online in 2000 as a series of linked HTML documents. The catalog lists all spider species known to science and all taxonomic treatments of each species (including synonyms), along with references to figures, distributions, and unique identifiers, among other information and editorial judgments. Two updates are released each year. These updates are maintained by updating text documents, rather than through a database. A database version has been created, but this is based on text parsing and updates are not maintained. Now, the catalog is moving from its birthplace at the American Museum of Natural History in New York to the Naturhistorisches Museum in Bern. Once there, the text will be converted into a new database, where it will be maintained for the foreseeable future.
The World Spider Catalog references more than 12,000 publications. Marking up this entire body of legacy literature for linking to the new catalog presents a significant challenge. However, there are strategies for efficiently approaching significant portions of this literature. One way is to focus on marking up taxonomic articles from journals that have published a large proportion of the spider literature. Eleven journals have published more than 25% of the articles referenced in the spider catalog. To address the long tail of taxonomic papers scattered across nearly 1,600 journals, a more distributed effort is prescribed, where taxonomic researchers could use user-friendly software to markup the treatments relevant to their particular research specialty.
This pilot project will take a two-track approach to demonstrate scalable workflows for extracting primary data from legacy taxonomic literature and linking it with the spider catalog. Track 1 will involve XML markup of all open access spider papers published in the taxonomic megajournal Zootaxa (http://www.mapress.com/zootaxa/index.html). This represents efforts to markup cohesive blocks of work, such entire journals where contributions to spider taxonomy are concentrated. However, Zootaxa is a conditional open access journal and only 34 papers (about 7% of spider taxonomy papers published in Zootaxa) qualified for track 1. Track 2 will markup all taxonomic treatments concerning the spider family Penestomidae. These sets of articles overlap (one monograph on Penestomidae was published as an open access article in Zootaxa), so this track represents filling in gaps by marking up articles from multiple sources.
Plazi and the new spider catalog database have agreed to exchange identifiers for the purpose of linking from a reference in the catalog directly to the corresponding taxonomic treatment in Plazi. Once markup is complete, treatments can be linked directly to their references on the World Spider Catalog. For treatments produced for this pilot, this includes the primary specimen data recovered from the source document using fine-grained XML markup.
The Untapped Value of Data in Taxonomic Literature
Fine-grained markup permits us to experiment with new approaches to visualizing and synthesizing the primary specimen data that are the foundation of taxonomic research. During pro-iBiosphere’s Data Enrichment Hackathon (http://wiki.pro-ibiosphere.eu/wiki/Data_enrichment_hackathon,_March_17-21_2014) event, the capabilities of Plazi’s data search and retrieval system was expanded. As a result, we can now demonstrate a series of charts to represent and summarize the specimen data associated with any treatment or group of treatments. These dashboards reveal, at a glance, key information of use to taxonomic researchers, collections managers, and other stakeholders. These include profiles of when specimens were collected (both by month of the year and by decade), specimens by elevation, proportions of male and female specimens, specimens portioned by institutional collection and collector, and map plots of collection localities.
Rates of publication in taxonomy continue to rise. So while addressing the challenge of legacy literature XML markup, we should not continue adding to the backlog. There have been great advances in taxonomic publishing. Pensoft (http://www.pensoft.net/journals/) journals have led the way with an XML-based approach that facilitates the reuse, aggregation, and dissemination of content to an increasing variety of cybertaxonomic databases and resources. This mode of taxonomic publishing represents the highest standard in transparency and repeatability in biodiversity science.
Mark-up of taxa
A high level of granularity is needed because we want stakeholders to use taxonomic data for various purposes, e.g.: identification through keys, traits, etc. Treatments from legacy literature will be marked-up to morphological characters, locality and bibliographic citations. In order to do mark-up of characters, WP3 and WP4 will need to explore the value and quality of the existing Floras and Faunas to actually provide enough data that can be marked up. A result might be a best practice on how much explicit data ought be included to allow extracting content.
See also (visible only when signed in): Internal:Meeting at the TDWG Bejing