Traits Task Group

From pro-iBiosphere Wiki
Jump to: navigation, search

The Team

Name Twitter Organization Role Repository
George Gosline RBGK domain expertise
Quentin Groom @cabbageleek BGM provider of use case
Thomas Hamann Naturalis developer
Claus Weiland Biodiversity and Climate Research Centre / Senckenberg programmer
Robert Hoehndorf Aberystwyth University programmer, provider of use case

The Project

The Original Idea(s)

"Using grammatical and positional clues we could explore ways to extract trait data from taxon treatments. Using Kew's and Meise's African floras it should be possible to extract traits for a large number of African species. It will also be interesting to compare the results from different floras." - Don Kirkup

"RDF knowledge base of plant phenotypes" - Robert Hoehndorf

Or, in short:

"The idea of our team is to extract plant trait data from digitized floras" - Quentin Groom

Thank you, Quentin, for being clear and concise.

The Chosen Approach

We extracted traits from several floras and represented them using some standard vocabulary. We used a simple text mining approach to mark up PO and PATO ontology terms, and then we constructed Entity-Quality (EQ) statements from the text and linked them to the taxon. From the EQ statements, we constructed a rough draft of a Flora Phenotype Ontology, which has all the traits found in the floras (Flore du Gabon and Flora Malesiana) and the associated taxa.

Source data files

Below are listed the data sources we used for our hacking.

Example flora treatments

These are the data we started out with.

Data sources in use

The following flora treatments were used as our base data sources. They come in two different formats with varying levels of mark-up applied to them.

Other examples

We considered using the Kew African Floras linked to below, but they are currently not used because we will need to develop a parser for the plant descriptions.


These ontologies were used as a basis for creating our new ontology.

This ontology was not used, although it shares a similar data structure with our ontology:

Glossaries, lexicons, and other sources of additional terminology

Robert matched the terms in the glossaries and lexicons given below with those found in the various ontologies used. The terms were then added as synonyms, allowing a better matching between the ontologies and the source data.

MOBOT French/English lexicon


Glossaries from CharaParser

Glossaries extracted from Flore du Gabon XML files

Character lists

File:Characters.xls Characters from FWTA family key.

File:GlossaryAttributes.xls Botanical glossary characters -- attributes

File:GlossaryAttributes.txt In text format

File:Plant Botanical glossary characters -- basic attributes & terms (leaves, fruits ...)

FlorML list of taxonomic ranks

File:Ranks in FlorML.txt

Extensions to Env0

Below are some additions to the EnvO ontology. They have been done to make the EnvO ontology more useful for tropical countries and French literature. Also interesting to us is the EcoLexicon, having a rich lexicon of environmental terms in several languages.

Use of characters -- multi-entry keys

An example of how traits can be used for the construction of a multi-entry key: Lucid key for Malvaceae genera

Intermediate results

Ontology term matching

We generated two files with missing terms. These were not matched by any of the terms in the ontologies or the various other files with terminology used:

The second file was updated with PATO IDs whenever possible, by manually matching terms to definitions, or in some cases, parent definitions (e.g. "gold" to "color"):

Taxon names extracted from the various Floras

These will be used to be able to classify traits according to taxon in the ontology.

Parser for scalar values

To link numerical data such as measurements to the respective traits, a parser for scalar values and, when present, units (e.g. mm) is required. The following perl program was created to extract many scalar values from the Flora files, using regular expressions that match the formats used in the floras for these types of data:

First run of generating the Plant Phenotype Ontology

The first ontology we generated is available here and can be opened in Protege, after which the ELK reasoner can be used to classify the ontology:



FLOPO first draft

Taxon annotations

Next Steps

Besides improving, cleaning up and documenting the code, these are the tentative next steps:

Ontology Work

  • Work with Plant Ontology (PO) and Phenotypic Quality Ontology (PATO) developers to incorporate missing terms, including synonyms.
  • Curate and improve Flora Ontology (FLOPO).

Integration with Trait Ontology

  • Use EQ definitions and reuse Trait Ontology; or migrate all traits to TO and only keep the phenotypes (trait values).

Fix NLP parser

  • Check out Charaparser, see if it works better.
  • Integrate terms from glossaries and vocabularies.
  • Recognize sub-statements.

Flora data integration

  • Use IPNI identifiers to identify taxon concepts used in the floras.
  • Add the taxon concepts to an RDF store.
  • Add the phenotypes for each taxon concept to the RDF store.
  • Extract values and value ranges from the floras, and add the results to RDF store.

General Extensions

  • Attempt to extract environmental information from the floras, represent environmental traits and features using ENVO, then add to RDF store.
  • Add geolocations for taxa to RDF store.
  • Visualize using an approach similar to

Other XML file formats

  • Add support for different XML file formats.
    • Need to find more clever way to recognize entities; e.g., use sentenizer from opennlp toolkit.