Traits Task Group
- 1 The Team
- 2 The Project
- 3 Source data files
- 3.1 Example flora treatments
- 3.2 Ontologies
- 3.3 Glossaries, lexicons, and other sources of additional terminology
- 3.4 Use of characters -- multi-entry keys
- 4 Intermediate results
- 5 Results
- 6 Next Steps
|George Gosline||RBGK||domain expertise|
|Quentin Groom||@cabbageleek||BGM||provider of use case||https://github.com/qgroom|
|Claus Weiland||Biodiversity and Climate Research Centre / Senckenberg||programmer||https://github.com/cp-weiland|
|Robert Hoehndorf||Aberystwyth University||programmer, provider of use case||https://github.com/leechuck/plantphenotypes|
The Original Idea(s)
"Using grammatical and positional clues we could explore ways to extract trait data from taxon treatments. Using Kew's and Meise's African floras it should be possible to extract traits for a large number of African species. It will also be interesting to compare the results from different floras." - Don Kirkup
"RDF knowledge base of plant phenotypes" - Robert Hoehndorf
Or, in short:
"The idea of our team is to extract plant trait data from digitized floras" - Quentin Groom
Thank you, Quentin, for being clear and concise.
The Chosen Approach
We extracted traits from several floras and represented them using some standard vocabulary. We used a simple text mining approach to mark up PO and PATO ontology terms, and then we constructed Entity-Quality (EQ) statements from the text and linked them to the taxon. From the EQ statements, we constructed a rough draft of a Flora Phenotype Ontology, which has all the traits found in the floras (Flore du Gabon and Flora Malesiana) and the associated taxa.
Source data files
Below are listed the data sources we used for our hacking.
Example flora treatments
These are the data we started out with.
Data sources in use
The following flora treatments were used as our base data sources. They come in two different formats with varying levels of mark-up applied to them.
- Flore du Gabon (File:FloreduGabon volumes Hackathon.zip; French-language, Gabon (Africa))
- Flora Malesiana (File:FloraMalesiana volumes Hackathon.zip; English-language, Malesia (South-east Asia))
We considered using the Kew African Floras linked to below, but they are currently not used because we will need to develop a parser for the plant descriptions.
- https://github.com/ggosline/hackathon (Kew African Floras)
- Flore d'Afrique Centrale (French Language, DR Congo, Rwanda & Burundi (Africa))
These ontologies were used as a basis for creating our new ontology.
- Phenotypic Quality Ontology (PATO)
- The Gene Ontology (GO)
- The Plant Ontology (PO)
- Unit Ontology (UO)
This ontology was not used, although it shares a similar data structure with our ontology:
Glossaries, lexicons, and other sources of additional terminology
Robert matched the terms in the glossaries and lexicons given below with those found in the various ontologies used. The terms were then added as synonyms, allowing a better matching between the ontologies and the source data.
MOBOT French/English lexicon
Glossaries from CharaParser
Glossaries extracted from Flore du Gabon XML files
File:Characters.xls Characters from FWTA family key.
File:GlossaryAttributes.xls Botanical glossary characters -- attributes
File:GlossaryAttributes.txt In text format
File:Plant vocab.zip Botanical glossary characters -- basic attributes & terms (leaves, fruits ...)
FlorML list of taxonomic ranks
Extensions to Env0
Below are some additions to the EnvO ontology. They have been done to make the EnvO ontology more useful for tropical countries and French literature. Also interesting to us is the EcoLexicon, having a rich lexicon of environmental terms in several languages.
Use of characters -- multi-entry keys
An example of how traits can be used for the construction of a multi-entry key:
http://www.kew.org/science/tropamerica/neotropikey/families/keys/malvaceae/index.htm Lucid key for Malvaceae genera
Ontology term matching
We generated two files with missing terms. These were not matched by any of the terms in the ontologies or the various other files with terminology used:
The second file was updated with PATO IDs whenever possible, by manually matching terms to definitions, or in some cases, parent definitions (e.g. "gold" to "color"):
Taxon names extracted from the various Floras
These will be used to be able to classify traits according to taxon in the ontology.
- Taxon Names from Flore du Gabon, Malesiana: http://jagannath.pdn.cam.ac.uk/plant/taxon2ppo.txt
- Taxon Names from Kew Floras: File:KewAfricanFlorasSpeciesNames.txt
Parser for scalar values
To link numerical data such as measurements to the respective traits, a parser for scalar values and, when present, units (e.g. mm) is required. The following perl program was created to extract many scalar values from the Flora files, using regular expressions that match the formats used in the floras for these types of data:
- File:Scafind.zip (supports number + unit, number without unit, fractions).
First run of generating the Plant Phenotype Ontology
- http://leechuck.de/plantphenotype.owl [new 19 March 14:53] (classified version: http://jagannath.pdn.cam.ac.uk/plant/flora.owl)
- First draft of Flora Phenotype Ontology (FLOPO) ontology: http://bioportal.bioontology.org/ontologies/FLOPO
- Ontology of unclassified terms (that will need to be defined): http://jagannath.pdn.cam.ac.uk/plant/flopo-unclassified.owl
- RDF store containing taxon names and associated traits: http://jagannath.pdn.cam.ac.uk:8080/parliament/
- Taxon annotations from Flore du Gabon, Malesiana to the Flora Phenotypes Ontology: http://jagannath.pdn.cam.ac.uk/plant/taxon2ppo.txt
Besides improving, cleaning up and documenting the code, these are the tentative next steps:
- Work with Plant Ontology (PO) and Phenotypic Quality Ontology (PATO) developers to incorporate missing terms, including synonyms.
- Curate and improve Flora Ontology (FLOPO).
Integration with Trait Ontology
- Use EQ definitions and reuse Trait Ontology; or migrate all traits to TO and only keep the phenotypes (trait values).
Fix NLP parser
- Check out Charaparser, see if it works better.
- Integrate terms from glossaries and vocabularies.
- Recognize sub-statements.
Flora data integration
- Use IPNI identifiers to identify taxon concepts used in the floras.
- Add the taxon concepts to an RDF store.
- Add the phenotypes for each taxon concept to the RDF store.
- Extract values and value ranges from the floras, and add the results to RDF store.
- Attempt to extract environmental information from the floras, represent environmental traits and features using ENVO, then add to RDF store.
- Add geolocations for taxa to RDF store.
- Visualize using an approach similar to http://browser.linkedgeodata.org/
Other XML file formats
- Add support for different XML file formats.
- Need to find more clever way to recognize entities; e.g., use sentenizer from opennlp toolkit.