D3.3.1 Semantic integration of the biodiversity literature

From pro-iBiosphere
Jump to: navigation, search
The report has been submitted to the European Commission and is also available from here.



Title: Report on state-of the art and research horizons of semantic integration of biodiversity literature

Lead: MfN

Due: M16 (Dec 2013)



Report on state-of the art and research horizons of semantic integration of biodiversity literature: The purpose of this document is to evaluate the state-of-the art in semantic integration of biodiversity literature. Building on the lessons learned of the BHL-Europe project and its global partners, and the experiences with mark-up methodologies from Plazi and Pensoft, it will report on available tools, processes and webservices and assess the suitability to semantically enhance legacy literature and integrate these types of information into the knowledge management workflow. This evaluation will take the three identified paths for semantic enrichment into account and assess the suitabilities of tools and processes for the different approaches separately.


There are many ways in which biodiversity literature is being made available in a digital manner. This has spurred the development of multiple systems, tools and associated workflows, with little initial thought to overarching standards that would be applicable across the whole domain of formally published biodiversity research literature. This report aims at summarizing the options available for turning the digital biodiversity literature as a whole into a coherent basis for semantic technologies - and notably semantic search - that would be of use for the biodiversity research community and beyond.

It would be nice to have a graphical abstract too, e.g. following the scheme in this blog post or this one, or some graphical variant of [Content] ==> [Generic representation] ==> [Search], where the the two arrows represent opportunities for semantification of the kind we are interested in here.


Semi-organized notes


Broader context

Purpose of this document


This part is mainly meant to be auxiliary to the writing process and likely to be largely dropped as the draft advances, using pointers instead to existing documents covering this ground. It is expected to be largely skipped by experts but to be vital for non-experts who attempt to understand the present document.


Current practice

Data sources

OCR of digitized documents
"Born digital" documents not marked up yet
"Born digital" documents marked up directly

Controlled vocabularies

A decadal view of biodiversity informatics: challenges and priorities states:

"16. Common vocabularies are the foundation for both human and machine communication (e.g. in data sharing, in automated workflows, data integration and analysis). By agreeing on a set of concepts and their definitions within a domain, a community of practice can share data and information unambiguously. Data integration and analysis critically requires semantic consistency as well as syntactic standardisation, the former being more challenging to achieve than the latter. Initially communities will accept a small controlled vocabulary - terms supported by human-readable text definitions. As terms are rarely independent of one another, the vocabulary list evolves into a thesaurus and, as formal relationships between terms are agreed, an ontology [66]. There are lessons to be learnt by looking elsewhere, for example, Google’s "Knowledge Graph" [67], the Unified Medical Language System (UMLS), medical informatics) [68], AGROVOC (agriculture) [69] and OBO (plant and animal phenotypes) stable of ontologies [70]. AGROVOC covers many of the terms relevant to biodiversity and is modular enough to be extended. There are other ontologies useful for capturing biodiversity data, such as the environment ontology, EnvO [71], and the more general DAML [72]. There is a pressing need for ontologies that span multiple communities, implying domains, and at present, such over-arching technologies do not seem to exist. Individual community ontologies tend to isolate communities rather than enable more open sharing, but community ontologies are with us now and need to be integrated. Some systems, such as UMLS are not structured to support reasoning or subsumption, so are not necessarily a good model for further development. Nevertheless, establishment of community standard terminologies and ontologies presents problems that are familiar to other communities, such as human genetics and model organism functional genomics, and some of these lessons have already been learned:

• Terminologies / ontologies need to be owned by the community but their maintenance is an ongoing requirement which requires stable funding and a degree of community coordination and interaction;

• tools that biologists find intuitive need to be developed for both data coding and analysis, making the process efficient and effectively invisible;

• ongoing terminology and syntax development need expert construction and are not just problems of computer science;

• a significant problem exists in the communication of changes in those lists to sites that consume the data and a central catalogue / source is required, such as currently provided by OBO or the NCBO (National Centre for Biomedical Ontology);

• mapping of data coded by legacy terminologies and integration of data coded by different species-specific ontologies are problems already addressed by some communities.

There is potential in semantic interoperability for biodiversity data, but this requires quite basic research and IT development to enter new paradigms supporting open semantic approaches. The provision of a strategy for transferring “legacy” data models into semantic-aware technologies is clearly desirable because existing data models are often accurate, comprehensive and represent a great deal of effort from the scientific community. We need a pragmatic strategy for mobilising this knowledge. Such mobilisation may also assist in achieving broad user acceptance, a greater problem than are the associated technical issues. Developing and applying vocabularies is clearly hard and requires the existence of persistent identifiers (paragraph 7 above) to be effective. It will require organisation and cooperation, or to put it simply, it takes goodwill but also cash. "

Metadata standards
Data models


  • ONIX
  • ePub
  • MediaWiki

Applications Programming Interfaces


Improper documentation of standards is a common barrier for wider adoption. Proper documentation thus has to be integrated with the workflows for defining standards for the semantic integration of biodiversity literature.

  • search for examples of standards whose documentation actually follows the standard
    • LaTeX
    • something closer to biodiversity

Canonical content formats


Text Encoding Initiative


  • Metadata Encoding and Transmission Standard (METS)
  • Encoded Archival Description (EAD)
  • Encoded Archival Context Corporate bodies, Persons, and Families (EAC-CPF)


Semantic integration

A decadal view of biodiversity informatics: challenges and priorities states:

"The key component needed to develop biodiversity informatics further is effective integration of the available resources, to ensure that the practice of publishing biodiversity information becomes widely adopted in the scientific community and leads to scientific synthesis."


" Existing data must either be transformed in a semantically-aware manner to conform to such standards, or software that is aware of the semantic heterogeneity must work with multiple standards. "

Before & during ingestion

In storage

Upon search

  • CoL
  • PESI
  • VIAF

Linked Open Data

Specificities of the biodiversity literature

Legacy literature

Defined as "all literature that does not comply with contemporary standards of linked open data" (Shotton, 2009).





Contemporary literature

Current state of the art

Lessons learned from BHL

Markup as used at Plazi

Markup as used at Pensoft

Pilot 1

Other examples

  • FlorML

Assessment of other existing approaches

  • Could be framed differently - e.g. infrastructure requirements; open standards, open API

Available tools


Taxonomic Literature II (TL-2) - http://www.sil.si.edu/DigialCollections/TL-2/

  • a selective guide to botanical publications and collections with dates, commentaries and types (Stafleu et al.).


Three paths for semantic enrichment

From DOW: Three viable paths for future improvement of semantic mark-up are presently recognized: (1) fully automated natural language processing (NLP), (2) base mark up complemented by automated processing and specialist correction, and (3) social crowd-sourcing models (citizen involvement).

Fully automated natural language processing

Automated processing with specialist curation

Crowd-sourcing models

Research horizons

Linked Open Data

Rich inference




Anything relevant for a future iBiosphere that does not fit in anywhere above.


  • MS12 in February 2014

Follow-up report

See also

Other versions

List of relevant initiatives/ projects

Representatives to be potentially invited to the MS12 workshop. According to D2.1.2 Draft strategy of increased cooperation, project participants agree to register all biodiversity web services that are provided to other Biodiversity institutions in the Biodiversity Catalogue.

  • identify and describe natural history illustrations from the digitized books and journals in BHL

see output: http://www.flickr.com/photos/biodivlibrary/sets/

  • data quality assurance and "crowd-sourcing" of content

Richard Pyle (Email to Taxacom 27 April 2013) "Whenever possible, on a ZooBank page for a new species name, we now include a link to the BHL page" Example: http://zoobank.org/NomenclaturalActs/A437F5A3-935C-4FB5-BA7D-B56E240C6D06

OCR Projects

IMPACT - http://www.impact-project.eu/

  • aims to significantly improve access to historical text and to take away the barriers that stand in the way of the mass digitisation of the European cultural heritage.

BHL/ IMPACT outcomes:

or BHL survey on OCR accuracy

Text mining in general

Useful information, ideas, articles


Paths for future improvement of semantic mark-up

Fully automated natural language processing (NLP)

Base mark up complemented by automated processing and specialist correction

  • Plazi

Social crowd-sourcing models (citizen involvement)

Special Issue on Semantics for Biodiversity, Semantic Web Journal

Open issues

  • What are the relative merits and pitfalls
  • Which illustrations would add value to the report?
  • e.g. a biodiversity version of https://en.wikipedia.org/wiki/File:RecipeBook_XML_Example.png or of Fig. 1 of Coombs, James H.; Renear, Allen H.; DeRose, Steven J. (November 1987). "Markup systems and the future of scholarly text processing". Communications of the ACM (ACM) 30 (11): 933–947. doi:10.1145/32206.32209.