D3.3.1 Semantic integration of the biodiversity literature
- 1 About
- 2 Draft
- 3 Purpose
- 4 Abstract
- 5 Timeline
- 6 Introduction
- 7 Markup
- 7.1 Purpose
- 7.2 Current practice
- 7.2.1 Data sources
- 7.2.2 Controlled vocabularies
- 7.2.3 Serializations
- 7.2.4 Applications Programming Interfaces
- 7.2.5 Documentation
- 7.3 Canonical content formats
- 8 Semantic integration
- 9 Specificities of the biodiversity literature
- 10 Current state of the art
- 11 Assessment of other existing approaches
- 12 Three paths for semantic enrichment
- 13 Research horizons
- 14 Recommendations
- 15 Conclusions
- 16 Outlook
- 17 Workshop
- 18 Follow-up report
- 19 See also
- 19.1 Other versions
- 19.2 List of relevant initiatives/ projects
- 19.3 Useful information, ideas, articles
- 19.4 Miscellaneous
- 19.5 Paths for future improvement of semantic mark-up
- 19.6 Special Issue on Semantics for Biodiversity, Semantic Web Journal
- 20 Open issues
Title: Report on state-of the art and research horizons of semantic integration of biodiversity literature
Due: M16 (Dec 2013)
- The drafting has been moved to a Google Doc.
Report on state-of the art and research horizons of semantic integration of biodiversity literature: The purpose of this document is to evaluate the state-of-the art in semantic integration of biodiversity literature. Building on the lessons learned of the BHL-Europe project and its global partners, and the experiences with mark-up methodologies from Plazi and Pensoft, it will report on available tools, processes and webservices and assess the suitability to semantically enhance legacy literature and integrate these types of information into the knowledge management workflow. This evaluation will take the three identified paths for semantic enrichment into account and assess the suitabilities of tools and processes for the different approaches separately.
There are many ways in which biodiversity literature is being made available in a digital manner. This has spurred the development of multiple systems, tools and associated workflows, with little initial thought to overarching standards that would be applicable across the whole domain of formally published biodiversity research literature. This report aims at summarizing the options available for turning the digital biodiversity literature as a whole into a coherent basis for semantic technologies - and notably semantic search - that would be of use for the biodiversity research community and beyond.
It would be nice to have a graphical abstract too, e.g. following the scheme in this blog post or this one, or some graphical variant of [Content] ==> [Generic representation] ==> [Search], where the the two arrows represent opportunities for semantification of the kind we are interested in here.
- Invitations for workshop about the recommendations of this report: September 12, 2013
- Initial outline: October 8, 2013
- Initial draft: December 9, 2013
- Comments from partners: December 11, 2013
- Final formatting: December 18, 2013
- Submission: December 19, 2013
- workshop about the recommendations of this report: February 2014
Purpose of this document
This part is mainly meant to be auxiliary to the writing process and likely to be largely dropped as the draft advances, using pointers instead to existing documents covering this ground. It is expected to be largely skipped by experts but to be vital for non-experts who attempt to understand the present document.
OCR of digitized documents
"Born digital" documents not marked up yet
"Born digital" documents marked up directly
"16. Common vocabularies are the foundation for both human and machine communication (e.g. in data sharing, in automated workflows, data integration and analysis). By agreeing on a set of concepts and their definitions within a domain, a community of practice can share data and information unambiguously. Data integration and analysis critically requires semantic consistency as well as syntactic standardisation, the former being more challenging to achieve than the latter. Initially communities will accept a small controlled vocabulary - terms supported by human-readable text definitions. As terms are rarely independent of one another, the vocabulary list evolves into a thesaurus and, as formal relationships between terms are agreed, an ontology . There are lessons to be learnt by looking elsewhere, for example, Google’s "Knowledge Graph" , the Unified Medical Language System (UMLS), medical informatics) , AGROVOC (agriculture)  and OBO (plant and animal phenotypes) stable of ontologies . AGROVOC covers many of the terms relevant to biodiversity and is modular enough to be extended. There are other ontologies useful for capturing biodiversity data, such as the environment ontology, EnvO , and the more general DAML . There is a pressing need for ontologies that span multiple communities, implying domains, and at present, such over-arching technologies do not seem to exist. Individual community ontologies tend to isolate communities rather than enable more open sharing, but community ontologies are with us now and need to be integrated. Some systems, such as UMLS are not structured to support reasoning or subsumption, so are not necessarily a good model for further development. Nevertheless, establishment of community standard terminologies and ontologies presents problems that are familiar to other communities, such as human genetics and model organism functional genomics, and some of these lessons have already been learned:
• Terminologies / ontologies need to be owned by the community but their maintenance is an ongoing requirement which requires stable funding and a degree of community coordination and interaction;
• tools that biologists find intuitive need to be developed for both data coding and analysis, making the process efficient and effectively invisible;
• ongoing terminology and syntax development need expert construction and are not just problems of computer science;
• a significant problem exists in the communication of changes in those lists to sites that consume the data and a central catalogue / source is required, such as currently provided by OBO or the NCBO (National Centre for Biomedical Ontology);
• mapping of data coded by legacy terminologies and integration of data coded by different species-specific ontologies are problems already addressed by some communities.
There is potential in semantic interoperability for biodiversity data, but this requires quite basic research and IT development to enter new paradigms supporting open semantic approaches. The provision of a strategy for transferring “legacy” data models into semantic-aware technologies is clearly desirable because existing data models are often accurate, comprehensive and represent a great deal of effort from the scientific community. We need a pragmatic strategy for mobilising this knowledge. Such mobilisation may also assist in achieving broad user acceptance, a greater problem than are the associated technical issues. Developing and applying vocabularies is clearly hard and requires the existence of persistent identifiers (paragraph 7 above) to be effective. It will require organisation and cooperation, or to put it simply, it takes goodwill but also cash. "
Applications Programming Interfaces
Improper documentation of standards is a common barrier for wider adoption. Proper documentation thus has to be integrated with the workflows for defining standards for the semantic integration of biodiversity literature.
Canonical content formats
Text Encoding Initiative
METS/ EAD/ EAC-CPF
"The key component needed to develop biodiversity informatics further is effective integration of the available resources, to ensure that the practice of publishing biodiversity information becomes widely adopted in the scientific community and leads to scientific synthesis."
" Existing data must either be transformed in a semantically-aware manner to conform to such standards, or software that is aware of the semantic heterogeneity must work with multiple standards. "
Before & during ingestion
Linked Open Data
Specificities of the biodiversity literature
Defined as "all literature that does not comply with contemporary standards of linked open data" (Shotton, 2009).
Current state of the art
Lessons learned from BHL
Markup as used at Plazi
Markup as used at Pensoft
Assessment of other existing approaches
Taxonomic Literature II (TL-2) - http://www.sil.si.edu/DigialCollections/TL-2/
Three paths for semantic enrichment
From DOW: Three viable paths for future improvement of semantic mark-up are presently recognized: (1) fully automated natural language processing (NLP), (2) base mark up complemented by automated processing and specialist correction, and (3) social crowd-sourcing models (citizen involvement).
Fully automated natural language processing
Automated processing with specialist curation
Linked Open Data
Anything relevant for a future iBiosphere that does not fit in anywhere above.
- MS12 in February 2014
- D3.3.2, April 2014
- the current version
- 1st draft by H. Scholz Dec 2012
List of relevant initiatives/ projects
Representatives to be potentially invited to the MS12 workshop. According to D2.1.2 Draft strategy of increased cooperation, project participants agree to register all biodiversity web services that are provided to other Biodiversity institutions in the Biodiversity Catalogue.
- Biodiversity Heritage Library:
- CiteBank - http://citebank.org/
- BHLUS - http://www.biodiversitylibrary.org/
- BHL-Europe - http://www.bhl-europe.eu/
- National Nodes
- Encyclopedia of Life (EOL) - http://eol.org/
- BHL “Art of Life” - http://biodivlib.wikispaces.com/Art+of+Life
- identify and describe natural history illustrations from the digitized books and journals in BHL
see output: http://www.flickr.com/photos/biodivlibrary/sets/
- ZOOBANK -http://zoobank.org/
- data quality assurance and "crowd-sourcing" of content
Richard Pyle (Email to Taxacom 27 April 2013) "Whenever possible, on a ZooBank page for a new species name, we now include a link to the BHL page" Example: http://zoobank.org/NomenclaturalActs/A437F5A3-935C-4FB5-BA7D-B56E240C6D06
IMPACT - http://www.impact-project.eu/
- aims to significantly improve access to historical text and to take away the barriers that stand in the way of the mass digitisation of the European cultural heritage.
BHL/ IMPACT outcomes:
or BHL survey on OCR accuracy
Text mining in general
- Fraunhofer Information Management and Production Control (ILT): http://www.iosb.fraunhofer.de/servlet/is/18352/
- Text Mining Solutions http://www.textminingsolutions.co.uk.
Useful information, ideas, articles
- Rod Page http://iphylo.blogspot.de/2012/06/bhl-and-text-mining-some-ideas.html
- BHL Gaming http://biodivlib.wikispaces.com/BHL+and+Gaming
- Literature-driven Curation for Taxonomic Name Databases
- Similar list on Wikipedia
- Ryan Schenk - http://bionames.org/
- BioNames tackles the link between a name and its publication
- Rod Page Blog post: http://iphylo.blogspot.co.uk/2013/05/bionames-now-live-report-on-project.html
- Taxonomy as impediment: synonymy and its impact on the Global Biodiversity Information Facility's database
- Wei et al. (2010), Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL)
- GBIF blog post on validating scientific names: http://gbif.blogspot.de/2013/07/validating-scientific-names-with.html
- Semantic MEDLINE
- Semantic techniques in libraries
- The BioCASe Monitor Service - A tool for monitoring progress and quality of data provision through distributed data networks
- Open Access to Biodiversity Scientific Data: A Comparative Study
Paths for future improvement of semantic mark-up
Fully automated natural language processing (NLP)
- Example: Danish health records (PLOS CB/ Nat Rev Gen)
- Issue: English vs. non-English
- machine translation
- example: Moses
- machine translation
Base mark up complemented by automated processing and specialist correction
Social crowd-sourcing models (citizen involvement)
- Old Weather
Special Issue on Semantics for Biodiversity, Semantic Web Journal
- Call for papers - deadline: January 24, 2014.
- What are the relative merits and pitfalls
- Which illustrations would add value to the report?
- e.g. a biodiversity version of https://en.wikipedia.org/wiki/File:RecipeBook_XML_Example.png or of Fig. 1 of Coombs, James H.; Renear, Allen H.; DeRose, Steven J. (November 1987). "Markup systems and the future of scholarly text processing". Communications of the ACM (ACM) 30 (11): 933–947. doi:10.1145/32206.32209.
- Should we dwell on #OCR of digitized documents, or is this covered in other piB documents?
- Should we only consider the classical literature or also things like notebooks or specimen tags?
- Need good examples for #Paths for future improvement of semantic mark-up, ideally from the biodiversity literature