Data enrichment hackathon, March 17-21 2014/Use cases

From pro-iBiosphere Wiki
Jump to: navigation, search

This page collects use cases for the Data enrichment hackathon, March 17-21 2014.


This page gives examples of data sources to be used at the hackathon. Feel free to add your own.


OCR correction and markup of literature

--Rod Page (talk) 15:26, 5 March 2014 (CET)BHL is a treasure trove of information, but extracting information from it can sometimes be problematic due to OCR errors in the text. Can we build a simple tool to edit the OCR and store the edits so that text mining tools can do a better job of indexing the text? I've done some work on converting BioStor articles to other formats, including HTML that is editable (see Towards BioStor articles marked up using Journal Archiving Tag Set)

There is a live example of editable HTML here: (more at ). This example has no means to stored edits, but it would not be too difficult to bolt in something like PouchDB and have an in-browser database that stores edits and can be synced to an external CouchDB database.


--David Shorthouse BHL has been awarded funds with its partners to come up with solutions to clean its OCR, So, it would be worth understanding what progress they've made already and what are their challenges rather than running the risk of reinventing solutions they may already have.

--Rod Page (talk) 05:26, 11 March 2014 (CET) A reasonable point, but I've never found waiting for others to solve a problem to be a satisfying solution ;)

--David King (talk) Hi Rod, a couple of thoughts.


See the ComTax Scratchpad page for an introduction, then follow the link to the OU website for the project, or go straight to the sample validation pages and try it for yourself. The front-end is straight forward HTML, JavaScript, etc, with Java behind the scenes talking - unfortunately - to an IBM Notes database for persistent storage of corrections. Not sure what is involved in moving it off a proprietary back-end. Something to consider maybe?

--Rod Page (talk) 05:26, 11 March 2014 (CET) Interesting. IBM Notes!? I'm a fan of CouchDB, which makes storing structured data trivial (the original developer of CouchDB, Damian Katz, worked on Lotus Notes).
--David King Had I been architect for ComTax, it would never have used Notes, which is not the optimal storage solution for its needs. CouchDB is a good document database, providing an easy to code back-end storage solution for relatively static applications with known queries. Horses for courses as ever. Let's see what our requirements are.


VARD (the VARiant Detector) is from the University of Lancaster. Originally written to work with spelling variants in Early English Books Online documents, it has since been adapted and used with other languages and in other contexts, for example tweets. VARD is free for academic use, and given a new dictionary and some training could work with biodiversity literature.

--Rod Page (talk) 05:26, 11 March 2014 (CET) It's gotta be web-based to be useful IMHO.

Simple data export for quality control and visualization

--Jeremy Miller 10:00, 12 March 2014 (CET)

The idea is simply to extract structured data from treatments in a simple spreadsheet-friendly format for the purposes of quality control and visualization. I have made a list of the desired fields and selected several example documents that have been marked up using GoldenGATE. I've organized the fields into two tables, one for authors and institutions, and the other for data (there can be many authors/institutions per publication). I can provide a small amount of real data for illustration of the desired output. With data structured in this way, several forms of visualization/dashboarding are possible. The first step in this process would be to produce a series of charts, maps and lists that at a glance deliver a wealth of information about the data that underlies any individual treatment. These would include data on proportions of male and female specimens, the proportion of specimens that come from major collections institutions, frequency of specimens by collecting decade, frequency of specimens by collecting month subdivided into categories (males, females, workers, queens, etc), a list of countries, and a map of georeferenced data points. Visualizations based on sets of treatments could follow proportions of specimens from major institutions as published in a taxonomic journal over time, or follow the activity of a particular collector over space and time (note that this will require a disambiguation step to resolve multiple spellings of collector names on labels, and deal with occurrence records that are credited to multiple collectors). Ultimately, the goal is a kind of reverse Biodiversity Data Journal - resurrect the primary data from legacy literature for aggregation and reanalysis.


Perhaps consider using R for this? Thomas D. Hamann (talk) 12:51, 12 March 2014 (CET)

Automated extraction of trait data from descriptive text

Provided by Don Kirkup, presented by Quentin Groom

Traits Task Group

Using grammatical and positional clues we could explore ways to extract trait data from taxon treatments. Using Kew's and Meise's African floras it should be possible to extract traits for a large number of African species. It will also be interesting to compare the results from different floras.


We could also try to do this cross-language using the Flore du Gabon files. Thomas D. Hamann (talk) 12:39, 12 March 2014 (CET)

Concept expansion from Taxonomic Names mentioned in literature

--Kevin Richards A concept expansion tool that can scan literature text and find names using all known synonyms of that name would be useful. This could expand into things like finding relationships between organisms using the set of synonyms for the names (e.g. hosts of a fungus), or relationships between names and localities using synonyms, e.g. to see if an organism is present in a certain country. Another useful tool would be one that could take a web page and highlight the names on the page and "display" other names relevant to that name.


--David Shorthouse (talk) 13:30, 6 March 2014 (CET) This might be a start to what you're looking for Kevin, Source is here:

Related use case

Provided by Jerry Cooper, Landcare Research, New Zealand

In NZ we are testing the idea of augmenting our national 8km grid based biodiversity monitoring program by doing next generation sequencing on environmental samples from the same plots. The data for each run comes back as a humungous file with perhaps 100,000 sequences up 1,000 base pairs in length. There are a whole bunch of analysis pipelines out there to deal with the data. I use a mixture of QIMME and UPARSE which are mixtures of Unix apps and Python scripts. The scripts do things like:

  1. Take out rubbish sequences.
  2. Find clusters of similar sequences – Operational Taxonomic Units (OTUs)
  3. Assign a representative sequence to each OTU.
  4. Assign names and higher taxonomy to each OTU based on match to sequences in genbank
  5. Use name/taxonomy to say something about biodiversity ...

It’s step 4 that’s needs most attention.

Step 4 ‘assign taxonomy’ is done in one of two ways, and neither works well. The first is to take each sequence and use the NCBI web services to Blast the sequence against genbank looking for the highest match. The associated name, the degree of match, and the NCBI higher taxonomy is returned and integrated into the data.

The second approach does the same job but locally, using a pre-compiled and selected set of genbank sequences. E.g the RDP classifier:;jsessionid=7A34647355F0C57B1ABF8736E1D66BAA.radiant or the UNITE classifier for fungi.

Either way the quality of the result is influenced by the names and taxonomy in genbank. It is curated, and so are the selected subsets, but is still quite often wrong or out of date, especially for higher taxonomy of microbial groups and synonyms of fungi.

What is actually needed is this:

  • Take each sequence and blast against genbank.
  • Get species name associated with highest match (and next 10 different matches which are real sequences , and not environmental sequences).
  • Use coL services to find and validate the species name, resolve synonyms and find CoL higher taxonomy.
  • Look at rest of first ten matches to ensure higher taxonomy is consistent and assign ‘quality’ score.
  • Build up table of original OTU, blast match, Col resolved match, col higher taxonomy, and higher taxonomy quality score.

The NCBI blast services are here:

--Rod Page (talk) 03:47, 12 March 2014 (CET)What happens if NCBI and CoL taxonomies don't agree? Why would I accept CoL over NCBI (or visa versa, for that matter). Would someone working on eukaryote phylogeny regard CoL as a erasable classification? Can you use NCBI to help resolve synonyms (e.g., similar sequences with different name could be unrecognised synonyms). Isn't the more general issue identifying and resolving conflicts between sequenced data and taxonomies?

Related use case

Provided by David Patterson, Plazi

Develop requirements for production-level Global Names Reconciliation and Resolution services. These services address the challenge of integrating / normalizing records with different names and different spellings of names. Prototype tools are in place ( but Global Names now needs input as to the next developments. Jerry's suggestions could take us a long way down the path.

Let us divide services into four levels: proof-of-concept (developed locally and often by an individual to show that 'it can be done'); prototype (still usually local, available as a service, but not good under pressure or with edge cases); production services (works under most situations, 99% ish performance); and flawless services. The relevant Global Names services are at proof-of-concept ( and prototype ( levels. SO), what investment is needed t turn them into production services that meet all reasonable expectations.

Transcribing specimen labels with the help of published literature

The basic idea (taken from Dimitris Koureas, NHM) is to use catalogue publications like

to try to identify ways in which text of specimen labels could be transcribed in an automated or semi-automated fashion.

Such catalogue publications contain a lot of label texts that have already been transcribed, so the specimen labels covered therein do not have to be digitized the classical way (i.e. page scans plus OCR). We have no idea yet how feasible this is, how many labels have been covered this way, and whether a label text corpus extracted from the literature could be useful for transcribing labels not covered in the literature.

It would be nice if we could address these issues in some basic fashion during the hackathon. For that, I would need support from someone who can write some relevant scripts quickly (I am rather slow at such things, as I am not really fluent in any programming language any more). I have a preference for these scripts to be in Python, but any language would do. --Daniel Mietchen (talk) 10:54, 3 March 2014 (CET)

BRIT (Botanical Research Inst Texas): Apiary project:


--Rod Page (talk) 15:02, 5 March 2014 (CET) Just to check, do you mean the ability to take some text and extract specimen codes? For example, like this tool I put together for a phyloinformatics course in Glasgow (based on code I use to extract museum codes from BHL literature as part of BioStor )
Hi Rod, thanks - I just tried it with the full text of , and that's a good start.
What we are after, though, is the text of the labels that belong to these specimen codes, and the question is whether mining the literature mentioning those labels might be useful in any way for getting that text, i.e. without scanning and OCR.
If you could parse lists of “material examined" and get a specimen code and associated metadata (such as collection date, locality, etc.), that would be useful on the way (and in other contexts), but the core of what we're after here is the text written on the specimen labels, so as to circumvent the need for classical digitization. However, we have no idea what kinds or percentages of specimen label texts are recoverable this way, and the parameters you listed may be useful to provide an estimate.
IMPACT project looked at digitising and recognising old documents - see
--Daniel Mietchen (talk) 15:44, 5 March 2014 (CET)
--David Shorthouse iDigBio hackathon Feb 2013 for OCR specimen labels


Quentin Groom (talk) 09:24, 4 March 2014 (CET) The digitised Flore d'Afrique Centrale contains place name, collector and collection number (72,769 specimens) of specimens mainly kept in the BR herbarium. These can be matched with the already digitised specimens in the BR herbarium catalogue (276,000 specimens).

We should be able to enrich the herbarium catalogue with data from the Flora where the specimen is already in the catalogue.
We should be able to identify herbarium specimens that have not just been catalogued and add information from the Flora to the herbarium catalogue to create dummy records to speed future transcription of labels.
We could link specimens in other herbaria to these data to verify and enrich each others dataset.

Flora Zambesiaca includes distribution statements for taxa, referenced to a specimen. We could link these to our own specimens and to those held elsewhere: A basic match configuration made c.1200 links Benefits: navigation from specimen to (very rich) descriptive information.

Automated parsing and linking of abbreviated taxonomic literature references

Provided by Rod Page

Legacy taxonomic literature use highly abbreviated citations that are unresolvable with parsers of modern citations. Nevertheless, these references are important evidence to taxonomic concepts. Furthermore, their use would be considerable enhanced if they were linked to digital libraries such as BHL. We could examine methods to parse and link these citations. This will require examination of their structure and format in different publications and their deciphering. However, there are various resources to help, including IPNI, BHL and institutional databases.


--Rod Page (talk) 15:16, 5 March 2014 (CET) I've built a series of tools to handle various citations, as well as linking them to external resources such as BHL, BioStor, DOIs, JSTOR, CiNii, etc. Often I've ended up having to develop ad hoc solutions, partly because every bibliographic data source has it's own API, and partly because citation formats are variable, and may bear little relation to the complete bibliographic citation. Mapping complete article citations is, perhaps, easier, and this is what I've been doing for animal names in BioNames. Plant citations tend to be "micro citations" where a page within a larger publication is cited Nomenclators + digitised literature = fail. This poses additional challenges, especially if trying to link to articles in databases such as CrossRef. It would be nice to have a single interface that wrapped citation parsing (both full and micro), and offered a way to check the results (e.g., displayed possible matches from BHL). If the search is for a taxonomic citation (e.g., IPNI name) we could use the name as an additional check. I've used this approach for Nomenclator Zoologicus, see Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature


Strings to things, things to strings, things to things

--Rod Page (talk) 03:41, 12 March 2014 (CET)A lot of data mining and linking relies on extracting things from strings (such as taxonomic names, identifiers such as DOIs), or substrings of strings (names, specimen codes, accession numbers, etc. within a body of text). Many things may have multiple string representations (e.g., all the different ways of writing a bibliographic citation, or a taxonomic name), and things may have relationships to other things (links, part-whole relationships, etc.). Imagine a service that can take a string and return the corresponding thing(s), or can take a thing and generate corresponding string representations. We have some services for parts of this (e.g., taxonomic name finding, citation matching, georeferencing tools) but imagine a generic tool that did this. Imagine that it was backed up by a data store that enabled users to annotate the strings and things (e.g., point out errors in metadata for a DNA sequences with a given accession number, or alternative spellings of a name, etc.). Imagine that this data store also recorded occurrences of strings (e.g., if a user submits a paper for text mining, the results of that mining are stored, linked to that paper). Essentially we would be building a universal biodiversity indexer and database. Something like Freebase or Wikipedia, where each thing had a "page" and each string representation pointed to that page. Instead of simply a bunch of ad hoc text mining tools, we could build a tool of lasting value to underpin all the data enrichment tasks our community undertakes.

--David Shorthouse Here's a collection of NLP APIs that might afford something toward what you have in mind:

Annotation tools to link specimens with relations expressing identity

--Peter Hovenkamp (talk) 15:45, 10 March 2014 (CET)

Collections (specimens) can be linked by relations expressing various sorts of identity, but potentially also by relations expressing conspecificity. To the best of my knowledge, this relation is not yet defined. What properties should it have to allow a feature-rich system where taxonomists collaborate online on the same group using partly overlapping sets of specimens? Is such a system at all feasible?

This use case has links with use cases 7, 8, 9 and 15, that all refer to links between specimens, and perhaps also with use case 12.

The ideas behind it are more fully explained in my presentation:

Annotation tools that allow to annotate records

Provided by Jeremy Miller - annotate records with geographic coordinates not in original, distinguish records with coordinates quoted in original from those annotated secondarily (provide totals for either or both, perhaps plotted with different color).

Connecting EDIT Platform instances to BioVeL Workflows

Provided by Patricia Kelbert

Equip the EDIT platform with additional access points for collection and observational data so that researchers using the platform can link up their data directly to BioVeL workflows. This will open up a range of taxonomic research data repositories to BioVeL workflows. During the hackathon a demonstrator will be implemented using the example of Quentin Groom's pro-iBiosphere Chenopodium Pilot.


Alan Williams: This can build upon similar work done on linking Scratchpads to a Taverna Player to run BioVeL workflows.

Running of workflows from within an iPython Notebook

Provided by Alan Williams and Aleks Pawlik - create Python code to run workflows from within an iPython notebook. Notebooks are a common way for scientists to develop their research. This task will allow workflows to be called from within notebooks, using data defined in the notebook and return data to the notebook.

Discussion at Data_enrichment_hackathon,_March_17-21_2014/IPython_Taverna

Improvement of Wikimedia pages with information about the pro-iBiosphere pilot taxa – Daniel Mietchen, MfN (Note: the presentation will consist of a brief announcement. The topic will be further explained/covered in the Wikimedia workshop on Tuesday)

Contingent on Media Library API deployment: development of image harvesting client

Image collections are being compiled by Naturalis' FES Collection Digitisation (FCD) activity, which generates an enormous amount of digital images of collection specimens. While high resolution images produced by FCD are stored elsewhere, lower resolution versions of these images will become available through the application-programming interface (API) of the Naturalis Media Library. The development of an API through which images can be queried and harvested opens up a wealth of raw data to which digital phenotyping and subsequent training of neural networks may be applied. Although this API itself is presently under development we note that its design will no doubt be enhanced by the development of client side code that actively tests the usability of the interfaces. We therefore propose that reference implementations of clients to this API are developed and submitted to the source code repository.

Serrano Pereira (talk) 12:49, 13 March 2014 (CET)

Implementing a WebService able to add and extract metadata to a NexML file

See also: NeXML_Services

Provided by Bachir Balech - Webservice able to bundle and/or read a NexML file and offers the possibility to extract information from it, namely the multiple sequence alignment or the phylogenetic tree and their corresponding annotations.

1- The service in the "bundle" mode allows clients to upload data in a number of formats (e.g. PHYLIP, FASTA, NEXUS, Newick, PhyloXML), which are then merged into NeXML. Data can be a multiple aligned sequences and/or a phylogenetic tree and metadata relative to one of this two item. Metadata can be of diffrent kinds: geographical origin, sample type, column of the alignment, such as "charset" in nexus lingo, indicating the partitions of the multiple sequence alignment, etc.... Metadata can be provided for example as a simple JSON data structure using CURIEs and values, to be folded into the NeXML. Example of charset metadata showing data partitions of a multiple sequence alignment: suppose you have a multiple alignment of 600 sites. Its partitions in nexus format looks like: utr5= 1-30; cds_pos1 = 30-200\3 400-500\3; cds_pos2 = 31-200\3 399-500\3; cds_pos3 = 32-200\3 398-500\3; intron= 200-397; utr3= 501-600;

2- The service in the "read" mode will take a NEXML and give the list of objects that could be extracted from. These objects can be: the multiple sequence alignment, the phylogenetic tree and the taxa. All metadata associated to the above mentioned objects can be exported together with object itself.

3- The service in the "extract" mode: having a NexML as input, the user can extract either the multiple alignment, the phylogenetic tree and/or Taxon names and their corresponding metadata (preferably in .csv format). The aim of this last function is to visualize its output using iTOL tools (for phylogenetic tree) and displaying an annotated multiple alignment.


RDF knowledge base of plant phenotypes

Provided by Robert Hoehndorf

Moved to Data_enrichment_hackathon,_March_17-21_2014/RDFKB.

Link names to collections via type citation

Provided by Nicky Nicolson - RBGK

Link names to collections via type citation Basic matchconf config made c 600 links to Kew specimens and c 250 to Paris specimens – e.g. Cites the Kew herbarium specimens: And this name: cites a Paris specimen: We should demonstrate this to other users at the workshop who hold significant herbarium collections (Paris / Edinburgh / Leiden / Missouri / Meise).

Link collection events between different systems

Provided by Nicky Nicolson

Benefits: start to show the set of specimens that arose from a single collection event.

Link collections to duplicates held in other herbaria

Provided by Nicky Nicolson

These two (held in Kew and Paris) are duplicates and would be considered as such by examining the collection event details (collector, collector number, locality and year) Benefits: potentially we could share georeferencing effort with the holders of duplicates, and our users may be able to navigate from collection records with no images to an image held elsewhere – e.g. a Herbcat search on Arecaceae returns 15189 records, of these only 1975 have images. If the remainder have duplicates in digitised herbaria, there may be images elsewhere that our users could view.

Linking specimens from digitized museum collections

Provided by Jordan Biserkov


--Rod Page (talk) 03:50, 12 March 2014 (CET)Not sure of the details here. Would one approach be a tool that linked specimen codes to corresponding GBIF occurrences? I've been doing this for articles in BioStor. One issue is that GBIF occurrence ids ar enot stable, so you gain richer links at the cost of having them potentially decay over time.

Enriching references using the BHL API

Provided by Jordan Biserkov

  1. Parse page URL -> page id
  2. Get page metadata -> Item id
  3. Get item metadata -> primary title id
  4. Get title metadata -> the full reference


--Rod Page (talk) 05:51, 11 March 2014 (CET) Jordan, I'm not sure what you are after here, are you simply wanting to get bibliographic metadata for a given page in BHL? Does the BHL API not give you the tools for this?
--William Ulate Watch out for this one... if what you intend is to get the Page ID from the URL, BHL never guaranteed that URL to be stable at all! In fact, that might change if there's ever an insertion of a new page. Instead, as Rod recommends, use the stable URI with the PageID that the API tools give you, the one with the syntax:[PageID]. For example in an URL like this: you can ignore pretty much anything coming after the # if you are looking for stable IDs. The actual stable URI of that page is