Markup pitch

From pro-iBiosphere Wiki
Jump to: navigation, search

The pitch

A simple API to extract entities and localities from plain text.


Map of localities extracted from text
Museum specimen codes extracted form text
List of articles containing the name Zonosaurus

For example, given the reference:

Achille P Raselimanana, Christopher J Raxworthy, and Ronald A Nussbaum (2000) A revision of the dwarf Zonosaurus Boulenger (Reptilia: Squamata: Cordylidae) from Madagascar, including descriptions of three new species. Scientific Papers Natural History Museum the University of Kansas 18, 1–16.

BioStor has extracted geographic and specimen data, and so displays a map of localities mentioned in the text, and a list of museum specimen codes. BioStor also uses the taxonomic names provided by BHL to support searches based on names.

Map of localities extracted from BioStor articles

Extracting these entities enables some useful queries. For example we can build a map of all localities mention in the articles, which means we can spatially search the database (e.g., "find all articles about Madagascar"). Extracting museum specimen codes enables us to get citation data for specimens (i.e., the number of times a specimen is cited in a paper). We can use this to compute the impact factor of a collection, or to track the changing taxonomic identity of a specimen.

Another example of what can be done is the Elsevier Challenge.


Plain text extracted from a publication (i.e., lacking markup such as bold, italic, etc., or structure such as paragraphs). For example:

Specimens examined. — All from Madagascar: MNHN 
7634, UADBA 395-97, Ambohimitombo, Ambositra 
Fivondronana, Fianarantsoa Province, 30 September 1994, 
J. B. Ramanamanjato and A. P. Raselimanana; UADBA 398- 
406, Ankeniheny forest, Moramanga Fivondronana, 
Toamasina Province, 9-22 December 1993, N. Rabibisoa, J. 
B Ramanamanjato, and O. Ramilison; UADBA 407-09, 
Fiherenana region, Amboasary Fivondronana, Toamasina 
Province, 19-23 September 1994, N. Rabibisoa, J. B 
Ramanamanjato, and O. Ramilison; UADBA 4124-35, 
4137-41, Andohahela National Park, 24°38' S, 46°46' E, 440- 
1450 m elevation, Tolagnaro Fivondronana, Toliara Prov- 
ince, 18 October-14 November 1995, J. B. Ramanamanjato 

Sources of text

  "id": "8de84eedaf1fb9b3ed6b4376d0d568bc",
  "status": 200,
  "pages": ["text...","text",..., "text"]


There are some services already available.

Can we package these in a single service, what other things can we extract? Obvious candidates include:

  • Citations, such as literature cited, or micro citations
  • GenBank accession numbers
  • Other identifiers in the text, such as DOIs, LSIDs, etc.

What would be the outcome?

Imagine that the service returns some simple JSON listing the entities found. Can we index these and create a simple search engine that can return documents based on geographic extent, specimen code, identifier, etc.? Can we use this to index a large corpus of text, such as BHL or BHL-derived archives.

Open issues

There are two obvious issues:

  • Errors in the text will prevent some entities being extracted (this is the OCR error issue).
  • What do we link the discovered entities too?