Strings to things pitch
The "biodiversity knowledge graph' is one way to model the relationships between the core entities in biodiversity informatics. A key task, therefore, is to build this graph. To do this we need a way to convert "strings" to "things".
This pitch proposes building a database that can take any string (a taxon name, a museum code, a place name, a citation, a collection acronym, an identifier) and return one or more identifiers for the corresponding object. The goal is to me able to match any string that has extracted from some text to the relevant entity.
|Repert. Spec. Nov. Regni Veg. 18: 236 1922||10.1002/fedr.19220181010|
|Moorea; Haumi, flanc gauche de la moyenne vallée.||-17.54814, 149.81609|
|Moorea; Hauni, flanc gauche de la moyenne vallée.||-17.54814, 149.81609|
There will be cases where multiple strings correspond to the same thing. These can be aggregated in various ways. For example, different strings for the same locality could be aggregated based on identical (or geographically close) coordinates (such as "Moorea; Haumi, flanc gauche de la moyenne vallée." and "Moorea; Hauni, flanc gauche de la moyenne vallée." above). In this way we could generate a set of strings that correspond to the same entity.
Generating strings for things
We could also help the mapping process by generating possible strings for things. As discussed in Rethinking citation matching, we could generate strings for citations by taking a database of references and generating a set of formatted citations using, say, the Citation Style Language. Or we could generate "micro citations" of the form "journal volume: page year" to facilitate matching literature cited in databases such as IPNI. For example, the DOI doi:10.1002/fedr.19220181010 corresponds to this paper:
Harms, H. (1922). Leguminosae americanae novae. III. Feddes Repert, 18(10-18), 232–237. doi:10.1002/fedr.19220181010
Given that the page range is 232–237, and the journal abbreviation "Repert. Spec. Nov. Regni Veg.", the possible micro citations are:
- Repert. Spec. Nov. Regni Veg. 18: 232 1922
- Repert. Spec. Nov. Regni Veg. 18: 233 1922
- Repert. Spec. Nov. Regni Veg. 18: 234 1922
- Repert. Spec. Nov. Regni Veg. 18: 235 1922
- Repert. Spec. Nov. Regni Veg. 18: 236 1922
- Repert. Spec. Nov. Regni Veg. 18: 237 1922
If we are assembling as database of things (and their strings) then this leads naturally to having a place to add annotations. If there are spelling variations we can point those out, if two things are related the links can be created. Records such as GenBank sequences could be linked to the papers that cite them, and the voucher specimens from which the DNA was obtained.
There are several ways a string to things database could be created. One obvious candidate is to use something like Semantic Mediawiki (see an example of this at iPhylo Linkout which manages links between NCBI taxonomy and Wikipedia). Semantic Mediawiki would mean an immediate start could be made, and it supports community editing, but at the same time it is rather clunky.
What would be the outcome?
A database that could be used to link names of entities that are extracted to the actual entitles, and to provide a common resource for data mining projects.