DOI or LOD or DOI and LOD

From pro-iBiosphere
Jump to: navigation, search
by G. Hagedorn, 2013

The pro-iBiosphere project has investigated the status of the use of stable identifier methods within the Biodiversity community. In the course of several workshops organised by the project (involving experts from Europe, the US, and Australia) participants almost unanimously agreed that:

  • Life Science Identifiers (LSIDs), being a technology driven solely by the biodiversity community for the past 8 years, should be abandoned.
  • It was also widely agreed, that the preferred form of any identifier, be it Linked Open Data URIs (LOD-URIs), DOIs, or ARKs, should be the Semantic Web compatible http-form (i.e. in the case of DOI or ARK including an http-based resolver).

However, the question whether the management of stable URIs should occur decentralized (at multiple institutions, each using the standard URI-stability technology provided by most web servers), or whether a special, centralized technology such as DOI should be mandatory, continues to be discussed.

Comparison of DOI and Semantic Web / Linked Open Data technologies

DOI vs LOD

The top scenario shows DOI resolution. A separate server, the DOI resolution provider, accepts the request, consults its internal Stability Mapping Definitions (where the DOI is mapped to the final URI), differentiates between RDF and html requests, and forwards the request to the ultimate destination. The client (machine or human) no longer sees the stable DOI, but instead the redirected (and potentially unstable) URI.

The bottom scenario shows the same situation for a linked data setup. A webserver that is located within the data providing institution differentiates between RDF requests from machines (red dot on the left side) and HTML requests from humans using a web browsers. Using content negotiation, both requests to the same URI are directed to RDF data and HTML web pages respectively. The webserver also consults its internal mapping definitions to maintain the URIs stable. One advantage of this situation is that the client (machine or human) continues to see a stable URI.

Technically both scenarios work very similarly. The DOI example has minor advantages with respect to stability (largely limited to scenarios where the domain is lost by accident or because, perhaps after a merger, the domain transfer is neglected). By introducing the additional redirection layer only a single domain name is needed. This is a single point of failure, but the central domain can be reasonably expected to be managed to the highest standards. The DOI has the disadvantage that the URI as seen from the client side changes, because the redirect goes to a different server and is not handled opaquely within a provider.

The main distinction between the two scenarios is therefore between centralized and decentralized stability management.

Advantages and disadvantages of centralization

Biodiversity community DOI system

The graphic shows the scenario where millions of requests from millions of clients have to be forwarded by a central resolver infrastructure to a large number of data providers. The service requirements of a biodiversity DOI service, providing the canonical identifiers to all living things in the semantic web (including humans, their parasites, crops, pets, etc.) may in fact be several order of magnitude higher than a CrossRef or DataCite DOI redirection. For data relations involving organism, including those from medicine, agriculture, etc., these DOIs would have to be resolved with every query or reasoning.

Some additional comments on the slide above:

  1. The central redirection table can grow very large. Technically this is manageable, but requires resources.
  2. The large number of involved data providers may require substantial human resources.
  3. Updating the redirection table by a provider for, e.g., 30 objects, can only be done through scripting. It requires the provider (e.g. a natural history collection) to learn the API of the central redirection service.
  4. Because major current DOI systems such as CrossRef or DataCite provide identifiers to Digitally published Object, and define some metadata expectations to this extent, it is rather doubtful whether they are suitable for physical specimens or abstract taxon concepts. The slide therefore assumes a Biodiversity owned and maintained community infrastructure, run, e.g. by GBIF. Who exactly is running the infrastructure is, however, secondary. The primary argument is that load can be high and management and resources need to be adequate and sustainably financed.

A central system has some advantages, especially with respect to additional services like quality control, centralized and reliable global statistics. However, these advantages may be decisive, depending on ones needs. Unfortunately, it may be time consuming to reach a consensus on this.


Don't wait

However, there is some good news: Substantial concerns above about the management resources required to maintain the mapping between DOIs and URIs both at the central redirection provider and at each data provider can be reduced, by first implementing well managed locally stable URIs. Doing so is straightforward, does not require additional technology, and drastically reduces the frequency or even likely that changes at an additional central redirections are necessary.

Thus, whether a central DOI system will be adopted over time or not: Investing today into establishing good management practices for stable, semantic web compatible identifiers at each institution will not be wasted effort. It may be that the solutions is LOD-URIs and DOIs.

DOI and LOD

(See also Best practices for stable URIs)