TaxonX CDM Quality Guidelines

From pro-iBiosphere Wiki
Jump to: navigation, search


The PLAZI-2-CDM Pipeline allows importing literature data marked-up in TaxonX format to be imported into an EDIT Platform CDM database. This kind of data aggregation differs from those for standard use-cases of TaxonX markuped documents at PLAZI or elsewhere.

The main focus in PLAZI is to make data from literature retrievable and to enrich it semantically. Therefore PLAZI more or less keeps the original document structure by purely adding semantic information. It does not loose or dislink information which is "not" marked up, at least not for the human reader who understands the meaning of text order. The EDIT Platform stores data in a (table based) database by completely transforming all parts of the text into database objects. Therefore the original structure and order of the text may suffer, depending on the time, effort and expertise available for the mark-up and if the import routines will not recognize these structure changes.

As the CDM completely transforms the document we ideally take care that all information is at least covered by some minimum markup, more importantly, we need to ensure that associated data may not be split into unrelated units.

The following table outlines the consequences and functionality levels resulting from the markup quality for a seamless PLAZI-2-CDM data integration.

For the PLAZI use-case it is often good enough to markup only the most important parts of a treatment (e.g. taxonomic names and specimen information) which makes it possible to search for these parts and to combine it with other important information. E.g. a specimen will be related to a taxon and therefore can also be searched under this name.

TaxonX-2-CDM Markup Guidelines

Best practice Tag <markedup text> Added functionality Examples from GoldenGate & EDIT Portals Remarks
Splitting text into parts <tax:treatment> Splitting text into treatments will create separated taxon pages. Otherwise all info goes into one single page. Treatment markup.png

Fig.: Treatment markup in GoldenGate
Text not added to a taxon might get lost or, in case of multiple taxa combined in one treatment, moved to the wrong taxon. Some texts cannot easily be split into single taxon treatments. Information needs to be imported manually.
Adding literature metadata <tax:taxonxHeader> Primary source information will be available. Lit meatadata fungi portal.png

Fig.:Literature metadata visible on the General taxon page.
The link to the PLAZI resource will always be available, Only the link to the primary data will not be available otherwise.
Separating nomenclatural information <tax:nomenclature> Separate nomenclatural information including synonymy, original descriptions and typification. GG2 SE.png

Separated synonymy fungi.png

Figs.:Tagging nomenclatural info in GoldenGate & separated synonymy in the portal.
Info outside <nomenclature> will go into the factual data part and needs to be moved manually later. If the accepted name is not in <nomenclature> it will be guessed by the first appearance of a tagged name (<tax:name>). If any <tax:name> is missing the accepted taxon name can not be created correctly.
Mark titles and headers <tax:head> Marking titles and headers will ensure correct handling of accepted taxon. Recognize duplicated names as possible headers and merge information Taxa will not be created twice, once as an accepted taxon, second as its first synonym. Seldom in journal articles, often in books such as floras.
Mark factual information <tax:treatment><tax:div> Ensures that info is attached to the correct taxon and will be typified correctly on a cross-level.
Markup of info items within nomenclature All tags within <tax:nomenclature> Markup as many info items as possible within nomenclature to improve atomisation and search funtion. Avoids continuous text in the taxon names title. Atomised synonymy.png

Fig.: Atomised synonymy in the portal
Add all information that follows <tax:name> to the names full title cache. Not marked-up data will either go to the factual data part as “not marked-up” or it will be cached in the names “full title cache” to allow later correction and atomization.
Use synonymy tag for synonyms <tax:synonymy> Improved atomisation. If not new <tax:name> tags will be taken as synonym start. May not work in some cases (attached to the wrong name or to the accepted taxon in general).
Markup authors <tax:author>,<dwc:scientificNameAuthorship> Markup authors to improve exact search. Author specific info might be extracted from data.
Keep basionym/original combination author in brackets Within <tax:author> and <dwc:scientificNameAuthorship> Full atomisation can be used. Try to guess from non <tax:xmldata> data if an author is an original combination author Author would be handled as current combination author. Real current combination author is lost. Author recognition needs to be improved in Plazi.
Markup typification information <tax:type>,<tax:type_loc>, <tax:collection_event> Markup typification information to ensure correct typification interpretation, as type info and not as synonym. Try to guess typification information by recognizing stop words and position of <tax:collection_event> after </tax:name> XXX Name types may be handled as synonyms instead of children of accepted taxon, messy name full title cache or loosing type designation information. Specimen types may be taken as simply related specimen (Individuals association) or names full title cache may be messed up.
Use status tag <tax:status> Use status tag to ensure correct typification. Status information may not go into titleCache or into “not marked-up” section. If marked up nom. nov. information may be easily removed and even replaced by original publication
Tag year <tax:year> Tag the year to allow correct sorting. Allows statistical computation on historical data. XXX Guess years (often together with authorship). No loss or wrong placement; mostly zoological names.
Use references belonging to nomenclature not outside <tax:nomenclature> <tax:ref_group>, <tax:bibref> This way references infois added to the exact place of information (name, type, …) it belongs to and not added incorrectly or handled as unplaceable factual data
Use correct type or otherType. Use generic types only rarely <tax:div type=”…”>,<tax:div type=”mupltiple”>, <tax:div type=”other”>,<tax:div otherType> Using correct type will give you correct headers on the general page.
Use “otherType” for not yet existing types. Always try to use the same label for a certain ”otherType”.
Typified text.png

Fig.: Typified text data in the portal
Use figure tag for figure information Often figure legends are marked-up as <tag:div type=“description”> or such. Better use figure tag to allow better handling of figure information (e.g. you may not want to show it publicly until it is cleaned up or the according figure is available online)
Markup keys <tax:div type=”key”> Correctly separate each alternative by <tax:p> to receive a clean text. Keys cleantext.png

Fig.: Keys plain text unparsed in the portal
Fully atomised keys are not supported by TaxonX, generally imported only as text blocks and show up as factual data.
Markup inline text references <tax:ref_group>, <tax:bibref> This way you make them linkable and extractable as reference objects in the CDM.
Markup collection events <tax:collection_event> Specimen should be handled as individual, not as pure text-based data, otherwise it will appear in the factual data as pure text e.g. in the materials examined section if marked-up such. Spider specimen1.png
Atomize collection events <tax:collection_event> subtags Markup for exact locality will allow visualization on maps. Years will allow filtering on timeperiods (e.g. in workflows on historical data), collector and field number will allow better retrieval as it is often used as informal identifier, … Spider specimen2.png

Fig.: Specimen shown correctly in the portal.