Task Groups

From pro-iBiosphere Wiki
Jump to: navigation, search

Task Group 1: Data Visualization

See the task group wiki page for detailed information.

Part of the final dashboard showing input selection options and topmost charts
Part of the final dashboard showing principal four charts, with Jeremy's contribution selected
Part of the final dashboard showing specimen map presentation

Name of Task Leader:

Participants:

  1. Serrano Pereira (talk)
  2. Guido Sautter
  3. Jeremy Miller - in absentia

Aim:

  1. An excellent question given that our product owner, ie Jeremy, has left the country! Fortunately, he prepared an excellent set of slides and sample data for us to work from.
  2. Create the front page for a treatment bank. Content includes the treatment text, plus a series of data visualizations (break down specimens by sex [pie chart], by institution [pie chart], decade [bar chart], month [bar chart, subdivided by sex], plus a map with georeferenced records plotted and a text list of countries).

Activities to be conducted:

  1. Extract data from GoldenGATE
    1. Prepare exporter for required statistics
    2. In both screen and API versions
    3. API to provide several formats
      1. csv for standalone analysis
      2. json for use in our data visualisation component and other tools
      3. xml for standalone analysis
  2. Reformat data for visualisation tools
    1. Use PIWIK as basis for this work
    2. PIWIK is the open source analytics tool, so fits in with pro-iBiosphere methodology and sustainability needs
    3. PIWIK uses jQuery and associated plug-ins, so we will do the same. We're using JQPlot for charts and jVectorMap for maps, which means we don't have to reinvent the wheel, just exploit existing, supported, tested code.

List of outcomes:

  1. Statistical data extraction routines added to GoldenGATE
  2. Data visualisation component

Live demonstration:

Task Group 2: Traits

Name of Task Leader:

Traits Task Group


Participants:

  1. Thomas Hamann
  2. David King
  3. George Gosline
  4. Ayco Hollemann
  5. Rutger Vos
  6. Claus Weiland
  7. Robert Hoehndorf

Aim:

  1. To extract useful plant trait data from African floras.
  2. Extend morphological and environmental ontologies.

Activities:

  1. Extending and translating the Environment Ontology (EnvO)

Outcomes:

  1. Extracted trait data.
  2. Extend relevant ontologies
  3. Publish data in the open platform.

Comments

John Deck, remarks on the Biological Collections Ontology (BCO) Bootcamp (by Ramona Walls, John Deck, and John Wieczorek, The iPlant Collaborative)

- One question that came up at the end was regarding the naturalis hackathon topic of mapping trait data from the literature and how BCO could help. I started to answer this and ended up with dead air...Just to finish up my response for whoever asked the question: "Encourage folks to think about delineating processes by which traits are determined, such as:

traitDeterminedByVisualObservationOfStudySpecimen
traitDeterminedByVisualObservationOfPhotograph
traitDeterminedByAnalysisFromLiterature

..."

Here are some of the links that we passed around:


In case the participants of this task group would like to connect some dots here in the short term, please inform this to Rutger / Soraya (we can then organise a second online meeting with them in the course of this week).

Task Group 3: Links to/from specimens and names

Name of Task Leader:

data map

Participants:

  1. Jordan Biserkov
  2. Matthew Blissett
  3. Kevin Richards
  4. Quentin Groom
  5. George Gosline
  6. Marko Tahtinen
  7. Thomas Hamann
  8. Peter Hovenkamp
  9. Soraya Sierra
  10. Daniel Mietchen (talk)


Aim:

  1. Link together name and specimen data, especially from Floras.
  2. Link specimen citations in "Literature" to specimens from Kew, Brussels and Edinburgh. "Literature" is either a document in Pensoft Writing Tool or an XML file from PhytoKeys. The approach can be extended to other (at least somewhat) structured data.
  3. Document the "Taxonomic Mind Mapper" proposal by Peter Hovenkamp and determine what tasks necessary to achieve it.


Activities:

  1. Analyse data and how best to configure matches, with an aim to output simple link statements such as:
  • [specimen identifier] referenced_in [doi]
  • [name identifier] cites_type [specimen identifier]
  • [specimen identifier] duplicate_of [specimen identifier]


--Peter Hovenkamp (talk) 23:05, 20 March 2014 (CET)I have expanded my use case with an outline of the specification of a system in terms of nodes and links: https://docs.google.com/document/d/1vni44RBwGNZ7iRCFf243NtcD-7xwjHrDeAPehato7KY/edit?usp=sharing


Outcomes:

  1. A specification of an API that collections can implement to allow publishers like Pensoft to send notifications when a collection object is cited in literature.

Update (Tuesday):

  • Peter has been documenting use cases for the taxonomic mind mapper.
  • Looking at Phytokeys specimen references to submit to service for conversion from text into links.
  • Flora Central Africa data mapped against IPNI names.
  • African flora specimen reference data atomised / parsed for submission to linking service.

Update (Wednesday):

  • Nicky turned Peter's use cases into a graph gist: http://gist.neo4j.org/?9645618
  • Ayco and Matt matched together the Naturalis collection against K/E/BR
  • Jordan and Matt working on matching specimen references pulled out from PhytoKeys
  • Nicky made a graph to show collector name variants:
    collector graph

Update (Thursday):

  • Matt set up a (test!) herbarium specimen collection event matching service on a free RedHat OpenShift platform. This takes a query in the form http://kewmatcher-mattblissett.rhcloud.com/match/basicCollEventMatch?recordedBy=Garrett&fieldNumber=1189&locality=Doi+Chiengdao&eventDate=1940 which returns a JSON response (list of matches). Alternatively, upload a CSV file with the same field names here — the response from a file upload is in HTML format only. We've put the Reconcilation Service into production, and removed the RHCloud instance. See http://data1.kew.org/reconciliation/.

Task Group 4: CDM API

Name of Task Leader:


Participants:

  1. Quentin Groom


Aim:

  • Implementing a web service to export occurrence data for a specific taxon from the Common Data Model. This will allow occurrence and specimen data to be curated in the Taxonomic Editor, but fed seamlessly into niche modelling workflows.


Activities:

  1. Day 2: Define list of terms to be extracted out of the CDM (Quentin + BioVel)
  2. To do: documentation ! The documentation will be based on the task 5 new schema.

Outcomes:

  1. Update day 3: CDM-Occurrence export to json

Task Group 5: XML schema to document web-services (not CDM related)

Name of Task Leader:


Participants:

  1. Patricia Kelbert
  2. Bachir Balech

Depends of development of nb.4 (time!)

Aim:

  1. Common standard schema for documentation of webservices

Activities:

  1. Look for existing standards; look for ontologies; define mandatory fields

Outcomes:

  1. The webservice developed in the task 4 will be used as a use case to check the completeness of the SWeDe.

Links

Progress Log https://docs.google.com/document/d/1hQuOrOdt0jmj0ptSo4vRc2Xjg0hZ8dXejRIvp_ut8O0/edit

Github Page https://github.com/njall/XS-SWeDe

Documentation/Wiki The SWeDe Project

This is the XML Schema for Scientific WEbservice DEscriptions produced at the Naturalis Data Enrichment Hackathon in Leiden (March 2014).

Task Group 6: NeXML services

Wiki page: NeXML Services

Name of Task Leader:

Participants:

  1. Rutger Vos
  2. Christian Brenninkmeijer
  3. Hannes Hetting
  4. David King (when available from leading task group 1)

Aims:

  1. To develop command-line tools that merge data in a number of commonly-used phylogenetic file formats and export them as NeXML.
  2. To develop command-line tools that extract objects from NeXML data: Taxa, Trees, Character matrices, all with metadata embedded.
  3. To wrap these tools inside Taverna-compatible RESTful services.
  4. To publish these services on BiodiversityCatalogue.
  5. To annotate these services according to BioVeL guidelines.

Activities:

  1. Development
  2. Web service testing and publishing
  3. Documentation

Outcomes: A webservice having two main functions:

  • Create a NexML file from three objects: a)Multiple sequence alignment, b)phylogenetic tree and c)Taxa list with their associated metadata.
  • Read a NexML file and offers the possibility to extract the above listed objects with their corresponding metadata.

Task Group 7: Web interface for correcting OCR text from BHL

Name of Task Leader:


See https://github.com/rdmpage/ocr-correction

Participants:

  1. Kevin Richards
  2. Marko Tahtinen
  3. David Shorthouse

Aim:

  1. Provide a simple interface for interactive editing of text, as well as tools to make inferences from the edits (e.g., frequency of certain kinds of OCR errors)

Activities & Outcomes:

  1. Web page for displaying OCR via DjVu XML files
  2. Unit and Integration testing framework incorporated
  3. Editing popup that shows exact line in original scanned image while editing one line at a time
  4. Batch processes to seed all lines from all DjVu XML files into CouchDB
  5. Integrated authentication using https://oauth.io/ such that all edits are tied to users & timestamps
  6. Summarize edits and make suggestions for possible corrections
  7. Integrate GlobalNames scientific name-finding in real-time at completion of a line edit to give user feedback
  8. Batch processes to collapse all edits and create text files that can be once again passed to the GlobalNames scientific name-finding tools

Future needs:

  1. Improve the suggestion mechanism to occur in real-time
  2. Find a mechanism to ascertain what automated suggestions make most sense (eg across a journal run of a certain date range)
  3. Determine if user edits can help inform developers of OCR software to improve their products
  4. Communicate with BHL to assess how this code might be incorporated into their web presence

Suggestion example, an edit to the incorrect OCR text "montli" to be "month" has higlighted the possible correction of "like" to use "h" instead of "li", which is obviously not valid in this case, but demonstrates the suggestion mechanism.

Suggested correction example

New names found example, "Crossarchus somaltcus" edited to be "Crossarchus somalicus" now finds the name when it previously did not.

New name found example

Task Group 8: IPython Notebook and Taverna integration

Name of Task Leader:


Participants:

  1. Youri Lammers
  2. Aleksandra Pawlik
  3. Ross Mounce


Aim:

  1. Integrate Ipython Notebook and Taverna


Activities:

  1. Contact a Taverna Player
  2. Authorize access to the Tavernq Player
  3. Determine the inputs for a Taverna workflow known to the player
  4. Use data already defined within an iPython notebook as the input values for a workflow run
  5. Run the Taverna workflow, possibly involving user interaction
  6. Retrieve the results of the workflow run
  7. Allow those results to be used in subsequent cells of the notebook
  8. Save the iPython notebook


Outcomes:

  1. Code on the github repo here: myGrid DataHackLeiden
  2. Notebook Viewer version
  3. Pypi package being created

Task Group 9: Liberating pretty figures to Flickr

Name of Task Leader:

Participants:

  1. Youri Lammers (thank you Youri!)


Aims:

  1. Liberate and showcase openly-licensed (e.g. CC-BY) images from academic journal article PDFs c.f. http://openfigs.tumblr.com/
  2. Re-publish (with full attribution to source) on popular image sharing social media sites like Flickr/Wikimedia/Twitter/Tumblr/Pinterest etc...
  3. Find images of phylogenetic trees to pass to TreeRipper for data re-extraction from the image. See http://www.biomedcentral.com/1471-2105/12/178


Activities to be conducted:

  1. Responsibly, scrape Phytotaxa website for article PDF URLs (done)
  2. Download all the open access PDFs & create an easily searchable plain text copy with pdftotext (done)
  3. Use pdfimages to strip all the images from the PDFs (done)
  4. Filtering: delete all the small non-figure images spat-out by pdfimages (done)
  5. Detect and colour-invert the Black & White inverted images (Have viable method, not implemented yet)
  6. Re-upload images with attribution to Flickr via the API (done, needs to be run regularly as cronjob in future)


List of outcomes:

  1. Open media content, made easily shareable & re-usable
  2. Potentially, some phylogenetic tree data
  3. Sample output from 1 paper (10 figures): http://www.flickr.com/photos/79472036@N07/sets/72157642597074643/with/13268597965/
  4. Articles with broken DOI links have been found and reported to CrossRef e.g. http://biotaxa.org/Phytotaxa/article/view/phytotaxa.112.2.1
  1. MEDIA: on flickr: http://www.flickr.com/photos/79472036@N07/
  2. CODE: on github https://github.com/rossmounce/LeidenPDFhack
  3. Twitter: https://twitter.com/PhytoFigs