OKFest 2014 Data mining workshop

From pro-iBiosphere Wiki
Jump to: navigation, search


This workshop on July 18, 2014, is a fringe event to OKFest 2014, which takes place in Berlin on July 15-17, 2014.



To register, please fill in the form at the bottom of this page.

Programme for the day

To follow, use the Etherpad.

09:30 - 12:30 Content mining workshop from the team at the ContentMine

12:30 - 13:30 Lunch break (self-organized)

13:30 - 17:00 Open Space and Hack Session (list ideas below)

  • Join open development group at Wikimedia Deutschland 13:00 - 15:00 for open science for development session [Kate Michi Ettinger would like to join this visit]
  • Host activities proposed for OKFest that did not get selected. If you'd like to include yours, add it to this list!

17:00 Social fun/continued hacking - suggestions welcome!

  • Possibility of a tour through the museum

18:00 END


  • We recommend that you install
  • You are welcome to bring your own tools.


Extracting taxonomic treatments

The three journals run by the museum often publish descriptions or revisions of biological species and other taxa. These taxon treatments are the core unit of taxonomic knowledge. From 2014 onwards, two of the journals have been marking up taxon treatments, so the idea of the workshop would be to extract taxon treatments from articles in past issues and to upload them to the Plazi repository. One way to do that would be through GoldenGATE.

The journals are:

Crowdsourcing data from the literature

Besides taxonomic treatments, there are many other kinds of data available in the literature, only part of which can be read by machines. This is why the Open Dinosaur Project turned to crowdsourcing when trying to comb the literature for dinosaur bone measurements. We will use ContentMine to data mine the back issues of MfN journals and to classify the data found therein, so as to inform decisions as to which kinds of data would benefit most from a crowd-based approach.

There is an open call by Zooniverse for proposals for new (kinds of) citizen science projects, especially if they have a historic dimension. Deadline: July 25.

Mining open research proposals

Some researchers have made research proposals available in public (e.g. here and here). What kind of information can be mined from them?

Storing the mined data

Once some resource has been mined, where should the resulting data go? How can data from multiple sources be best combined?

Linking from publications to extracted data

Once we extract data it is also important to keep the link back to the publication where it was extracted from. As data will be combined and reused it is important to be able to track back data to the source. In addition, links to not just data but also other publications can help provide insight in how research is done. At PeerLibrary we are building an open platform to facilitate such linking and collaborative building of layer of knowledge on top of papers.