38
Virtual Biodiversity ViBRANT SEVENTH FRAMEWORK PROGRAMME -infrastructure Community web sites: small pieces loosely joined Dave Roberts, David King, Simon Rycroft, David Morse, Lyubomir Penev, Donat Agosti & Vince Smith

Community web sites: small pieces loosely joined

Embed Size (px)

DESCRIPTION

A presentation given by Dave Roberts and coauthored by David King, Simon Rycroft, David Morse, Lyubomir Penev, Donat Agosti & Vince Smith. This was given at the Fourth Metadata and Semantics Research Conference (MTSR 2010) at Acala de Henares, Madrid, in the premises of the Faculty of Law.

Citation preview

Page 1: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

SEVENTH FRAMEWORK PROGRAMME -infrastructure

Community web sites: small pieces loosely joined

Dave Roberts, David King, Simon Rycroft, David Morse, Lyubomir Penev, Donat Agosti & Vince Smith

Page 2: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Page 3: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Small pieces loosely joined

Has many potential meanings:

Joining contributors together to form communities

Joining the data together that go towards forming a Scratchpad

Joining Scratchpad content with the landscape of biodiversity informatics data on the web

Page 4: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Goal ...

Data set ...

People ...

Addressing the challenges of taxonomy

Inventory the Earth’s speciesDocument their relationships“Publish” & apply these data

1.8 M described spp. (10M names)300M pages (over last 250 years)1.5-3B specimens

4-6,000 taxonomists30-40,000 “pro-amateurs”Many more citizen scientists?

Page 5: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

The technology must largely embody the cause–effect relationship connecting problem to solution.

The effects of the technological fix must be assessable using relatively unambiguous or uncontroversial criteria.

Research and development is most likely to contribute decisively to solving a social problem when it focuses on improving a standardized technical core that already exists.

Sarewitz and Nelson (2008) Three rules for technological fixes. Nature, 456: 871-872

I

II

III

Page 6: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

15 October 2010

Biodiversity - a kind of washing powder?

When 2010 was named as the "year of biodiversity" by the UN, it began with a plea to save the world's ecosystems.

UN Secretary-General Ban Ki-moon said: "Biological diversity underpins ecosystem functioning... its continued loss, therefore, has major implications for current and future human well-being."

Recently, members of the public were asked what biodiversity is. The most common answer was "some kind of washing powder".

http://www.bbc.co.uk/news/science-environment-11546289

Page 7: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Addressing the challenges of biodiversity informatics

“…the field [of biodiversity informatics] appears to be growing in a void of overarching, motivating questions, effectively making it a set of technologies in search of questions to address.”

Peterson et al, Syst. & Biodiv. 2010

Page 8: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Scratchpadshttp://scratchpads.eu

Hosted websites for taxonomistsResearch & publication platform Modular (Drupal) & flexible Supports the taxonomic workflowBottom-up design, agile dev.Ecosystem of communities (185)2,350+ users (unpaid) from 2007ViBRANT follow on, €4.75M

Page 9: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Taxonomy & Literature

eBooks

Image Galleries Societies & Organizations

eJournals

DNA, Phylogeny & Specimens2.3k users, 58 countries, 268k pages

185 "Virtual Research Communities"

EDIT, GBIF, NHM, & EOL

Platform for biodiversity research & data publication

Changing the nature of collaboration

Expanding opportunities to participate in science

Page 10: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Magic

Your data Your web site

A website for you & your community

Page 11: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Taxonomy import,management andnavigation

Page 12: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Reference manager /Endnote support forbibliographies

Page 13: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Image galleries,image upload &annotation

Page 14: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Nexus / Newick import forvisualizing phylogenies

Page 15: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Molecular & morphological character matricies(discrete, morphometric and text characters)

Page 16: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Presence / absence country maps

Page 17: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Specimen & locationrecords (DwC)

Page 18: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Web fora with e-mail integration

User blogs

Static web pages

Newsletters with e-mail integration

Page 19: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Import from CSV text file to any content type

Page 20: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

A Virtual Research Environment (Scratchpads) where users can safely store, share and manage their research information.

Analytical services for users to build identification keys and phylogenetic trees.

A publication platform for users to automatically compile taxonomic manuscripts from their research database.

A portal for users to centrally access publicly accessible biodiversity research information and literature.

Training, support & sociological study, helping research communities to use these tools and services.

A standards compliant technical architecture that can be sustained by biodiversity research community.

ViBRANT Products

Page 21: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

ScratchpadsVirtual Research

EnvironmentPhylogenetic

analysisBioclimaticmodelling& metrics

Identificationtools

Matrix dataeditor

Biodiversity data

publishingScholarlymanuscriptpublishing

DistributedScratchpad

hostingSoftwaremodule

integration

Sustainabilityplan

Communalbiodiversity

literature

Biodiversityliteraturemarkup

Biodiversitydatamining

Citizenscience

programme

Fieldrecordingsupport

Usersociology

study

Userfeedbacksystems

Training& outreachprogramme

Biodiversitydata

standards

Dataaggregation

portal

GBIFintegrationactivities

Biodiversityvisualisation

layers

Controlledvocabulary

platformNetworking

WP3. TrainingWP4. Standards

WP8. Mobilisation

ResearchWP2. ArchitectureWP7. Literature

ServiceWP5. Data

WP6. Publishing

The “chromosome”

Page 22: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Cues Indented textUPPER CASE TEXTBold textItalic textLatinKeywordsSymbols

Biodiversity literature looks like this

Page 23: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

M BRITISH MUSEUM (NATURAL HiSi 26JU PRESENTED GENERAL UC.-lARYBulletin ofthe BritishMuseum (Natural History) The ichneumon-fly genus Banchus in the OldWorld(Hymenoptera) M. G. Fitton seriesEntomology Vol51 Nol 25 July 1985

Adobe Reader has this

Page 24: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

MBRITISH MUSEUM(NATURAL HiSi26 JUPRESENTEDGENERAL UC.-lARYBulletin of theBritish Museum (Natural History)The ichneumon-fly genus Banchus(Hymenoptera) in the Old WorldM. G. FittonEntomology seriesVol51 Nol 25 July 1985

Lura (BHL) has this

Page 25: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

But choice of XML schema is importantABBYY XML is very detailed

This line of text has 202 bytes:

The Bulletin of the British Museum (Natural History), instituted in 1949, is issued in fourscientific series, Botany, Entomology, Geology (incorporating Mineralogy) and Zoology,and an Historical series.

To encode in ABBYY XML format this line requires 45,533 bytes.

There are 84,263 lines in the document from which this example was taken.

Page 26: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Look for taxon namesUsed uBio FindIT web service

Overall excellent

Especially as add Namebank ID

But still some oddities

Genus = ‘The’

The scutellum

The primitive

Species or Author = ‘and’

Exetastes and

B[anchus] falcatorius and

Page 27: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Look for paragraph types

Simple keyword matching

Surprisingly effective!

Issue – can identify start, but not end…

Follow up work

Punctuation

Concepts

Page 28: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Look for other proper names

Biologia Centrali-Americana has a gazetteer

Most journals do not

Generic solution = OpenCalais

Good accuracy

Old countries

D.D.R.

West Germany

Continents

America

Page 29: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Ambiguities and Mis-identificationsNew York

City

State

Washington

City

State

Lake George

City

Lake Victoria

City

Other Oddities

Persons

Surname only

Two part names

Van Veen

van Veen

Regions and Continents

East Africa

Africa

Page 30: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Negative spell checking

Go beyond stop words

Remove everything not in a spell dictionary

Check:

Minor

Vulgar

Bulletin 27 from the Zoology Series reduced

From 139,034

to 5,219 words

Page 31: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

LigaturesINTRODUCTION.

Volume, one of five required for the enumeration of the Rhynchophora, was

THIS

commenced by Dr. Sharp in 1889 and is now concluded by myself. The study of the " Otiorhynchinœ Alatse " has unfortunately been delayed for many years, during the publication of Vol. IV. parts 4, 5, and 7, all of which are devoted to the Family Curculionidœ. The present Volume, IV. part 3, includes the Subfamilies Attelabinae, Pterocolinœ, Allocoryninee, Apioninœ, Thecesterninae, and Otiorhynchinre. The Attelabinae are represented by 104 (88 new), the Pterocolinse by three (all new), the Allocoryninse (a new subfamily) and Thecesterninse each by one, the Apioninae by 88 (84 new), and the Otiorhynchinae by 419 (340 new) species respectively; the total number for the six subfamilies being 616 species, with 516 new, and forty new genera. Amongst the 419 Otiorhynchinae, the apterous and winged forms are almost equal in number, there being a preponderance of apterous terrestrial species (Eupagoderes, Epicœrus, Epayriopsis, &c.) in the arid portions of Mexico and the winged forms ÇExophthalmuS) &c.) becoming relatively more numerous in the forest regions southward. Taking the Curculionidœ as a whole—the subfamilies Curculioninae and Calandrinse, in addition to those worked out in the present Volume,—the number of species enumerated altogether from Central America is as follows :— Vol. IV. part 3, 616; IV. part 4, 1365; IV. part 5, 908; IV. part 7, 344 : total 3233. The three other families of Rhynchophora—the Brenthidae, Scolytidae, and

Page 32: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Ligatures

For the 24 æ there are: 11 ae; 5 œ; 5 se; 1 ee; 1 re; 1 a?;

So not a single correct rendering of the ligature, æ.

By contrast, the only example of œ in the page, Epicœrus, was correctly rendered.

OtiorhynchinæAlatæ

CurculionidæAttelabinæ

PterocolinæAllocoryninæ

ApioninæThecesterninæOtiorhynchinæ

AttelabinæPterocolinæ

Allocoryninæ

Otiorhynchinœ Alatse Curculionidœ Attelabinae Pterocolinœ Allocoryninee Apioninœ Thecesterninae Otiorhynchinre Attelabinae Pterocolinse Allocoryninse

=>=>=>=>=>=>=>=>=>=>=>=>

ThecesterninæApioninæ

OtiorhynchinæOtiorhynchinæ

CurculionidæCurculioninæ

CalandrinæBrenthidæScolytidæ

AnthribidæHispidæ

Cassididæ

ThecesterninseApioninaeOtiorhynchinaeOtiorhynchinaeCurculionidœCurculioninaeCalandrinseBrenthidaeScolytidaeAnthribidaeHispidaCassididae

=>=>=>=>=>=>=>=>=>=>=>=>

Page 33: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Soundex

831639637616578

elytraprothorax

Habpunctate

millim

E436P636H100P523M450

8315092941253612987211

elytraElytraelytriselytralelytron

elytrisqueelytrorumque

Elytralelytrorum

elytro Elytrorum

Elytris

E436E436E436E436E436E436E436E436E436E436E436E436

Page 34: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Similar words?

denticulate => denticulataLevenshtein distances of 1: 0,0,1

denticulate => reticulateLevenshtein distances of 2: 3,2,0

denticulate => geniculateLevenshtein distances of 2: 2,2,0

Page 35: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

What did we achieve?

Marked up 11 volumes, i.e. 4,504 pages

Have robust workflow, can mark up a Bulletin in about 10-15 minutes. Choke point is call to OpenCalais web service

No manual intervention or review required: workflow is scalable

Recognising taxon names:

Well uBio gives us a goods start, and we have techniques to cluster ALL mis-spellings and variants with a valid taxon; but not perfect, eg BanchusFabricius ends up in more than one cluster

Page 36: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

More reliable (e.g., distribute the servers)

More functional (e.g., phylogenetic & publication services)

Easier to use (better workflows)

Prettier (better graphical design - more intuitive)

More integrated (for data stored inside & outside the Scratchpad framework)

More sustainable (simple administration, distribute developers, development sandbox)

“making the Scratchpads better”

“making natural history better”Easier to compile, manage and reuse your data

Easier to find and reuse other peoples data

Promoting your data inside & outside the taxonomic community

Getting people to work for you (crowdsourcing)

Page 37: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Manuscript preparation on a Scratchpad

Submit as XML

Produce PDF

Enhanced XML

Register with ZooBank,

GBIF, EoL etc.

Printed paper

PDF

Enhanced HTML

Send to reviewers

AuthorAuthorAuthor

Publisher

Public

Page 38: Community web sites: small pieces loosely joined

Virtual BiodiversityViBRANT

-infrastructureSEVENTH FRAMEWORK PROGRAMME

Thank you for your attention.

Any questions