Upload
evangelos-pafilis
View
433
Download
1
Embed Size (px)
Citation preview
ENA – 1st Dec 2014 – EBI, UK
Evangelos Pafilis
Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC)
Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greece
[email protected], http://epafilis.info
Text Mining and Environmental
Metadata Suggestion
ENA – 1st Dec 2014 – EBI, UK
Species – Environments
ENA – 1st Dec 2014 – EBI, UK
Comparative Αnalysis • Location • Environment • Time Period
Image from http://theresilientearth.com/
Coral Reefs
?
ENA – 1st Dec 2014 – EBI, UK
Not Trivial
Slide by Dr. P. Yilmaz, http://www.arb-silva.de/projects/contextual-data/
ENA – 1st Dec 2014 – EBI, UK
Metadata
Meta- = Μετά (“after”)
=> data “after” data
=> data describing data
Essential Context Information
ENA – 1st Dec 2014 – EBI, UK
a clear definition, that can be interpreted
in many, sometimes conflicting, ways
ENA – 1st Dec 2014 – EBI, UK
a clear definition, that can be interpreted
in many, sometimes conflicting, ways
Essential Context Information
ENA – 1st Dec 2014 – EBI, UK
Community Standards
• Standards (such as MiXS, MIMARKS)
see http://gensc.org/gc_wiki/index.php/GSC_Publications
for a comprehensive list of publications
• capture genomic/metagenomic and other type of sequence contextual information
• Including detailed guidelines on how to annotate a sample
(e.g. Yilmaz P et al. (2011) The ISME journal 5: 1565–1567)
http://gensc.org/
P. Yilmaz et al., Nat Biotech 29, 415–420 (2011)
source: http://wiki.gensc.org/index.php?title=MIMARKS
ENA – 1st Dec 2014 – EBI, UK
http://www.tomorrowstarted.com/2013/01/how-a-key-works/.html
ENA – 1st Dec 2014 – EBI, UK
• Project descriptions
• Scientific-content web pages
• Full text scientific articles
• Literature abstracts
• In-house documents
ENA – 1st Dec 2014 – EBI, UK
Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)
ENA – 1st Dec 2014 – EBI, UK
Looking up terms:
Intensive, learning curve
ENA – 1st Dec 2014 – EBI, UK
Literature Mining
ENA – 1st Dec 2014 – EBI, UK
processing text
to extract facts of interest
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS
ENA – 1st Dec 2014 – EBI, UK
terrestrial, aquatic, marine, lagoon, coral reef, sediment, freshwater, soil
ENVIRONMENTS: ENVO term identification in text
ENA – 1st Dec 2014 – EBI, UK
Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)
ENVIRONMENTS: ENVO term identification in text
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS: ENVO term identification in text
ID: ENVO:00000150 Name: coral reef
Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS: ENVO term identification in text
ID: ENVO:00000150 Name: coral reef
Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific.
Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”)
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS http://environments.hcmr.gr http://environments-eol.blogspot.gr/
● Dictionary based ● Open source ● Environment Ontology ● fast performance
● 4000 PubMed abstracts / second *
● Based on SPECIES name recognition tagger (Pafilis et al, PLOS ONE)
● E600 gold standard: ENVO-based corpus of EOL Species pages
● Recognition Accuracy – Mention Level: - F1: 82.0% 87.1% of the TPs: exact id among predicted ones
● Submitted preprint: http://biorxiv.org/content/early/2014/11/13/011403
Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390, *: based a single-thread run on an Intel 2,27GHz, 24 GB RAM processing a set of 536,052 abstracts
ENA – 1st Dec 2014 – EBI, UK
biome
environmental feature
environmental material
environmental condition
habitat … … … … …
Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany
http://environmentontology.org ~1600 terms, June 2013
ENVO: source of environment descriptor names and synonyms
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS – Improving Accuracy
● Increasing matches in text ● orthographic variation supported
e.g. freshwater, fresh water, and fresh-water ● Case-insensitive matching ● Synonym generation to reflect the way environment descriptive
terms are mentioned in text (both generic and ENVO specific)
● Preventing overmatching (i.e. avoiding increased FP) ● „stopword-list” (e.g. spring, well, range)
Action Example Add a variant in which non-informative words have been removed
epipelagic zone → epipelagic estuarine biome → estuarine
Plural form addition sediment → sediments Adjective form addition lagoon → lagoonal
ENA – 1st Dec 2014 – EBI, UK
ENVO parts Not included: species tissues foods
Limitations – Known Issues
negation not supported conflicts with anatomy terms
(e.g. mouth, blowhole)
Scope
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS – Sample Output
Update to EOLTAGS 346289845
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477
File Name
Start coord
End coord
Match text ENVO ID
Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221
ENA – 1st Dec 2014 – EBI, UK
ENVIRONMENTS – Sample Output
Update to EOLTAGS 346289845
eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477
File Name
Start coord
End coord
Match text ENVO ID
Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221
Traversing all IS_A, PART_OF
Relationships in ENVO
ENA – 1st Dec 2014 – EBI, UK
Download
ENVIRONMENTS
• Home Page: http://environments.hcmr.gr/ • Tagger Software:
http://download.jensenlab.org/environments_tagger.tar.gz
ENA – 1st Dec 2014 – EBI, UK
other forms of access
ENA – 1st Dec 2014 – EBI, UK
http://eol.org/info/discover_what
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
ENVIRONMENTS
ID: ENVO:00000150 Name: coral reef
Interactive Curation
http://www.ncbi.nlm.nih.gov/pubmed/18301735
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
http://www.ncbi.nlm.nih.gov/pubmed/18301735
Interactive Curation
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
http://www.ncbi.nlm.nih.gov/pubmed/18301735
Interactive Curation
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
http://www.ncbi.nlm.nih.gov/pubmed/18301735
Interactive Curation
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
http://www.ncbi.nlm.nih.gov/pubmed/18301735
Interactive Curation
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
Not only ENVO terms
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
http://www.ncbi.nlm.nih.gov/pubmed/18301735
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
What else is being identified?
ready you to discover!
ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
ENA – 1st Dec 2014 – EBI, UK
! Importance of standardized metadata and annotations ! ENVO: Standardized hierarchically organized descriptions of
environment types ! Literature, project and other scientific content web pages may
describe the environment context of a metagenomics sample ! ENVIRONMENTS:
! Dictionary-based environment descriptive term identification ! Ontological Community standards, e.g. ENVO: name source ! Command line application
! Browser extensions, a user-friendly interface ! Highly Interactive ! Can be used while browsing the web ! Extract ENVO from a selected part of a web page ! Extended for:
! Organism, diseases, and tissue mention identification
Summary
ENA – 1st Dec 2014 – EBI, UK
Digging-out Information
http://hartpurylrc.files.wordpress.com Photo by Dr Chatzinikolaou E
ENA – 1st Dec 2014 – EBI, UK
Critical Assessment of Information Extraction in Biology
BioCreative: Metagenomics Track
• Preparing a Metagenomics Track as part of the BioCreative 2015 challenge • Aim: improve the environmental-context annotation of sequences in major
metagenomics repositories.
• Track coordinator: Dr. L. Hirschman, MITRE • BioCreative (www.biocreative.org)
ENA – 1st Dec 2014 – EBI, UK
ACTION ES1103
ENVIRONMENTS-EOL http://environments-eol.blogspot.com/ Encyclopedia of Life (EOL) http://www.eol.org • process EOL taxon pages • extract environmental context (ENVO terms) • EOL Taxon Page: Quick Facts, Data tab • integrated in Traitbank • large scale biological questions Rubenstein Fellowship 2013 In collab: Jennifer Hammock, Patrick Leary, Katja Schulz, Cyndy Parr
SEQenv http://environments.hcmr.gr/seqenv.html • annotate microbial sequences with ENVO terms • sequence analysis, literature mining, visualization • GenBank isolation source, PubMed Abstracts • sample comparison, temporal/spatial pattern analysis • extension: proteins, protein families, 3D visualization Reused: Analysis of America bird habitats, http://blog.eol.org/
(NoPlaceLikeHome, in collab: Rob Stevenson, Carl Nordman)
Hexanchus griseus EOL page, http://eol.org/pages/212027
Biodiversity – Genomics
ENA – 1st Dec 2014 – EBI, UK
http://jensenlab.org/
Santos A et al. (under review), preprint: http://biorxiv.org/content/early/2014/11/10/010975
Frankild S et al. (under review), preprint: http://biorxiv.org/content/early/2014/08/25/008425
Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390
ENA – 1st Dec 2014 – EBI, UK
Acknowledgements
HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, HITS: Dr. S. Berger and more
Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)
Thank You!
Amvrakikos Lagoons, May 2011
ACTION ES1103
ENA – 1st Dec 2014 – EBI, UK
Acknowledgements
Thank You!
Amvrakikos Lagoons, May 2011
ACTION ES1103
id: ENVO:00000038 name: lagoon
HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, and more
Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)
ENA – 1st Dec 2014 – EBI, UK
• Start Firefox • Install the “megx-seqenv-bar.xpi”
• Drug and Drop • “Install Now” and “Restart”
• Visit a couple of PubMed abstracts or article web
pages of your preference • Annotate the complete abstract, • Annotate selected sentences only
Tutorial