1
Community challenges in biological data science: an optimistically cautionary tale Genomics Assemblathon CAMI Sequence squeeze Structural Biology CASP CAPRI Fold.it Systems biology sbv-IMPROVER Function CAFA Text mining BioCREATIVE CACAO Data providers Programmers Steering committee Assessors Assemble the Teams DREAM Prepare Challenge Computational scientists enjoy developing new methods, and the community encourages them to do so. However, it is often confusing to know which method to choose: which method is best? Moreover, what does “best” mean? To help choose an appropriate method for a particular task, scientists often form community-based challenges for the unbiased evaluation of methods in a given field. These challenges help evaluate existing and novel methods, while helping to coalesce a community and leading to new ideas and collaborations. Use more than one metric Avoid redundant analyses Identify social media, mailing lists, & other communication venues of your community Publish a flagship paper & specialized “satellite” papers to maximize impact & credit Agree on co-authorship & credits before challenge Have a data sharing plan Data & software sharing policy should be accepted by all Lots of work for everyone: understand commitment Advertise challenge Run challenge Analyze & score Maintain database, website & code Hold a conference Code & website Set challenge rules Identifying a community Choosing a clear question Start by... The Challenge Publish Seek funding Nurture the Challenge and your Community Methods may overfit to win at challenge metrics rather than real-life problems Risk-taking may be discouraged by “surefire” incremental additions to existing methods, rather than novel development Methods improve due to challenges Communities form, expand, and become more cohesive Classroom educational value (CACAO, CAFA) Citizen science value (Fold.it) Data Create incentives Challenge Goal Notes CASP Predicting protein structure from sequence Long running. Improved and quantified structure prediction methods CAPRI Protein protein interaction Fold.it Protein structure energy minima prediction game Improved protein structure prediction via gameification Assemblathon Genomic DNA sequence assembly A better understanding of assembly metrics and species-specific considerations CAMI Metagenomic DNA assembly and analysis sbv-IMPROVER Systems biology and precision medicine CAMDA Large-scale systems biology problems Sequence Squeeze Sequence data compression Finding ways to efficiently store large volumes of sequence data DREAM Framework for many systems biology challenges Provides a framework and easy setup for many challenges. CACAO Improve function annotations in UniProt Competition between student teams teaches biocuration to college students CAGI Predicting phenotypes from genetic variants Strong wetlab engagement CAFA Predicting protein function Improve function prediction methods BioCreative Evaluating text mining and information extraction systems Funders may be tempted to judge methods primarily by challenge performance Iddo Friedberg, Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA USA Some Community Challenges in Life Science Saez-Rodriguez J, Costello-JC, Friend SH et al Crowdsourcing biomedical research: leveraging communities as innovation engines (2016) Nature Reviews Genetics 17, 470–486 Friedberg I, Mooney SD and Radivojac P Ten Simple Rules for a Community Computational Challenge (2015) PloS Computational Biology 17, 470–486 (2016) Further Reading CAGI CAMDA

Computational Challenges in Biological Data Science: an Optimistically Cautionary Tale

  • Upload
    iddo

  • View
    146

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Computational Challenges in Biological Data Science: an Optimistically Cautionary Tale

Community challenges in biological data science: an optimistically cautionary tale

Genomics

AssemblathonCAMISequence squeeze

Structural Biology

CASPCAPRIFold.it

Systems biology

sbv-IMPROVER

Function

CAFA

Text mining

BioCREATIVE

CACAO

Data providers

Programmers

Steering committee

Assessors

Assemble the Teams

DREAM

Prepare Challenge

Computational scientists enjoy developing new methods, and the community encourages them to do so. However, it is often confusing to know which method to choose: which method is best? Moreover, what does “best” mean?

To help choose an appropriate method for a particular task, scientists often form community-based challenges for the unbiased evaluation of methods in a given field. These challenges help evaluate existing and novel methods, while helping to coalesce a community and leading to new ideas and collaborations.

• Use more than one metric• Avoid redundant analyses

Identify social media, mailing lists, & other communicationvenues of your community

Publish a flagship paper& specialized “satellite”papers to maximize impact & credit

Agree on co-authorship &credits before challenge

Have a data sharing plan

Data & software sharing policy should be accepted by all

Lots of work for everyone: understand commitment

Advertise challenge

Run challenge

Analyze & score

Maintain database, website & code Hold a conference

Code & website Set challenge rules

Identifying a communityChoosing a clear question

Start by...

The Challenge

PublishSeek funding

Nurture the Challenge and your Community

Methods may overfit to win at challenge metrics rather than real-life problems

Risk-taking may be discouraged by “surefire” incremental additions to existing methods, rather than novel development

Methods improve due to challenges

Communities form, expand, and become more cohesive

Classroom educational value (CACAO, CAFA)

Citizen science value (Fold.it)

Data

Create incentives

Challenge Goal Notes

CASP Predicting protein structure from sequence Long running. Improved and quantified structure prediction methods

CAPRI Protein protein interaction

Fold.it Protein structure energy minima prediction game

Improved protein structure prediction via gameification

Assemblathon Genomic DNA sequence assembly A better understanding of assembly metrics and species-specific considerations

CAMI Metagenomic DNA assembly and analysis

sbv-IMPROVER Systems biology and precision medicine

CAMDA Large-scale systems biology problems

Sequence Squeeze Sequence data compression Finding ways to efficiently store large volumes of sequence data

DREAM Framework for many systems biology challenges

Provides a framework and easy setup for many challenges.

CACAO Improve function annotations in UniProt Competition between student teams teaches biocuration to college students

CAGI Predicting phenotypes from genetic variants Strong wetlab engagement

CAFA Predicting protein function Improve function prediction methods

BioCreative Evaluating text mining and information extraction systems

Funders may be tempted to judge methods primarily by challenge performance

Iddo Friedberg, Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA USA

Some Community Challenges in Life Science

Saez-Rodriguez J, Costello-JC, Friend SH et al Crowdsourcing biomedical research: leveraging communities as innovation engines (2016) Nature Reviews Genetics 17, 470–486

Friedberg I, Mooney SD and Radivojac P Ten Simple Rules for a Community Computational Challenge (2015) PloS Computational Biology 17, 470–486 (2016)

Further Reading

CAGI

CAMDA