Scott Edmunds slides from #IDCC13 Data Science session

www.gigasciencejournal.com

Scott Edmunds, GigaScience/BGI Hong Kong IDCC 2013, Amsterdam, 15th January 2012

Perspectives in (Big) Data Publication

If “data is the new oil”, we should use it as suchWilliam Gibson: "Information is the currency of the future world”

Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systems themselves”

Move compute to the data: think EC2 rather than S3

Source: Bio-IT World http://www.bio-itworld.com/2013/1/11/whats-fueling-our-growing-loss-faith-big-science.html DNA Nexus/SRA http://techcrunch.com/2011/10/12/dnanexus-raises-15-million-teams-with-google-to-host-massive-dna-database/

DNA Nexus: 1PB SRA data = $15 million from Google

Atul Butte: "Not only will a genome be free, people will soon be paying you to get your genome sequenced."

New business models:

What (big data) publishing needs to take from data science #1:Data publishing should not be just about storage

Structured v unstructured data approaches

Bill Frezza: "Rich data publishing must become the norm if there is any hope of exposing both the fraudulent and the incompetent.”

http://www.bio-itworld.com/2013/1/11/whats-fueling-our-growing-loss-faith-big-science.html



http://techcrunch.com/2011/10/12/dnanexus-raises-15-million-teams-with-google-to-host-massive-dna-database/

Ease of submission important, but should not forget metadata:

What (big data) publishing needs to take from data science #2:You can do better than “amorphous blobs” of data

Effort

Usability

mm = minimal metadata threshold (>0)

• Good to lower hurdles, but data submission should not be 100% effort free

• Need for data to be harvestable & searchable (inc pdfs)

• Need for minimal metadata standards?

• Means and additional credit for capturing richer data?

• Try to avoid “undercutting” established well-curated community resources otherwise potentially detrimental to data commons

• In genomics Bermuda/Fort Lauderdale rules still apply (INSDC)

Analysis Data

Tools/Workflows

Compute

Citable DOIs

What (big data) publishing needs to take from data science #3:Credit Reproducibility: Executable Research Objects

Publish methods/workflows/analyses

doi:10.1186/2047-217X-1-18doi:10.5524/100038

AnalysisData Methods

doi:10.5524/100044+ =

Wang J et al., (2012): Updated genome assembly of YH: the first diploid genome sequence of a Han Chinese individual (version 2, 07/2012). GigaScience Database. http://dx.doi.org/10.5524/100038

Luo R et al., (2012): Software and supporting material for “SOAPdenovo2: An empirically improved memory-efficient short read de novo assembly”. GigaScience Database. http://dx.doi.org/10.5524/100044

Data

Methods

Luo R et al., (2012): SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler GigaScience, 1:18 (28th December 2012) http://dx.doi.org/10.1186/2047-217X-1-18

Analysis

http://dx.doi.org/10.5524/100038

http://dx.doi.org/10.5524/100038

http://dx.doi.org/10.5524/100044

http://dx.doi.org/10.5524/100044

http://dx.doi.org/10.1186/2047-217X-1-18

http://dx.doi.org/10.1186/2047-217X-1-18

http://dx.doi.org/10.1186/2047-217X-1-18

http://dx.doi.org/10.1186/2047-217X-1-18

Technology

Scott Edmunds slides from #IDCC13 Data Science session