22
Scientific Discovery in the Era of Big Data: More than the Scientific Method A RENCI WHITE PAPER Vol. 3, No. 6, November 2015

A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

Scientific Discovery in the Era of Big Data:More than the Scientific Method

A RENCI WHITE PAPERVol. 3, No. 6, November 2015

Page 2: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 1

ScientificDiscoveryintheEraofBigData:MorethantheScientificMethod

AuthorsCharlesP.Schmitt,DirectorofInformaticsandChiefTechnicalOfficer

StevenCox,CyberinfrastructureEngagementLead

KaramarieFecho,MedicalandScientificWriter

RayIdaszak,DirectorofCollaborativeEnvironments

HowardLander,SeniorResearchSoftwareDeveloper

ArcotRajasekar,ChiefDomainScientistforDataGridTechnologies

SidharthThakur,SeniorResearchDataSoftwareDeveloper

RenaissanceComputingInstituteUniversityofNorthCarolinaatChapelHillChapelHill,NC,USA919-445-9640

Page 3: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 2

ATAGLANCE• Scientificdiscoveryhaslongbeenguidedbythescientificmethod,whichisconsidered

tobethe“goldstandard”inscience.• Theeraof“bigdata”isincreasinglydrivingtheadoptionofapproachestoscientific

discoverythateitherdonotconformtoorradicallydifferfromthescientificmethod.Examplesincludetheexploratoryanalysisofunstructureddatasets,datamining,computermodeling,interactivesimulationandvirtualreality,scientificworkflows,andwidespreaddigitaldisseminationandadjudicationoffindingsthroughmeansthatarenotrestrictedtotraditionalscientificpublicationandpresentation.

• Whilethescientificmethodremainsanimportantapproachtoknowledgediscoveryinscience,aholisticapproachthatencompassesnewdata-drivenapproachesisneeded,andthiswillnecessitategreaterattentiontothedevelopmentofmethodsandinfrastructuretointegrateapproaches.

• Newapproachestoknowledgediscoverywillbringnewchallenges,however,includingtheriskofdatadeluge,lossofhistoricalinformation,propagationof“false”knowledge,relianceonautomationandanalysisoverinquiryandinference,andoutdatedscientifictrainingmodels.

• Nonetheless,thetimeisrightforincreasedfocusontheconstructionofCollaborativeKnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasourcesandintegratetraditionalandemergingscientificmethodsandtherebydrivescientificdiscoveryandapplication.

Introduction

Knowledgediscoveryinsciencereferstothesystematicprocesswherebyscientistsdrawlogicalconclusionsregardingtheworldaroundus,generatenewtheoriesbasedonthoseconclusions,andsharefindingswithotherscientistsandthelaypublic,thusenablingcriticalreviewandconsensusbeforenewfindingsareaddedtothecollectivebodyofknowledge.Historically,scientificdiscoveryhasbeenguidedbythescientificmethod,whichdatestoancienttimesandinvolvesbothaphilosophicalandpracticalapproachtoscience.Indeed,therenownedphilosopherAristotle(384-322BC)wasoneofthefirsttoapproachknowledgediscoverythroughrigorous,systematicobservation,althoughitwasn’tuntilmillennialaterthatthescientificmethodwasactuallyformalizedandimplemented,largelythroughtheworkofCopernicus(1473-1543),TychoBrahe(1546-1601),JohannesKepler(1571-1630),GalileoGalilei(1564-1642),ReneDescartes(1596-1765),andIsaacNewton(1643-1727)(Gower1997;Betz2011).Atfirstglance,thescientificmethodappearstoberelativelysimpleandstraightforward.Insum,themethodinvolvesarepeatingcycleofstandardizedsteps:theprocessbeginswithcarefulobservationofthenaturalworld,theframingofaquestiononthebasisofone’sobservations,andareviewoftheexistingbodyofknowledgetodetermineifareasonable

Page 4: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 3

explanationalreadyexists.Assumingthatthequestionremainsopenended,ascientistwillthenformulateahypothesis,designandimplementanexperimenttotestthehypothesis,andanalyzetheexperimentalresults.Theanalyticalresultsareusedtoformallyacceptorrejectthehypothesis.Thehypothesismaythenbemodified,withtheexperimentrepeatedoranewexperimentdesigned.Importantly,theresultsarethendisseminatedviapresentationtopeersandpublication,whichallowforadjudication1orpeerconsensusregardingthevalidityofthescientificfindings.This,inshort,isscientificdiscoveryviathescientificmethod.RevisitingCharlesDarwinAsanyschoolchildwillattest,theworkofCharlesDarwinprovidesanexemplarofthescientificmethodandthemanychallengesinvolvedinscientificdiscovery(Burkhardt1996;McKie2008;Montgomery2009).Darwin’sgeniusperhapsliesinhiskeenabilitytoobservethenaturalworld.Oneofhisearliestobservationswasthatsimilarspeciesarefoundacrosstheglobeandthatindividualswithinaspeciesarenotidenticalbuthavelocalvariation.Hequestionedwhymultiplespeciesexistinsteadofjustone.Darwincontinuouslyresearchedandreviewedtheexistingscientificliterature(i.e.,theestablishedscientificbodyofknowledge).Darwin’sworkwasinfluencedbytheresearchandwritingsofThomasMalthus,whofoundthathumansproducemoreoffspringthanareneededtoreplacethemselvesandspeculatedthatpopulationsizewouldsoonexceedtheavailableresourcesrequiredforsurvival.Darwinalsoobservedthatpopulationsofplantsandanimalsstayaboutthesamesizebecauseoflimitedresourcesandcompetitionforthoseresources.Darwin’sthinkingalsowasinfluencedbytheresearchandwritingsofCharlesLyell,whofoundthatsmall,gradualgeologicalprocessescanproducelargechangesovertime.DarwinmadeseveralbrilliantinferencesonthebasisofhisscientificobservationsandtheworkofMalthus,Lyell,andothersthatledhimtohypothesizethatspecieschangeslowlyinaprocessofevolutionfromacommonancestor.Hethenspentdecadesinobservationalexperimentation,ongoinganalysisofhisscientificfindings,andrefinementofhishypothesisuntilheeventuallyreachedthenowfamousconclusionthattheoriginofthespeciesliesinnaturalselection,ortheprocesswherebyindividualvariation,coupledwithcompetitionamongindividualsfornaturalresourcesandsocialcooperationamongkintoincreaseindividualfitness,determinesdifferencesamongrelatedspecies.Coincidentally,whileDarwinwasrefininghishypothesisanddrawingaconclusion,AlfredR.Wallace,amuchjuniorresearcherwhowasfamiliarwithDarwin’swork,reachedasimilarconclusion,whichheplannedtopublishanddisseminatetoscientificpeers.Awarethathisworkmightgounrecognized,2DarwinreachedanagreementwithWallace,brokeredbyLyellandJosephD.Hooker,andreluctantlypublishedajointscientificmanuscriptin1858.OntheOriginofSpeciesbyNaturalSelectionwasn’tpublisheduntilNovember1859,butonlyasabookchapter,notthefullbookthatDarwinhadintended.Interestingly,thescientificcommunityreactednegativelytoDarwin’swork(e.g.,Gray1860);manyofDarwin’speersfeltthathedidnothavesufficientevidencetoputforthhishypothesis,whileothersweredisturbedbythetheological,political,andsocialimplications.ThescientificdebatecontinuedfornearlyacenturyuntilthescientificcommunityreachedadjudicationonDarwin’shypothesisandscientificfindingsandestablishedthetheoryofevolutionbynaturalselection.Ofnote,thelaydebatecontinuestoday,largelyontheologicalandpoliticalgrounds.1“Adjudication”isalegaltermthatreferstodecisionmakinginthepresenceofaneutralthirdpartywhohastheauthoritytodetermineabindingresolutionthroughsomeformofjudgmentoraward.Inscience,adjudicationgenerallyreferstodecisionmakingandconsensusbuildingamongagroupofwidelyregardedscientificexpertsonthetopicunderdiscussion(Spangler2003).2“Ineversawamorestrikingcoincidence,ifWallacehadmyM.S.sketchwrittenoutin1842hecouldnothavemadeabettershortabstract!EvenhistermsnowstandasHeadsofmyChapters…Soallmyoriginality,whateveritmayamountto,willbesmashed.ThoughmyBook,ifitwilleverhaveanyvalue,willnotbedeteriorated;asallthelabourconsistsintheapplicationofthetheory”—CharlesDarwintoCharlesLyell,June18,1858(InCharlesDarwin’sLetters.ASelection,F.Burkhardt(Ed.),1996).

Page 5: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 4

Thescientificmethodhascertaindesirablecharacteristicsthathaveenabledittowithstandthepassageoftime,evenasnewscientifictoolsandtechniqueshavebeenintroducedtothescientificprocess.Forexample,themethodisobjectiveandremovedfromallpersonalandculturalbiases.Allhypothesesaredevelopedtobeconsistentwithacceptedscientifictruthsatthetimethatthehypothesisisgenerated.Allmeasurementsmustbeobservableandpertinenttothehypothesis.ThehypothesisandallconclusionsmustfollowtheprincipleofOccam’sRazorintermsofparsimony,orsimplicitywithfewassumptions.Thehypothesisalsomustbefalsifiable(andcapableofbeingdisproven).Finally,thescientificfindingsmustbereproduciblebyotherscientists.Withoutdispute,thescientificmethodremainsthemostcommonandonlyvalidatedapproachtoscientificdiscoveryandknowledgeextraction.Infact,drugdevelopmentintheU.S.reliesexclusivelyontherandomizedplacebo-controlledclinicaltrial—theexemplarofthescientificmethod.However,today’sadvancementsindigitalcomputingandstoragecapabilities,coupledwithnewmethodsforscientificcommunication,includingsocialmedia,areintroducingnewapproachestoscientificdiscovery,eachofwhichbringschallengesandopportunities.Inthiswhitepaper,weconsiderhowtheadventofthedigitalageandtoday’sworldof“bigdata”arechangingscientificdiscoveryprocesses.WeclosewithaframeworkforCollaborativeKnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasourcesandintegratetraditionalandnewdata-drivenscientificmethods,allowingforunprecedentedadvancesinscientificdiscoveryandapplication.ExploratorydatasetsandexploratoryanalysisToday’sscientisthasaccesstonumerouslargedatasetsofrelevancetomultiplescientificdomains.Forexample,theNationalOceanicandAtmosphericAdministrationmaintainsseveraldatabasescontainingdataonclimatepatterns,earthquakes,ozonelevels,andoceantemperatures;thesedataareusefultoscientistsinmanyfields,includingenvironmentalscience,energy,publichealth,andmedicine.Scientistsareincreasinglyaccessingthesedatasetsforexploratoryanalysis,whichisanapproachusedtodetermineifageneralhypothesisbearsanymeritand/orifanexperimentaldesignisfeasible.Exploratoryanalysistypicallybeginswithageneralhypothesisorexperimentaldesignthatisn’twellfleshedoutandtheapplicationofavarietyofstatisticalapproachesandvisualizationtechniquestoidentifyandvalidatedataelementsand/ordetermineifahypothesisistestableusingthedata(Behrens1997;Gelman2004;Diaconis2011).Forexample,consideraneconomicsresearcherwhoisinterestedinlearningabouttheloaningpracticesofalargecreditunion,intermsofthebreakdownofmortgageloansacrosssocioeconomicsectors.S/herequestsaccesstothecreditunion’sloandatabaseinordertodeterminehowthedataarestructuredandhowfine-grainedtheavailablesocioeconomicdataelementsare.Theresearcherdeterminesthatthedatabasecontainsdataelementsontheage,sex,andgrossincomeofloanholders,aswellastheloanamountandthelocationoftheproperty,butitdoesnotcontaininformationontheethnicityoftheloanholders.The

Page 6: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 5

researcherusesthisinformationtomodifyhis/herhypothesisforsubsequenttestingusingthesamedata.Exploratoryanalysishasalwaysbeenpartofscientificdiscoveryand,historically,hasbeenusedtogeneratethedrivingquestionunderlyingascientifichypothesis.However,inthepast,theprocesswasinformal,slow,andobservation-based(e.g.,detailednotesonthetypesoffoliageidentifiedatdifferentaltitudeswithinagivenregion);whereastoday,itisfast,large-scale,data-drivenandofteninvolvesextensiveuseofadvancedstatisticalmethodsandvisualizationtechniques.Furthermore,largedatasetsarebeinggeneratedstrictlyforexploratoryanalysis,andoftensuchdatasetsincuraconsiderableexpensewithquestionablecost-benefit.3ArecentMcKinseyreport(Manyika,etal.2015),forexample,estimatesthatlessthan1%ofdatacapturedandstoredfromanoffshoreoilrigequippedwith30,000sensorsisactuallyusedtoguideoperations.Thevalueoftherestofthedataremainstobedetermined.

Today,exploratoryanalysisisfast,large-scale,data-drivenandofteninvolvesextensiveuseofadvancedstatisticalmethodsandvisualizationtechniques.

Intermsofbenefits,exploratoryanalysisenablesascientisttoquicklydevelopamorerefined,testablehypothesisandrigorousexperimentaldesignforsubsequenthypothesistesting(Behrens1997;Gelman2004;Diaconis2011).Challengesincludethecostsinvolvedwiththegenerationandlong-termstorageofexploratorydatasets.Inaddition,drawingconclusionsfromananalysisofexploratorydatasetscanintroduceerrorifthedataelementsarenotdescribedwithsufficientmetadatatofullyunderstanddatastructureandmeaningorifthedataelementshaveinherentbiasesduetohowtheywerecollected.Additionalchallengesincludeissuesofprivacyandtheinadvertentorintentionalleakageof“sensitive”data,particularlyifascientistdoesnotobtaintherequisiteauthorizationtoaccessadatasetthatcontainssensitivedataorifsensitivedataareaccidentallyprovidedtoanunauthorizedorthird-partyuser(Behrens1997).DataminingDataminingoftenbeginswithoutahypothesisandinvolvestheapplicationoftoolsandtechniquesfromstatistics,mathematics,andvisualizationtoidentifypreviouslyunknownpatternsandtrendsinadatasetderivedfromoneormoreexistingdatabases(Fayyad,etal.1996;Holzinger,etal.2014).Whiledataminingissometimesconsideredaformofexploratorydataanalysis(Behrens1997;Diaconis2011),weargueforadistinctionbetweenexploratorydataanalysis,whichenablesascientisttoexploreadatasetintermsofstructureandtypesofdataelements,includingtherelevancyofexternaldatasources(e.g.,literaturecitations)to

3Thecostsandbenefitsoftheseexploratorydatasetsareanopentopicofdiscussion;forexample,seeWilhelmsen,etal.2013andEvans,etal.2015fordiscussionsontheprosandconsofcapturingandstoringwholegenomeversustargetedgenomesequencingdata.

Page 7: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 6

dataelements,anddatamining,whichenablesascientisttoidentifypreviouslyunknownpatternsinadatasetintermsofrelationshipsbetweendataelements.Wenotethatnewstatisticalapproachesarebeingdevelopedtoovercomesomeofthelimitationsofbothtypesofscientificdiscovery.Forexample,themaximalinformationcoefficient,orMIC,overcomesthelimitationsofPearson’scorrelationcoefficient,r,byallowingfortheidentificationofcomplex,nonlinearrelationships(e.g.,exponential,periodic)betweendataelementsindatasetsofanysize(Reshef,etal.2011).4Supposeamicrobiologistgeneratesageneexpressiondatasettoidentifygenesinvolvedintheregulationofthecellcycle.Usingtraditionalstatisticalapproaches,thescientistisabletoidentifygeneswithstrongdeterministicorlinearassociationswithdifferentaspectsofthecellcycle(e.g.,agenethatisrequiredforchromatinassembly).UsingapproachessuchastheMIC,thescientistisabletoidentifygeneswithperiodicrelationshipswiththecellcycle(e.g.,ageneassociatedwithanestablishedcyclicaleventoraneventthatoccursatapreviouslyunknownfrequencyduringthecellcycle).Dataminingbearslittleresemblancetothescientificmethod.Inparticular,dataminingisnot:(1)constrainedtobeobjective;(2)consistentwithacceptedscientifictruths;(3)parsimonious;or(4)falsifiable.Inaddition,allmeasurementsareobservableonlyafterthefact.Nonetheless,dataminingoffersbenefits,includingtheabilitytorelativelyquicklydiscoverpreviouslyunknownrelationshipsandtherebygenerateanewhypothesisthatcanthenbetestedusingthescientificmethod(Fayyad,etal.1996;Holzinger,etal.2014).Althoughoftencitedasacriticismofdatamining,akeybenefitistheabilitytogeneratemultipleassociationswithinasingledataset,whichmayyieldpowerfulnewinformation,especiallywhencombinedwithassociationsidentifiedinotherdatasets.Inthisregard,dataminingcanbeviewedasakintometa-analysisofclinicaltrialdata.Akeybenefitofdataminingistheabilitytogeneratemultipleassociationswithinasingledataset,whichmayyieldpowerfulnewinformation,especiallywhen

combinedwithassociationsidentifiedinotherdatasets.Challengesincludethefactthatdataminingrequiresextensivetrainingbeyondtheskillsofatypicaldomainscientist;withoutsuchtraining,ascientistmayidentifypatternsorassociationsthatarenotvalidorreproducible(Fayyad,etal.1996;Behrens1997;Diaconis2011;Holzinger,etal.2014).Further,thereisnocommonlyacceptedapproachormethodfordatamining,whichmakesthefieldsomewhatmoreofanartthanascience(Fayyad,etal.1996;Diaconis2011).Thereisalsoatendencytotreatallfindingsasconclusivewhentheymaybechance

4Asanaside,wenotethatthetheoreticalfoundationoftheMICliesintheconceptofmutualinformation(MI)inpairsofrandomvariables,whichwasdevelopedbyClaudeShannon,thefounderofinformationtheory,morethan50yearsago(Speed2011).TheimplementationofMIintopracticeasMICcannotbeachievedthroughmanualorsemi-manualcomputationandinsteadrequiressignificantdigitalcomputationalpower—somethingnotavailableuntilrecently.

Page 8: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 7

findings(Fayyad,etal.1996;Diaconis2011).Inaddition,whilethepowerofdataminingincreaseswhenmultipleheterogeneousdatasetsareintegratedbeforethedataaremined,sotoodoesthecomplexityoftheprocess(Fayyad,etal.1996;Holzinger,etal.2014).Withhigh-dimensionaldata,multipleapproachesmustbeusedtoanalyzethedata;forexample,sophisticatedvisualizationapproachesmayneedtobecombinedwithtraditionalstatisticalapproachestodatamining(Fayyad,etal.1996;Diaconis2011;Holzinger,etal.2014).ComputermodelingComputermodelinginvolvesconceptual,mathematical,computer-generated,orphysicalrepresentationsofreal-worldobjectsorphenomenon(Bowers2012;Buytaert,etal.2012;Perraetal.,2012;Berman,etal.2015).Itisusedtotestahypothesisorobserveandmanipulateanobjectorphenomenonthatisotherwisedifficult(orunethical)toobserveandmanipulate.Asanexample,considerascientistwhowishestodeterminehowthebeta-amyloidproteininfluencesthedevelopmentofAlzheimersDisease.S/hedevelopsasoftwareprogramusingestablishedprinciplesofproteinfoldingtovisuallyexplorein3-Ddifferentscenarioswherebybeta-amyloidmayfoldinthebraintoinfluencecognitivefunctionandleadtothedevelopmentofAlzheimersDisease.Modelingrepresentsafundamentalaspectofscientificdiscoveryandhasbeenusedthroughoutthehistoryofmodernscience(e.g.,anatomicalmodels).Theuseofcomputermodelingisrelativelynew,however,andassuch,computer-drivenmodelingtendstobeahighlyspecializedtoolasopposedtoageneraltool.Benefitsofcomputermodelingincludetheadditionofapotentiallypowerfultooltotheformerlymanualprocess,particularlywhenmodelsincorporatethevaststreamsofdatathatareavailablefromtechnologiessuchascrystallographyandothersophisticatedsourcesofdata(Perra,etal.2012;Berman,etal.2015).Computermodelsbecomeevenmorepowerfulwhentheyaregeneratedusingdataderivedfrommultidisciplinaryscience(Bowers,etal.2012;Buytaert,etal.2012).Challenges,however,includethefactthatcomputer-generatedmodelsareonlyasgoodastheunderlyingdataandsoftwareprogramsusedtocreatethem(Joppa,etal.2013;Berman,etal.2015).Inaddition,theriskofintroducingerrorincreasessignificantlywhenmodelsarecreatedforpoorlyunderstoodobjectsorphenomenon,whenthedataarequalitativeorotherwisedescribedinanon-standardizedformat,orwhenmodelsdevelopedinonefieldareapplied(withoutvalidation)toanotherfieldorevensharedbetweendifferentresearchgroupswithinthesamefield(Bowers,etal.2012;Buytaert,etal.2012;Perra,etal.2012;Joppa,etal.2013;Bergman,etal.2015).Onceintroduced,errorstendtopropagateandmayevenamplify(Buytaert,etal.2012).Moreoever,manymodelsarenotdesignedtoworkinadynamic,flexible,user-driven,webenvironment(Buytaert,etal.2012).Finally,modelingcostscanbequitehigh,sometimeshigherthanthecostsoftheinstrumentsusedtogeneratethedatathatareusedinthemodels(Berman,etal.2015).

Page 9: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 8

InteractivesimulationandvirtualrealityInteractivesimulationandvirtualrealityaresimilartomodelinginthatacomputerisusedtocreaterealisticscenariosfortestingahypothesisormanipulatinganobjectorphenomenonthatisotherwisedifficult(orunethical)totestormanipulate.However,thedifference,asdefinedherein,isthatwithinteractivesimulationandvirtualreality,humanbehavior,nottheunderlyingsoftwareprogram,guidestheoutcomeofthesimulationorvirtualexperience(Zyda2005;Kunkler2006;Diemer,etal.2015).Supposeamedicaldevicecompanyhasbeendevelopinganewanesthesiamachine.InordertoobtainapprovalfromtheU.S.FoodandDrugAdministration,thedevicehastobetestedforsafetyinPhaseIandIIclinicaltrials.Toachievethis,thecompanycreatesa“dummy”patientthatisprogrammedtomimicphysiologicalresponsesduringparticulartypesofsurgeriesinasimulatedoperatingroom.APhaseIclinicaltrialisthenconductedinthesimulatedenvironment,usingateamofactualsurgeons,anesthesiologists,andnursesasstudysubjects.Interactivesimulationandvirtualrealitycanbeusedforhypothesistestingaspartofthescientificmethod,buttheyalsocanbeusedforexploratoryanalysis.Thebenefitsincludetheabilitytoinvestigatehumanbehaviorinthecontextoftechnologyorcomputer-assistedscenariosthatresembletherealworld(Zyda2005;Kunkler2006;Diemer,etal.2015).Thus,theapproachfacilitatesbothteam-basedscienceandscienceonhumanteams.Severalchallengesexist,however.First,ateamwithexpertiseinbothsoftwaredevelopmentandhumanbehaviormustcreatethesoftwareprogramsthatdrivethesimulationsandvirtualscenarios,withcarefulattentiontoeverydetailofthescenario(Diemer,etal.2015).Also,humanbehaviorinasimulatedenvironmentmightnotbeidenticaltohumanbehaviorintherealworld;atpresent,softwareprogramscannotfullycapturethetruehumanexperience,includingbehaviorssuchasemotionalreactions(Kunkler2006).Finally,technologicalchallengesgrowexponentiallywiththecomplexityofthescenario,andupfrontcostsarehigh(Zyda2005).ScientificworkflowsScientificworkflowsrefertotheabstractstepsrequiredtocompleteaspecificscientifictask.Today,thesearetypicallydesignedasspecializedworkflowmanagementsystemsthatarecapableoforchestratingandautomatingmanyofthesteps,includingdataflowand,insomecases,personnel(e.g.,scheduling,accessrights,alertsorreminders),requiredtocompleteaspecifictaskand/ortestahypothesis(es)(Curcin&Ghanem2008;Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Guo2013).TheconceptoftheautomatedscientificworkflowappearstohaveoriginatedwithanundatedpublicationbySinghandVouk(seereferences).Scientificworkflowsareusedtoautomatetedious,time-consuming,orhighlycomplexstepsinexperimentaltestingand/oranalysis.Forinstance,imaginethatabiomedicalresearchcompanyreliesonflowcytometry,atechniquethatenablesthevisualizationandquantificationoftheproteincompositionoflivecells,asacriticalpartofitsdrugdiscoveryeffortsindiabetes.Thecompanydecidesto

Page 10: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 9

automatemuchoftheprocessandhiresasoftwareengineeringteamtodesignascientificworkflowsystemtoautomateandcoordinatethetechnologiesandresearchteamsinvolvedineachstepoftheprocess,includingisolationofbloodcellsfrompatientswithdiabetes,incubationofcellswithconjugatedantibody(ies),multiplewashes,immunofluorescencevisualization,andanalysistodeterminecellularsubpopulations.Thedesignofmostscientificworkflowsystemsmodelstheactualscientificmethod,exceptthatthemeasurementsareautomatedandnotalwaysobservablebythescientist.Moreover,scientificworkflowsareincreasinglybeingusedforautomateddeductionandinference,whicharestepsthathistoricallyreliedonthescientist.Intheexampleabove,forinstance,theuseofflowcytometrytoidentifycellularsubpopulationsonthebasisoftheirfluorescentprofilecanbemoreofanartthanascience,especiallywhenmultiplefluorescentconjugatesareused;assuch,automatingtheprocesscanyielderroneousorinconsistentresults.Thatsaid,scientificworkflowsenableascientisttoconductexperimentsmorequickly,efficiently,andwithgreaterstatisticalpowerandreproducibilityduetothelargedatavolumesthatawell-designedscientificworkflowcanhandleandtheabilitytointegrateheterogeneousdatasources,ofteninrealtime(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Guo2013).Scientificworkflowsareincreasinglybeingusedforautomateddeductionand

inference,whicharestepsthathistoricallyreliedonthescientist.Whilepotentiallyquitepowerful,thescientistintherealmoftheworkflowisoftendependentonthetechnologyandremovedfromcriticaldecision-makingsteps(e.g.,calculations,incubationtimes,etc.),which,intheabsenceofcarefulsystemdesignandimplementation,threatensthevalidityandreproducibilityofworkflowfindings.Furthermore,unlesstheworkflowstepsareclearlyannotatedandopenlyaccessible,apeerreviewerwillbeunabletofullyevaluateandreproducethescientificfindings(Bowers2012).Moreover,workflowdesigncanbequitecomplex,involvinghundredsofindividualanalysissteps,largesetsofheterogeneousdata,andmultipleexistingworkflowsthatoftenwerenotdesignedtoworktogether;thecomplexitypresentsdifficultiesinuser-friendliness,design,andcaptureandrecordofprovenance5(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Gup2013).Thereisalsoapracticallimittothedegreeofgranularityinworkflowtasks;highlygranularactivitiesaretypicallynotfeasibleduetoanegativeimpactonoverallsystemperformance(Perraud,etal.2010).Anadditionalconcernisthatscientificworkflowscanintroducesystematicbiasinthaterrors(e.g.,anincorrectcalculation)maybeintroducedandpropagatedindefinitelybecauseautomationmakesitmoredifficulttodetectandcorrecterrors.Finally,workflowstendtobeveryspecializedandoftencannotberealisticallyadaptedfordifferentdomains(Curcin&Ghanem2008).

5“Provenance”referstothecaptureofmetadatathatdescribestheoriginsofthedata,eachstepofdatatransformationandanalysis,andarecordofversioningtoidentifyhowup-to-datethedataare(Guo2013).

Page 11: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 10

DisseminationandadjudicationDisseminationandadjudicationofscientificdiscoveriesarecriticalcomponentsofscientificdiscovery,asthesearetheprocessesthroughwhichnewscientific“truths”areaddedtothecollectivescientificbodyofknowledge(Smith2006;ACSpublications2013;Almeida2013;ColdSpringHarborLaboratory2014).Disseminationhashistoricallybeenachievedthroughpresentationtoscientificpeersandpublicationinscientificjournalsinordertoobtaincriticalpeerreviewandvalidation(orrejection)ofscientificfindingsandinformthescientificandlaycommunities.Adjudicationisthemethodbywhichaconsensusisreachedamongscientificexpertsregardingthevalidityofthescientificfindings(alsoseefootnote1).Asanexampleofthedisseminationandadjudicationprocesses,assumeascientistdeliversaPowerPointpresentationonnewtechnologiesforhydrologyatascientificconferenceonwatersafety.S/hereceivesimmediatefeedbackonthepresentationfromcolleaguesinattendanceatthelecture.Onthebasisofthatfeedback,thescientistdecidesthatanadditionalexperimentisrequiredbeforehis/herworkisreadyforpublicationinapeer-reviewedjournal.Toprovideanotherexample,imagineascientistintendstopublishhis/hernewceramicsmodelinascientificjournalfocusedonmaterialsscience.Thejournalrequirescriticalpeerreview.6Thescientistsubmitsthemanuscripttothejournalforpotentialpublication,anditisreviewedbyanonymouspeers.Onthebasisofthepeerreview,theeditordecidesthatbeforethemanuscriptcanbepublished,thetextneedstoberevisedtobetterframetheneedforanewceramicsmodelinthecontextofexisting,similarmodels.Historically,disseminationandadjudicationhavebeenkeycomponentsofthescientificmethodandthefinalscreenbeforenewscientifictruthsareaddedtothescientificbodyofknowledge.Thebenefitsaretremendousbecausetheprocessenablesscientificfindingstoberigorouslyevaluatedbyexpertscientists,thusensuringtherobustnessandintegrityofscientificfindings(Smith2006;ACSpublications2013;Almeida2013;ColdSpringHarborLaboratory2014).Thechallenges,however,aregrowingintheeraofbigdata.Specifically,thedigitalworldischangingthewaythatscientificfindingsaregenerated,presented,published,andcatalogued.Forexample,traditional,paper-based,peer-reviewedscientificjournalsarecompetingwithnew,electronic,openaccessscientificjournals7thatmaynotrequireaslow,rigorouspeerreview,butonlyaquick,minimaleditorialreviewbeforepublication(Almeida2013;ColdSpring

6“Peerreview”referstotheprocesswherebyscientificjournalsorscientificfundingorganizationsenlist(forfree)thehelpofcolleagueswithinthesamefieldasthemanuscriptorgrantproposal’sinvestigativeteamtoevaluatethescientificmeritandintegrityofadocument(Smith2006).7“Openaccess”scientificjournalsareonlinejournalsthatpermitaccesstoalljournalcontentwithoutasubscriptionfee.(ColdSpringHarborLaboratory2014).Thepremiseistoallowordinarycitizenstoaccessallscientificfindings,particularlythosethataregeneratedwithfederalmoney,asa“publicgood”(e.g.,http://publicaccess.nih.gov/),muchthesamewaythatfederalsalariesarereleasedtothepublicforevaluation.“Publicgoods”,asdefinedbyeconomists,arethoseitems(commoditiesorservices)thatarebothnon-excludableandnon-rivalrous;examplesincludepublicparks,sewersystems,highways,policeservices,etc.(Samuelson1954).Ironically,manyjournalsrequiretheauthortopayafeeforopenaccesspublication.

Page 12: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 11

HarborLaboratory2014).Inaddition,manyscientistsarenowpublishingnon–peer-reviewedscientificfindingsinelectronicform,viawebsites,digitalwhitepapersandtechnicalreports,blogs,webcasts,YouTubevideos,etc.(ACSpublications2013;Almeida2013).Thisapproachallowsforrapiddisseminationofscientificfindings,butitbypassesboththepeerreviewprocessforpublicationandthepeerselectionprocessforspeakerpresentationatascientificmeeting.Thatsaid,thecurrentpeerreviewprocessisnotwithoutflaw,intermsofbiasanderror(Smith2006),soperhapstheshifttodigitalself-publicationwillreformthepeerreviewprocessorkindleanentirelynewformofpeerreview.NewhybridmodelssuchaseGEMscombinepeerreviewwithfreepublicationandopenaccesspermissions;whetherthesemodelsarefiscallysustainablehasyettobeseen.Anotherchallengeisthewealthofscientificinformationavailabletoday.Thisdisseminationdelugemakesitchallengingtoadjudicateexistingfindingsandidentifyimportantnewfindingsthatmaybecomelostinthewealthofavailabledata(i.e.,the“longtailofscience”)(Trader2012).KnowledgebasesTraditionally,thedisseminationandadjudicationofscientificknowledgeinvolvesocialprocesses;howeverthestorage,integration,search,andinferenceofknowledgeisrapidlychangingtoahybridmodelthatreliesequallyuponhumansandlarge,complex,collectionsofdigitalinformationthatincludedata,relationshipsbetweendataelements,humanassertionsonthedata,and,insomeinstances,capabilitiesforautomatedcognitiveprocessing:knowledgebases.Theterm“knowledgebase”wasintroducedinthe1970s(Jarke,etal.1978)todistinguishitfromtheexistingdatabaseofthetime,whichessentiallystoreddataintabularformforuseraccessviaquery.Thetermremainslooselydefined,butaknowledgebasecanbeconsideredtobeacomputingsysteminwhichassertionsarerepresentedandpersistedwithinanoverarchingontologicalframeworkthatprovidesthesemanticcontextandthatallowsforqueryingandcomputingacrossassertions.Thehumanuserisintegraltotheknowledgederivedfromandcontainedwithintheknowledgebase;and,insomecases,thesystemcanbestructuredtoderivenew,automatedassertionsbasedonmachinelearningornewdata.Whiletraditionaldatabaseshaveevolvedtobecometransaction-basedandrelationalandcanbepartofaknowledgebasesystem,knowledgebasesremaindifferentiatedinthatthehumanuser’sabilitytodynamicallyaccess,addto,andusetheknowledgeisanessentialpartofthedesignandfunctionofthesystem(e.g.,Wikipedia)(Pugh&Prusak2013).Completelyautomatedknowledgebasesareunderdevelopment,butautomatedcurationofhumanassertionsisnotyetpossibleandmayneverbe.Nonetheless,emergingknowledgebasesareincorporatingsophisticatedalgorithmsforautomatedandsemi-automatedstructuring,parsing,summarization,retrieval,andvisualizationofinformation.Leading-edgeproductionknowledgebasescurrentlycontain109to1011assertionsminedfromavarietyofunstructuredandsemi-structureddatasources,includingWikipediaandPubMed(Bordes,etal.2015;Southern2015).ProminentcommercialexamplesofknowledgebasesincludetheIBMWatsonsystemandElsevierPathwayStudio;proprietaryexamplesincludeWalMart’sSocialGenomeandGoogle’sKnowledgeGraph;opensourcesystemsincludeOpenBEL,OpenPHACTS,andDARPA’sDeepDive.OtherexamplesincludetheSemanticWebandtheLinkedDatamovement.

Page 13: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 12

Theuseofknowledgebasesinscientificresearchisgrowing.Forinstance,theGenBankknowledgebaseallowsonetosearchforspecificgenesandthenidentifyawealthofcuratedinformationaboutthatgeneandrelatedgenesandexternaldatasources.TheGeneOntology(GO)knowledgebaseprovidesannotationongenes,biologicalprocesses,cellularcomponents,andmolecularfunctionsandhasbeenprominentintheapplicationofknowledgebasesfornewstatisticalapproachessuchasenrichmentanalysis(Mi,etal.2013).Asscientistsbecomemoreawareofthebenefitsofknowledgebases,andasthequalityandquantityoftheirassertionsincrease,theirapplicationinscientificdiscoveryisexpectedtogrow.

Emergingknowledgebasesareincorporatingsophisticated

algorithmsforautomatedandsemi-automatedstructuring,parsing,summarization,retrieval,andvisualizationofinformation.

Majorchallengesinthedevelopmentofknowledgebasesforscientificdiscoveryincludethecostsofhumancuration,intermsoftheamountoftimeandexperiencerequiredtocontributetoaknowledgebaseinameaningfulway,andthelackofincentivesforscientiststocontributetoknowledgebases,especiallythoseperceivedtobeoutsideofascientist’sareaofexpertise.Asexemplifiedintheknowledgebasesystemslistedabove,recentresearchhasadvancedourunderstandingofhowtoconstructknowledgebases,includingtheincorporationofprobabilisticmodels,structuredrepresentationofunstructureddata,deep-learningsystems,question-answerschemas,andnaturallanguageprocessingalgorithms.Recentresearchalsohasadvancedourunderstandingofhowtointegrateknowledgefromdifferentsourceswithdifferingqualityanddifferingsemantics(Southern,2015;Nickel2015).Forexample,DeepDiveisbuiltuponaprobabilisticgraphmodelthatallowsfortheencodingofassertionsandrelationshipsbetweendataelementsthathaveinaccuracies,suchasthosethatarederivedfrompredictivemodelsanddata-drivenapproaches,alongwithassertionsthathavehighervalidity.Thesystemusesinferenceenginestominetherelationships,aswellasthestrengthoftheassertions,inordertoconstructdifferent“worldmodels”uponwhichnewinferencescanbemade.Theseadvancesshowpromise,butthehumanuser,thescientist,remainstheroadblockinthefurtherdevelopmentandwidespreadapplicationofknowledgebasesforscientificdiscovery.TheBigPicture:TheFutureofKnowledgeDiscoveryinScienceWerecognizethattherehavealwaysbeenscientificapproachesthatdonotrelyonthescientificmethod,asothershavearguedpreviously(e.g.,Cleland2001).Yet,thescientificmethodremainsthe“goldstandard”intermsofscientificdiscovery.Thewealthofdataavailabletodayofferstheunprecedentedopportunitytoconductscienceinwaysneverbeforeenvisioned:real-timescience,real-worlddata,andnewmethodsofscientificdiscovery.Inembracingnewapproachestoscientificdiscoverythatarenotnecessarilyalignedwiththe

Page 14: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 13

scientificmethod,ormaybeevencontradictit,wemustconsiderhowtobestemployandunifythenewapproacheswitheachotherandwiththetraditionalscientificmethod.JohnTukey,consideredbymanytobethemostinfluentialstatisticianinmoderntimes,oncestated:“Thebestpartofbeingastatisticianisthatyougettoplayineveryone’sbackyard.”AmongTukey’smanyaccomplishmentsaretheintroductionoftheterms“bit”(in1946)and“software”(in1958)andtheconceptsandmethodsofexploratorydataanalysis,datamining,theTukeyFastFourierTransformation,andavarietyofotherstatisticalandmathematicalapproachesforscientificdiscovery,manyofthemalsonamedafterhim(Bittrich2000;Leonhardt2000;Brillinger2002).Tukeyalsointroducedtheterm“uncomfortablescience”,whichhasbeendescribedaslyingsomewherebetweenclassicalmathematicalstatisticsand“magicalthinking”(Diaconis2011).Theconceptisthatinmanyinstances,scientificinferencecanandmustbemadefromintuitionandexploration,ratherthancontrolledexperimentation,usingafiniteamountofpotentiallyrichdatathatareflawedandoftennonreplicable.8Tukeydiedin2000,beforetheintroductionoftheterms“bigdata”and“InternetofThings”,butonecanbetthathe’dbethefirsttotakeadvantageofthemanyopportunitiesthatthe“datafication”ofsocietyhasprovided(Bertolucci2013).

Thewealthofdataavailabletodayofferstheunprecedentedopportunitytoconductscienceinwaysneverbeforeenvisioned:real-timescience,real-world

data,andnewmethodsofscientificdiscovery.

Anotherforward-thinkingscholaristhelateJames(Jim)Gray,whoselastpositionwasasaTechnicalFellowandManageroftheBayAreaResearchCenteratMicrosoft.Inadditiontohispioneeringworkonlargedatabasesandtransactionprocessingsystems,GrayintroducedtheconceptoftheFourthParadigmforscience(Hey,etal.2009).9WhileGray’sinitialfocuswasonexploratoryanalysisanddata-intensivescientificcomputation,theFourthParadigmessentiallyrepresentstheradicaltransformationofthescientificmethodfromhypothesis-driventohypothesis-generatingscience,asdescribedherein.10Gray’sparadigmwasdevelopedinthelate1990sandearly2000s,buthisgeniusliesinhisvisionfortoday’sworld,justadecadeor8“Farbetteranapproximateanswertotherightquestion,whichisoftenvague,thananexactanswertothewrongquestion,whichcanalwaysbemadeprecise.”—JohnTukey,1962,Thefutureofdataanalysis.AnnalsofMathematicalStatistics,33,1–67(quotedonpp.13-14).9AfterGray’sdeath,AlexSzalayandcolleaguesformallyintroducedtheconceptofthe“FourthParadigm”,withhisco-authoredpublicationinthejournalScience(Bell,etal.2009).10DuringGray’slastlectureonJanuary11,2007[http://research.microsoft.com/en-us/um/people/gray/jimgraytalks.htm],heisquotedasstatingthefollowing:“Theworldofsciencehaschanged,andthereisnoquestionaboutthis.Thenewmodelisforthedatatobecapturedbyinstrumentsorgeneratedbysimulationsbeforebeingprocessedbysoftwareandfortheresultinginformationorknowledgetobestoredincomputers.Scientistsonlygettolookattheirdatafairlylateinthispipeline.Thetechniquesandtechnologiesforsuchdata-intensivesciencearesodifferentthatitisworthdistinguishingdata-intensivesciencefromcomputationalscienceasanew,fourthparadigmforscientificexploration.”(Hey,etal.,2009).

Page 15: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 14

twolater,wherenearlyeveryscientificdomainismovingtowarddata-intensiveanddata-enabledscientificdiscovery.Thisshiftinvolvestraditionalfieldssuchasphysicsandastronomy,whichhavealwayshadaccesstorichdatasourcesbutarenowfacingunprecedentedcomputationalandanalyticalchallenges,andemergentfieldssuchasenvironmentalscience,whichhasonlyrecentlyhadaccesstothewealthofdistributeddataavailablefromenvironmentalsensorsandsatellites.Eventhe“soft”sciencessuchassocialscienceandpoliticalsciencearerecognizingthepowerofdataderivedfromsocialmediadata,mobiledevices,and“smartcities”.Moreover,appliedfieldssuchasmedicineareembracingtheFourthParadigmandthesemyriadnewdatasources.Forexample,in2011,theNationalResearchCouncilorganizedanadhoccommitteetodevelopaframeworkforaunifiedtaxonomyofhumandiseasethat,whenimplemented,willaccelerateprogresstowardprecisionmedicine,whichisanewareaofmedicinethataimstopersonalizemedicinethroughtheintegrationofdataonindividualvariabilityingenes,environment,andlifestyleinordertoidentifythemostappropriatemedicalmonitoringandtreatmentplanforanygivenpatient(NationalResearchCouncil2011).Thecommittee’sworklargelymotivatedPresidentObama’sJanuary2015announcementduringtheStateoftheUnionAddressonanewNationalInitiativeinPrecisionMedicine(Collins&Varmus2015).AFrameworkforScientificDiscoveryintheEraofBigData:TheCollaborativeKnowledgeNetworkThetimeisrighttowhole-heartedlyembracetheflexibilityandpowerthatnewapproachestoscientificdiscoveryoffer,whilemaintainingthepowerthatthetraditionalscientificmethodholds.Thescientificapproachespresentedinthiswhitepaperareincreasinglybeingadoptedbyscientists;however,thereremainshesitancyabouttheirrolesandlegitimacyinthescientificprocess,alackoftraininginandincentivesfortheiruse,andtheneedforadditionalresearchtotailorandimprovetheseapproachesforscientificdiscovery.Wearguethataprimarychallengeinmovingtowardsacombinationofhypothesis-anddata-drivenscienceistheintegrationofapproachesatacommunitylevel.TheNationalResearchCouncil’sframeworktocreateapathtowardpersonalizedmedicineincludestheconceptofaKnowledgeNetworkofDiseaseasafederateddiscoverynetworkorganizedaroundanInformationCommonsofstructuredpatient-centricdatabasedlargelyongenomicsandincludingpatientmedicalhistory,currentsignsandsymptoms,etc.(NationalResearchCouncil2011).Wesuggestasimilar,butbroaderframeworkforaCollaborativeKnowledgeNetworkforScientificDiscovery(Figure1).Theenvisionednetworkwouldbuildupontheknowledgebasesthatarealreadybeingdeveloped,butthroughanopen,community-basedeffortdesignedtolinktheseknowledgebasesandtherebycreateaknowledgenetworkforscientificdiscovery.Theeffortwouldinvolvenotonlyscientificexperts,butallstakeholders,includingpolicymakers,industryrepresentatives,andevenordinarycitizens,therebyenablingnontraditionaldatasourcesanddatatypestodrivescientificdiscovery.Indeed,scientiststodayhaveaccesstoanabundanceofnewdatasourcesanddatatypes,whichwe’veclassifiedas:(1)

Page 16: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 15

archetypalorvisiblebigdatasuchasthelarge,well-curated,publicdatasetsavailablefromNASAandotherlarge-scaleresearchinitiatives;(2)crowd-sourcedorsupernovabigdatasuchasthemoment-to-momentdatasetsavailablefromTwitterandtheInternetofThings;and(3)

Figure1.ConceptualframeworkfortheproposedCollaborativeKnowledgeNetworkforScientificDiscovery.long-tailordarkbigdatasuchasthosehard-to-finddatasetsheldbysmallscientificteamsandamateurorcitizenscientists.EffortssuchasWikipediaprovideamodelforopen,collaborative,community-governedknowledgeaccumulation(Cohen2014).Acommunity-drivenandcommunity-governedknowledgenetworkcouldrevolutionizescientificdiscoveryandmovescienceindirectionsnotimaginabletoday.ClosingConsiderationsTheCollaborativeKnowledgeNetworkforScientificDiscovery,asproposedandenvisionedherein,willrequirenewscientificmethods,approaches,andpoliciesinordertofullyrealizeitspotential.Forexample,issuesrelatedtodataprovenancewillneedtobeaddressed,aswillissuesrelatedtofundingforongoingdevelopmentandlong-termsustainability.Theintegrationofknowledgeassertionsgeneratedthroughdifferentscientificapproachesthathavedifferinglevelsofcertaintyandexpressivenessremainsafundamentalchallenge,butonewhereprogressisbeingmade.Thesearenottrivialissues,asthecreationofnewknowledgefromcollectiveknowledge,particularlywhenautomated,yieldscomplicatedissuesregardingintellectualpropertyandownership.Withmultiplestakeholdersinvolvedintheenvisioned

Page 17: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 16

knowledgenetwork(e.g.,academics,industry,government,thepublic),ownershipissueswillneedtoberesolved.Ongoingdevelopmentandlong-termsustainabilitybringrelatedchallengesthatlikelywillrequiresignificantup-frontcommunitybuy-in.Complexmodelsforprovenanceandsustainabilitymayneedtobedevelopedtoaccountforconflictinginterests.Furthermore,today’sstudentsandworkersaren’tbeingtrainedfortheFourthParadigmofscienceandoftenlackthecriticalthinkingskillsrequiredtoobjectivelynavigatethroughtheabundanceofdataavailabletodayandmakemeaningfulinferencesabouttheworldaroundthem.Thisgapintrainingandworkforcedevelopmentwillneedtobeaddressedinordertoensurethataknowledgenetworkisnotmisused.Perhapsthemostimportantconsideration,however,isthedevelopmentofnewapproachestoenablethecriticalevaluationandadjudicationofscientificfindingsinthemosttraditionalsense.For,intheabsenceofvalidautomatedmethodsfordrawingconclusionsregardingscientific“truths,”scientistswillcontinuetofaceadatadelugewithoutarationalpathtowardpeerconsensus,erroneousknowledgemaybeintroducedandpropagated,andthelaypublicwilllosetrustinscience.Indeed,thepublicalreadyhasatendencytomistrustscience,especiallywhenscientificfindingsgoagainstintuitionorpersonalbelief(Achenbach2015).Withoutafacilemethodtodeterminethelegitimacyofscientificfindings,anypublicmistrustinsciencewillonlyincrease.Intheabsenceofpublictrust,scienceitselfwillsuffer.Howtoreferencethispaper:

Schmitt,C.,Cox,S.,Fecho,K.,Idaszak,R.,Lander,H.,Rajasekar,A.andThakur,S.(2015):ScientificDiscoveryintheEraofBigData:MorethantheScientificMethod.RENCI,UniversityofNorthCarolinaatChapelHill.Text.http://dx.doi.org/10.7921/G0C82763

Page 18: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 17

References*Achenbach,J.(2015).Whydomanyreasonablepeopledoubtscience?NationalGeographic.http://ngm.nationalgeographic.com/2015/03/science-doubters/achenbach-text.Achilleos,K.G.,Kannas,C.C.,Nicolaou,C.A.,Pattichis,C.S.,&Promponas,V.J.(2012).Opensourceworkflowsystemsinlifesciencesinformatics.Proceedingsofthe2012IEEE12thInternationalConferenceonBioinformatics&Bioengineering(BIBE).Larnaca,Cyprus.Alemida,P.(2013).Theoriginsandpurposeofscientificpublications.JEPSBulletin.http://blog.efpsa.org/2013/04/30/the-origins-of-scientific-publishing/.AmericanChemicalSocietyPublications.(2013).Befoundorperish:writingscientificmanuscriptsforthedigitalage.BIOJOURNALS.AmericanChemicalSociety.http://pubs.acs.org/bio/ACS-Guide-Writing-Manuscripts-for-the-Digital-Age.pdf.Behrens,J.T.(1997).Principlesandproceduresofexploratorydataanalysis.PsycholMethods.,2(2),131–160.Bell,G.,Hey,T,&Szalay,A.(2009).BeyondtheDataDeluge,Science,323(5919),1297-1298.Berman,H.M.,Gabanyi,M.J.,Groom,C.R.,Johnson,J.E.,Murshudov,G.N.,Nicholls,R.A.,Reddy,V.,Schwede,T.,Zimmerman,M.D.,Westbrook,J.,Minor,W.(2015).Datatoknowledge:howtogetmeaningfromyourresult.IUCrJ.,2(Part1),45–58.Bertolucci,J.(2013).Bigdata’snewbuzzword:datafication.InformationWeek.http://www.informationweek.com/big-data/big-data-analytics/big-datas-new-buzzword-datafication/d/d-id/1108797?.Betz,F.(2011),OriginofScientificMethod.InManagingScience:MethodologyandOrganizationofResearch(Innovation,Technology,andKnowledgeManagementSeries,No.9),byF.Betz.SpringerScience+BusinessMedia.Bittrich,M.E.(2000).BiographyofJohnWilderTukey.AT&TBellLaboratoriesCommunication.http://cm.belllabs.com/cm/stat/tukey/bio.html.Bordes,A.,Weston,J.,Collobert,R.,Bengio,Y.(2009).Learningstructuredembeddingsofknowledgebases.AssociationfortheAdvancementofArtificialIntelligence,301-306.http://www.thespermwhale.com/jaseweston/papers/AAAI11.pdf.Bowers,S.(2012).Scientificworkflow,provenance,anddatamodelingchallengesandapproaches.J.DataSemant.,1(1),19–30.

Page 19: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 18

Brillinger,D.R.(2002).JohnW.Tukey:hislifeandprofessionalcontributions.AnnalsStatist.,30(6),1535–1575.Burkhardt,F.(1996).CharlesDarwin’sLetters.ASelection.CambridgeUniversityPress.Buytaert,W.,Baez,S.,Bustamante,M.,Dewulf,A.(2012).Web-basedenvironmentalsimulation:bridgingthegapbetweenscientificmodelinganddecision-making.EnvironScienceTechnol.,46,1971–1976.CohenN.Wikipediavs.theSmallScreen.TheNewYorkTimes.February9,2014.http://www.nytimes.com/2014/02/10/technology/wikipedia-vs-the-small-screen.html?_r=2m.ColdSpringHarborLaboratory.(2014).Guidetoopenaccess.http://cshl.libguides.com/content.php?pid=222607&sid=1847688.Collins,F.S.,Varmus,H.(2015).Anewinitiativeonprecisionmedicine.NEJM,372(9),793–795.http://www.nejm.org/doi/pdf/10.1056/NEJMp1500523.Curcin,V.,Ghanem,M.(2008)Scientificworkflowsystems–canonesizefitall?Proceedingsofthe2008IEEE,CIBEC.IEEE.Diaconis,P.(2011).TheoriesofDataAnalysis:frommagicalthinkingthroughclassicalstatistics.In:ExploringDataTables,Trends,andShapes,byD.C.Hoaglin,F.Mosteller,andJ.W.Tukey.JohnWiley&Sons,2011.Diemer,J.,Alpers,G.W.,Peperkorn,H.M.,Shiban,Y.,Mühlberger,A.(2015).Perceptionandpresenceonemotionalreactions:areviewofresearchinvirtualreality.FrontPsychol.,6,Article26.Evans,J.,Wilhelmsen,K.,Berg,J.,Schmitt,C.,Krishnamurthy,A.,Fecho,K.,&Ahalt,S.(2015).Clinicalgenomics:howmuchdataisenough?.RENCI,UniversityofNorthCarolinaatChapelHill.Text.doi:10.7921/G0F769G9.http://renci.org/wp-content/uploads/2015/02/0215White-Paper-CLinicalGenomics-highres.pdf.Fayyad,U.,Piatetsky-Shapiro,G.,Smyth,P.(1996).Fromdataminingtoknowledgediscoveryindatabases.AIMagazine,Fall1996,37–54.Gelman,A.(2004).Exploratorydataanalysisforcomplexmodels.J.ComputationalGraphicStatist.,13(4),755–779.Gower,B.(1997).Scientificmethod:anhistoricalandphilosophicalintroduction.London,England:Routledge.Guo,P.(2013).Datascienceworkflow:overviewandchallenges.CommunicationsoftheACMblog.http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext.

Page 20: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 19

HeyT,TansleyS,TolleK(Eds.)TheFourthParadigm.Data-IntensiveScientificDiscovery.Seattle,Washington:MicrosoftCorporation;2009.Holzinger,A.,Dehmer,M.,Jurisica,I.(2014).Knowledgediscoveryandinteractivedatamininginbioinformatics–state-of-the-art,futurechallengesandresearchdirections.BMCBioinformatics,15(Suppl6),1.Jarke,M.,Neumann,B.,Vassiliou,Y.,Wahlster,W.(1978).KBMSrequirementsofknowledge-basedsystems.In:Schmidt,J.W.,&Thanos,C.(eds.)FoundationsofKnowledgeBaseManagement.ContributionsfromLogic,Databases,andArtificialIntelligence.Berlin,Germany:Springer,pp.391-395.http://www.dfki.de/wwdata/Publications/KBMS_Requirements_of_Knowledge-Based_Systems.pdf.Joppa,L.N.,McInerny,G.,Harper,R.,Salido,L.,Takeda,K.,O’Hara,K.,Gavaghan,D.,Emmo,S.(2013).Troublingtrendsinscientificsoftwareuse.Science,340(6134),814–815.Kunkler,K.(2006).Theroleofmedicalsimulation:anoverview.TheInternationalJournalofMedicalRoboticsandComputerAssistedSurgery,2,203-210.http://onlinelibrary.wiley.com/doi/10.1002/rcs.101/pdf.Leonhardt,D.(2000).JohnTukey,85,statistician;coinedtheword‘software’.TheNewYorkTimes.http://www.nytimes.com/2000/07/28/us/john-tukey-85-statistician-coined-the-word-software.html.Manyika,J.,Chu,M.,Bisson,P.,Woetzel,J.,Dobbs,R.,Bughin,J.,Aharon,D.(2015).TheInternetofThings:mappingthevaluebeyondthehype.McKinsey&Company.http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights/Business%20Technology/Unlocking%20the%20potential%20of%20the%20Internet%20of%20Things/Unlocking_the_potential_of_the_Internet_of_Things_Executive_summary.ashx.McKie,T.HowDarwinwontheevolutionrace.TheGuardian.June21,2008.http://www.theguardian.com/science/2008/jun/22/darwinbicentenary.evolution.MiH,MuruganujanA,CasagrandeJT,ThomasPD.(2013).Large-scalegenefunctionanalysiswiththePANTHERclassificationsystem.NatProtoc.,8(8),1551-1566Montgomery,S.(2009).CharlesDarwin&evolution,1809~2009.Naturalselection.Cambridge,UK:Christ’sCollege,UniversityofCambridge.http://darwin200.christs.cam.ac.uk/pages/index.php?page_id=d3.NationalResearchCouncil.(2011).TowardPrecisionMedicine.BuildingaKnowledgeNetworkforBiomedicalResearchandaNewTaxonomyofDisease.Washington,D.C.:TheNationalAademiesPress.ContractNo.N01-0D-4-2139.http://www.nap.edu/catalog/13284/toward-precision-medicine-building-a-knowledge-network-for-biomedical-research.

Page 21: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 20

Nickel,M.,Murphy,K.,Tresp,V.,Gabrilovich,E.(2015).Areviewofrelationalmachinelearningforknowledgegraphs.Frommulti-relationallinkpredictiontoautomatedknowledgegraphconstruction.CornellUniversityLibrary:arXiv:1503.00759.http://arxiv.org/abs/1503.00759.Perra,N.,Goçalves,B.,Pastor-Satorras,R.,Vespignani,A.(2012).Activitydrivenmodelingoftimevaryingnetworks.ScientificReports,2,469,1.Perraud,J.M.,Bai,Q.,Hebir,D.(2010).Ontheappropriategranularityofactivitiesinascientificworkflowappliedtoanoptimizationproblem.InternationalEnvironmentalModellingandSoftwareSociety(iEMSs)2010InternationalCongressonEnvironmentalModellingandSoftwareModellingforEnvironment’sSake,FifthBiennialMeeting.Ottawa,Canada.Pugh,K.,Prusak,L.Designingeffectiveknowledgenetworks.MITSloanManagementReview.September12.2013.http://sloanreview.mit.edu/article/designing-effective-knowledge-networks/.Reshef,D.N,Reshef,Y.A.,Finucane,H.K.,Grossman,S.R.,McVean,G.,Turnbaugh,P.J.,Lander,E.S.,Mitzenmacher,M.,Sabeti,P.C.(2011).Detectingnovelassociationsinlargedatasets.Science,334(6062),1518-1524.Samuelson,P.A.(1954).Thepuretheoryofpublicexpenditure.TheReviewofEconomicsandStatistics,36(4),387–389.http://www.ses.unam.mx/docencia/2007II/Lecturas/Mod3_Samuelson.pdf.Singh,M.P.,Vouk,M.A.(undated).Scientificworkflows:scientificcomputingmeetstransactionalworkflows.http://www.csc.ncsu.edu/faculty/mpsingh/papers/databases/workflows/sciworkflows.html.Smith,D.F.(2014).Abriefhistoryofgamification.EDTECH,FocusonHigherEducation.http://www.edtechmagazine.com/higher/article/2014/07/brief-history-gamification-infographic.Smith,R.(2006).Peerreview:aflawedprocessattheheartofscienceandjournals.JRoyalSocietyMed.,99(4),178–182.Southern,M.Googleislookingintowaystoranksitesbasedonaccuracyofinformation.SearchEngineJournal.March4,2015.http://www.searchenginejournal.com/google-is-looking-into-ways-to-rank-sites-based-on-accuracy-of-information/127394/#.Spangler,B.(2003).Adjudication.BeyondIntractability.http://www.beyondintractability.org/essay/adjudication.Speed,T.(2011).Mathematics.Acorrelationforthe21stcentury.Science,334(6062),1502-1503.

Page 22: A RENCI WHITE PAPER · RENCI White Paper Series, Vol. 3, No. 6 1 Scientific Discovery in the Era of Big Data: More than the Scientific Method Authors Charles P. Schmitt, Director

RENCIWhitePaperSeries,Vol.3,No.6 21

Trader,T.(2012).Tamingthelongtailofscience.HPCWire.http://www.hpcwire.com/2012/10/15/taming_the_long_tail_of_science/.Wilhelmsen,K.,Schmitt,C.&Fecho,K.(2013).Factorsinfluencingdataarchivaloflarge-scalegenomicdatasets:amathematicalformalismtocomprehensivelyevaluatethecosts-benefitsofarchivinglargedatasets.RENCITechnicalReportSeries,TR-13-03.UniversityofNorthCarolinaatChapelHill,ChapelHill,NC,USA.doi:10.7921/G0MW2F25.http://renci.org/technical-reports/factors-influencing-data-archival-of-large-scale-genomic-data-sets/.Zyda,M.(2005).Fromvisualsimulationtovirtualrealitytogames.Computer,38(9),25–32.*AllhyperlinkswerelastaccessedonSeptember4,2015.