39
Unleash your inner (data) scientist : The ability and audacity to scale your science with extensible cyberinfrastructure Nirav Merchant The University of Arizona & iPlant Collaborative [email protected]

Nirav Merchant - Unleash your inner (data) scientist : The abilityand audacityto scale your science with extensible cyberinfrastructure

Embed Size (px)

DESCRIPTION

2015 CUAHSI Conference on Hydroinformatics

Citation preview

  • Unleash your inner (data) scientist :The ability and audacity to scale your science with

    extensible cyberinfrastructure

    Nirav MerchantThe University of Arizona &iPlant [email protected]

  • Topic Coverage

    The Big Data and Data Scientist wave What is cyberinfrastructure (CI) Delivering pragmatic CI ecosystem What has the community built with our CI Lifecycle of research and innovation Continuing education and learning with CI Future thoughts and challenges

  • Science Paradigms1. Thousand years ago: science was empirical

    describing natural phenomena, observations2. Last few hundred years: theoretical branch

    using models, generalizations3. Last few decades: a computational branch

    simulating complex phenomena4. Today: data exploration (eScience)

    unify theory, experiment, and simulation

    Based on the transcript of a talk given by the late Jim Grayto the National Research Council Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 20073

  • The Fourth Paradigm: Data-Intensive Scientific Discovery

    Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.

    The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies.

    http://research.microsoft.com/en-us/collaboration/fourthparadigm/4

  • The Discovery Lifecycle

    The Fourth Paradigm: Data-Intensive Scientific Discovery5

  • Evolution of X-Info The evolution of X-Info and Comp-X for each discipline X e.g.

    (Bio-Informatics , Computational-Biology) How to codify and represent our knowledge The Generic Problems:

    How to share it with others Query and Vis tools Building and executing models Integrating data and literature Documenting experiments Curation and long-term preservation

    Data ingest Managing a petabyte Common schema How to organize it How to reorganize it

    The Fourth Paradigm: Data-Intensive Scientific Discovery6

  • Classic paradigm: You produce data, analyze, interpret (end to end)

    Conventional paradigm: Consortium/centers produce data and you consume it

    New Paradigm: Consortium/centers have produced data and creating cyber infrastructure to tackle the grand challenge

    Paradigm Shift

    7

  • 8

  • Big Data

    Extracting meaningful results from vast amount of data (linked data) Big data information assets demand cost-effective, innovative

    forms of information processing for enhanced insight and decision making.

    Big Data Is only the Beginning of Extreme Information Management

    Big Data Technology, all Is Not New

    Attributed to Gartner Consulting 9

  • A few word about Big Data and Data ScienceThe 2014 Gartner Technology Hype-Cyclehttp://www.gartner.com/newsroom/id/2819918

  • + =

    Simple Formula for Success

    11

  • The Reality

    + +

    Excel, R PERL Python ARCGIS Java Ruby Fortran C C#

    C++ Matlab etc.

    Amazon Azure Rackspace Campus HPC XSEDE Etc.

    and lots of glue..12

  • + =

    Simple Formula

  • http://cloudtweaks.com/2011/05/the-lighter-side-of-the-cloud-data-transfer/

  • Rise of the data janitors

    15

  • The relevance Bioinformatics has become too central to biology to

    be left to specialist bioinformaticians. Biologists are all bioinformaticians now

    - Lincoln Stein Dec. 2008

    http://genomebiology.com/2008/9/12/114

  • iPlant Collaborative: Vision

    www.iPlantCollaborative.org

    Enable life science researchers and educators touse and extend cyberinfrastructure

  • The iPlant CollaborativeWe are a Cyberinfrastructure

    Platforms, tools, datasets Storage and compute Training and support

    From data to discovery

  • The iPlant CollaborativeAnd a virtual organization

    Developer Expertise Computational Capacity Science Domain Expertise Training Administrative and Organization

  • Facilitating the 4As of Computational Thinking approaches for Life Sciences: Abstraction, Automation, Ability and Audacity

    Allowing researchers and educators to establish and manage data driven collaborations: Supporting distributed teams and virtual organizations (VO) at global scale

    Making efficient and coordinated use of CI resources from national, regional, institutional and commercial providers: NSF XSEDE, iPlant, campus HPC and high bandwidth connections to commercial cloud providers

    Adopting best practices from science domains where key CI challenges have been solved: Astronomy, Particle Physics etc.

    Community driven, self-provisioning, extensible and open source: Development and prioritization driven through community engagement, active engagement with CISE communities

    iPlant Collaborative: CI for Scalable Science

  • iPlant Collaborative: Platform Philosophy Strive to provide the CI Lego blocks Danish 'leg godt' - 'play well Also translates as 'I put together' in Latin If desired functionality is not available, the

    community can craft their own by using andextending iPlant CI components (like lego blocks)

    Through these extensible and customizedplatforms create a ecosystem of interoperabletools that benefit the broad community (and notfew lab groups)

    Provide the tools to allow community to managetheir digital assets (cloud, HPC etc.)

    Improve Computational Productivity

  • Who did we build it for ?

  • iPlant: Platform for Big Data Collaborations

  • Ready to usePlatforms

    FoundationalCapabilities

    Established CI Components

    Extensible Services

    Eas

    e of

    use

    iPlant Collaborative: Products

  • iPlant: Cohesive Platform for Big Data lifecycle

  • Researchers like to share ! User Statistics

    ~27000 user accounts 4900 users with data 2600 users (53% of users with data) made at least 1 share 2100 shares per user 42 million files (58% shared) 59 million (1.1 million/month) shares

    Community Data Statistics 5 million files 55 million (1.0 million/month) shares

    ~1.1PB of User Managed data Our users consume 5M+ SU annually and more

    (we graduate them to compete for their own allocations from XSEDE)

  • How is it being used ?

    User build their own systems (powered by iPlant components) but managed by them

    Consume specific components (a la carte, data store, Atmosphere) Directly use applications (DE) Custom design appliances (Atmosphere) Publish their findings (PNAS, Nature) Advocate use Create learning material and courses

  • Many 1000s omes projectmanage their data & analysis

    Execute large scale workflows(25-50TB data , Million+ CPUhours)

    Data infrastructure tocoordinate digitization effortsfor multiple sites

    Sharing, Visualizing (3D) &Analyzing high resolutionmicroscopy images (40K x40K) via web browser

    Learning material, new coursework, custom applications

    iPlant CI: What is the community building ?

  • And it goes way beyond plants and life science

  • Partnership with SoftwareCarpentry and Data Carpentry toprovide best practices necessaryto make efficient use of CI

    Allowing individual researchersand educators to utilize data andcomputational infrastructure atscale (and encounter realchallenges)

    Community contributed material(built on iPlant CI)

    iPlant Collaborative: Training data scientists

  • Applied Cyberinfrastructure Concepts (ACIC) Semester long project based learning course: introduces fundamental

    concepts, tools and resources for effectively managing common tasks associated with analyzing large datasets.

    Graduate + Undergraduate course working on a REAL research workflows where scalability is a bottleneck

    Provide familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers, Cloud (Future Grid and commercial providers such as Amazon).

    Learning to apply relevant CI skills (for final project) and developing wiki based documentation of these best practices.

    Learning how to effectively collaborate in interdisciplinary team settings. Deliver a functional solution to the stakeholder

  • From research question to reality

  • Why is it valuable ?

    Users are able to over come data and computational bottle necks Share data of ANY size with ANYONE Connect data and compute on single platform Manage their data and computations regardless of scale Build their own apps and solutions (create their own community

    iAnimal, iVirome) Create custom appliances

  • iPlant: What worked All major CI components have seen steady adoption (few

    exception) Think tank to do tank transition was rapid Evolved to a technology proving ground Take research products (NSF funded) to production use for our

    community Running infrastructure is not fun, building is. Allowing people to

    focus on science (while stream line CI)

  • iPlant: What worked Evolution of training (software carpentry) Sharing/collaboration Give people exit strategy (options) and they are happy adopt

    solution Provide feedback to CI component creators to improve (usability) Expectation management: Do not expect the same experience

    (cable cord cutting v/s netflix/hulu)

  • What did not work Managing distributed teams is harder in VO (load balancing,

    enthusiasm etc) Technology lifecycle is not synchronized across all products Relying on multiple providers for solution is challenging

    (downtimes) Changing/Evolving needs of community are hard to predict Growth of users out paces our cloud capabilities (see tweets)

  • Even the tech geeks notice

  • Connect with iPlant!Get a account: http://user.iplantcollaborative.orgEmail us: [email protected]: http://ask.iplantcollaborative.orgTwitter: @iPlantCollab #iPlantFacebook: facebook.com/iPlantCollabLinkedIn: iplant.co/iPlantCollabLinkedInGoogle+: iplant.com/iPlantGooglePlus

  • Luck favors the braveAnalysis favors the organized

    Slide Number 1Topic CoverageScience ParadigmsThe Fourth Paradigm: Data-Intensive Scientific DiscoveryThe Discovery LifecycleEvolution of X-InfoParadigm ShiftSlide Number 8Big DataA few word about Big Data and Data ScienceSimple Formula for SuccessThe RealitySimple FormulaSlide Number 14Rise of the data janitorsSlide Number 16Slide Number 17Slide Number 18Slide Number 19iPlant Collaborative: CI for Scalable ScienceSlide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Researchers like to share !How is it being used ?iPlant CI: What is the community building ? And it goes way beyond plants and life scienceSlide Number 30Applied Cyberinfrastructure Concepts (ACIC) From research question to realityWhy is it valuable ?iPlant: What workediPlant: What workedWhat did not workEven the tech geeks noticeConnect with iPlant!Luck favors the braveAnalysis favors the organized