The iPlant Collaborative Cyberinfrastructure Matt Vaughn Cold Spring Harbor Laboratory April 2010

The iPlant CollaborativeCyberinfrastructure

Matt Vaughn

Cold Spring Harbor Laboratory

April 2010

http://www.iplantcollaborative.org/

2

What is iPlant?• Simply put, the mission of the iPlant Collaborative is to build

Cyberinfrastructure to support the solution of the grand challenges of plant biology.

• A “unique” aspect is the grand challenges were not defined in advance, but are identified through an ongoing engagement with the community.

• Not a center, but a virtual organization forming grand challenge teams and relying on the national CI.

• Long term focus on sustainable food supply, climate change, biofuels, pharmaceuticals, etc.

• Hundreds of participants from around the world; Working group members at > 50 US academic institutions, USDA, DOE, etc.


What is Cyberinfrastructure?(Originally about TeraGrid)

And More!:

- Viz

- Facilities

- Data collections

…

It’s a Grid!

It’s Storage!

It’s a Common Software Environ!

It’s a Network!

They are HPC

Centers!

It’s Apps and

Support!

It was six men of Indostan,To learning much inclined,

Who went to see the elephant,(Though all of them were blind),

That each by observationMight satisfy his mind.

WWW.TERAGRID.ORG


4

The iPlant CI

• Engagement with the CI Community to leverage best practice and new research

• Unprecedented engagement with the user community to drive requirements

• An exemplar virtual organization for modern computational science

• A Foundation of Computational and Storage Capability

• A single CI for all plant scientists, with customized discovery environments to meet grand challenges

• Open source principles, commercial quality development process


5

A Foundation of Computational and Storage Capability

• iPlant is positioned to take advantage of *tremendous* amounts of NSF and institutional compute and storage resources:– Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora (ASU)

Marin, Ice (UA) • ~700 Teraflops, more computing power than existed in all the Top 500

computers in the world 4 years ago – Storage: Corral, Ranch (UT), Ocotillo (ASU)

• Well over 10 Petabytes of storage can be made available for the project, on scalable systems capable of growing much more.

– Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave• Among the world’s largest visualization systems

– Virtualized/Cloud Services: iPlant (UA) and ASU virtual environments, vendor clouds

• iPlant is positioned to cloud technologies to deliver persistent gateways and services to users.

In short, the physical aspects of cyberinfrastructure employed via iPlant, utilizing large scale NSF investments, has capabilities second to none anywhere on the planet.


6

A Single CyberInfrastructure, many Discovery Environments

• iPlant is constructing one constantly evolving software environment – A single architecture and “core” – An ever-growing collection of many integrated tools and

datasets (many will be externally-sourced). – Transparently leveraging an evolving national physical

infrastructure

• Customized for particular problems/use cases through the creation of individual “Discovery Environments” (DE): – Have an interface customized to the particular problem domain.– Integrate a specific collection of tools– Utilize the common core– Several DE’s may exist to address a single grand challenge– Think of these like ‘applications’


7

Open Source Philosophy, Commercial Quality Process

• iPlant is open in every sense of the word:– Open access to source– Open API to build a community of contributors– Open standards adopted wherever possible– Open access to data (where users so choose).

• iPlant code design, implementation, and quality control will be based in best industrial practice


8

Commercial Quality Process• Agile development methodology has been adopted

• Complete product lifecycle in place: – Product Definition, Requirements Elicitation, Solution Design, Software

Development, Acceptance Testing

• Code is only built after rigorous requirements process

– Needs Analysis

– User Persona

– Problem Statement

– User Stories

• The Grand Challenge Engagement Team plays the role of “Product Champion” and “Customer Advocate” in this scheme


9

Scope: What iPlant won’t do

• iPlant is not a funding agency– A large grant shouldn’t become a bunch of small

grants

• iPlant does not fund data collection

• iPlant will (generally) not continue funding for <favorite tool x> whose funding is ending.

• iPlant will not seek to replace all online data repositories

• iPlant will not *impose* standards on the community.


10

Scope: What iPlant *will* do

• Provide storage, computation, hosting, and lots of programmer effort to support grand challenge efforts.

• Work with the community to support and develop standards

• Provide forums to discuss the role and design of CI in plant science

• Help organize the community to collect data

• Provide appropriate funding for time spent helping us design and test the CI


11

What is the iPlant CI?

• Two grand challenges defined to date:

– iPlant Tree of Life (IPTOL):Build a single tree showing the evolutionary relationships of all green plant

species on Earth

– iPlant Genotype-to-Phenotype (IPG2P)Construct a methodology whereby an investigator, given the genomic and

environmental information about a given individual plant, can predict it’s characteristics.

Taken together, these challenges are the key to unlocking many “holy grails” of plant biology, such as the creation of drought resistant or pest resistant crops, or breaking reliance on fossil fuel based fertilizer


12

What is the iPlant CI?

• IPTOL CI:– Five areas: Data assembly and integration, visualization, scalable

algorithms for large trees, trait evolution, tree reconciliation

• IPG2P CI:– Five areas: Data Integration, Visualiztion, Modeling, Statistical

Inference, Next Gen Sequencing Tools

In both, a combination of applying compute resources, developing or enhancing new tools, and creating web-based “discovery environments” to integrate tools and facilitate collaboration.


13

Problem Statement• Given: A particular

– species of plant (e.g. corn, rice) – genetic description of an individual (genotype)– growth environment– trait of interest (flowering time, yield, or any of hundreds of

others)

• Predict: the quantitative result (phenotype)

Top priority problem in plant biology (NRC)

• Reverse problem: What genotype will yield the desired result in a given environment?

Genotype-to-Phenotype (G2P)


14

User inferred

Seq data

Expression data

Metabolic data

Whole plant data

Environment data

Vis

ualiz

atio

n

DI

DI

DI

DI

DI

Experiment

Modeling and

Statistical Inference

Hyp

oth

esis

User inferred

Vis

ualiz

atio

n

Super-user Developer


15

iPG2P Working Groups• Ultra High Throughput Sequencing

– Establishing an informatics pipeline that will allow the plant community to process NextGen sequence data

• Statistical Inference– Developing a platform using advanced computational approaches to statistically

link genotype to phenotype

• Modeling Tools– Developing a framework to support tools for the construction, simulation and

analysis of computational models of plant function at various scales of resolution and fidelity

• Visual Analytics– Generating, adapting, and integrating visualization tools capable of displaying

diverse types of data from laboratory, field, in silico analyses and simulations

• Data Integration– Investigating and applying methods for describing and unifying data sets into

virtual systems that support iPG2P activities


16

Metadata Manager

Scalable services

UHTS Discovery Environment

Data•NCBI SRA•User local•iPlant store

Metadata•MIAME•MINSEQE•SRA

Data Wrangling•Quality Control•Preprocessing•Transformation

Alignments•BWA•TopHat +BOWTIE

Cufflinks

SAMTools

SAM Alignments

ExpressionLevels(RPKM)

Variants(VCF3.3)

User story: Arthur, an ecological genomics postdoc, is looking for gene regulators by eQTL mapping expression data in a panel of recombinant inbred lines he has constructed and genotyped.

Coming Q2 2010


17


• Network Inference• QTL Mapping

– Regression (fixed, random effects)– Maximum likelihood– Bayesian methods– Decision trees


18

Computational Challenges

Indiv 1 … 6.5e6

1

2

3

…

Indiv 1 … 3.9e4

1

2

3

…

38,963 expression phenotypes:# transcripts in Arabidopsis measured by UHTS

6.5 million markers: Two Arabidopsis-sized genomes @5% diversity

X

* Single-SNP test: a few min* 100-replicate bootstrap: a few hours* Only gets larger for epistasis tests, forward model selection, fms+bootstrapping


19

Statistical Genetics DE

Scalable serviceData•User local•iPlant store

Data Wrangling•Projection•Imputation•Conversion•Transformation

GLM Computation

Kernel

Configuration•User-specified•Driver code Configuration

•User-specified•Driver code

Significant results

Reconfigurable GLM Kernel•C/MPI/Scalapack•GPU•Hybrid CPU

Command-line environment and API expected Q3 2010


20

Modeling Tools

• Integrated suite of tools for:– model construction & simulation– parameter estimation, sensitivity analysis– verification

• Draw on existing SBML tools• Protocol converters for network models• Facilitate MIRIAM usage for code/model

verification


21

Data Integration Principles

G2P Biology is data-driven science. Integration is key: information curators already exist and do extremely good work.

• No monolithic iPlant database(s)

• Provide virtual databases via services

• Provenance preservation

• Foster and actively support standards adoption

• Match orphan data sets with interested researchers & educators


22

Genotype Phenotype

Existing genetic and genomic

data

Generation of new genomic data•Re-sequencing•De novo sequencing

Powerful Statistical Inference

Existing expression, metabolomic, network, physical phenotypes

Data Integration Layer

Generation of new phenotype data•RNAseq•High-throughput phenotyping

• Image Analysis


23

Physical InfrastructureCameras, Scanners, etc

RDBMS

High-throughput Image Analysis

HTIP Service Layer

Scalable services

Algorithm Plugins

MATLAB

PythonC/C++

Web GUI

RESTful API

Workflow Control

ConsumerProcesses

Data IntakeProcesses httpd

Inputs

•Serial images•Multichannel images•Volumetric data•Movies

Database Schema

• Semantic storage and retrieval of images and metdata• Storage of derived results from analysis procedures

Requirements elicitation ongoing


Plant Biology CI Empowerment Strategy

Plant EcologyPlant Ecology

Evolutionary Biology


Plant GenomicsPlant Genomics

PhenotypingPhenotyping

GC SolutionsGC Solutions


Plant Biology CI Empowerment Strategy

Plant EcologyPlant Ecology



Plant GenomicsPlant Genomics

PhenotypingPhenotyping

GC SolutionsGC Solutions

Big Trees

Tree Reconciliation

Trait Evolution

Taxonomic Intelligence

Tree Decoration

Next Gen Sequencing


Image AnalysisData Integration

Visualization

Green Plant ToL

Flowering Phenology

Stress & AdaptationModeling

C3/C4 Evolution


26

Technology and the iPC CI

Physical InfrastructurePhysical Infrastructure

ComputeStorage

Persistent Virtual Machines

TeraGridOpen Science Grid

UA/ASU/TACC

iPlant MiddlewareiPlant Middleware

Job Submission Workflow Management Service/Data APIsiRODS, Grid Technologies, Condor, RESTful Services

iPlant Discovery EnvironmentsiPlant Discovery Environments

Grand Challenge Workflows, iPlant InterfacesThird Party Tools, iPlant-built Tools, Community Contributed Tools and Data!

Build a CI that’s robust, leverages national infrastructure, and can grow through community contribution!

User

Technical Questions? Contact Nirav Merchart – [email protected]


27

iPlant : Connecting Users, Ideas & Resources

Core CI Foundation:

Data layerRegistry and Integration layerCompute and Analysis layerInteraction & Collaboration layer



28

iPlant: Using proven technologies• Data layer:

providing access to raw and ingested data sets including high throughput data transfers

iRODS

GridFTP , Aspera

Dspace (DuraSpace), OpenArchive initiative

Content Distribution Networks (CDN)

High performance storage @ TACC (Lustre)

MySQL and Postgres database clusters

Connection to other DataOne, DataNet initiatives

Cloud style storage (similar to Amazon S3 and Walrus)



29

iPlant: Using proven technologies

• Registry and Integration LayerConnecting services, data and meta data elements using semantic understanding



30


• Compute and Analysis Layer:Connecting tasks with scalable platforms and algorithms

Virtualization (Xen clusters)

High Performance Computing at TACC and Teragrid

Grid (Condor, BOINC, Gearman)

Cloud (Eucalyptus, Nimbus, Hadoop)

Reconfigurable Hardware (GPU, FPGA)

Checkpoint & Restart (DMTCP)

Scaling and parallelizing code (MPI)



31


• Interaction and Collaboration layer:Providing end user access to unified services and data, from API to large scale visualization

Google Web Toolkit (GWT driven front end)

Messaging bus (Java Mule, XMPP/Jabber)

RESTful web services (web API access)

Single sign-on/identity management (Shibboleth. OAuth)

Integration with desktop applications (via web services)

Sharing data (DOI, persistent URL, CDN, social networks)

Large scale visualization (Large Tree, Paraview, ENVISION)



32

An Example Discovery Environment


33

First DE

• Support for one use case: independent contrasts. But also…– Seamless remote execution of compute tasks on TeraGrid resources

– Incorporation of existing informatics tools behind iPlant interface

– Parsing of multiple data formats into Common Semantic Model

– Seamless integration of online data resources

– Role based access and basic provenance support

• Next version will support:– Ultra High Throughput Sequencing pipeline, Variant Detection,

Transcript Quantification

– Public RESTful API


34

Example Service API


35

AcknowledgmentsUniversity of Arizona

Rich Jorgensen

Greg Andrews

Kobus Barnard

Rick Blevins

Sue Brown

Vicki Bryan

Vicki Chandler

John Hartman

Travis Huxman

Tina Lee

Nirav Merchant

Martha Narro

Sudha Ram

Steve Rounsley

Suzanne Westbrook

Ramin Yadegari

Cold Spring Harbor Laboratory, NY

Lincoln Stein

Matt Vaughn

Doreen Ware

Dave Micklos

Sheldon McKay

Jerry Lu

Liya Wang

Texas Advanced Computing Center

Dan Stanzione

Michael Gonzales

Chris Jordan

Greg Abram

Weijia Xu

University of North Carolina-Wilmington

Ann Stapleton

Funded by NSF


36

Collaborating Institutions

• CSHL iPlant CI

• EMEC External Evaluator

• TACC iPlant CI

• UNCW iPlant CI

• Field Museum Natural History

• MoBot APWeb2

• BIEN Taxonomic Intelligence

• UCSB Image Platform

• UWISC Image Platform

• Boyce Thompson Inst. iPG2P

• KSU iPG2P

• UCD iPG2P

• VA Tech iPG2P

• Brown iPToL

• UFL iPToL

• UGA iPToL

• UPenn iPToL

• UTK iPToL

• Yale iPToL


37

Soft Collaborators• 1kP Consortium

• ARS at USDA

• BRIT: Botanical Research Institute of Texas

• CGIAR and Generation Challenge Program

• Cyberinfrastructure for Phylogenetic Research (CIPRES)

• The Croquet Consortium

• NIMBioS: National Institute for Mathematical and Biological Synthesis

• Pittsburgh Supercomputing Center

• pPOD: processing PhyloData

• Syngenta Foundation

• NanoHub & HubZero

• ELIXIR

• Fluxnet.

• Howard Hughes Medical Institute

• Knowledgebase

• NPN: National Phenology Network

• PEaCE Lab: Pacific Ecoinformatics and Computational Ecology Lab

• MORPH: Research Coordination Network (RCN)

• NCEAS: National Center for Ecological Analysis and Synthesis

• NEON: National Ecological Observation Network

• NESCent: National Evolutionary Synthesis Center


38

Unprecedented Engagement with the Plant Science User Community

• A unique engagement process– The Grand Challenge process has resulted in the most intensive

user input of any large scale CI project to date.

• iPlant will construct a single CI for plant science; driven by grand challenges and specific user needs

• Grand Challenge Engagement Teams will continue this very close cooperation with the community– Work closely with the GC proposal team and the broader

community– Build use cases to drive development


39

An Exemplar Virtual Organization for Modern Computational Science

• iPlant aims to be the Gold Standard against which other science-focused CI projects will be measured.

• One Cyberinfrastructure Team, many skills and roles– iPC CI Creation is done by a diverse group:

• Faculty, postdocs, staff, and students• Bioinformatics, Biology, Computing and Information

Researchers, Software Engineers, Database Specialists, etc.• Arizona, Cold Spring Harbor, Texas, etc.

– Many different tasks:• Engagement/Requirements, Tech Eval, Prototyping, Software

Design (DE and Core), Data Integration, Systems, many more.

• A single Cyberinfrastructure Team, where roles may change rapidly to match skill sets


40

Timelines/Milestones

• Growth in staffing & capability; from a few in March 2009, now 47 involved in CI across all sites.

• Architecture definition in August-Sept 2009; enough to get started, still evolving.

• Software environment, tools, practices laid down about the same time.

• Real SW development commenced in September 2009.

• Serious prototyping and tool support in response to ET needs began ramping up in November.


41

Technology Eval Activities

• Largest investment in semantic web activities– Key for addressing the massive data

integration challenges

• Exploring alternate implementations of QTL mapping algorithms

• Experimental Reproducability

• Policy and Technology for Provenance Management

• Evaluation of HubZero, Workflow engines, numerous other tools


42

IPTOL CI – A High Level Overview

• Goal: Build very large trees, perhaps all green plant species

• Needs:– Most of the data isn’t collected. A lot of what is

collected isn’t organized.– Lots of analysis tools exist (probably plenty of them) –

but they don’t work together, and use many different data formats.

– The tree builder tools take too long to run.– The visualization tools don’t scale to the tree sizes

needed.


43

IPTOL CI – High Level• Addressing these needs through CI

– MyPlant – the social networking site for phylogenetic data collection (organized by clade)

– Provide a common repository for data without an NCBI home (e.g. 1kP)

– Discovery Environment: Build a common interface, data format, and API to unite tools.

– Enhance tree builder tools (RAxML, NINJA, Sate’) with parallelization and checkpointing

– Build a remote visualization tool capable of running where we can guarantee RAM resources


Documents

The iPlant Collaborative Cyberinfrastructure Matt Vaughn Cold Spring Harbor Laboratory April 2010