Upload
laura-martin
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
The iPlant CollaborativeCyberinfrastructure
Matt Vaughn
Cold Spring Harbor Laboratory
April 2010
2
What is iPlant?• Simply put, the mission of the iPlant Collaborative is to build
Cyberinfrastructure to support the solution of the grand challenges of plant biology.
• A “unique” aspect is the grand challenges were not defined in advance, but are identified through an ongoing engagement with the community.
• Not a center, but a virtual organization forming grand challenge teams and relying on the national CI.
• Long term focus on sustainable food supply, climate change, biofuels, pharmaceuticals, etc.
• Hundreds of participants from around the world; Working group members at > 50 US academic institutions, USDA, DOE, etc.
What is Cyberinfrastructure?(Originally about TeraGrid)
And More!:
- Viz
- Facilities
- Data collections
…
It’s a Grid!
It’s Storage!
It’s a Common Software Environ!
It’s a Network!
They are HPC
Centers!
It’s Apps and
Support!
It was six men of Indostan,To learning much inclined,
Who went to see the elephant,(Though all of them were blind),
That each by observationMight satisfy his mind.
WWW.TERAGRID.ORG
4
The iPlant CI
• Engagement with the CI Community to leverage best practice and new research
• Unprecedented engagement with the user community to drive requirements
• An exemplar virtual organization for modern computational science
• A Foundation of Computational and Storage Capability
• A single CI for all plant scientists, with customized discovery environments to meet grand challenges
• Open source principles, commercial quality development process
5
A Foundation of Computational and Storage Capability
• iPlant is positioned to take advantage of *tremendous* amounts of NSF and institutional compute and storage resources:– Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora (ASU)
Marin, Ice (UA) • ~700 Teraflops, more computing power than existed in all the Top 500
computers in the world 4 years ago – Storage: Corral, Ranch (UT), Ocotillo (ASU)
• Well over 10 Petabytes of storage can be made available for the project, on scalable systems capable of growing much more.
– Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave• Among the world’s largest visualization systems
– Virtualized/Cloud Services: iPlant (UA) and ASU virtual environments, vendor clouds
• iPlant is positioned to cloud technologies to deliver persistent gateways and services to users.
In short, the physical aspects of cyberinfrastructure employed via iPlant, utilizing large scale NSF investments, has capabilities second to none anywhere on the planet.
6
A Single CyberInfrastructure, many Discovery Environments
• iPlant is constructing one constantly evolving software environment – A single architecture and “core” – An ever-growing collection of many integrated tools and
datasets (many will be externally-sourced). – Transparently leveraging an evolving national physical
infrastructure
• Customized for particular problems/use cases through the creation of individual “Discovery Environments” (DE): – Have an interface customized to the particular problem domain.– Integrate a specific collection of tools– Utilize the common core– Several DE’s may exist to address a single grand challenge– Think of these like ‘applications’
7
Open Source Philosophy, Commercial Quality Process
• iPlant is open in every sense of the word:– Open access to source– Open API to build a community of contributors– Open standards adopted wherever possible– Open access to data (where users so choose).
• iPlant code design, implementation, and quality control will be based in best industrial practice
8
Commercial Quality Process• Agile development methodology has been adopted
• Complete product lifecycle in place: – Product Definition, Requirements Elicitation, Solution Design, Software
Development, Acceptance Testing
• Code is only built after rigorous requirements process
– Needs Analysis
– User Persona
– Problem Statement
– User Stories
• The Grand Challenge Engagement Team plays the role of “Product Champion” and “Customer Advocate” in this scheme
9
Scope: What iPlant won’t do
• iPlant is not a funding agency– A large grant shouldn’t become a bunch of small
grants
• iPlant does not fund data collection
• iPlant will (generally) not continue funding for <favorite tool x> whose funding is ending.
• iPlant will not seek to replace all online data repositories
• iPlant will not *impose* standards on the community.
10
Scope: What iPlant *will* do
• Provide storage, computation, hosting, and lots of programmer effort to support grand challenge efforts.
• Work with the community to support and develop standards
• Provide forums to discuss the role and design of CI in plant science
• Help organize the community to collect data
• Provide appropriate funding for time spent helping us design and test the CI
11
What is the iPlant CI?
• Two grand challenges defined to date:
– iPlant Tree of Life (IPTOL):Build a single tree showing the evolutionary relationships of all green plant
species on Earth
– iPlant Genotype-to-Phenotype (IPG2P)Construct a methodology whereby an investigator, given the genomic and
environmental information about a given individual plant, can predict it’s characteristics.
Taken together, these challenges are the key to unlocking many “holy grails” of plant biology, such as the creation of drought resistant or pest resistant crops, or breaking reliance on fossil fuel based fertilizer
12
What is the iPlant CI?
• IPTOL CI:– Five areas: Data assembly and integration, visualization, scalable
algorithms for large trees, trait evolution, tree reconciliation
• IPG2P CI:– Five areas: Data Integration, Visualiztion, Modeling, Statistical
Inference, Next Gen Sequencing Tools
In both, a combination of applying compute resources, developing or enhancing new tools, and creating web-based “discovery environments” to integrate tools and facilitate collaboration.
13
Problem Statement• Given: A particular
– species of plant (e.g. corn, rice) – genetic description of an individual (genotype)– growth environment– trait of interest (flowering time, yield, or any of hundreds of
others)
• Predict: the quantitative result (phenotype)
Top priority problem in plant biology (NRC)
• Reverse problem: What genotype will yield the desired result in a given environment?
Genotype-to-Phenotype (G2P)
14
User inferred
Seq data
Expression data
Metabolic data
Whole plant data
Environment data
Vis
ualiz
atio
n
DI
DI
DI
DI
DI
Experiment
Modeling and
Statistical Inference
Hyp
oth
esis
User inferred
Vis
ualiz
atio
n
Super-user Developer
15
iPG2P Working Groups• Ultra High Throughput Sequencing
– Establishing an informatics pipeline that will allow the plant community to process NextGen sequence data
• Statistical Inference– Developing a platform using advanced computational approaches to statistically
link genotype to phenotype
• Modeling Tools– Developing a framework to support tools for the construction, simulation and
analysis of computational models of plant function at various scales of resolution and fidelity
• Visual Analytics– Generating, adapting, and integrating visualization tools capable of displaying
diverse types of data from laboratory, field, in silico analyses and simulations
• Data Integration– Investigating and applying methods for describing and unifying data sets into
virtual systems that support iPG2P activities
16
Metadata Manager
Scalable services
UHTS Discovery Environment
Data•NCBI SRA•User local•iPlant store
Metadata•MIAME•MINSEQE•SRA
Data Wrangling•Quality Control•Preprocessing•Transformation
Alignments•BWA•TopHat +BOWTIE
Cufflinks
SAMTools
SAM Alignments
ExpressionLevels(RPKM)
Variants(VCF3.3)
User story: Arthur, an ecological genomics postdoc, is looking for gene regulators by eQTL mapping expression data in a panel of recombinant inbred lines he has constructed and genotyped.
Coming Q2 2010
17
Statistical Inference
• Network Inference• QTL Mapping
– Regression (fixed, random effects)– Maximum likelihood– Bayesian methods– Decision trees
18
Computational Challenges
Indiv 1 … 6.5e6
1
2
3
…
Indiv 1 … 3.9e4
1
2
3
…
38,963 expression phenotypes:# transcripts in Arabidopsis measured by UHTS
6.5 million markers: Two Arabidopsis-sized genomes @5% diversity
X
* Single-SNP test: a few min* 100-replicate bootstrap: a few hours* Only gets larger for epistasis tests, forward model selection, fms+bootstrapping
19
Statistical Genetics DE
Scalable serviceData•User local•iPlant store
Data Wrangling•Projection•Imputation•Conversion•Transformation
GLM Computation
Kernel
Configuration•User-specified•Driver code Configuration
•User-specified•Driver code
Significant results
Reconfigurable GLM Kernel•C/MPI/Scalapack•GPU•Hybrid CPU
Command-line environment and API expected Q3 2010
20
Modeling Tools
• Integrated suite of tools for:– model construction & simulation– parameter estimation, sensitivity analysis– verification
• Draw on existing SBML tools• Protocol converters for network models• Facilitate MIRIAM usage for code/model
verification
21
Data Integration Principles
G2P Biology is data-driven science. Integration is key: information curators already exist and do extremely good work.
• No monolithic iPlant database(s)
• Provide virtual databases via services
• Provenance preservation
• Foster and actively support standards adoption
• Match orphan data sets with interested researchers & educators
22
Genotype Phenotype
Existing genetic and genomic
data
Generation of new genomic data•Re-sequencing•De novo sequencing
Powerful Statistical Inference
Existing expression, metabolomic, network, physical phenotypes
Data Integration Layer
Generation of new phenotype data•RNAseq•High-throughput phenotyping
• Image Analysis
23
Physical InfrastructureCameras, Scanners, etc
RDBMS
High-throughput Image Analysis
HTIP Service Layer
Scalable services
Algorithm Plugins
MATLAB
PythonC/C++
Web GUI
RESTful API
Workflow Control
ConsumerProcesses
Data IntakeProcesses httpd
Inputs
•Serial images•Multichannel images•Volumetric data•Movies
Database Schema
• Semantic storage and retrieval of images and metdata• Storage of derived results from analysis procedures
Requirements elicitation ongoing
Plant Biology CI Empowerment Strategy
Plant EcologyPlant Ecology
Evolutionary Biology
Evolutionary Biology
Plant GenomicsPlant Genomics
PhenotypingPhenotyping
GC SolutionsGC Solutions
Plant Biology CI Empowerment Strategy
Plant EcologyPlant Ecology
Evolutionary Biology
Evolutionary Biology
Plant GenomicsPlant Genomics
PhenotypingPhenotyping
GC SolutionsGC Solutions
Big Trees
Tree Reconciliation
Trait Evolution
Taxonomic Intelligence
Tree Decoration
Next Gen Sequencing
Statistical Inference
Image AnalysisData Integration
Visualization
Green Plant ToL
Flowering Phenology
Stress & AdaptationModeling
C3/C4 Evolution
26
Technology and the iPC CI
Physical InfrastructurePhysical Infrastructure
ComputeStorage
Persistent Virtual Machines
TeraGridOpen Science Grid
UA/ASU/TACC
iPlant MiddlewareiPlant Middleware
Job Submission Workflow Management Service/Data APIsiRODS, Grid Technologies, Condor, RESTful Services
iPlant Discovery EnvironmentsiPlant Discovery Environments
Grand Challenge Workflows, iPlant InterfacesThird Party Tools, iPlant-built Tools, Community Contributed Tools and Data!
Build a CI that’s robust, leverages national infrastructure, and can grow through community contribution!
User
Technical Questions? Contact Nirav Merchart – [email protected]
27
iPlant : Connecting Users, Ideas & Resources
Core CI Foundation:
Data layerRegistry and Integration layerCompute and Analysis layerInteraction & Collaboration layer
Technical Questions? Contact Nirav Merchart – [email protected]
28
iPlant: Using proven technologies• Data layer:
providing access to raw and ingested data sets including high throughput data transfers
iRODS
GridFTP , Aspera
Dspace (DuraSpace), OpenArchive initiative
Content Distribution Networks (CDN)
High performance storage @ TACC (Lustre)
MySQL and Postgres database clusters
Connection to other DataOne, DataNet initiatives
Cloud style storage (similar to Amazon S3 and Walrus)
Technical Questions? Contact Nirav Merchart – [email protected]
29
iPlant: Using proven technologies
• Registry and Integration LayerConnecting services, data and meta data elements using semantic understanding
Technical Questions? Contact Nirav Merchart – [email protected]
30
iPlant: Using proven technologies
• Compute and Analysis Layer:Connecting tasks with scalable platforms and algorithms
Virtualization (Xen clusters)
High Performance Computing at TACC and Teragrid
Grid (Condor, BOINC, Gearman)
Cloud (Eucalyptus, Nimbus, Hadoop)
Reconfigurable Hardware (GPU, FPGA)
Checkpoint & Restart (DMTCP)
Scaling and parallelizing code (MPI)
Technical Questions? Contact Nirav Merchart – [email protected]
31
iPlant: Using proven technologies
• Interaction and Collaboration layer:Providing end user access to unified services and data, from API to large scale visualization
Google Web Toolkit (GWT driven front end)
Messaging bus (Java Mule, XMPP/Jabber)
RESTful web services (web API access)
Single sign-on/identity management (Shibboleth. OAuth)
Integration with desktop applications (via web services)
Sharing data (DOI, persistent URL, CDN, social networks)
Large scale visualization (Large Tree, Paraview, ENVISION)
Technical Questions? Contact Nirav Merchart – [email protected]
33
First DE
• Support for one use case: independent contrasts. But also…– Seamless remote execution of compute tasks on TeraGrid resources
– Incorporation of existing informatics tools behind iPlant interface
– Parsing of multiple data formats into Common Semantic Model
– Seamless integration of online data resources
– Role based access and basic provenance support
• Next version will support:– Ultra High Throughput Sequencing pipeline, Variant Detection,
Transcript Quantification
– Public RESTful API
35
AcknowledgmentsUniversity of Arizona
Rich Jorgensen
Greg Andrews
Kobus Barnard
Rick Blevins
Sue Brown
Vicki Bryan
Vicki Chandler
John Hartman
Travis Huxman
Tina Lee
Nirav Merchant
Martha Narro
Sudha Ram
Steve Rounsley
Suzanne Westbrook
Ramin Yadegari
Cold Spring Harbor Laboratory, NY
Lincoln Stein
Matt Vaughn
Doreen Ware
Dave Micklos
Sheldon McKay
Jerry Lu
Liya Wang
Texas Advanced Computing Center
Dan Stanzione
Michael Gonzales
Chris Jordan
Greg Abram
Weijia Xu
University of North Carolina-Wilmington
Ann Stapleton
Funded by NSF
36
Collaborating Institutions
• CSHL iPlant CI
• EMEC External Evaluator
• TACC iPlant CI
• UNCW iPlant CI
• Field Museum Natural History
• MoBot APWeb2
• BIEN Taxonomic Intelligence
• UCSB Image Platform
• UWISC Image Platform
• Boyce Thompson Inst. iPG2P
• KSU iPG2P
• UCD iPG2P
• VA Tech iPG2P
• Brown iPToL
• UFL iPToL
• UGA iPToL
• UPenn iPToL
• UTK iPToL
• Yale iPToL
37
Soft Collaborators• 1kP Consortium
• ARS at USDA
• BRIT: Botanical Research Institute of Texas
• CGIAR and Generation Challenge Program
• Cyberinfrastructure for Phylogenetic Research (CIPRES)
• The Croquet Consortium
• NIMBioS: National Institute for Mathematical and Biological Synthesis
• Pittsburgh Supercomputing Center
• pPOD: processing PhyloData
• Syngenta Foundation
• NanoHub & HubZero
• ELIXIR
• Fluxnet.
• Howard Hughes Medical Institute
• Knowledgebase
• NPN: National Phenology Network
• PEaCE Lab: Pacific Ecoinformatics and Computational Ecology Lab
• MORPH: Research Coordination Network (RCN)
• NCEAS: National Center for Ecological Analysis and Synthesis
• NEON: National Ecological Observation Network
• NESCent: National Evolutionary Synthesis Center
38
Unprecedented Engagement with the Plant Science User Community
• A unique engagement process– The Grand Challenge process has resulted in the most intensive
user input of any large scale CI project to date.
• iPlant will construct a single CI for plant science; driven by grand challenges and specific user needs
• Grand Challenge Engagement Teams will continue this very close cooperation with the community– Work closely with the GC proposal team and the broader
community– Build use cases to drive development
39
An Exemplar Virtual Organization for Modern Computational Science
• iPlant aims to be the Gold Standard against which other science-focused CI projects will be measured.
• One Cyberinfrastructure Team, many skills and roles– iPC CI Creation is done by a diverse group:
• Faculty, postdocs, staff, and students• Bioinformatics, Biology, Computing and Information
Researchers, Software Engineers, Database Specialists, etc.• Arizona, Cold Spring Harbor, Texas, etc.
– Many different tasks:• Engagement/Requirements, Tech Eval, Prototyping, Software
Design (DE and Core), Data Integration, Systems, many more.
• A single Cyberinfrastructure Team, where roles may change rapidly to match skill sets
40
Timelines/Milestones
• Growth in staffing & capability; from a few in March 2009, now 47 involved in CI across all sites.
• Architecture definition in August-Sept 2009; enough to get started, still evolving.
• Software environment, tools, practices laid down about the same time.
• Real SW development commenced in September 2009.
• Serious prototyping and tool support in response to ET needs began ramping up in November.
41
Technology Eval Activities
• Largest investment in semantic web activities– Key for addressing the massive data
integration challenges
• Exploring alternate implementations of QTL mapping algorithms
• Experimental Reproducability
• Policy and Technology for Provenance Management
• Evaluation of HubZero, Workflow engines, numerous other tools
42
IPTOL CI – A High Level Overview
• Goal: Build very large trees, perhaps all green plant species
• Needs:– Most of the data isn’t collected. A lot of what is
collected isn’t organized.– Lots of analysis tools exist (probably plenty of them) –
but they don’t work together, and use many different data formats.
– The tree builder tools take too long to run.– The visualization tools don’t scale to the tree sizes
needed.
43
IPTOL CI – High Level• Addressing these needs through CI
– MyPlant – the social networking site for phylogenetic data collection (organized by clade)
– Provide a common repository for data without an NCBI home (e.g. 1kP)
– Discovery Environment: Build a common interface, data format, and API to unite tools.
– Enhance tree builder tools (RAxML, NINJA, Sate’) with parallelization and checkpointing
– Build a remote visualization tool capable of running where we can guarantee RAM resources