52
2010 Best Practices Competition IT & Informatics: HPC Pg Nominating User ny Compa Project Title 2 JPR Communication s Amylin Virtual Data Center 5 Bristol Myers Squibb, Research & Development High Content Screening ‐ Road 20 Cycle Computing Purdue University DiaGrid 22 DataDirect Networks, Inc. Cornell University Center for Advanced Computing Scalable Research Storage Archive 24 FalconStor Software Human Neuroimaging Lab (HNL) – Baylor College of Medicine Ensuring a more reliable data storage infrastructure at Baylor College of Medicine's HNL 29 Isilon Systems Oklahoma Medical Research Foundation Transition to Nextgen Sequencing and Virtual Data Center 31 National Institute of Allergy and I Diseases (NIAID) nfectious A Centralized and Scalable Infrastructure Approach to Support Next Generation Sequencing at the N nstitute of Allergy and Infe seases ational I ctious Di 37 Panasas Uppsala University UPPNEX 43 TGen, The Translational Genomics Research Institute NextGen Data Processing Pipeline

2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

2010 Best Practices Competition IT & Informatics: HPC 

 

Pg  Nominating  User ny  Compa Project Title 2  JPR 

Communications 

Amylin  Virtual Data Center 

5     Bristol Myers Squibb, Research & Development 

High Content Screening ‐ Road 

20  C  ycle Computing Purdue University DiaGrid 22  DataDirect 

Networks, Inc. Cornell University 

Center for A  dvancedComputing 

Scalable Research Storage Archive 

24  FalconStor Software 

Human Neuroimaging Lab (HNL) – Baylor College of Medicine 

Ensuring a more reliable data storage infrastructure at Baylor College of Medicine's 

HNL 

29  Isilon Systems  Oklahoma Medical Research Foundation 

Transition to Nextgen Sequencing and Virtual Data Center 

31     National Institute of Allergy and I  

Diseases (NIAID) nfectious

A Centralized and Scalable Infrastructure Approach to Support Next Generation 

Sequencing at the N nstitute of Allergy and Infe seases 

ational Ictious Di

37  Panasas  Uppsala University 

UPPNEX 

43     TGen, The Translational Genomics 

Research Institute

NextGen Data Processing Pipeline 

 

Page 2: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

2010 Bio IT Award

1. Nominating Organization Organization name: JPR Communications Address: 20750 Ventura blvd Ste.350 City: Woodland Hills State: CA

2. Nominating Contact Person

Name: Judy Smith Title: President Phone:8188848282 Email: [email protected]

3. User Organization

Organization name: Amylin Pharmaceuticals Address: 9360 Towne Centre Drive City: San Diego State: CA Zip: 92121

4. Contact Person

Name: Steve Phillpott Title: CIO Phone: 858-309-7585 Email: [email protected]

5. Project

Project Title: Amylin Virtual Data Center Category: IT and Informatics

6. Description of project (4 FIGURES MAXIMUM):

See slide presentation

A. ABSTRACT/SUMMARY of the project and results (800 characters max.) Amylin Pharmaceuticals is a San Diego-based biopharma company, focused on providing first in class therapies for diabetes and obesity. Accomplishing Amylin’s mission of “Challenging Science and Changing Lives” requires tremendous IT capabilities, and the company has a history of being an early adopter of technology. In 2008, the company’s need for additional technology investment ran headlong into the economic realities of the time. Additionally, Amylin began to pursue a more flexible business model, emphasizing partnerships and virtualization over doing everything itself. In short, a core philosophy became “access tremendous capabilities, without owning those capabilities”. Amylin’s CIO, Steve Phillpott, and his IT leadership team applied this new strategy, developing an operating model they called the “Amylin Virtual Data Center”, which utilizes detailed service costing and cloud and SaaS capabilities to dramatically lower the cost of IT.

Page 3: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

B. INTRODUCTION/background/objectives Amylin IT set out to move to a flexible technology model that would allow access to world-class IT capabilities, without having to operate each of those capabilities. First, the team spent several months preparing detailed cost analysis for every service they provide. This “cost by service” model included the labor, licensing, maintenance, hardware, data center cost and even power usage for each service and application allowed the team to do more accurate comparisons of costs between delivering services internally or externally. The result was a list of IT services or applications provided by Amylin IT, each of which would be assessed to determine whether the same service could be provided at lower cost through utility services. Besides cost, other factors were also considered, including security, performance, architectural appropriateness for the cloud, and vendor capability. Importantly, the team actively looked for opportunities where SaaS and the Cloud would work, rather than enumerating all the reasons why cloud doesn’t work. Amylin built-out a “toolkit” of Cloud and SaaS offerings that their IT staff could make use of to enable flexible IT. For Infrastructure as a service (IAAS), they chose Amazon Web Service (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com. For cloud storage, they started a relationship with Nirvanix. And finally they began a deep investigation of Software as a Service (SaaS) capabilities to meet their application needs. In each case, internal IT teams would begin pilot projects, have personal “sandboxes”, and get to understand these capabilities on a technical level. In the case of Amazon, Force.com, and Nirvanix, initial skepticism turned into a positive response, as the capabilities of these tools were understood. Getting tools in the hands of technical people was key to gaining their understanding an buy-in. C. RESULTS (highlight major R&D/IT tools deployed; innovative uses of technology). As Amylin rolled out their cloud initiatives, they first focused on Amazon EC2 to host a limited number of application use cases. Amazon EC2 will continue to grow as a hosting platform for Amylin, and additional migrations are planned for this year and 2011. Amylin has a number of internal legacy applications, often without the internal resources to manage or upgrade them. As initial pilot applications were successful, the team is now planning to move legacy applications to Force.com. The component reusability and rich platform led Amylin developers to determine they could be more productive in such an environment. The third focus was storage and disaster recovery capabilities. Rather then building out an in-house system, Amylin called upon Nirvanix, a cloud storage partner. Amylin server images and data is now stored in the Nirvanix cloud, meeting compliance requirements and providing disaster recovery and backup capability for Amylin’s data. Finally, Amylin invested significant time understanding the wide range of SaaS offerings available. Frequently, they discovered that SaaS offerings were more feature-rich and easier to use than internally hosted applications. Currently, Amylin utilized over a half a dozen SaaS applications, and migrations to several more are in progress. These include Workday, Microsoft Hosted Exchange, LiveOffice, and Saba. Amylin used the following tools (cloud services) to meet their business needs:  Nirvanix: Nirvanix Storage Delivery Network (SDN) for enterprise cloud storage. The project involved moving critical validated server images that are used for all business and manufacturing applications and drugs simulation process such as Blast, C‐Path and other genomics simulations. Since these are 

Page 4: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

critical images and are frequently used, they are stored on tier I storage platform to ensure high availability and safety. Nirvanix provided better capabilities and provided additional level of protection as the images are now stored on the Cloud and are protected against any datacenter/localized infrastructure failures within Amylin. Further, Nirvanix’s “Plug and Play” architecture enabled them to seamlessly integrate the “CloudNAS” into their environment without any overhaul of their existing set‐up. Further, the new release of the product ties into their existing Netbackup and Commvault set‐up further simplifying backup, recovery and e‐discovery process.   Amazon: Amylin leveraged Amazon for their compute infrastructure services (EC2).  Several applications have been piloted in EC2, and some are now in full production.  Additionally, Amylin expects to leverage EC2 and Cycle Computing’s CycleCloud for high performance research computing in the coming years.    LiveOffice: Amylin implemented LiveOffice Mail Archive to store all Amylin email archives, for compliance purposes.  This saved the significant investment in an in‐house email eDiscovery capability, and was available to the business much sooner than building a software solution.    Symplified: Amylin deployed Symplified’s SaaS identify management package.  Amylin found that deploying SaaS and cloud applications increased the problem of user account management and logins, and Symplified provided a fast to deploy and affordable solution.   D. ROI achieved or expected (1000 characters max.): The storage cloud strategy resulted in a significant reduction in costs compared to the Tier I solution by approx 50+%. In many cases, ROI was achieved within couple of months into production use. Further, Cloud Storage enabled Amylin to achieve a significant business mile stone of having a basic data DR solution by storing and protecting data in the cloud. E. CONCLUSIONS/implications for the field (800 characters max.) Amylin implemented cloud solutions early, did extensive research, and selected some of the leaders in the cloud computing market. Starting a new infrastructure is a learning experience and Amylin continues to educate itself on recent cloud advancements and test its current plans. Amylin is looking ahead and into possibly launching an internal virtualization and private cloud with VMware, thus further complementing their current cloud deployments. With their four layers of the cloud in place, Amylin is in a solid position and can make sound selections based upon cost, control, performance and best fit. 6. REFERENCES/testimonials/supporting internal documents See power point presentation.

Page 5: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

BMS is submitting nomination for Best practices at BIO-IT World 2010. The category for Best practices is: IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging technologies  

Following is the nomination form that will be filled on line at the conference Website.

_____________________________________________________________ 

Bio-IT World 2010 Best Practices Awards  

Celebrating Excellence in Innovation  

 INSTRUCTIONS and ENTRY FORM 

www.bio‐itworld.com/bestpractices  

DEADLINE FOR ENTRY: January 18, 2010 (Updated deadline: February 19, 2010)  

Bio‐IT World is seeking submissions to its 2010 Best Practices Awards. This prestigious awards program is designed to recognize outstanding examples of technology and strategic innovation—initiatives and collaborations that manifestly improve some facet of the R&D/drug development/clinical trial process.   The awards attract an elite group of life science professionals: executives, entrepreneurs, innovators, researchers and clinicians responsible for developing and implementing innovative solutions for streamlining the drug development and clinical trial process. All entries will be reviewed and assessed by a distinguished peer‐review panel of judges.   The winners will receive a unique crystal award to be presented at the Best Practices Awards dinner, on Wednesday, April 21, 2010, in conjunction with the Bio‐IT World Conference & Expo in Boston. Winners and entrants will also be featured in Bio‐IT World.  INSTRUCTIONS 

1. Review criteria for entry and authorization statement (below). 

Page 6: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

 A. Nominating Organization Organization name:   Bristol‐Myers Squibb Address:   

 B.  Nominating Contact Person Name:   Mohammad Shaikh Title:      Associate Director Tel:        (609) 818 3480 Email:   [email protected] 

 2.  User Organization (Organization at which the solution was deployed/applied) 

 A. User Organization Organization name:  Bristol Myers Squibb, Research & Development Address:                      311 Pennington‐Rocky hill Road                                       Pennington. NJ  08534 

 B. User Organization Contact Person Name:   Donald Jackson Title:  Sr. Research Investigator II Tel: 609‐818‐5139 Email: [email protected] 

 3. Project   

Project Title:   High Content Screening ‐ Road Team Leader:  Name:  James Gill Title: Director Tel:    203.677.5708 Email: [email protected] 

Team members – Michael Lenard, James Scharpf, Russell Towell, Richard Shaginaw, Normand Cloutier   4. Category in which entry is being submitted (1 category per entry, highlight your choice)  

Basic Research & Biological Research: Disease pathway research, applied and basic research  Drug Discovery & Development: Compound‐focused research, drug safety   Clinical Trials & Research: Trial design, eCTD    Translational Medicine: Feedback loops, predictive technologies  Personalized Medicine: Responders/non‐responders, biomarkers  IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging technologies 

Page 7: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, resource optimization 

Health‐IT: ePrescribing, RHIOs, EMR/PHR  Manufacturing & Bioprocessing: Mass production, continuous manufacturing 

 (Bio‐IT World reserves the right to re‐categorize submissions based on submission or in the event that a category is refined.)  

6. Description of project (4 FIGURES MAXIMUM):   

 A. ABSTRACT/SUMMARY of the project and results (150 words max.)  

High-content screening (HCS) data has unique requirements that are not supported by traditional high-throughput screening databases. Effective analysis and interpretation of the screen data requires ability to designate separate positive and negative controls for different measurements in multiplexed assays.

The fundamental requirements are the ability to capture information on the cell lines, fluorescent reagents and treatments in each assay; the ability to store and utilize individual-cell and image data; and the ability to support HCS readers and software from multiple vendors along with third-party image analysis tools. The system supports target identification, lead discovery, lead evaluation and lead profiling activities.

The solution was designed using a combination of complimentary technologies that later became part of best practices at Bristol-Myers Squibb’s Research Informatics. The image data generated by HCS processes is over 50 TB over five years and has seen exponential growth trends. Database and data logistics were built using Oracle (11g) partitioning techniques, Isilon storage was used to handle unstructured data and EMC for relational data. Application was built using techniques like external tables, caching, materialized views, parallel queries and used .Net framework for business rules and visualizations. Statistical functions in Oracle API libraries were leveraged for analysis.  

 INTRODUCTION/background/objectives   

High content screening (HCS) has demonstrated utility at multiple points in the drug discovery process including target identification, target validation, lead identification, lead evaluation and profiling1, mechanism of action determination2 and toxicology assessment3. Within a single organization, HCS may be used for multiple purposes with distinct groups and even instruments supporting different stages of drug discovery. The

Page 8: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

scope of HCS projects can range from large-scale compound and RNAi collections tested in high-throughput screens to the detailed characterization of small numbers of compounds in multiple assays and cell lines. Despite their different roles, each group has common needs for data analysis including: deriving numeric measurements from images; connecting results with treatments, cell lines and assay readouts; identifying positive and negative controls to normalize data; rejecting failed data points; and selecting hits or fitting concentration-response curves. Establishing a common framework for HCS data allows users from different groups to analyze their results and share best practices and algorithms between users and instruments.

HCS data can be divided into three types: image data, derived data (e.g. single cell measurements and well-level summary statistics), and metadata4. This last data type includes both procedural information (e.g., how the images were acquired and analyzed) and experimental annotation (what cell lines, fluorescent probes and treatments were used). Procedural metadata is captured by most HCS platforms and by open-source projects such as the Open Microscopy Environment (OME)5. Experimental annotation metadata is less well supported even though it is essential for the interpretation and analysis of HCS results. The Minimum Information About a Cellular Assay (MIACA) standard established guidelines for what experimental annotation should be included in scientific publications6 but is not intended for laboratory data management.

HCS data shares many requirements with other types of high-throughput screening data, especially from cell-based assays. In particular, the need to capture assay design information in a structured and consistent manner is essential for the analysis and reporting of experimental results7. Other essential components include a reagent registry (for compounds, RNAi reagents, and other reagent types), a reagent inventory database (with information on plate maps), and tools for hit selection and concentration-response analysis8.

Despite the parallels to HTS data, managing and analyzing HCS data presents distinct challenges not encountered with other assay platforms, including single-endpoint cell based assays. First, HCS is image-based. Access to the underlying images is essential to troubleshoot problems, confirm and understand results, and communicate results to colleagues. Second, HCS produces large amounts of data. For example, a single 384-well plate can produce over 2 GB of images and millions of records of derived data4; this scale of data requires support from information technology experts along with mechanisms to systematically identify and delete unneeded data. Third, HCS assays often multiplex several distinct biological readouts in the same well. This requires the ability to designate separate positive and negative controls for different channels or even measurements so

Page 9: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

that assay performance and result normalization can generate meaningful values. Fourth, multiple vendors produce HCS readers and image analysis packages, along with third-party analysis packages such as CellProfiler9. Results and images must be converted to a common format so data and analysis tools can be shared between groups. Finally, HCS assays are inherently cell-based. Consistent identification of the cell lines, fluorescent dyes or antibody conjugates, and fluorescent proteins used in each assay is essential for the proper documentation and long-term mining of HCS results.

To address these requirements we developed HCS Road, a data management system specifically designed for HCS. As the name indicates, HCS Road provides a smooth, well-defined route from image quantification to data analysis and reporting. The system combines an experiment definition tool, a relational database for results storage, assay performance reports, data normalization, and analysis capabilities. HCS Road currently supports multiple imaging platforms and provides a common repository for HCS data across instruments and user groups. In this work, we describe the approaches we took for data storage, experimental annotation, and data analysis and the scientific and business reasons for those decisions. We also present a XML schema for HCS data that supports multiple HCS platforms.

RESULTS (highlight major R&D/IT tools deployed; innovative uses of technology). 

  

System Architecture Figure 1 shows an overview of the architecture of HCS Road. HCS Road currently supports three platforms: the Cellomics Arrayscan, the InCell 1000 (GE Healthcare, Parsippany, NJ), or the Evotec Opera. Images are analyzed with the appropriate software and the results are collected in platform-specific files or a platform database such as Cellomics Store. An HCS Road service converts data to a common XML format for import into the HCS Road database. Once the data is loaded into HCS Road it is merged with experimental annotation and treatment plate maps. Data import and merging can be performed manually or automatically based on previously registered plate barcodes. QC metrics and normalized results are calculated automatically and can be reviewed and analyzed using the HCS Road client or exported to third-party applications such as TIBCO Spotfire (TIBCO Software, Cambridge, MA).

Users interact with HCS Road through two client applications. The Data Import application enables users to select plates for import from the platform-specific data repository (Cellomics database, Opera or InCell file share). Multiple plates can be

Page 10: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

transferred in parallel for faster import, and well summary results are imported separately from cell-level measurements so users can review well-level results more quickly. A web-based administration tool controls the number of threaded processes and other data import settings. Experimental annotation, data mining and visualization are supported by the dedicated Data Explorer client application. Data-intensive operations, including data extraction and updates, QC and data analysis are implemented on the servers and the database to reduce the amount of data transferred from server to client. The Data Explorer also allows users to view images for selected wells or as a ‘poster’ of images for an entire plate. Images can also be viewed in third-party applications such as TIBCO Spotfire using a web page (Fig. 1). In either case, the image conversion server retrieves images from the appropriate platform repository and converts them from proprietary formats to standard TIFF or JPEG formats as needed.

IT Tools & Techniques

The large volumes of data generated by HCS require particular attention to image and data storage and management.

Storage: HCS system provides scalable and extensible storage that is well suited for managing large numbers of images. The distributed nature of the system means that input and output bandwidth grow in parallel with capacity, avoiding a potential bottleneck. Images are stored at or near the site where they were acquired (and where they are likely to be analyzed or viewed) to reduce network latency issues. This approach reduced storage costs while increasing the bandwidth for image transfer.

After extensive product evaluation, we decided on Isilon Systems clustered network-attached storage appliances. We deployed these as a file service, exposing several Windows networking file shares to the HCS readers, as well as to researcher workstations. Key Differentiators influencing our decision for Isilon NAS cluster were: True unified name space, robust data protection algorithms, straightforward scalability using building block nodes, ease of administration – FreeBSD CLI and lower-cost SATA disks.

Data Management The large number of data records generated by HCS also presents an informatics challenge. We store HCS results in Oracle relational databases, as do other HCS users10. These databases can become very large, primarily because of cell level data. We observed that as the size of our databases grew, performance deteriorated. To address this, we used Oracle’s database partitioning capabilities. We focused our efforts on the two largest tables in the database, which both contain cell-level data. Our partitioning scheme

Page 11: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

exploits the fact that, once written, cell level data is unlikely to change. Partitioning the tables in a coordinated fashion provided 10-fold reductions in data load times and 20-fold reductions in query times. Historical partitions are accessed in read-only mode which helps to protect data integrity and speeds up database backup and recovery.

Experimental annotation HCS Road captures information on experimental treatments and conditions in a way that enables long-term mining of results across assays and users and enforces consistent nomenclature for cell lines, detection reagents, and control or experimental treatments. Figure 2 shows the workflow for assay definition, treatment selection, and data import and analysis. Much of this information is referenced or imported from other databases. Thus, HCS Road imports or references treatment information such as compound structures, RNAi targets and sequences, and library plate from existing enterprise databases (green box in Fig. 2). Similarly, cell line information is linked to an enterprise registry that tracks information on source, tissue type, transgenic constructs, passages and other relevant information. This reduces the data entry burden on users, reduces errors, and ensures consistency within HCS Road and with data from other platforms. Annotation that cannot be imported or referenced is stored in the Road database. For example, information on fluorescent probes including probe name, vendor and catalog number, fluorescent characteristics and molecular or cellular targets is stored within HCS Road in a way that supports re-use across multiple assays.

The creation of a new assay begins with the selection of the cell line(s) and fluorescent probes used in an experiment (yellow box in Fig. 2). Control and reference compounds can be selected from the reagent registry or entered manually (as for commercially purchased reagents). Business metadata is also collected to enable reports of results across multiple assays and to support data retention decisions. Next, one or more ‘master’ plates are created with information on cell seeding along with locations and concentrations of control and reference treatments and fluorescent probes. HCS Road supports multiple plate layouts including 96, 384 and 1536-well; additional custom layouts can be quickly defined as needed. Finally, multiple copies of this master plate are created to correspond to the physical plates in the assay. Reagents tested in the assay can be entered manually (as during assay development) or automatically from existing reagent databases (green box in Fig. 2). Assays and plates can also be copied to streamline small changes to experimental designs or plate layouts.

The last step in experimental annotation is the assignment of positive and negative control treatments (blue box in Fig. 2). Different treatments can be designated as positive and negative controls for different measurements. This provides the flexibility needed to

Page 12: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

support multiplexed, multi-parameter HCS assays and provide meaningful performance metrics and normalized results. Control status is assigned to treatments (or treatment combinations) rather than to well locations. Any wells that receive the control treatment(s) become controls for the specified measurement(s). This reduces the amount of data users must enter, allows a single analysis protocol to support multiple plate layouts (for example, in screening multiple existing reagent collections with different layouts), and facilitates the re-use of assay definitions.

Data loading and analysis Once images have been collected and analyzed, the results are loaded into HCS road for analysis (pink box in Fig. 2). Images and numeric results are imported from platform repositories using a dedicated, internally developed application. Data can be loaded automatically using pre-defined criteria or selected manually after image acquisition and analysis are complete. Multiple sets of images and results can be loaded for a single assay plate to support kinetic imaging and re-imaging or re-analysis of plates using different objectives, filters or analysis algorithms. Results are associated with assay plates manually or using barcodes on the assay plates.

HCS Road calculates multiple quality control metrics and provides tools for rejecting failed wells or plates. In addition to the Z’ metric of Zhang et al11, the plate mean, median, standard deviation, minimum and maximum are reported for negative control, positive control and sample wells for each plate in a run. Users can review individual plates and may choose to reject all measurements from a well or only reject selected measurements. The ability to selectively reject measurements is necessary because of the multi-parameter nature of HCS assays. For example, a treatment may reduce cell count in a multiplexed assay; this is a legitimate result but measurements in other channels may not be reliable.

Data analysis Commonly used analyses are implemented as fixed workflows within the HCS Road Data Explorer application. HCS Road automatically performs multiple normalizations when data is loaded. The calculations include percent control, percent inhibition, signal to background and z-score12. The first analysis we implemented was concentration-response curve fitting. Curves are fit using a 4-parameter logistic regression with XLFIT equation 205 (IDBS Business Solutions, Guilford, UK). A graphic view shows the fit line and data points for an individual compound. Data points are linked to the corresponding images so users can review the images for a well and choose to reject it and recalculate the fit. The resulting IC50 values were consistent with those produced by our existing HTS analysis tools (not shown).

Page 13: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

We also identified a need to export results and annotation from HCS Road to third party applications so researchers can perform calculations and generate visualizations that are not part of a common workflow. We use TIBCO Spotfire for many of our external visualizations because: it can retrieve data directly from the HCS Road database; it supports multiple user-configurable visualizations; it provides tools for filtering and annotating data, and it can perform additional analyses using internal calculations or by communicating with Accelerys PipelinePilot (SanDiego, CA). Figure 3 shows a Spotfire visualization for analyzing RNAi screening results. This workflow retrieves results and treatment information from the HCS Road database. The user is presented with information on the distribution of normalized values for each endpoint and can select wells that pass the desired activity threshold. Additional panels identify RNAi reagents where multiple replicate wells pass the threshold and genes where multiple different RNAi reagents scored as hits, an analysis that is unique to RNAi screening. Within Spotfire, HCS assay results can be cross-referenced with other information such as mRNA expression profiling to identify RNAi reagents whose phenotype correlates with levels of target expression in the assay cell line (not shown).

Cell-level data Managing and analyzing cell-level data was a high priority in the development of HCS Road. Cell level data enables the analysis of correlations between measurements at the cellular level, the use of alternative data reduction algorithms such as the Kolmogorov-Smirnov distance13; 14, classification of subpopulations by cell cycle phase15, and other approaches beyond basic well-level statistics16. However, the volume of cell data in an HCS experiment can be very large. Storing cell data as one row per measurement per cell creates a table with large numbers of records and slows down data loading and retrieval. Because cell data is typically used on a per-plate/feature basis for automated analyses and for manual inspection, we chose to store it in files on the HCS Road file share (Fig. 1) rather than in the database. When cell data is needed, it is automatically imported into a database table using Oracle’s bulk data loading tools. When the cell measurements are no longer needed the records are deleted from the Road database (but are still retained in files). This controls database growth and improves performance compared to retaining large numbers of records in the database.

ROI achieved:

HCS Road currently supports target identification, lead identification and lead profiling efforts across multiple groups within BMS Applied Biotechnology. Scientists can analyze their experiments more rapidly and the time needed to load, annotate and review experiments has been reduced from days to hours. Integration with existing databases

Page 14: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

reduces the amount of data users must enter, reduces errors and facilitates integration with results from other assay platforms. HCS Road enables new types of experiments that were not supported by our previous data management tools, including 1536-well HCS assays and cell cycle analysis based on DNA content measures for individual cells. HCS Road provides a single source for data from Cellomics Arrayscan, GE InCell and Evotec Opera instruments. Finally, HCS Road facilitates the sharing of assays and analysis tools between groups. Users can review assay data from other groups, determine whether a cell line or fluorescent probe has been used before, and see how a hit from their assay performed in previous experiments.

The data management solutions we implemented allow us to handle the large volumes of data that HCS generates. Database partitioning reduces backup times and improves query performance; network attached storage systems enable the storage and management of large numbers of images; and the use of file-based storage with transient database loading for cell level data allows us to analyze this unique result type while minimizing database size.

CONCLUSIONS. Successfully developing an enterprise-wide data management system for HCS results presents challenges. The diversity of instruments, users and projects begs the question of whether it is better to develop isolated systems tailored to the requirements of a single group or instrument type. We concluded that the benefits of an integrated system were worth the effort required. HCS Road currently supports multiple imaging platforms and research groups and provides a single point of access for results and experimental annotation. It facilitates the sharing of assays and data analysis methods between groups and provides a rich and structured model for annotating cell-based assays.

We chose to develop our own system for HCS data management so that we could accommodate our needs and workflows and could integrate it with other enterprise databases. A consequence of this integration is that no two organization’s solutions will look exactly the same. Large organizations will wish to accommodate their existing workflows and databases whereas smaller organizations may need to implement some of those functions within their HCS data management system. We believe that the requirements and solutions we identified will be informative to other HCS users looking to develop or purchase their own data management solution.

The system was built using technologies by multiple vendors who made several updates to their architectures to make optimize the performance and reliability of the solution. The

Page 15: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

partitioning techniques first deployed at BMS for this application was later adopted and standardized by Cellomics.

BMS was one of the first in the Pharmaceutical industry to use Isilon storage for managing structured as well as unstructured Lab data. Isilon systems accommodated several suggestions by BMS design team to it’s firmware and architecture which benefited many other use cases. At BMS, use of Isilon storage was later extended to manage Neuroscience video files, Mass spectrometry raw & result files, NMR data, Bright field images, HPLC LIMS contents, Non-chrome LIMS contents and Oracle recovery files generated by RMAN and Flash recovery systems.

Page 16: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

Enterprise results

repository

Reagent and cell line registries

Instruments and platform data repositories

Database

Services Image Conversion

ArrayScan

Opera

InCell 1000

CellomicsStore

database

Image share

Image + Datashare

Image + Datashare

HCS Road Data Explorer

File share

HCS Road

Third-party tools(TIBCO Spotfire)

Enterprise results

repository

Enterprise results

repository

Reagent and cell line registries

Reagent and cell line registries

Instruments and platform data repositories

DatabaseDatabase

ServicesServices Image Conversion

Image Conversion

ArrayScan

Opera

InCell 1000

CellomicsStore

database

Image share

Image + Datashare

Image + Datashare

ArrayScanArrayScan

Opera Opera

InCell 1000 InCell 1000

CellomicsStore

database

Image share

Image + Datashare

Image + Datashare

HCS Road Data Explorer

HCS Road Data Explorer

File share

HCS Road

Third-party tools(TIBCO Spotfire)Third-party tools(TIBCO Spotfire)

FIG. 1. Overview of HCS Road components showing data flow from HCS instruments through the HCS Road database and file share to data analysis and visualization tools. Blue icons designate instrument-specific databases and file shares. Green arrows and green box indicate HCS Road components. Gray arrows indicate data import or export to existing enterprise databases or third-party analysis tools.

  

Page 17: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

AssayDefinition

Define master plate layout• Cell line(s)

• Seeding density• Probes

• Control/reference treatments

Select Cell Line

Define or select

fluorescent probes

Enter Business Metadata

• Client group• Program

Define orselect

additional compounds

Create Assay Plates

Library Definition(external)

RegisterReagents

DefinePlate maps

RegisterBarcodes

AnalysisDefinition

Select Measurements

for analysis

Designate control

treatments for measurement

• Well-level• Cell-level

Images & Data from HCS

reader/software

CalculateQC Metrics

• Z’• Mean• CV

Review resultsReject outliers

AnalyzeData

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Publish results toenterprise results

database

Data Loading & Analysis

AssayDefinition

Define master plate layout• Cell line(s)

• Seeding density• Probes

• Control/reference treatments

Select Cell Line

Define or select

fluorescent probes

Enter Business Metadata

• Client group• Program

Define orselect

additional compounds

Create Assay Plates

Library Definition(external)

RegisterReagents

DefinePlate maps

RegisterBarcodes

AnalysisDefinition

Select Measurements

for analysis

Designate control

treatments for measurement

• Well-level• Cell-level

Images & Data from HCS

reader/software

CalculateQC Metrics

• Z’• Mean• CV

Review resultsReject outliers

AnalyzeData

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Create Imaged Plates

Publish results toenterprise results

database

Data Loading & Analysis

FIG. 2. Workflow for experiment definition, data import and analysis. White boxes show workflow steps and colored boxes indicate functional subsets of the process. Black arrows indicate workflow progression and dependencies between steps.

  

Page 18: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

Normalized data distribution

Status

Valu

e Pc

tInh

100

80

60

40

20

0

-20

-40

-60

TargetActivationV3Well:SelectedObjectCount…Run 47 Run 48 Run 55

NEG POS SAM… NEG POS SAM… NEG POS SAM…

Distribution of ALL normalized data across all plates

Data table:SHADOW results

Color byStatus

NEGPOSSAMPLE

Reference points

Median

Wells per treatment

Treatment role, treatment,1 Value

Uni

queC

ount

(Wel

l id)

N NU NU NU NU NU NU NU NU NU NU NU NU NU NU…S SA SA SA SA SA SA SA SA SA SA SA SA SA SA…

6

5

4

3

2

1

0

Number of wells that match current filters for each treatment

Data limited by:

Active measurement

Data table:Results by well and measure

Marking:Hit treatments

Color byTreatment role

SAMPLE

Available MeasurementsMeasurement name

TargetActivationV3Well:MEAN_ObjectAreaCh1TargetActivationV3Well:MEAN_ObjectShapeP2ACh1TargetActivationV3Well:MEAN_ObjectTotalIntenCh1TargetActivationV3Well:SelectedObjectCountTargetActivationV3Well:SelectedObjectCountPerValidFTargetActivationV3Well:ValidObjectCount

Treatments per gene

Treatment role, Gene id

Uni

queC

ount

(trea

tmen

t,1 V

alue

)

5 230 100 597 285 115 167 555 232 185 381 139 687 844 5604S SA SA SA SA SA SA SA SA SA SA SA SA SA SA…

3

2

1

0

Data limited by:

Active measurementHit treatments

Data table:Results by well and measure

Marking:Hit genes

Color byTreatment role

SAMPLE

Normalized data statistics

Summary statistics for ALL normalized data across all plates.

Data table:SHADOW res

TargetActivationV3Well:SelectedObjectCountPerValidField

NEG POS SAMPLEUAV 34.45 102.24 103.86Q3 9.26 100.60 63.69Median -0.04 100.09 42.43Q1 -9.05 99.49 23.39LAV -35.25 97.84 -36.96Mean + 3SD 40.94 103.30 126.89Mean 0.00 100.00 43.18Mean - 3SD -40.94 96.70 -40.54

(Col

umn

Nam

es)

Measurement name, Treatment role

UAV, Q3, Median, Q1, LAV, Mean + 3SD, Mean, Mea…Normalized results for hits

Median normalized value for each siRNA for hits

Data table:SHADOW r

Color byMedian(Valu

Min (5.9Max (10

TargetActivationV3Well:SelectedObjectCountPerV… Grand total1 2 3 4 (Empty)

Gene579 100.37 99.73 99.39 45.73 - - - 99.50Gene59 83.49 66.61 89.87 58.68 84.88 84.87Gene735 94.31 87.21 82.25 66.20 - - - 84.37Gene672 84.69 13.92 96.64 80.50 - - - 84.24Gene254 83.56 94.82 18.03 82.87 - - - 83.56Gene597 79.00 85.68 81.44 55.55 - - - 79.65Gene694 81.13 72.30 86.55 53.80 - - - 79.30Gene195 75.64 71.84 95.17 69.35 - - - 77.26Gene536 77.07 87.20 78.93 27.19 - - - 77.02Gene150 84.23 91.17 70.13 60.68 - - - 76.60Gene109 81.08 62.99 85.96 75.22 - - - 76.29Gene43 86.84 76.77 50.12 75.52 - - - 75.46Gene98 89.32 76.14 71.90 5.93 - - - 75.15Gene406 80.80 25.61 77.28 75.79 - - - 74.33Gene611 69.31 72.58 76.66 62.55 - - - 72.13Gene550 79.76 70.80 71.10 40.41 - - - 71.99Gene707 74.31 70.90 92.75 28.29 - - - 71.83Gene180 71.11 62.81 96.32 71.82 - - - 71.58Gene335 71.76 83.26 69.17 52.51 - - - 70.30Gene433 63.75 92.22 41.97 70.78 - - - 66.79

Gen

e

Measurement name, siRNA index

Median(Value PctInh)Number of hits

Data limited by:Active measurementHit genes

Data table:

Count(Well id) 338UniqueCount(treatment… 61UniqueCount(Gene id) 20

(Col

umn

N…

Figure 3: TIBCO Spotfire workflow for hit selection from RNAi screens from HCS Road showing: (top left) table of available measurements; (top center) histograms of cell count percent inhibition for control and library wells across multiple runs; (top right) table of summary statistics for normalized cell count for control and library reagents; (middle left) bar chart of numbers of wells per RNAi reagent with normalized values above a user-defined threshold (blue shading indicates hit reagents where at least 4 of 6 replicate wells passed the threshold); (bottom left) bar chart of numbers of individual RNAi reagents per gene where 4 or more replicate wells passed the normalized value threshold (red shading indicates hit genes where 3 or more independent RNAi reagents for the same gene were selected as hits); (middle right) table of median cell count percent inhibition values for all hit genes; (bottom left) numbers of wells, RNAi reagents and genes selected as hits.

  

Page 19: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425

Published Resources for the Life Sciences

7. REFERENCES/testimonials/supporting internal documents (If necessary; 5 pages max.) 

1. Agler M, Prack M, Zhu Y, Kolb J, Nowak K, Ryseck R, Shen D, Cvijic ME, Somerville J, Nadler S, Chen T: A high-content glucocorticoid receptor translocation assay for compound mechanism-of-action evaluation. J Biomol Screen 2007; 12:1029-1041.

2. Ross-Macdonald P, de Silva H, Guo Q, Xiao H, Hung CY, Penhallow B, Markwalder J, He L, Attar RM, Lin TA, Seitz S, Tilford C, Wardwell-Swanson J, Jackson D: Identification of a nonkinase target mediating cytotoxicity of novel kinase inhibitors. Molecular cancer therapeutics 2008; 7:3490-3498.

3. Zock JM: Applications of high content screening in life science research. Combinatorial chemistry & high throughput screening 2009; 12:870-876.

4. Dunlay RT, Czekalski WJ, Collins MA: Overview of informatics for high content screening. Methods in molecular biology (Clifton, NJ 2007; 356:269-280.

5. Goldberg IG, Allan C, Burel JM, Creager D, Falconi A, Hochheiser H, Johnston J, Mellen J, Sorger PK, Swedlow JR: The Open Microscopy Environment (OME) Data Model and XML file: open tools for informatics and quantitative analysis in biological imaging. Genome biology 2005; 6:R47.

6. Miaca Draft Specification Retrieved from http://cdnetworks-us-2.dl.sourceforge.net/project/miaca/Documentation/MIACA_080404/MIACA_080404.pdf.

7. Palmer M, Kremer A, Terstappen GC: A primer on screening data management. J Biomol Screen 2009; 14:999-1007.

8. Ling XB: High throughput screening informatics. Combinatorial chemistry & high throughput screening 2008; 11:249-257.

9. Carpenter AE, Jones TR, Lamprecht MR, Clarke C, Kang IH, Friman O, Guertin DA, Chang JH, Lindquist RA, Moffat J, Golland P, Sabatini DM: CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome biology 2006; 7:R100.

10. Garfinkel LS: Large-scale data management for high content screening. Methods in molecular biology (Clifton, NJ 2007; 356:281-291.

11. Zhang JH, Chung TD, Oldenburg KR: A Simple Statistical Parameter for Use in Evaluation and Validation of High Throughput Screening Assays. J Biomol Screen 1999; 4:67-73.

12. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R: Statistical practice in high-throughput screening data analysis. Nature biotechnology 2006; 24:167-175.

13. Giuliano KA, Chen YT, Taylor DL: High-content screening with siRNA optimizes a cell biological approach to drug discovery: defining the role of P53 activation in the cellular response to anticancer drugs. J Biomol Screen 2004; 9:557-568.

14. Perlman ZE, Slack MD, Feng Y, Mitchison TJ, Wu LF, Altschuler SJ: Multidimensional drug profiling by automated microscopy. Science (New York, NY 2004; 306:1194-1198.

15. Low J, Huang S, Blosser W, Dowless M, Burch J, Neubauer B, Stancato L: High-content imaging characterization of cell cycle therapeutics through in vitro and in vivo subpopulation analysis. Molecular cancer therapeutics 2008; 7:2455-2463.

16. Collins MA: Generating 'omic knowledge': the role of informatics in high content screening. Combinatorial chemistry & high throughput screening 2009; 12:917-925.

Page 20: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Bio‐IT World 2010 Best Practices Awards  Nominating Organization name:    Cycle Computing Nominating Organization address:    456 Main Street Nominating Organization city:    Wethersfield Nominating Organization state:    CT Nominating Organization zip:    06109 Nominating Contact Person:    Ashleigh Egan Nominating Contact Person Title:    Account Executive, Articulate  Communications Nominating Contact Person Phone:    212‐255‐0080 x12 Nominating Contact Person Email:    [email protected]  User Organization name:    Purdue University User Organization address:    504 Northwestern Ave. User Organization city:    West Lafayette User Organization state:    IN User Organization zip:    47907 User Organization Contact Person:    John Campbell User Organization Contact Person Title:    Associate Vice President of  Information Technology User Organization Contact Person Phone:    212‐255‐0080 x12 User Organization Contact Person Email:    [email protected]  Project Title:    DiaGrid  Team Leaders name: Team Leaders title: Team Leaders Company: Team Leaders Contact Info: Team Members name: Team Members title: Team Members Company:  Entry Category:    IT & Informatics  Abstract Summary: Introduction:    The demand for computational power at Purdue for  scientific, quantitative and engineering research was rapidly outpacing the budget for new space, power and servers to run them.  At the same time, most machines across campuses, enterprises or government agencies are only used less than half of the time.  The challenge was to harness these unused computational cycles for multiple colleges/departments while building a framework that maintains scalability, management and ease of use. Purdue wanted to build a grid of idle campus computers/servers and provide the computational capacity to researchers throughout the nation.  By collaborating with several other campuses, including Indiana University, University of Notre Dame (Ind.), Indiana State University, Purdue’s Calumet and North Central campuses and Indiana University‐Purdue University Fort Wayne, Purdue was able to increase the total capacity to more than 177 teraflops – the equivalent of a $3 million supercomputer requiring several thousand square feet of datacenter space.  

Page 21: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Results:    Purdue selected the free, open‐source Condor distributed  computing system developed by the University of Wisconsin and the CycleServer compute management tool from Cycle Computing.  Computers in the pool run client software and efficiently and securely connect them to front‐end servers, to which jobs are submitted and parceled out to various pool machines when idle.  In this way, tens of thousands of processors can be brought to bear on problems from various researchers.  The work is automatically reshuffled when the owner of a machine needs it.  Using Condor’s flexible policy features, technical staff can control over when and how their machines are used (on idle, evenings only, etc.).  Today, with more than 28,000 processors, DiaGrid offers more than two million compute hours per month.  The research clusters within the DiaGrid pool average about 1‐2 percent idle – providing one of the highest utilization levels.  Purdue was able to: •  Squeeze every bit of performance out of each hardware dollar already  spent.  Desktop machines are continually providing computational cycles during off hours and the research clusters average only 1‐2 percent idle. •  Avoid purchasing additional computational capacity by harvesting more  than 177 Teraflops, for two million compute hours a month using hardware it already owns.  Purchasing equivalent cycles would cost more than $3 million. •  Build installation packages that easily pull information from the  CycleServer centralized management tool. •  Achieve something no one has tried before:  pooling the variety of  hardware represented in DiaGrid, including computers in campus computer labs, offices, server rooms and high‐performance research computing clusters running a variety of operating systems. •  Easily manage policy configuration information with CycleServer, using  repeated templates for machines across various pools of resources with more than 28,000 processors – and a goal of eventually hitting 120,000 processors across many universities. •  Put owner’s policies in place for when machines could run calculations. •  Get status, reporting and management capabilities across pools of  resources on many campuses. •  Enable creative uses of computation.  For example, DiaGrid is used in  creating a virtual pharmacy clean room for training student pharmacists; rendering fly‐through animation of a proposed satellite city to serve as a refuge for Istanbul, Turkey, in the event of a catastrophic earthquake; and animating scenes for “Nano Factor,” a game designed to for junior‐high‐aged kids interested in science and engineering.  ROI achieved: Conclusions:  References:  

Page 22: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Bio‐IT World 2010 Best Practices Awards  Nominating Organization name:    DataDirect Networks, Inc. Nominating Organization address:    9351 Deering Avenue Nominating Organization city:    Chatsworth Nominating Organization state:    CA Nominating Organization zip:    91311 Nominating Contact Person:    Jeffrey Denworth Nominating Contact Person Title:    VP, Marketing Nominating Contact Person Phone:    1‐856‐383‐8849 Nominating Contact Person Email:    [email protected]  User Organization name:    Cornell University Center for Advanced  Computing User Organization address:    512 Frank H. T. Rhodes Hall User Organization city:    Ithaca User Organization state:    NY User Organization zip:    14853 User Organization Contact Person:    David A. Lifka, PhD User Organization Contact Person Title:    Director, Cornell University  Center for Advanced Computing User Organization Contact Person Phone:    607‐254‐8621 User Organization Contact Person Email:    [email protected]  Project Title:    Scalable Research Storage Archive  Team Leaders name: Team Leaders title: Team Leaders Company: Team Leaders Contact Info: Team Members name:    Dr. Jaroslaw Pillardy Team Members title:    Sr. Researcher at Cornell’s Computational Biology  Service Unit Team Members Company:    Cornell University  Entry Category:    IT & Informatics  Abstract Summary: Introduction:    The Cornell Center for Advanced Computing (CAC) is a  leader in high‐performance computing system, application, and data solutions that enable research success. As an early technology adopter and rapid prototyper, CAC helps researchers accelerate scientific discovery.  Located on the Ithaca, New York campus of Cornell University, CAC serves faculty and industry researchers from dozens of disciplines, including biology, behavioral and social sciences, computer science, engineering, geosciences, mathematics, physical sciences, and business.  The center operates Linux, Windows, and Mac‐based HPC clusters and the staff provides expertise in HPC systems and storage; application porting, tuning, and optimization; computer programming; database systems; data analysis and workflow management; Web portal design, and visualization.  

Page 23: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

CAC network connectivity includes the national NSF TeraGrid and New York State Grid.  The DataDirect Networks S2A9700 storage system is used as the central storage platform for a number of departments and applications. Initially deployed for backup and archival storage, CAC is increasingly using the S2A9700 as front‐line storage for applications such as genome sequencing.  Since CAC provides services to a wide range of Cornell departments and applications, implementing centralized storage platforms is critical in ensuring an efficient, reliable and cost‐effective infrastructure.  Cornell researchers were considering buying commodity, off‐the‐shelf storage solutions to locally store their research data. While the cost of such technology appeared initially low – the lack of coordination, data protection and system reliability detracted from the long‐term value of this approach. As research productivity and access to data are directly correlated – the primary focus of the storage solution had to be high reliability and scalability.  It was clear that an affordable, centrally managed, highly available research storage system was needed in order to control costs and also to ensure that researchers remained productive. Accommodating a variety of applications and departments would prove a challenge for ordinary storage systems, but the DDN S2A9700 proved capable even beyond the initial scope of the project. Results:    The center selected an S2A9700 storage system from DDN with  40TB unformatted capacity in RAID‐ 6 configurations. DDN partnered with Ocarina Networks to provide transparent, content‐aware storage optimization at CAC, reducing the overall capacity need by more than 50 percent. For some Microsoft SQL database applications, a compression rate of up to 82 percent was achieved.  DDN storage technology enables massive scalability and capacity optimization through storage collaboration.  As compared to other storage technologies in it's class ‐ the S2A9700 features industry leading throughput (at over 2.5GB/s per system), capacity (scalable to hold up to 2.4 Petabytes in a single system) and data center efficiency (DDN systems are the densest in the industry, housing up to 600 hard drives in a single data center rack ‐ also featuring Dynamic MAID power management technology).  The combination of the S2A9700 system scale and the data center optimized configuration proved to Cornell that installing and adding capacity could be done very cost‐effectively and the system could scale to meet the Center's evolving storage volume requirements without a forklift upgrade.  "We have been very impressed with the performance DDN's S2A9700 delivers,"  said David A. Lifka, CAC director. "For genomics research ‐ Cornell uses Solexa Sequencers and the DDN storage system is directly connected to the compute cluster, while at the same time continuing to provide backup and archive storage for our other projects and departments." ‐ David A. Lifka, CAC Director  Ocarina’s ECOsystem platform uses an innovative approach to data reduction. The ECOsystem first extracts files into raw binary data and applies object boundaries 

Page 24: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

to the data. It then applies object dedupe and content‐aware compression to the natural semantic objects found within.  The object dedupe approach finds object duplicates in compressed, encoded data that would never be found using standard block dedupe. After processing object duplicates, the ECOsystem then applies content specific compression to the remaining unique object. This dual approach provides better space savings than either block dedupe or generic compression alone would. Ocarina’s ECOsystem includes multiple data compressors for the types of files commonly found in research computing environments and includes over 100 algorithms that support 600 file types. > ROI achieved: As compared to the alternative of disparate storage "islands" managed by various independent departments, Cornell experienced a substantial ROI through the consolidation and optimization of a globally accessible storage pool.  By deploying scalable, high‐speed DDN S2A Storage with intelligent Ocarina data optimization software, Cornell projected a nearly full return on investment within as little as one year.  Aggregate capacity requirements were reduced, administration was consolidated and economies of scale were gained.  It is expected that the savings associated with a cost‐effective (capacity‐optimized) petabyte‐scalable storage pool, in addition to the FTE savings the University realized, will have fully paid for the new system within 12 months time.  > Conclusions:  As multi‐departmental and multi‐application organizations adopt higher fidelity research tools and engage in high‐throughput research, storage requirements will balloon across the enterprise.  As evidenced at Cornell, a well planned storage consolidation, optimization and deployment strategy can not only allow researchers to focus on research, but also aids organizations through substantial cross‐departmental budgetary relief. Scalable storage systems from DataDirect Networks, coupled with intelligent file‐format‐aware Ocarina Networks storage optimization software, have proven to enable consolidation, savings and simplification with tools optimized for the life sciences researcher.  References:    DDN Case Study:  http://www.datadirectnet.com/index.php?id=246 Drug Discovery News Article:  http://www.drugdiscoverynews.com/index.php?newsarticle=2787 GenomeWeb Article:  http://www.genomeweb.com/informatics/ocarina‐pitches‐content‐aware‐compression‐approach‐storing‐life‐science‐data?page=1   

Page 25: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

250 Fi

1. Nominat 

AOA

 BNTTE

 2.  Use

 AOA

 BNTTE

 3. Projec

PTTTET

  4. Catego

B D C T P

irst Avenue, Suite 

ting Organiza

A. NominatingOrganization nAddress: 

B.  NominatinName:   Title:   Tel:   Email:   

r Organizatio

A. User OrgaOrganization nAddress: 

B. User OrganName:   Title:   Tel:   Email:   

ct   Project Title: Team Leader NTitle:   Tel:   Email:   Team membe

ory in which eBasic ResearcDrug DiscoverClinical Trials Translational Personalized M

300, Needham, M

Bio‐IT W

ation (Fill this 

g Organizationame:  Fa

   

ng Contact Per  Ka  PR  61  Ka

on (Organizati

anization   name:  Hu

  1 

nization Conta  Ju  Sy713‐798‐40jking@hnl.

 Name:  Ju

Systems Ad713‐798‐40jking@hnl.

rs – name(s), 

entry is beingh & Biologicary & Developm& Research: TMedicine: FeeMedicine: Res

MA 02494  |  phone

World 20

out only if you

n alconStor Softw

rson athryn Ghita R 17‐236‐0500athryn.ghita@

on at which th

uman NeuroimBaylor Place, 

act Person stin King ystems Admin035 .bcm.edu 

stin King dministrator035 .bcm.edu title(s) and co

g submitted (l Research: Diment: CompoTrial design, eCedback loops,sponders/non

  

e: 781‐972‐5400  

10 Best P 

u are nominat

ware 

@metiscomm.c

he solution wa

maging Lab – BHouston, TX  7

istrator 

ompany (optio

(1 category peisease pathwaund‐focused rCTD    predictive tecn‐responders, 

|  fax: 781‐972‐54

Publis

Practices 

ting a group o

com 

as deployed/a

Baylor College77030 

onal): 

er entry, highlay research, apresearch, drug

chnologies biomarkers 

25 

shed Resourc

Awards 

other than you

applied) 

e of Medicine 

light your chopplied and bag safety  

ces for the Lif

ur own.) 

oice)  sic research 

fe Sciences

Page 26: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

 

250 Fi

I Ko

H M

(Bio‐IT Wis refined 5. Descri 

A

B

irst Avenue, Suite 

IT & InformatKnowledge Moptimization Health‐IT: ePrManufacturin

World reserves d.) 

iption of proj

A. ABSTRACTThe HumaCollege ofscience anspeed of dhandcuffedout to enhaimprovingtechnologiinvaluableresearch.  

B. INTRODUAs one theinteractioscanners, cInternet toto scan and Researcheeach of thetime consureliability.types of an The HNL analysis, atape backulength any In additioas constanperfecting

300, Needham, M

ics: LIMS, HighManagement: D

rescribing, RHIng & Bioproces

the right to re

ject (4 FIGURE

T/SUMMARY an Neuroimaginf Medicine thatnd economics. discovery. Prevd by cumbersoance its storage

g reliability, incies such as virte data and more

CTION/backge top 10 medin through hypcan interact wi

o control multipd monitor brain

rs at The HNLese scans as theuming and exp Once the scannalysis, creatin

needed a moreas well as ensurup solution thaty data may be r

n, Systems Adntly switch out

the hyperscan

MA 02494  |  phone

h PerformanceData mining, i

IOs, EMR/PHRssing: Mass pr

e‐categorize s

ES MAXIMUM)

of the projectng Laboratory t concentrates o This groundbr

viously relyingome managemee management creasing retentitual tape librarie efficiently to

round/objectiical and reseaperscanning, ath one another

ple scanners, evn activity simu

L are running hyey were done aensive to reprons were compl

ng a glut of sim

e reliable data sre that no of tht required swapretained.

dministrator, Juthe various tap

nning software

  

e: 781‐972‐5400  

e Computing, dea/expertise

R roduction, con

ubmissions ba

):  

t and results ((HNL) is part

on research proreaking researc

g on standard taent and disk spa

processes, wition and becomies (VTL) and keep up with t

ives rch institutiona method by whr while their braven if they are

ultaneously wh

yperscans at thand consistentloduce, so the dleted, three cop

milar data on the

storage infrastrhe information wpping out of ta

ustin King, waspes. As a resul. King was de

|  fax: 781‐972‐54

Publis

storage, datae mining, text 

ntinuous man

ased on submi

(150 words maof the Departm

ojects coveringch requires a reape and disk-toace constraintsthout disruption

ming less dependata deduplicathe daily dema

ns, the HNL fohich multiple sains are simultlocated thousaile they are int

he same time anly back them upata storage solu

pies of each filee system.

ructure to storewas lost. Prev

apes during a ba

s often called ult, King lost vaetermined to fi

25 

shed Resourc

 visualization,mining, collab

ufacturing 

ission or in the

ax.) ment of Neurosg neuroscienceeliable infrastruo-disk backupss. With a smaln, to accompli

ndent on tape. Tation, the HNLand of cutting-e

cuses on resesubjects, each itaneously scannands of miles ateracting with e

nd the solutionp. Experimentution needed toe would be ma

e these multipleviously, The HNackup as well a

upon to fix tapeluable researchnd a simpler so

ces for the Lif

, imaging techboration, reso

e event that a

science at Bayl, psychology, pucture to match, the HNL wasl IT staff, the Hsh the goals ofThrough the us

L was able to predge neuroscie

arching socialin a separate Mned. Scientists

apart in differeneach other.

n was needed tots are extremelo save it quick

ade to do three

e scans during NL was using aas putting a lim

e backup issuesh time on updaolution that cou

fe Sciences

hnologies ource 

 category 

lor political h the

s HNL set f se of rotect the nce

l MRI s use the nt centers,

o take ly difficult, kly and different

data a physical

mit on the

s, as well ating and uld run

Page 27: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

 

 

250 Fi

C

D

irst Avenue, Suite 

without hiquick data

C. RESULTS (King’s goawas provincontinue wHNL bette From reseadeduplicatwith data ddeduplicatimplementThe fact thVTL solut Prior to thdeduplicatsimilar filethat more dquicker an The Falconfocus on immultiple pachieving deduplicatscans bein

D. ROI achieThe greatdata protehyperscanneeded.  Tand a grea

The improto fully foHis time is The data dfor backupmore info

300, Needham, M

s constant attena protection sol

(highlight majal in finding a nng to be too fauwith the same der in the future,

arching varioution that woulddeduplication ation feature redtation of the Vhat there wasn’tion made it an

e VTL solutiontion feature hases; 1 is processdata could be s

nd deeper resea

nStor VTL solumproving the heople running greater unders

tion ensures thang added to the

ved or expectest value of thection solutions from one mThe improvedater overall un

oved reliabilitycus on the ress no longer sp

deduplicationp.  The 15:1 raormation on a 

MA 02494  |  phone

ntion, grow wilution.

or R&D/IT toonew solution wulty and unrelidata protection,, as well as inc

us data backup d easily integraallowed King tduced the amouTL solution wa’t any extensive

n even better so

n, all the informs greatly increased and stored stored for longarch into the di

ution with dedhyperscanning MRIs or analytanding the braat no informatisystem.

ted (200 wordhe VTL with dan.  King has si

month to six m retention timnderstanding 

y of the virtuasearch neededpent switching

 ROI is seen inatio means thasmaller disk s

  

e: 781‐972‐5400  

ith The HNL d

ols deployed; was to end the rable. Although, King felt that

crease reliabilit

solutions, Kingate into the existo complete fasunt of data thatas done with lie architecting o

olution as King

mation was bacased the amounfor backup. Ther lengths of tiscovery proces

duplication greasoftware and o

yzing the scansain and how inion is lost regar

s max.): ata deduplicatnce achieved

months with thme allows for mof brain funct

al solution oved for major disg out tapes or 

n the ratio of dat 150 TB of losize, data rete

|  fax: 781‐972‐54

Publis

emands for sto

innovative usereliance on taph he could havt a different solty.

g chose a virtusting VMware ster, more reliat needed to be sittle to no chanor hardware ch

g was able to ge

cked up regardnt of files savedhe HNL’s storaime. The addss of the brain.

atly reduced thother research t, so each hyper

ndividuals reactrdless of the am

tion solution ha six‐fold incre ability to exmore in‐depthtions and proc

er the physicalseases such asfixing problem

data files to thogical data contion rates inc

25 

shed Resourc

orage while pro

es of technolope as a data prove bought morelution would b

ual tale library (environment.

able backups, wstored on a dis

nge needed to thhanges needed et it running qu

dless of similar d with a 15:1 rage footprint w

ditional data sto

he backup issuetopics. At anyrscan is extremt to one anothemount of peopl

has been the sease in data rxtend this out h analysis, socicesses. 

 tape alloweds personality dms that resulte

hose actually puld be stored creased expon

ces for the Lif

oviding a much

ogy). otection solutioe disks and tapee able to scale

(VTL) solution The FalconSt

while the data k. In fact, thehe backup envito implement t

uickly.

data and files.ratio – or out ofwas greatly reduorage time allo

es, freeing Kingy given there mmely important er. The VTL wle using the da

simplification retention ratesto a full 12 moial interaction

d King to  disorders and ed in a faulty b

processed andon a 10TB disnentially allow

fe Sciences

h reliable,

on. Tape es to with

n with tor VTL

e ironment. the new

. The f 15 uced so ws for

g’s time to may be

to with data ata or new

of HNL’s s for the onths if it n research 

others.   backup.     

d saved sk.  With wing the 

Page 28: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

 

250 Fi

E

 6. REFER

irst Avenue, Suite 

researchebrain diso 

E. CONCLUSThe mostone persointo the bother labsquickly insout‐of thea short pe The HNLvariety ofincludingWith a seconductinand resea

RENCES/testim

300, Needham, M

ers longer acceorders and oth

SIONS/implicatt compelling aon, such as Kibrain. As a sucs or research fstalled with me‐box solutioneriod of time. 

L research is vf environment

g for conditionecure data prong ground brearch solutions

monials/suppo

MA 02494  |  phone

ess to the dataher issues.    

tions for the faspect of Theing, could runccessful implefirms looking fminimal enviro for most env

vital to understs. This reseans such as Parotection solutieaking researc. 

orting internal 

  

e: 781‐972‐5400  

a with the aim

field.  HNL’s story

n a lab while aementation wifor a scalable,onment changeironments, Ki

standing the barch may helprkinson’s, schion in place, Tch into analyz

documents (I

|  fax: 781‐972‐54

Publis

m of learning m

y is there are salso being ablthin a data‐in reliable data e.  As the Falcng was able to

brain and howp lead to breahizophrenia, AThe HNL couzing the brain

If necessary; 5

25 

shed Resourc

more about va

solutions on thle to conduct tensive lab, itprotection soconStor data po install it and

w it processes kthrough in aAutism as weuld focus on wn and creating

5 pages max.)

ces for the Lif

arious brain fu

he market theimportant res is a proof poiolution that maprotection solud forget about 

information ia number of all as other dis

what it does b better measu

fe Sciences

nctions; 

ere where search int for ay be ution is an it within 

in a areas sorders. est –

urement

Page 29: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Bio‐IT World 2010 Best Practices Awards  Nominating Organization name:    Isilon Systems Nominating Organization address:    3101 Western Ave Nominating Organization city:    Seattle Nominating Organization state:    WA Nominating Organization zip:    98121 Nominating Contact Person:    Lucas Welch Nominating Contact Person Title:    PR Manager Nominating Contact Person Phone:    206‐315‐7621 Nominating Contact Person Email:    [email protected]  User Organization name:    Oklahoma Medical Research Foundation User Organization address:    825 NE 13th Street User Organization city:    Oklahoma City User Organization state:    OK User Organization zip:    73104 User Organization Contact Person:    Stuart Glenn User Organization Contact Person Title:    Software Engineer User Organization Contact Person Phone:    405‐271‐7933 x35287 User Organization Contact Person Email:    stuart‐[email protected]  Project Title:    Transition to Nextgen Sequencing and Virtual Data Center  Team Leaders name:    Stuart Glenn Team Leaders title:    Software Engineer Team Leaders Company:    OMRF Team Leaders Contact Info:    405‐271‐7933 x35287, stuart‐[email protected] Team Members name: Team Members title: Team Members Company:  Entry Category:    IT & Informatics  Abstract Summary: Introduction:    Oklahoma Medical Research Foundation (OMRF), a leading  nonprofit biomedical research institute, experienced an unprecedented influx of mission‐critical, genetic information with the introduction of a high‐powered, next‐generation Illumina Genome Analyzer and server virtualization. To maximize both its infrastructure investment and the value of its genetic data, OMRF needed a storage solution capable of keeping pace with its tremendous data growth while still powering its virtual data center without the burden of costly upgrades and tedious data migrations  In its efforts to identify more effective treatments for human disease, OMRF generates tremendous amounts of mission‐critical genomic information.  This data is then processed and analyzed using Linux computer servers running the VMware ESX virtualization software application. With its previous NAS system, OMRF would have been forced to migrate genetic information back and forth between disparate data silos, slowing sequencing runs and depriving its virtual servers of the data access and high throughput necessary to realize the full potential of virtualized computing. 

Page 30: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

 Results:    Using scale‐out NAS from Isilon Systems, OMRF has unified both  its DNA sequencing pipeline and virtualized computing infrastructure into a single, high performance, highly scalable, shared pool of storage, simplifying its IT environment and significantly speeding time‐to‐results.  OMRF can now scale its storage system on‐demand to meet the rapid data growth and unique performance demands of its mission‐critical workflow, increasing operational efficiency and decreasing costs in an effort to identify genetic precursors to diseases such as Alzheimer’s, Lupus and Sjögren’s Syndrome.  With its scale‐out NAS solution, OMRF has created a single, highly reliable central storage resource for both its entire next‐generation sequencing workflow and its virtual computing infrastructure, dramatically simplifying storage management and streamlining data access across its organization. Today, OMRF can cost‐effectively manage rapid data growth from a single file system, eliminating data fragmentation caused by traditional NAS in virtual environments and maximizing the performance of both its virtual servers and its DNA sequencing workflow.  By deploying a second Isilon system off‐site and using Isilon’s SyncIQ® asynchronous data replication software to replicate data between its primary and off‐site clusters, OMRF also has a highly reliable solution in place to ensure its data is immediately available even in the case of IT failure or natural disaster.     ROI achieved: Conclusions:  References:   

Page 31: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

Bio‐IT World 2010 Best Practices Awards  

1. Nominating Organization (Fill this out only if you are nominating a group other than your own.)  

A. Nominating Organization Organization name: Address: 

 B.  Nominating Contact Person Name: Title: Tel: Email: 

 2.  User Organization (Organization at which the solution was deployed/applied) 

 A. User Organization Organization name: National Institute of Allergy and Infectious Diseases (NIAID) Address: 10401 Fernwood Rd., Bethesda, MD 20892 

 B. User Organization Contact Person Name: Nick Weber Title: Scientific Informatics & Infrastructure Analyst Tel: 301.594.0718 Email: [email protected] 

 3. Project   

Project Title: A Centralized and Scalable Infrastructure Approach to Support Next Generation Sequencing at the National Institute of Allergy and Infectious Diseases Team Leader Name: Nick Weber (Lockheed Martin Contractor) Title: Scientific Informatics & Infrastructure Analyst Tel: 301.594.0718 Email: [email protected] Team members – name(s), title(s) and company (optional): • Vivek Gopalan – Scientific Infrastructure Lead (Lockheed Martin Contractor) • Mariam Quiñones – Computational Molecular Biology Specialist (Lockheed Martin Contractor) • Hugo Hernandez – Senior Systems Administrator (Dell Perot Systems Contractor) • Robert Reed – Systems Administrator (Dell Perot Systems Contractor) • Kim Kassing – Branch Chief, Operations and Engineering Branch (NIAID Employee) 

Page 32: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

• Yentram Huyen – Branch Chief, Bioinformatics and Computational Biosciences Branch (NIAID Employee) 

• Michael Tartakovsky – NIAID Chief Information Officer and Director of Office of Cyber Infrastructure and Computational Biology (NIAID Employee) 

  4. Category in which entry is being submitted (1 category per entry, highlight your choice)  

Basic Research & Biological Research: Disease pathway research, applied and basic research  Drug Discovery & Development: Compound‐focused research, drug safety   Clinical Trials & Research: Trial design, eCTD    Translational Medicine: Feedback loops, predictive technologies  Personalized Medicine: Responders/non‐responders, biomarkers  IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging technologies  Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, resource optimization 

Health‐IT: ePrescribing, RHIOs, EMR/PHR  Manufacturing & Bioprocessing: Mass production, continuous manufacturing 

 (Bio‐IT World reserves the right to re‐categorize submissions based on submission or in the event that a category is refined.)  5. Description of project (4 FIGURES MAXIMUM):   

A. ABSTRACT/SUMMARY of the project and results (150 words max.)  Recent advances in the “next generation” of sequencing technologies have enabled high‐throughput sequencing to expand beyond large specialized facilities and into individual research labs. Improved chemistries, more powerful software, and parallel sequencing capabilities have led to the creation of many terabytes of data per instrument per year that will serve as the basis for diverse genomic research. In order to manage the massive amounts of data, many researchers will require assistance from IT experts and bioinformaticians to store, transfer, process, and analyze all the data generated in their labs. The Office of Cyber Infrastructure and Computational Biology (OCICB) at the National Institute of Allergy and Infectious Diseases (NIAID) has developed a centralized and scalable infrastructure to support Next Generation Sequencing efforts across the Institute. Primary goals of this approach are to standardize practices for data management and storage and to capitalize on the efficiencies and cost savings of a shared high‐performance computing infrastructure.  

B. INTRODUCTION/background/objectives 

The Office of Cyber Infrastructure and Computational Biology (OCICB) manages technologies supporting NIAID biomedical research programs. The Office provides a spectrum of management, technologies development, applications/software engineering, bioinformatics support, and professional development. Additionally, OCICB works closely with NIAID intramural, extramural, and administrative staff to provide technical support, liaison, coordination, and consultation on a wide variety of ventures. These projects and initiatives are aimed at ensuring ever‐increasing interchange and dissemination of scientific information within the Federal 

Page 33: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

Government and among the worldwide scientific network of biomedical researchers. Both the Operations and Engineering Branch (OEB) and the Bioinformatics and Computational Biosciences Branch (BCBB) are branches of the OCICB.   The OEB provides technical and tactical cyber technologies management and support for NIAID extramural biomedical research programs. OEB delivers essential and assured services to facilitate communication using electronic systems and a collegial, authorized, and accessible framework for automated information sharing and collaboration. The BCBB provides three suites of scientific services and resources for the NIAID research community and its collaborators: Biocomputing Research Consulting, Bioinformatics Software Development, and Scientific Computing Infrastructure.  

The primary objectives of the ‘Centralized and Scalable NIAID Infrastructure’ project include the following:   • To assist NIAID laboratories in assessing their infrastructure needs for data storage and analysis of 

massively‐parallel sequencing. • To procure, operate, and maintain computing hardware that supports the data storage and processing 

needs for Next Generation Sequencing across the Institute.  • To procure, build, and assist in the use of third‐party applications to be hosted on the NIAID Linux High 

Performance Computing Cluster. • To provide a robust, reliable, cost‐effective, and scalable cyber infrastructure that will serve as the 

foundation to support Next Generation Sequencing at the NIAID.  A secondary objective of this project is to develop a standardized process for handling infrastructure requests for similar high‐performance computing endeavors that will require access to large amounts of data storage and processing.   Project responsibilities of the OCICB Operations and Engineering Branch include:  • Designing and provisioning appropriate resources to meet the scientific and business goals of the 

Institute • Consulting regularly with clients to assess performance and modify the core facility to maintain 

appropriate performance • Selecting and managing the operating system, grid engine, and parallelizing software for computing 

resources • Selecting, developing, maintaining, and managing computing resources pursuant to effective processing 

of associated data • Selecting, developing, maintaining, and managing the enterprise storage components • Selecting, developing, maintaining, and managing effective networking components • Managing the security of the data, operating systems, appliances, and applications • Provisioning user accounts necessary for user applications • Collaborating with the Bioinformatics and Computational Biosciences Branch to ensure appropriate 

resources are provisioned that enable effective use of the facility  The Bioinformatics and Computational Biosciences Branch’s responsibilities include:  • Facilitating coordination and communications among OCICB groups and NIAID laboratories 

Page 34: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

• Maintaining a shared intranet portal for collaboration and document sharing between the OCICB and NIAID laboratories 

• Documenting minimum requirements for software applications that will be hosted on the NIAID Linux High Performance Computing Cluster (in order to aid OEB in the determination of hardware specifications for the cluster) 

• Working with the NIAID laboratories to analyze and document workflows/pipelines for downstream data analysis 

• Installing, maintaining, upgrading, and supporting software applications on the NIAID Linux High Performance Computing Cluster 

• Providing user‐friendly, web‐based interfaces to software applications hosted on the NIAID Linux High Performance Computing Cluster 

• Evaluating and selecting a Laboratory Information Management System (LIMS) to assist with end‐to‐end processing and analysis of Next Generation Sequencing data 

 C. RESULTS (highlight major R&D/IT tools deployed; innovative uses of technology). 

 The OCICB’s Operations and Engineering Branch (OEB) has made several significant investments to support Next Generation Sequencing research, including improvements in the NIAID network, in data storage and processing hardware, and in the personnel required to build and maintain this infrastructure. Specific upgrades include the following:  

•  Expansion of network bandwidth from 1 to 10 gigabits per second to support increased network traffic between NIAID research labs and the NIAID Data Center 

•  Construction of a high‐speed and highly‐dense enterprise storage system, originally built at 300‐ terabyte capacity but rapidly scalable to up to 1.2 petabytes 

•  Creation of a high‐performance Linux computing cluster hosting many third‐party applications that enables efficient data processing on a scalable and high‐memory pool of resources 

•  Deployment of a localized mirror of the UCSC Genome Browser for rapid data visualization and sharing  In addition to these upgrades, the OCICB’s Bioinformatics and Computational Biosciences Branch (BCBB) will provide bioinformatics collaboration and support to researchers. Specific resources that will be provided include the following:  

•  End‐to‐end laboratory information management system (LIMS) to support sample preparation and tracking; task assignment; interaction with the instrument; downstream analysis and custom pipelines between applications; data sharing; and data publication/visualization 

•  Training on the use of bioinformatics applications and development of custom workflows and application pipelines to streamline data analysis 

•  Collaboration on the data integration, analysis, and annotation/publication processes  Some policy decisions for using the centralized infrastructure have yet to be made, including formalizing procedures for long‐term data retention as well as balancing data privacy/security requirements while concurrently facilitating data sharing and publication. Nevertheless, NIAID’s centralized approach highlights the need for a cooperative partnership between bench researchers, computational scientists, and IT professionals in order to advance modern scientific exploration and discovery. 

 

Page 35: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

D. ROI achieved or expected (200 words max.):  Expected returns on this investment are many and include the tangible and intangible benefits and cost avoidance measures listed below:  Tangible Benefits: 

• Cost savings through reduction of people‐hours for IT development, application deployment, system maintenance, and customer support for centralized implementation (versus distributed implementations to support labs separately) 

 Intangible Benefits: 

• Improved security/reduced risk by managing a single, centralized pool of infrastructure resources (includes enterprise‐level security, storage, and back‐up; dedicated virtual LAN; failover/load‐sharing file services cluster and scheduler; and a single, formal disaster recovery and continuity of operations plan) 

• Increased awareness of bioinformatics resources available to labs at NIAID and other NIH Institutes • Elevated access to single, integrated team of subject matter experts including system administrators, 

infrastructure analysts, bioinformatics developers, and sequence analysis experts • Enhanced collaboration with research organizations external to NIAID that will take advantage of 

high‐performance computing environment • Improved research productivity to work toward combating/eradicating critical diseases 

 Cost Avoidance: 

• Efficient use of centralized storage and computing resources used at higher capacity • Leveraged energy efficiency of data center power and cooling systems • Estimated 5‐fold savings in software licensing fees for shared deployment on cluster • Limited consolidation and migration costs for systems/data in centralized implementation 

  

E. CONCLUSIONS/implications for the field.  

Genomic research is a rapidly growing field with broad implications at the NIAID and in the global research community in general. Rather than having laboratory staff attempt to develop the requisite storage, network, and computing capacity themselves, NIAID’s Chief Information Officer has made a significant investment to centralize infrastructure resources in order to maximize efficiency and minimize cost and risk. Major network and storage upgrades, in addition to the construction of a powerful and scalable Linux computing cluster, are the most visible parts of this investment. However, additional personnel – including an experienced Linux Systems Administrator and bioinformatics support staff – have also been acquired. By utilizing the centralized infrastructure and resources, researchers doing important and influential work in immunology, vaccinology, and many other research areas that are immensely beneficial to the public will be better able to conduct their research.  Large datasets and powerful multi‐core computers are not unique to Next Generation Sequencing. Other research areas of interest at the NIAID will also benefit from the new high‐performance computing resources. The NIAID has been able to reuse many of its successful development, procurement, and 

Page 36: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

communications processes of this project to continue to foster cooperation between bench researchers, bioinformaticians, and IT professionals. Sharing this experience as a best practice – including highlighting the hurdles and setbacks in addition to the progress – can provide a strong starting point for other organizations that plan to increase their Next Generation Sequencing and high‐performance computing capabilities.  

   

1. REFERENCES/testimonials/supporting internal documents (If necessary; 5 pages max.)  

Page 37: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

Bio‐IT World 2010 Best Practices Awards  

1. Nominating Organization (Fill this out only if you are nominating a group other than your own.)  

A. Nominating Organization Organization name: Panasas Address: 6520 Kaiser Drive, Fremont, CA 94555  

 B.  Nominating Contact Person Name: Angela Griffo, Trainer Communications (agency contact) Title:  Director Tel: 949‐240‐1749 Email: [email protected] 

 2.  User Organization (Organization at which the solution was deployed/applied) 

 A. User Organization Organization name: Uppsala University Address: P.O. Box 256 SE‐751 05 Uppsala, Sweden 

 B. User Organization Contact Person Name: Ingela Nystrom PhD Title: Director of Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) Tel: +46 70 1679045 Email: [email protected] 

 3. Project   

Project Title: UPPNEX Team Leader  Name: Ingela Nystrom Title: Director Tel: +46 70 1679045 Email: [email protected] Team members – name(s), title(s) and company (optional): Professor Kerstin Lindblad‐Toh, Broad Institute/Uppsala University PhD Jukka Komminaho, Systems expert manager of UPPMAX, Uppsala University Jonas Hagberg, Systems expert of UPPMAX, Uppsala University  

  4. Category in which entry is being submitted (1 category per entry, highlight your choice)  

Basic Research & Biological Research: Disease pathway research, applied and basic research 

Page 38: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

Drug Discovery & Development: Compound‐focused research, drug safety   Clinical Trials & Research: Trial design, eCTD    Translational Medicine: Feedback loops, predictive technologies  Personalized Medicine: Responders/non‐responders, biomarkers  IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging technologies  Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, resource optimization 

Health‐IT: ePrescribing, RHIOs, EMR/PHR  Manufacturing & Bioprocessing: Mass production, continuous manufacturing 

 (Bio‐IT World reserves the right to re‐categorize submissions based on submission or in the event that a category is refined.)  5. Description of project (4 FIGURES MAXIMUM):   

A. ABSTRACT/SUMMARY of the project and results (150 words max.)  Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) is Uppsala University's resource of high‐performance computing.  In recent years, Swedish researchers have become overwhelmed with data from next‐generation sequencing machines. UPPMAX’s challenge was to provide the researchers with a centralized compute and storage facility, capable of handling multiple terabytes of new bioinformatics data per week.  In 2008 the Knut and Alice Wallenberg Foundation granted research funding for a national IT facility dedicated to the compute and storage of genomic data. UPPMAX therefore had a new project, ‘UPPmax NEXt generation sequence Cluster & Storage’ (UPPNEX).    Today, a centralized resource for the compute and storage of next‐generation sequencing data is in place, resulting in faster conclusion for scientific research. Since the introduction of UPPNEX, project times have decreased by several months!   Groundbreaking research, using UPPNEX resources, has already developed improvements in agriculture processes and the understanding of human growth and obesity.   

  

B. INTRODUCTION/background/objectives The UPPMAX facility was founded in 2003 at Uppsala University. UPPMAX is part of the Swedish National Infrastructure for Computing (SNIC). Since its establishment, UPPMAX has provided researchers (both locally and nationally) with access to a number of high‐performance computing (HPC) systems. 

Page 39: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

 UPPMAX’s users traditionally come from research areas such as physics, chemistry, and computer science. Lately, however, the number of Life Sciences users has increased dramatically. This is mainly due to the technical advances, affordability and increased deployment of next‐generation sequencing (NGS) machines. 

 In 2008 it had become apparent, to Swedish researchers, that the tsunami of data from NGS systems created a problem that individual research grants could not solve.  In many cases, Life Sciences research teams were trying to manage the problem themselves. However, due to the sheer volume of data, they wasted a lot of time copying data between systems, waiting for others to complete their computing before they could start their own ‐ and often writing custom code to manage jobs that would typically max‐out system resources. In short, the teams often spent as much time solving computing challenges as they did on scientific research!  It was for these reasons that, in 2008, a national consortium of life sciences researchers was formed to address the challenges presented by this massive increase in bioinformatics data. These researchers would normally compete for resources and research funding. However, it had become apparent that a centralized facility was required. The computation and data storage requirements of NGS data created a workload that, at peak processing times and for long‐term data archiving, had to be handled by a larger facility. 

 The consortium therefore submitted an application to SNIC and the Knut and Alice Wallenberg Foundation to fund a centralized life sciences compute and storage facility to be hosted at UPPMAX.  The united conviction of the consortium being that a sufficient compute and storage facility would ultimately strengthen their attempts to combat disease.  The application was successful with the Knut and Alice Wallenberg Foundation noting that the consortium’s collaborative effort was a major advantage.   And so, the “UPPmax NEXt generation sequence Cluster & Storage (UPPNEX)” project was formed.  Today, a 150 node (1200 core) compute cluster from HP with Infiniband as interconnect is in production with half‐a‐petabyte (500TBs) of Panasas parallel storage. The solution passed a one‐month acceptance period, at the first time of asking, and entered production in October 2009.   The objectives of the UPPNEX solution were to provide Life Sciences Researchers throughout Sweden with:  

1. Sufficient high‐performance computing resources to cover their regular and peak project requirements 

 

Page 40: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

The key challenge was to provide a compute system with enough performance and resources to handle the massively parallel software algorithms required to process the genomic data. Furthermore, to provide a sufficient high‐performance storage solution that could handle the large number of clients with concurrent I/O requests. 

  2. Longer‐term data storage facilities to provide a centralized, national data repository 

 With multiple terabytes of new data being received by UPPNEX on a weekly basis, the storage solution had to scale capacity, without incremental complexity and management costs.  To protect the data, the storage solution had to be highly‐available (with failover and redundancy features built in). Additionally, the storage had to be compatible with UPPMAX’s existing back‐up infrastructure. 

  

C. RESULTS (highlight major R&D/IT tools deployed; innovative uses of technology).  In order to address the challenges of the massive ingest of bioinformatics data, UPPNEX leverages a parallel storage solution from Panasas.  Panasas was born out of a 1990’s US DOE research project into Petascale computing and the file‐system technologies required to process and manage massive amounts of data.  Since Panasas was formed in 1999, the company has developed its modular storage hardware platform in unison with its parallel file‐system, PanFS. With strong initial success in traditional HPC markets, Panasas has complemented its performance with enterprise class features and easy management. The past few years have seen Panasas at an inflection point, where the company’s solutions have been gaining swift traction in data‐intensive workflows such as seismic processing, computational fluid dynamics and life sciences (in particular around next‐generation sequencing and medical imaging).  UPPNEX chose Panasas parallel storage because it provided the performance required by their HPC system when processing massively parallel life sciences applications, additionally Panasas provided a lower‐cost (yet highly reliable) storage pool for the longer‐term storage requirement. The unique aspect of the Panasas solution is that both of these storage pools sit under the same management layer. It is therefore easy to manage both storage pools, which results in the administration overhead of UPPNEX being significantly reduced, if compared to a traditional NFS‐based solution.  It is anticipated that the long‐term storage pool for UPPNEX will grow by 250 Terabytes in 2010. However, unlike alternative NAS solutions, the management complexity of the Panasas solution will not grow as the storage capacity grows. The Panasas solution scales to tens of Petabytes in a single management layer and additional capacity is added with zero loss in productivity.  

D. ROI achieved or expected (200 words max.): 

Page 41: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

 Technology ROI: Individual research groups no longer have to over‐spec IT solutions to meet peak requirements. By moving towards centralized solutions, there are substantial gains thanks to the coordination of staff, computer halls, etc.   Research ROI: An example research project, that leveraged UPPNEX, has reduced its time‐to‐completion by several months . The project focused on gaining a deeper understanding of the relationship between genetic variation and phenotypic variation. Through whole genome resequencing, the researchers distinguished key genes causing the differences between wild and domestic chickens. They have identified candidate mutations that cause special effects on the phenotype. This is an efficient strategy to increase our understanding of how different genes control different traits.   One gene, associated with the fast growth of broiler chickens, is associated with obesity in humans. The study established a new animal model that can be used to explore the mechanics of how this gene influences human growth and obesity.  Lastly, the domestic chicken is the most important global source of animal protein. The research has established the possibility to develop domestic chickens that are extremely efficient producers of animal proteins, namely eggs and meat.   

E. CONCLUSIONS/implications for the field.  The recent technological advancements, affordability and wide deployment of NGS machines is feeding a tsunami of digital data. The information technology infrastructure required to compute and store such vast amounts of data is beyond the funding of Individual research groups. Centralized HPC and data‐storage facilities are being deployed at regional, national and global level to provide researchers with access to the IT infrastructure they require.  The challenge for the centralized facilities is to provide sufficient compute and data‐storage resources to fuel multiple research projects simultaneously.  With ever‐increasing amounts of digital data being ingested, how do they process, manage and store the data both reliably and efficiently.  Traditional storage technologies cannot keep pace. Their limitations on capacity encourage data silos, multiple copies of data, system administration headaches and an escalating management overhead. Clustered storage technologies struggle to address diverse performance requirements within the life sciences workflow, again encouraging data silos and disparate storage management layers.  

Page 42: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

  

250 First Avenue, Suite 300, Needham, MA 02494  |  phone: 781‐972‐5400  |  fax: 781‐972‐5425 

Published Resources for the Life Sciences

Panasas parallel storage caters for the diverse performance, reliability and cost requirements across the life sciences workflow.  Scaling to tens of petabytes under a single management layer, Panasas users can scale storage with zero loss in productivity.    The industry is at an inflection point that goes beyond the capabilities of traditional storage technologies. Centralized facilities such as UPPNEX are blazing a trail and deploying innovative technologies to enhance national scientific discovery that ultimately benefits the global community. 

  

1. REFERENCES/testimonials/supporting internal documents (If necessary; 5 pages max.)  

Can we add here a link to the paper on the chicken?

The journal Nature is very strict on that links to manuscripts should not be spread prior to release. On the other hand, the paper is very soon out, so please ask the PI of the project if it is possible: Professor Leif Andersson [email protected].

Can we add a link to anything about the research consortium or the grant approval?

UPPMAX: www.uppmax.uu.se

UPPNEX: www.uppnex.uu.se (very soon available)

Uppsala University’s press-release of the grant approval: http://www.uu.se/news/news_item.php?id=534&typ=pm

SNIC announcement of the grant (in Swedish): http://www.snic.vr.se/news-events/news/kaw-och-snic-30-miljoner-kronor-till-storskaliga

Page 43: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

1. User Organization (Organization at which the solution was deployed/applied)  

A. User Organization Organization name: Translational Genomics Research Institute 

Address: 445 N 5th Street  Phoenix AZ 85004 

 

B. User Organization Contact Person 

Name: James Lowey 

Title: Director HPBC 

Tel: 480‐343‐8455 

Email: [email protected] 

 

3. Project   

Project Title: NextGen Data Processing Pipeline 

Team Leader James Lowey 

Name: James Lowey 

Title: Director HPBC 

Tel: 602‐343‐8455 

Email: [email protected] 

Team members – name(s), title(s) and company (optional): Carl Westphal – IT Director, Dr. Waibhav Tembe – Sr Scientific Programmer, Dr. David Craig ‐Associate Director of the Neurogenomics Division,  Dr. Ed Suh ‐ CIO 

  

4. Category in which entry is being submitted (1 category per entry, highlight your choice)  

Basic Research & Biological Research: Disease pathway research, applied and basic research  Drug Discovery & Development: Compound‐focused research, drug safety   Clinical Trials & Research: Trial design, eCTD    Translational Medicine: Feedback loops, predictive technologies  Personalized Medicine: Responders/non‐responders, biomarkers 

x     IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging technologies 

Page 44: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, resource optimization 

Health‐IT: ePrescribing, RHIOs, EMR/PHR  Manufacturing & Bioprocessing: Mass production, continuous manufacturing 

 

Abstract Evolving NextGen sequencing requires high throughput scalable Bio-IT infrastructure. Organizations committed to using this technology must remain nimble and design workflows and IT infrastructures that are capable of adapting to the dramatic increase in demands driven by changes in NextGen sequencing technology. TGen as an early adopter of multiple NextGen sequencing platforms has experienced the evolution first hand and has implemented infrastructure and best-practices that have enabled our scientists to effectively leverage this technology. This paper will provide an overview of the challenges presented by NextGen sequencing, the associated impact in terms of informatics workflow and IT infrastructure, and will discuss what TGen has done to address this challenge.

[Introduction]

Beginning of the data deluge in 2009

In late 2008, NextGen sequencing at TGen was just beginning. One Illumina SOLEXA and a single ABI/Lifetech SOLiD sequencer were the initial NextGen platforms brought into TGen. At that point, just one whole-genome alignment with SOLiD had been successfully completed using ABI/LifeTech’s corona software pipeline on TGen’s large parallel cluster supercomputer and the SOLEXA Genome Analyzer pipeline was running on a smaller internal cluster. A team of bioinformaticians were still going through the steep learning curve involved in getting familiarized with the technology, file types, analytical challenges, and data mining opportunities. In January 2009, TGen investigators began work on a SOLiD NextGen sequencing processing and analysis project. This project needed to demonstrate before March 2009 the capability to align 4x SOLiD pilot data from about 110 samples against the whole-genome and carrying out the required annotation and disseminating the results to collaborating centers. The sheer volume and computational resource requirements for processing this data within 90 days presented a formidable challenge. Turning this challenge into an opportunity, the TGen IT team working in conjunction with bioinformaticians, designed and implemented a customized version of corona pipeline configured to maximally utilize the available computational horsepower of the TGen’s High Performance Computing (HPC) Cluster computer [1]. The NextGen data processing pipeline depicted in Figure 1 distributed the computational task of data alignment over multiple cores, while analyzing and annotating both single fragment and mate pair analysis using several custom scripts. This set-up proved to be sufficient to successfully carry-out this project. However, it was quickly realized that a radically different IT infrastructure was required to meet the computing and infrastructure challenges to make NextGen data analysis a standardized service to the scientists.

Page 45: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Figure 1 Data processing Pipeline (March 2009)

Challenges Faced

TGen’s NextGen sequencing demand is growing at an unparalleled pace, requiring large scale storage infrastructure, high performance computing and high throughput network connectivity. This demand places considerable strain upon conventional analysis tools and scientific data processing infrastructure.

The volume of data being generated for NextGen sequencing is based upon the specific technology, instrument version, sample preparation, experimental design, and sequencing chemistry. Each experimental run typically generates between 25 to 250 GB of data consisting of sequenced bases and quality scores. Each such dataset must be moved from the sequencer to longer term storage, and also be made available to computational resources for alignment and other tasks, such as variant detection and annotation. Results need to be written back to long-term storage and optionally be made available to external collaborators over the Internet.

Some of the specific challenges TGen had to overcome in the early days of NextGen sequencing at TGen are as follows:

1. Fair allocation of resources: The analysis of one sample from the NextGen sequencers takes 3-4 days on the HPC cluster. The ability of TGen to process and analyze 110 samples in 90 days,

Page 46: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

requires running multiple-jobs using hundreds of processing cores. However, the HPC cluster was a shared system being used by hundreds of users so it was necessary to ensure that jobs were properly prioritized in order to meet the requirements of the project.

2. System optimization: Multiple instances of the software being used in the sequence data analysis pipeline pushed the limits of the I/O capabilities of the underlying Lustre [2] file system on the HPC cluster. Manual intervention from the system administrators was required to build custom job processing queues to allow the system to reallocate its resources in order for the HPC cluster to continue functioning optimally.

3. Evolving tools: The software tools for converting the output sequence files into the deliverable format were evolving. It was necessary to maintain sequence data in a variety of different formats to test and validate the software tools used for converting the output sequence files. This required TGen to keep intermediate data files resulting in a considerable demand for storage resources.

4. Data deluge and transfer: The amount of data being generated from the sequencers and post-processing led to the challenge of managing tens of terabytes of data. The volume of computational processing pushed the limits of the existing 80 TB Lustre file system on the HPC cluster. In addition, transferring Terabytes of data for data processing and sharing over a 1Gb link was a bottleneck in the sequence data processing pipeline.

5. User Education and support: The bioinformatics team dedicated to the analysis was relatively new to SOLiD data processing and using the full functionality of the available HPC cluster resources. Therefore, end-user education and providing 24x7 help on data analysis tasks was necessary.

These factors hindered the implementation of a fully automated data processing pipeline and manual supervision of every single analysis was necessary. Next-generation sequencing was being increasingly adopted by TGen investigators and more sequencing projects were in the pipeline for 2009. This required TGen to build and provide scalable sequence data processing infrastructure within the given financial and time constraints. In response to these challenges IT worked closely with the scientific community and designed a new internal workflow and deployed an advanced IT infrastructure for NextGen sequencing data processing. The new software and hardware infrastructure accelerates data processing and analysis, and enables scientists to better leverage the NextGen sequencing platform. The following section provides the lessons learned from the challenges we faced. Lessons Learned:

The challenges above provided the opportunity to learn many valuable lessons in how to construct and provide a NextGen sequencing data processing pipeline. The following is a summary of the key lessons we learned to date.

The Impact of I/O: We quickly learned that it is possible for a 4000 CPU core cluster to be rendered nearly useless by less than 1/3 of the nodes saturating the file system with I/O operations. Many small I/O requests can quickly overwhelm the cache on disk controllers which causes a large queue of requests to accumulate, having a negative impact on performance. The Lustre-based file system remains intact, however the delay of doing an I/O on the shared file system increases and essentially all operations on

Page 47: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

the cluster grind to a halt. The key is to actively manage and schedule computational jobs in such a way as to prevent select jobs from overwhelming the system and impacting other jobs.

WAN Transport & System Tuning: TGen’s initial sequence data processing pipeline included transferring the raw sequenced data via a 100Mb WAN Ethernet link from the sequencer to the HPC cluster environment that is located at our off-site data center. Despite upgrading the 100Mb WAN Ethernet link to 1Gb, the data transfer time of NFS over TCP at a 12 mile distance was still slow. This was due to the effect of latency and TCP checksums. Basically, the round trip time for packets meant that every checksum that was verified took upwards of 4.5 ms to complete, resulting in a fairly substantial delay between each frame. In order to mitigate this, we fine tuned Linux Kernel network parameters, such as TCP_Window_Size. We used open source tools such as iperf [3] to test the effects of kernel tuning which showed dramatic increases in throughput. However, the performance of data transfer over NFS was still unsatisfactory. Due to the variety and number of hosts that required connections across the Ethernet link, performing individual kernel tuning on each host was impractical. The solution to the data transfer issues was doing the NFS mounts over UDP. This introduced the new issue of silent data corruption because UDP does not perform checksums. This meant that MD5 checksums must be generated for data files being transferred to ensure data integrity. The key lesson learned was that careful attention should be paid to performance tuning measures. There is a lot of benefit to be gained by taking the time to understand and optimize system parameters. Doing so may reduce costs associated with unnecessary bandwidth upgrades that may not deliver the expected performance improvement.

LAN Data Transport Capacity: Moving data off of the sequencers to storage and computational resources became a very time consuming task. Having multiple sequencers producing and transporting data simultaneously quickly overwhelmed 1Gb LAN segments. Fortunately TGen had previously invested in 10Gb core network components enabling us to extend 10Gb networking to key systems and resources in data processing pipeline thus eliminating bottlenecks on the LAN. As a result we learned or validated the importance of fully exploiting the capabilities of the infrastructure available and the importance of having a flexible network architecture.

Internet Data Transport & Collaboration: As TGen began to exchange sequenced data with external collaborators, it became immediately apparent that traditional file transfer methods such as FTP would not be practical as the data sets were simply too large and the transfer times were not acceptable. This problem could not be addressed by simply increasing bandwidth as TGen has no control over the bandwidth available at collaboration sites. Internet latency issues became magnified when attempting to transfer large data sets. This project required TGen to receive sequenced data from other organizations, perform analysis, and make the results available to the other organizations. After researching various approaches and exchanging ideas with others at the Networld Interop conference, TGen chose to implement the Aspera FASP file transfer product. Aspera enabled scientists to send and receive data at an acceptable rate, and enhanced TGen’s ability to participate in collaborative research projects involving NextGen sequencing. Lesson, actively seek out best practices and leverage the experiences of others in your industry. Participating in user groups and other industry related forums can reduce the time it takes to identify and implement significant improvements to your infrastructure or workflow.

Data Management: The sheer volume of NextGen sequencing data had an immediate and significant impact on our file management and backup infrastructure and methods. Scientists were initially hesitant to delete even raw image data until they were comfortable with the process of regenerating the information. This resulted in scientists keeping multiple versions of large data which quickly consumed backup and storage capacity. TGen’s IT department worked collaboratively with the scientific community to optimize data management methods. This involved achieving consensus on what is “essential data”, defining standard naming conventions, and establishing mutually agreed upon rules regarding the

Page 48: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

location and retention of key data. Specifically, IT took the following steps to improve the data management process and accelerate the scientific workflow:

• Dedicated NFS storage for raw reads, attached to back-up tape library • Dedicated NFS storage for all results, attached to back-up tape library • Automated backup process for “key” files • User education on how to mount/unmount the storage space • Configured Aspera server to read directly from designated NFS mount points eliminating

unnecessary data moves • Weekly cron jobs for monitoring and informing users about storage resource capacity • Automated monitoring of user jobs utilizing the HPC Cluster • Established a SharePoint based web portal to share NextGen project related information

These changes had to be synchronized and communicated across multiple scientific divisions as well as the within the IT department. The end result was a more streamlined scientific workflow, improved data management environment and reduced impact on the storage, backup and network infrastructure. Lesson, be flexible in regards to data management procedures and the supporting infrastructure. Rapidly advancing technologies such as NextGen sequencing can render your current methods obsolete and you must be willing to make dramatic changes in response to the needs of the scientific community and the demands of the technology.

Benchmarking:

Alignment of billions of reads to reference genomes is computationally expensive. An effort was initiated to benchmark sequence alignment tools. TGen’s IT team was actively involved in this process by providing several performance measurement and tuning tools and creating automated scripts for collecting data about computing resource utilization associated with six popular sequence alignment programs. IT used performance measurement tools for cluster computing environments to benchmark the speed, CPU utilization and input-output bandwidth needed for the program. This information is now being used for selecting the best tool for various projects and planning the resource requirements for future NextGen sequencing projects. Lesson, time spent benchmarking can provide significant benefit in terms of reducing the cost and effort associated with the “trial & error” approach to selecting and using complex technology such sequencing alignment tools. [Results]

Key Technologies & Supporting Methodologies

The TGen High Performance Bio-Computing Center (HPBC) manages a diverse collection of HPC systems, storage and networking resources, including two large supercomputers. The first supercomputer is called Saguaro2, and is a Dell Linux cluster. This system consists of ~4000 Intel x86-64 processor cores, with 2 GB RAM per core. This system has a shared parallel 250 TB (Lustre) file system that allows massive amounts of concurrent input/output operations spread across many compute nodes. This system is very effective at running thousands of concurrent discrete processing jobs, or at running very large parallel processing workloads. This large HPC cluster system is installed at the Arizona State University campus in the Fulton High Performance Computing Initiative (HPCI) center and was funded via NIH grant S10 RR25056-01.

Page 49: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Figure 2 Saguaro2 supercomputer

In addition to the Saguaro2 cluster system, TGen also has a large memory Symmetric Multi-Processor (SMP) system available. This system, is an SGI Altix 4700 consisting of 48 Intel IA-64 cores and 576 GB of globally shared memory. The SGI system is well suited for solving memory intensive problems, or algorithms that are not easily parallelized. With the resources available on this system, it can run several concurrent memory intensive jobs, without having a performance penalty inflicted due to the architecture of both the processors and the I/O backplanes on this system. This system was funded via NIH Grant S10 RR023390-01.

Updated NextGen sequencing workflow:

Learning from the experience and systematically identifying the resource requirements at various stages of the NextGen data analysis and transfer, TGen developed and installed a significantly improved NextGen sequencing data processing pipeline (Figures 3 & 4). The updated data processing pipeline utilizes several customized scripts tailored to the software implementation underlying various data analysis tools have been developed, which improve the effectiveness of using HPC for analyses. By indentifying the critical files at various stages, redundancy of storage has been minimized and policies have been established to delete intermediate files automatically after fixed time. Several compute systems have been dedicated to local data processing, such as annotation and parsing. Involving PIs in the infrastructure design process and educating their research staff has helped significantly in creating a team of proficient and more mindful users of the data processing pipeline.

Page 50: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Figure 3 Scientific data workflow (Feb. 2010)

Scalable storage

The dedicated storage capacity for NextGen projects has increased from ~80TB to over 200TB with a new scalable Isilon storage system with a single name space file system. This system provides robust performance, redundancy and scalability. Being able to manage the very large amounts of storage required to support biomedical research using the minimal of IT support allows researchers to concentrate on their research and IT to concentrate on building better IT infrastructure in support of scientific programs. The Isilon system uses a modular architecture and symmetric clustered file system so that tasks such as adding additional storage to the storage cluster is as simple as plugging in additional storage arrays. This helps to minimize costs while providing a solution that can grow as the data storage requirements continue to increase.

Backup optimization

In addition to the Isilon storage system, TGen used Ocarina storage optimization appliances to compress data before backup, saving considerable overhead on the backup systems. This makes it feasible to backup more of the sequencing data.

File sharing

File sharing with external collaborators and other partners is accomplished using the Aspera FASP file transfer technology. This technology allows optimal use of the network bandwidth to achieve high throughput file transfer across the Internet.

Page 51: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

Figure 4 March 2010 Nextgen sequencing infrastructure pipeline

[ROI Expected or Achieved]

Highly scalable IT infrastructure supporting high-throughput NextGen sequencing data processing and analysis

1. High-speed, shared file transfer infrastructure that enables TGen scientists to participate in large-scale collaborations involving NextGen sequencing-based research

2. Improved data management procedures resulting in a more cost effective use of storage and other infrastructure resources

3. Efficient scientific data processing workflow including computational tools that can be leveraged to expedite research

4. Robust HPC infrastructure that is capable of supporting large-scale NextGen sequencing projects

As a result of the above benefits, TGen is better positioned to compete in large-scale grants and contracts involving NextGen sequencing technology.

[Conclusions]

In spite of resource limitations, infrastructure constraints and a relatively short time to carry out the large-scale sequencing data analysis, TGen has successfully aligned approximately 270 Giga-bases out of 550 Giga-bases processed against the human genome. Throughout 2009, several new research groups at TGen incorporated NextGen sequencing technologies into their research, consequently the number of bioinformatics personnel carrying out NextGen data analysis is increasing. Concurrently, the number of sequencers at TGen has gone from two to seven (two SOLEXA and five SOLiD). TGen expects to add six more SOLiD sequencers in early 2010. The throughput of each sequencer at TGen has more than doubled relative to early 2009 and this trend is expected to continue or even accelerate. Large volumes of data generated by external collaborators and

Page 52: 2010 Best Practices Competition IT & Informatics: HPC€¦ · IT & Informatics: HPC ... (AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use Force.com

industrial partners are being processed at TGen. The increase in throughput and data volume necessitates scalable storage, HPC, and high-bandwidth network connectivity to store, manage and process sequencing data. These challenges will continue to provide opportunities for IT to play an increasingly important role in scientific research.

[REFERENCES]

[1] Saguaro supercomputer (http://www.top500.org/system/9789)

[2] Lustre Parallel File system (http://www.oracle.com/us/products/servers‐storage/storage/storage‐software/031855.htm) 

[3] iperf (http://sourceforge.net)