20
White Paper Abstract This EMC Isilon Sizing and Performance Guideline white paper reviews the Key Performance Indicators (KPIs) that most strongly impact the production processes for Next-Generation Sequencing (NGS) workflows. August 2012 NEXT-GENERATION GENOME SEQUENCING USING EMC ISILON SCALE-OUT NAS: SIZING AND PERFORMANCE GUIDELINES

White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

Embed Size (px)

DESCRIPTION

This EMC Isilon sizing and performance guideline White Paper reviews the Key Performance Indicators (KPIs) that most strongly impact the production processes for the storage of data from Next-Generation Sequencing (NGS) workflows.

Citation preview

Page 1: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

White Paper

Abstract

This EMC Isilon Sizing and Performance Guideline white paper

reviews the Key Performance Indicators (KPIs) that most strongly

impact the production processes for Next-Generation Sequencing

(NGS) workflows.

August 2012

NEXT-GENERATION GENOME SEQUENCING USING EMC ISILON SCALE-OUT NAS: SIZING

AND PERFORMANCE GUIDELINES

Page 2: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

2 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Copyright © 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as

of its publication date. The information is subject to change without notice. The information in this publication is provided ―as is.‖ EMC Corporation makes no representations or warranties of any kind

with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All other trademarks used herein are the property of their

respective owners. Part Number H19061

Page 3: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

3 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Table of Contents

Executive summary............................................................................... 4

Introduction ........................................................................................ 4

NGS workflow – sequencing instruments and file types ............................. 5

NGS workflow – HPC ............................................................................. 7

NGS workflow – Isilon scale-out NAS .....................................................10

EMC Isilon scale-out NAS overview ........................................................12

Simple.............................................................................................. 12

Scalable ............................................................................................ 13

Predictable ........................................................................................ 13

Efficient ............................................................................................ 13

Available ........................................................................................... 14

Enterprise-ready ................................................................................. 14

NGS: key performance indicators ...........................................................17

HPC server parameters ......................................................................... 17

Network infrastructure parameters........................................................... 18

Isilon storage configuration parameters .................................................... 18

Summary .......................................................................................... 19

Conclusion ..........................................................................................20

Page 4: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

4 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Executive summary

Next-generation sequencing (NGS) workflows are comprised of genome sequencer

instrumentation, high-performance computing (HPC) infrastructure, a network-

attached storage (NAS) platform, and the network infrastructure connecting these

components together.

Raw NGS data is the largest component of an NGS process, making data storage

capacity and scalability important factors in NGS performance. The raw TIFF image

from the sequencer can be up to 70 percent of the total dataset. These files may be

compressed and stored for later use. Most organizations do not save the TIFF images,

but retain either the BCL or FASTQ files as the raw files. Each sequencing run can also

generate analysis data in the range of 50-200 GB. With faster sequencers and larger

read lengths, this can add up to between approximately 1 PB and 2 PB per year for a facility with three NGS sequencers.

Beyond capacity scalability, I/O performance is also a critical file storage attribute for

overall NGS performance and efficiency. NGS is I/O bound rather than processor bound,

and therefore storage I/O performance has a high impact on overall NGS performance

in relation to other NGS workflow parameters.

Internal EMC testing has determined that the Key Performance Indicators (KPIs) that

most affect the performance of NGS applications are:

Total random access memory (RAM) size on HPC cluster nodes (recommended at 3 GB/core)

RAM and SSD allocation on the EMC® Isilon® storage cluster – place maximum allowable RAM on the performance layer and minimum recommended on the

archival layer with about 1 percent to 2 percent of the raw storage capacity as SSD

Storage configuration parameters: NFS version V4, NFS async enabled, TCP

MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s), and tuning the Grid Engine package

Introduction

Over the past five years, the precision and effectiveness of sequencing technology

has considerably increased the pace of biological research and discovery. The resources

focused on molecular biology, cellular biology, and bioinformatics continue to accelerate

at a significant pace. Projections indicate that before the end of the 21st century, we

could gain a full understanding of the workings of our DNA. Such knowledge could

allow us to improve our collective quality of life through a better understanding of how

a specific genetic variation impacts a drug’s efficacy or toxicity, or, by possibly providing

the knowledge to eradicate a range of genetically based disorders.

DNA exome sequencing is an approach to selectively sequence the coding regions of

the genome as an easier yet still effective alternative to whole genome sequencing.

The exome of the human genome is formed by exons. Exons are short, functionally

Page 5: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

5 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

important coding sequences of DNA within the gene’s mature messenger RNA that constitute about 1.5 percent of the human genome.1

Many large-scale exome sequencing projects are underway to analyze human

diseases. This technology is often the choice as it is more affordable than whole

genome sequencing (WGS) and therefore allows analyzing more patients. In addition,

it has the advantage that resulting data volumes are much smaller and therefore

easier to handle. However, recent studies2 focused on this question found that both

technologies complement each other. As neither the whole genome nor the large-

scale exome sequencing technologies cover all sequencing variants, it is optimal to conduct both experiments in parallel.

A single human genome—composed of a total of about 3.2 billion base pairs—requires

about 1.2 GB of unassembled storage. Industry analysts predict that the estimated

number of human whole genomes sequenced will explode from 25,000 genomes in 2012, to between 50,000 and 100,000 in 2013, and up to about 1 million by 2015.

The key enabling technology for NGS are the many commercial sequencers available

from various companies, including Illumina, Life Technologies, Roche/454 Life Sciences,

and others. These sequencers interface to a computer network, which correlates and

concatenates the billions of overlapping segments of DNA sequence short reads that have been streamed to or stored on a NAS system.

Accommodating the output rate of the sequencers requires a precisely designed and

balanced system. The peak rate of data (base pairs) produced by an Illumina sequencer,

for example, is already approaching 600 GigaBases per week, equivalent to about 100

whole human genomes. The range of data per year for an Illumina sequencer is from 350 TB to 1 PB.

The components of the NGS workflow are comprised of:

Genome Sequencer instruments,

HPC infrastructure,

NAS platform, and the

Network infrastructure stitching these components together.

These four components make up the hierarchy of the NGS gene-sequencing architecture.

Each component depends on the other and must have the ability to adapt and scale

to meet current and future sequencing needs. If one component creates a bottleneck,

then the performance of the entire NGS system suffers. The focus of this document

endeavors are optimum performance and sizing guidelines for the core components of NGS: the HPC infrastructure and network-attached storage.

NGS workflow – sequencing instruments and file types

The applications at the heart of NGS data creation come from important established

and emerging organizations involved in bringing NGS to market. The list includes

software from Illumina®, Life Technologies (Applied Biosystems), Roche/454, Ion Torrent,

1 See Gilbert W (February 1978). ―Why genes in pieces?‖. Nature 271 (5645): 501. 2 See Performance comparison of exome DNA sequencing technologies. Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M, Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.

Page 6: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

6 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Pacific Biosciences, and a myriad of open source offerings such as Galaxy. Running

these applications in a research and analysis environment places complex and special

requirements on the IT systems, and in particular, the storage infrastructure. This

document will focus on the Illumina technologies—specifically CASAVA™ for the

sequence analysis software. Other Genome Assembly and Analysis Platforms like Galaxy would be summarized in subsequent documents.

An NGS environment typically consists of Scientific, Lab, and Analysis users:

The Scientific User initiates the method of genome sequencing and instrumentation. This may also be the Analysis User.

The Lab User runs the experiment (chemistry workflow) using a multiplexed sampling scheme (or lanes) supported by the NGS instrument.

The Analysis User works on the results from the genome sequencing study with bioinformatics tools and algorithms.

Most commercial NGS data centers also have a trained storage administrator on their

staff. With the growing use of NGS technologies, a new user has emerged for these

storage systems. The scientist or researcher running the experiments frequently

handles the data directly. Data management has to be intuitive to allow this new user

to run experiments and administer the data with minimal difficulty. In addition, the

storage administrator needs access to the more advanced management features to

set sophisticated management policies. These help with optimization of performance

and use of storage system. It is important that the storage system deployed provide management capabilities tuned to both types of users.

A graphical representation of the typical NGS data flow is shown below in Figure 1:

Figure 1. NGS architecture, data flow, and file types

The results stage of the NGS workflow as shown in Figure 1 consists of a number of

successive steps each involving file conversions and each resulting in approximately

5x smaller file sizes. These steps include conversion of the raw image file into base-

call data, then of base call data into FASTQ text-based file format for storing both

Page 7: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

7 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

biological sequence and its corresponding quality scores, for example, using LQUAL

or QUAL formats. This is followed by conversion into BAM (Binary Alignment Map) file

data followed by conversion into Variant Calls (VCF) file data, which is converted next

into results data in SRA format. This tertiary file data is typically kept forever, needs to be kept safe, as well as available, and accumulates over time.

Today’s instruments produce higher level information and may avoid some of the

intermediate steps, thus reducing output data compared to previous NGS systems.

Therefore, data flows generated by the latest NGS instruments have typically decreased

in size per run. This decrease has been offset by a larger number of experiments,

secondary data, and increased consumption by users working downstream on many

different efforts and workflows. The size and characteristics of data produced from

these efforts place unpredictable demands on capacity as well as on throughput of the

storage systems. NGS storage environments need to be able to adapt to demands for

more capacity from post-processing work done by researchers downstream from the first data capture.

NGS workflow – HPC

NGS applications have both common and unique analysis tools. All applications generate

large files that must be managed through multiple rounds of processing. Although many

tools were written specifically for easy implementation on a high-end desktop computer

(e.g., 64-bit dual- or quad-core, 16 GB RAM), routine analysis is typically conducted on high-performance compute clusters.

Using a high-performance compute cluster, secondary analysis processing can generally

be done at a rate equal to or faster than primary data generation. Due to the open-ended nature of tertiary analysis, a similar rate estimate cannot be precisely stated.

It is important that the parallelization of the NGS analysis platform be well understood

before planning on optimum server CPU core sizing. Most of the NGS tools are at least

multi-processor aware or are highly parallelized by simply dividing the sequence data,

the assembly algorithm, variant calling, or all, and starting separate analysis on these

data subsets. For NGS applications, the current parallelization per process is typically between 75 percent and 90 percent.

As genomics has very large, semi-structured, file-based data and is modeled on post-

process streaming data access and I/O patterns that can be parallelized, it is ideally

suited for the Hadoop software framework3 which consists of two main components:

a file system and a compute system—the Hadoop Distributed File System (HDFS) and the MapReduce framework, respectively.

3 See Hadoop in the life sciences: an Isilon Systems white paper. Joshi S.

Page 8: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

8 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Figure 2. Amdahl’s Law and parallelization

One of the basic tenets in HPC, Amdahl’s Law4, postulates that adding more

microprocessor cores to a process does not speed it up linearly. A 64 core HPC

platform is estimated to be the performance threshold for 75 percent parallelization

per NGS process, which delivers a speedup of 4x (see Figure 2). Even more than

100 cores per active NGS process do not speed up the process substantially when the

algorithm(s) are between 75 percent and 90 percent parallelization. During actual

testing of the NGS processes in the range of 75-90 percent parallelization, the speedup from 12 cores to 72 cores was found to be only about 1.25x.

Horizontal platforms like Hadoop that combine compute and data in a parallel context would benefit genome assembly considerably.

4 See ―Validity of the single processor approach to achieving large-scale computing capabilities‖, Amdahl G, AFIPS Conference Proceedings (30): 483–485, 1967

Page 9: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

9 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Figure 3. Performance curves for NGS using Illumina CASAVA

As shown in Figure 3 above, the NGS process is storage I/O and memory bound. The

performance curves show a direct relationship between NGS performance and saturation

of read/write I/O and memory functionality. In contrast, there is an inverse relationship

between the CPU core utilization and storage I/O and memory functionality. This number

may be due to mutual dependencies or portions of the process that can only be performed

sequentially; NGS algorithms requiring movement of large amounts of data in and out

of the CPU; startup overhead including base calling and other large numbers of small file writes; and degree of serialization involved in communication.

In view of the above discussion, it is recommended that the HPC server hardware platform be configured with:

Best I/O chipset, for example, using the latest generation Intel I/O controllers

Highest DRAM speed (with a minimum of 3 GB per core of RAM)

Multi-core CPU set with > 2 GHz processors

Simplified BIOS and driver upgrades with a single management console for all driver upgrades

Linux driver compatibility (over 90 percent of all HPC systems are Linux-based)

Disk drives between 200 GB and 600 GB with RAID 10

Cluster management tools such as Ganglia

Increasing the network bandwidth up to 4 Gbps would alleviate the read I/O and memory saturation.

Page 10: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

10 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

NGS workflow – Isilon scale-out NAS

Figure 4. Data flow using Illumina NGS process

NGS production processes generate potentially millions of files with terabytes of

aggregate storage impacting the capacity and manageability limits of existing file

server structures.

Figure 4 shows the data flow including a file number and capacity summary of an actual

NGS process using Illumina sequencer and Isilon scale-out NAS storage. As can be seen,

the process generates over 500,000 having aggregate size of greater than 5 TB over the course of the 48-hour run.

Raw NGS data is the largest component of an NGS process. The raw TIFF image can be

up to 70 percent of the total dataset. These files may be compressed and stored for

later use. Most organizations do not save the TIFF images, but retain either the BCL

or FASTQ files. If sequencing as a service is used, the input to the process is a BAM

file. Each sequencing run can also generate intermediate and final analysis data in the

Page 11: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

11 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

range of 50-200 GB. With faster sequencers and larger read lengths, this can add up to between 1 PB and 2 PB per year for a facility with three NGS sequencers.

Genomics is a data reduction process from the raw instrument information (images or

voltages) to the variants. This reduction process follows the ―Rule of One-Fifth‖ as

shown in the sizing table below:

File format Size, GB Illumina Size, GB IonTorrent Comments

TIFF, WELLS

2500 750 TIFF Range: 2.5 to 4 TB, Ion Torrent™ is WELLS voltage format

BCL / SFF 500 500 Ion Torrent uses SFF

BAM 100 100 2x compression (~200 GB normal)

VCF 20 20 Variant calls

SRA, EMR 4 4 EMR (Electronic Medical Record) includes Radiology and

Pathology images

Table 1. Data reduction for the NGS process; human whole genome, all file sizes are approximate

Raw instrument data typically consists of large image files (2-5 TB per run are the norm),

usually in TIFF format or an electropherogram file format native to a sequencer (for

example, the SEQ format native to the Illumina sequencer). These files are only kept

long enough (7-10 days) to verify that the experiment worked. The image file for the

experiment is usually the largest file size in NGS.

Intermediate or secondary data consists of raw data processed into information of

increasing value, stored for medium- to long-term storage (1 year or more), requires

high bandwidth access for fast analysis, and is expensive to re-create, so storage needs

to be highly available. These include files in BCL format for base calling and conversion with an aggregate ratio of approximately one-fifth compared to raw instrument data.

Beyond capacity scalability, I/O performance is also a critical file storage attribute for

overall NGS performance and efficiency. As discussed earlier, NGS is I/O bound, rather

than processor bound, and thus storage I/O performance has a high impact on overall

NGS performance in relation to other NGS workflow parameters. As a result, NGS

environments require a file storage infrastructure that is purpose-built to address the

capacity and performance scalability, efficiency, availability, and manageability challenges

of next-generation NGS environments.

Page 12: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

12 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

EMC Isilon scale-out NAS overview

NGS is an unstructured file-based process, not a block-based storage process. The EMC

Isilon scale-out NAS manages unstructured file data through a single namespace through its storage appliance nodes arranged in clusters which support massive scalability.

A short description of the EMC Isilon storage solution and the EMC Isilon OneFS® file

operating system with each of its features summarized below confirms its suitability

for next-generation genomic sequencing:

Simple

OneFS combines the three layers of traditional storage architectures—the file system,

volume manager, and RAID/data protection—into one unified software layer, creating a single intelligent distributed file system that runs on an Isilon storage cluster.

Figure 5: OneFS eliminates the need for complex file management

This scale-out hardware provides the appliance on which OneFS distributed file system

resides. A single EMC Isilon cluster consists of multiple storage nodes, which are rack-

mountable enterprise appliances containing memory, CPU, networking, NVRAM, storage

media and the InfiniBand back-end network that connects the nodes together. Hardware

components are best-of-breed and benefit from ever-improving cost and efficiency

curves. OneFS allows nodes to be added or removed from the cluster at will and at any

time, abstracting the data and applications away from the hardware. Adding nodes—

instead of adding volumes and LUNS via physical disks—becomes an extremely simple task at the petabyte (PB) scale, which is common in NGS.

Page 13: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

13 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Scalable

Figure 6: Linear scalability with OneFS

EMC Isilon provides a high-performance, fully symmetric cluster-based distributed

storage platform. It has linear scalability with increasing capacity—from 18 TB to

15.5 PB in a single filesystem—as compared to traditional storage. The concept of

node-based capacity growth with linear scaling is critical to NGS where scale needs to

be painless, since the process can generate upwards of 8 TB per week per instrument.

The researchers and clinicians need to focus on managing scientific data and patients, not managing storage.

Predictable

Along with raw scaling of capacity, balancing of the content across the new nodes needs

to be predictable for an NGS workflow due to its sustained throughput requirement.

Since the instrument end keeps changing with newer technologies faster than the

HPC or Storage, this balancing and scale become invaluable. Dynamic content

balancing is performed as nodes are added or data capacity changes. There is no

added management time for the administrator, or increased complexity within the

storage system. The storage reporting application, InsightIQ, can be used to plan the growth of a system from storage statistics both for infrastructure and for budgeting.

Efficient

Operational Expenditure (Opex) hinges upon efficiency, specifically in NGS, since the

total storage can run into PBs. A recent survey conducted by Scripps Institute concluded

that more than 35 percent of institutions today are at petabyte scale in NGS with a

10 percent year-over-year growth.

Isilon scale-out NAS offers an 80 percent efficiency ratio and ―smart pooling‖ of the

data across multiple tiers, making dynamic, rule-based data transfer between storage

pools an integral piece of the NGS process. This efficiency is at the application level and tiered by the performance types:

S-Series node for high performance (I/O per second)

X-Series node for high throughput

NL-Series node for archive

Page 14: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

14 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Figure 7: Storage tiering based on node type

The tiers in the storage cluster as shown in Figure 7 above are identified as ―pools‖

and managed by the EMC Isilon SmartPools™ application. A pool is a group of similar

nodes which is defined by the user and is based on the functionality or workflow. A pool

is governed by policies which can be changed based on needs; default policies are

built in. Policies defined by any standard file metadata: file type, size, name, location,

owner, age, last accessed, etc. Data can be migrated from pool to pool. The timing for this data movement is configurable: default is 1x/day @ 10 PM.

Available

Data availability and redundancy are the core requirements of the scientific and clinical

staff in NGS. As NGS moves into the clinical realm, availability becomes even more

important. Flexible data protection occurs during power loss, node or disk failures,

loss of quorum, and storage rebuild. OneFS avoids the use of hot spare drives, and

simply borrows from the available free space in the system in order to recover from failures; this technique is called virtual hot spare.

Since all data, metadata, and parity information is distributed across the nodes of the

cluster, the Isilon cluster does not require a dedicated parity node or drive, or a dedicated

device or set of devices to manage metadata. This helps to ensure that no one node can become a single point of failure and makes the cluster ―self-healing.‖

Enterprise-ready

The NGS data system does not exist as an island; it usually coexists with other storage

and IT systems. The standard protocols that OneFS supports build the standards-based

protocol bridges to other information systems from NGS. Specifically, connectivity to the

Isilon scale-out NAS cluster is via standard file protocols: CIFS, SMB, NFS, FTP/HTTP,

iSCSI, and HDFS. The complete data lifecycle is accessible to the centralized IT group.

Snapshots, Replication, and Quotas are supported via a simple web-based UI.

Data is given infinite longevity and future-proofs the enterprise from evolving hardware

generations—eliminating the cost and pain of data migrations and hardware refreshes.

Page 15: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

15 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Active Directory (AD), LDAP, NIS and local users are standardized authentication and

access control available at scale. Simultaneous or rolling upgrades to OneFS are possible,

with little or no impact to the production environment.

Figure 8: Standard protocols are critical to enterprises

The software to manage OneFS is automated to eliminate complexity, as shown in Figure 9 below:

Figure 9: OneFS software management suite

All of the applications shown above are available as software licenses and are web-based

through the main administrative user interface. A comprehensive command-line based administration interface is also available.

Page 16: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

16 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

OneFS Software Management Suite Making data management easier for NGS

OneFS infrastructure software solutions meet critical data protection, access,

management, and availability needs

Application Category What it does

SmartPools Resource management

Implements a highly efficient, automated tiered storage strategy to optimize storage performance and costs

SmartConnect Data access Enables client connection load balancing and dynamic NFS failover and failback of client connections across storage nodes to optimize use of cluster resources

SnapshotIQ Data protection Protects data efficiently and reliably with secure, near-instantaneous snapshots while incurring little to no performance overhead

InsightIQ Performance management

Maximizes performance of your Isilon scale-out storage system with innovative performance-monitoring and reporting tools

SmartQuotas Data management

Assigns and manages quotas that partition storage into easily managed segments at the cluster, directory, sub-directory, user, and group levels

SyncIQ Data replication Replicates and distributes large, mission-critical data sets to multiple shared storage systems in multiple sites for reliable disaster recovery capability

Table 2. Functional overview of the OneFS software suite

Page 17: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

17 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

NGS: key performance indicators

As discussed in the HPC section, performance of the NGS process is highly dependent

on the I/O performance of the file storage system and memory resources available in the

NGS architecture. In addition, there are a range of second-order factors that need to be

considered in terms of optimizing performance for a specific NGS process, including5:

How much faster can a given problem be solved with multiple workers (or server cores) instead of one?

How much more work can be done with multiple workers (or server cores) instead of one?

What impact do the communication requirements of the parallel NGS application have on overall performance and scalability?

What fraction of the resources in an NGS configuration is actually used productively for solving the NGS problem?

The KPI for NGS consists of factors that can be used to predict and optimize the performance of an NGS configuration and can be broken down into four categories:

HPC server attributes: RAID, number of processor cores per HPC node, total RAM size per HPC node

NGS network infrastructure: TCP MTU, Channel Bonding, DNS

Sun Grid Engine parameters: number of nodes, PAR_EXECD_INST_COUNT

Isilon file storage attributes:

SSD size and RAM

NFS protocol parameters: NFS server OS, async, number of threads, locks

Software RAID, maximum number of directories at a level, maximum number of files in a directory, number of files less than 8 KB

HPC server parameters

RAID: With modern multi-core CPUs, the performance of software RAID is very

close to that of hardware RAID. RAID 10 (first mirroring, then striping the mirrors)

is recommended for the HPC nodes with a minimum of two identical drives per node

where both drives are bootable. The benefit of such a configuration is that the server continues to boot seamlessly even in the face of a failure of a single drive.

Total processor cores: The empirical rule-of-thumb for total number of threads and processes running in parallel is determined by the equation:

∑ (threads + processes) = (2 x total cores) + 1

Please note that the total number of threads include functions such as NFS and HPC

queuing as well as the processes that run NGS algorithms; it is important to document

all the processes that are multi-threaded. Amdahl’s law vis-à-vis parallelization is also an important consideration.

5 See Introduction to High Performance Computing for Scientists and Engineers , Hager G, Wellein G, © 2010 Taylor & Francis Group, LLC

Page 18: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

18 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Total RAM size: NGS analysis requires large file processing, including functions

related to string processing, clustering of large files, and statistical quality measures,

and thus easily becomes memory-bound. As a result, a large DDR3-based RAM pool is optimal.

Network infrastructure parameters

TCP MTU: The default Maximum Transmission Unit (MTU) (or frame size) of current

Ethernet systems is 1500 B. However, higher bandwidth network infrastructures can

handle a much higher MTU of 9000 B (called ―jumbo frames‖) for efficient data transfer.

Please note that the jumbo frame setting needs to be completed both on the HPC server

node(s) and the switch(es).

Ethernet Bonding (LACP): Ethernet Bonding using the Link Aggregation Control

Protocol (LACP) is a method used to alleviate bandwidth limitations and port-cable-

port failure issues. By combining several Ethernet interfaces to a virtual ―bond‖ interface,

the network bandwidth can be increased since LACP splits the communications and

sends frames among all the Ethernet links. Bonding 2x 1 GbE interfaces provides the

required bandwidth between HPC server nodes and NAS file storage.

Isilon storage configuration parameters

NFS Master OS: By default, EMC Isilon OneFS operating system is the NFS server. It

is recommended that this default be maintained since SmartConnect and other OneFS features may be affected if the HPC master node OS is chosen as the NFS server.

NFS V4: NFS V4 provides improved performance, security, and robustness vis-à-vis

NFS V3. These include support of multiple operations per RPC operation (vs. a single

operation per RPC in NFS V3), use of Kerberos and access control lists (ACLs) for

security (vs. UNIX file permissions in NFS V3), use of TCP transport (vs. UDP in NFS

V3), and integrated file locking (vs. use of the adjunct Network Lock Manager protocol

for NFS V3). As a result, it is recommended that sites utilize NFS V4 for the NGS environments. Please note that initial setting up of NFSv4 can be cumbersome.

NFS async: The NFS async (asynchronous) mode allows the server to reply to client

requests as soon as it has processed the request and handed it off to the local file

system, without waiting for the data to be written to stable storage. However, write

performance is better when synchronous mode is used (also called ‘noasync’), especially

for smaller file sizes. This is the recommended mode, especially since NFSv4 uses TCP connectivity.

NFS number of threads: This is the number of NFS server daemon threads that are

started when the system boots. The OneFS NFS server usually has 16 threads as its

default setting; this value can be changed via the Command Line Interface (CLI):

isi_sysctl_cluster sysctl vfs.nfsrv.rpc.[minthreads,maxthreads]

Increasing the number of NFS daemon threads improves response minimally; the maximum number of NFS threads need to be limited to 64.

NFS ACL: The NFS ACL (Access Control List) for NFSv4 is a list of permissions associated

with a set of files or directories which contain one or more Access Control Entries (ACEs).

There are four types of ACEs: Allow, Deny, Audit, and Alarm; with three kinds of

flags: group, inheritance, and administrative. There are 13 file permissions and

Page 19: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

19 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

14 directory permissions. OneFS manages NFS ACLs which need to be mapped to the NFS client using the idmapd configuration.

NFS locks: The mounting and locking processes have been enhanced in NFSv4 which

supports mandatory as well as advisory locking. Caching and open delegation provide

performance improvements in most situations. More information about state is stored on the servers in the HPC tier, enabling recovery of the files when they are in use.6,7

Maximum number of directories at a level and files within a directory: While

Isilon OneFS supports an upper bound of 100,000 files in a directory as well as number

of directories at a level, in order to ensure highest performance while traversing a

directory tree, the maximum number of directories at a level and the maximum number

of files within a directory needs to be below 10,000.

Number of small (<8 KB) files: Random-write operations on small files have low

response times and can degrade overall application performance. In order to optimize

performance, it is recommended that Base Call files that are typically <8 KB be aggregated into 128 KB or larger ZIP archive files.

SGE number of nodes: The Sun Grid Engine (SGE) package is a popular distributed

resource manager (DRM) and scheduler package for controlling access to and control

of cluster resources. It is recommended that at least a minimum of three SGE nodes be

used for NGS for performance and backup reasons. While a commercial version of

SGE available from Oracle, SGE is also available as open source. Other popular open source DRM packages are Torque/Maui and Lava.

Execution daemons: The SGE PAR_EXECD_INST_COUNT variable contained within

the SGE configuration file defines the number of parallel execd (execution daemons) for the NGS HPC cluster.

DNS location: If the HPC NGS system is run within a private network, it is

recommended that Linux BIND be installed on the HPC master node with DNS

forwarding to the organization’s DNS server.

Summary

Internal EMC testing determined that the KPIs that affect the performance the most are:

RAM on HPC cluster server nodes (recommended at 3 GB/core)

RAM and SSD on the Isilon storage cluster—maximum allowable RAM on the performance layer and minimum recommended on the archival layer with about 1 percent to 2 percent of the raw storage capacity as SSD

Storage configuration parameters: NFS version V4, NFS async enabled, TCP MTU (jumbo frames), LACP and the Grid Engine package

6 See info on Isilon SmartLock: http://www.emc.com/collateral/software/white-papers/h8325-wp-isd-smartlock.pdf 7 See info on Isilon high-performance computing: http://www.isilon.com/high-performance-computing

Page 20: White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS: Sizing and Performance Guidelines

20 Next-Generation Genome Sequencing Using EMC Isilon Scale-out NAS

Conclusion

NGS production processes generate potentially millions of files with terabytes of

aggregate storage impacting the capacity and manageability limits of existing file

server structures. Raw instrument data typically consists of large image files (2-5 TB

per run are the norm), usually in TIFF format. The image file for the experiment is usually the largest file size in NGS.

Genomics is a data reduction process from the raw instrument information (images or

voltages) to the variants which follows the ―Rule of One-Fifth.‖ Intermediate or

secondary data consists of raw data files including files in BCL format for base calling and

conversion have an aggregate ratio of approximately one-fifth compared to raw

instrument data.

Internal EMC testing has determined that the KPIs that affect the performance of

NGS applications the most are: total RAM size on HPC cluster nodes (recommended

at 3 Gb/core, RAM and SSD on the Isilon storage cluster [typically 1 percent of RAM

storage]), and storage configuration parameters with NFS version V4, NFS async

enabled, TCP MTU (jumbo frames), LACP (2x 1 Gb/s or 4x 1 Gb/s) and a Grid

Engine package.

NGS environments require a file storage infrastructure that is purpose-built to address

the capacity and performance scalability, efficiency, availability, and manageability

challenges of next-generation NGS applications. Cumulative network bandwidth between HPC and NAS increases with the total number of Isilon nodes on the storage cluster.

Isilon scale-out NAS presents a range of benefits optimal for NGS. The Isilon approach

of enabling storage I/O and capacity growth through addition of cluster nodes is optimal

since NGS requires storage performance and capacity scalability to be implemented

as seamlessly as possible. In addition, dynamic content balancing performed within

Isilon scale-out NAS as nodes are added or data capac ity changes is ideal for an NGS workflow due to its sustained throughput requirement.

Isilon scale-out NAS also offers an 80 percent efficiency ratio and ―smart pooling‖ of

the data across multiple performance tiers, making dynamic, rule-based data transfer

between storage pools an integral piece of the NGS process. Flexible, multi-dimensional

data protection which occurs within Isilon scale-out NAS during power loss, node or disk failures, loss of quorum, and storage rebuild enables non-stop data availability for NGS.