6
Technical Case Study By Pethuraj Perumal, IT Storage Manager, NVIDIA Corporation At NVIDIA, our success is driven and determined by relentless innovation and the ability to bring new processor designs to market quickly. Already recognized as the leader in visual computing, we are diversifying and expanding rapidly into new markets. Our graphics processing unit (GPU) technology is enabling breakthroughs in healthcare, science, transportation, entertainment, and more, opening up a wealth of new opportunities for NVIDIA. The performance and reliability of our engineering compute farm are absolutely critical to NVIDIA being first to market with new chip designs and ultimately growing revenue and providing value for our partners and customers. To achieve our business goals today and in the future, having a high-performance storage platform is essential. Transforming a Compute Farm into an Innovation Factory How NVIDIA doubled its capacity for engineering computing with NetApp and accelerated innovation into new markets

NetApp Technical Case Study - Transforming a Compute … · 3 • Scalability. NetApp allows us to take a modular approach and add more controllers to maintain optimal performance

Embed Size (px)

Citation preview

Technical Case Study

By Pethuraj Perumal, IT Storage Manager, NVIDIA Corporation

At NVIDIA, our success is driven and determined by relentless innovation and the ability to bring new processor designs to market quickly. Already recognized as the leader in visual computing, we are diversifying and expanding rapidly into new markets.

Our graphics processing unit (GPU) technology is enabling breakthroughs in healthcare, science, transportation, entertainment, and more, opening up a wealth of new opportunities for NVIDIA. The performance and reliability of our engineering compute farm are absolutely critical to NVIDIA being first to market with new chip designs and ultimately growing revenue and providing value for our partners and customers. To achieve our business goals today and in the future, having a high-performance storage platform is essential.

Transforming a Compute Farm into an Innovation FactoryHow NVIDIA doubled its capacity for engineering computing with NetApp and accelerated innovation into new markets

2

Supporting World-Class Research and DevelopmentNVIDIA engineers design a range of processors, from tiny chips that power smartphones and tablets to huge supercomputing processors packed with 7 billion transistors. Designing and simulating these chips is an increasingly complex and technically challenging task. We are increasingly generating larger file sizes and a larger number of files. In the past nine months, our engineers have created 2.4 billion files—that’s around 10 million files every day. We’ve amassed more than 15PB of engineering data, and the data volume doubles approximately every two years. Planning for this level of data growth is challenging as budgets remain flat, but the demand continues to increase.

We don’t want our product engineering teams to even have to think about storage while they are testing their designs—and we certainly don’t want storage to be a bottleneck to their research and development (R&D) workflow. Our electronic design automation (EDA) workflow cannot be delayed or interrupted for any reason. If a compute job stops, it must be rerun from the beginning, potentially affecting the entire test cycle and delaying time to market. Thankfully, our compute factory built on NetApp® storage is able to keep up with the pace of innovation set by our thousands of engineers, allowing them to complete chip designs, simulations, and logic verification quickly and reliably.

To enable our engineers to innovate without disruption or delay, IT must provide them with the highest performance storage platform available, specifically tuned to provide “scratch space” and data volumes for file-driven, I/O-intensive engineering workloads. As data grows, one of my team’s key objectives is maximizing the “CPU time–to–wall time” ratio, where wall clock time represents the total amount of time necessary to process the compute job, and CPU time measures the amount of time that the CPU is actively working on processing the task. The higher the ratio, the more efficient our compute factory; however, improving this ratio requires a storage platform with fast I/O. If the CPU is waiting on storage to respond, that is idle time that erodes our overall efficiency.

Technical Requirements for Storage LayerSeveral years ago, we tried using another vendor’s storage technology that stripes all the disks across a much larger array, and we ran into three problems:

• We were not getting the linear performance we required from the system.

• Small-file random I/O became a bottleneck.

• Stability and reliability were insufficient. A storage controller failure could have potentially delayed time to market; all active jobs in the workflow would need to be started over again at day one.

While evaluating ways to address the issues, replacement of the current system became the right choice. Our team selected NetApp to support our R&D compute operations for the following reasons:

• Performance. Our R&D compute operations involve a high level of concurrency, with more than 5,000 compute nodes accessing the storage, so performance is largely determined by the storage controller. We always want the fastest processors available in a storage controller, with the most parallel network threads to process I/O requests. We also need the ability to handle small file random I/O operations efficiently, because this is also a major determinant of performance for our workloads.

Results of NVIDIA Compute Factory Transformation

By deploying NetApp FAS6290 and FAS6280 storage systems with controller-based intelligent caching and optimizing the storage environment for processor design workflows, NVIDIA was able to:

• More than double the overall processing efficiency of its compute factory, from 2 million compute jobs per day to 4.5 million

• Improve performance for compiles by up to 19% and simulation run times by up to 25%

• Accommodate 60,000 compute jobs running concurrently at any given time, accelerating workflows

• Provide increased operations and support with no additional budget and reduced IT headcount

3

• Scalability. NetApp allows us to take a modular approach and add more controllers to maintain optimal performance as our data grows. We can scale storage horizontally, which is a very effective model for us. This also reduced risk, because we are not bound to a single point of failure.

• Reliability. We needed a stable, proven data management platform such as NetApp Data ONTAP®. Using NetApp storage controllers clustered in HA pairs allows us to provide seamless failover in case any hardware failure occurs, as well as perform updates without creating any disruption to jobs running in our compute factory. If one domain goes down, it does not bring the entire cluster down.

• Efficiency. We are constantly striving to keep our overall power consumption and hardware footprint in check and increase density wherever possible. NetApp offers a number of technologies that maximize efficiency, including the ability to maintain data consistency with point-in-time Snapshot™ copies that consume minimal storage space. The NetApp volumes are thin provisioned by default, reducing the initial storage space consumption.

• Simplicity. The flexibility to provision storage quickly and to offer shared access to engineering files using Network File System (NFS) and Common Internet File System (CIFS) was essential. Multiprotocol support in the NetApp Unified Storage Architecture enables us to use both protocols (see Figure 1).

How We Doubled Capacity with NetAppBy 2012, our engineering compute infrastructure was nearly at capacity with the NetApp storage we had in place. To support continued innovation, we needed to accommodate more simultaneous workflows and improve performance for compute jobs.

To address this challenge, we deployed NetApp FAS6280 and FAS6290 storage systems with intelligent caching for greater throughput and consolidated standalone systems into HA pairs. We also moved to an updated version of Data ONTAP, which gave us more parallel network threads to process the I/O requests and more balanced CPU utilization across all the cores. We also worked closely with NetApp engineering to do benchmark testing and optimize the storage for our specific EDA tools, without changing or affecting the under-lying workflows of our engineering teams (see sidebar: “Building a Customized, Optimized Engineering Factory”).

The end result of the additional NetApp storage systems, caching, and other optimizations is that the overall processing efficiency of our compute factory more than doubled, from 2 million compute jobs per day to 4.5 million. We can accommodate 60,000 compute jobs at any given time. Our overall CPU-to–wall time ratio improved—we’ve seen up to 19% improvement in wall clock performance for compiles and up to 25% improvement in simulation run times.

Building a Customized, Optimized Engineering Factory

At NVIDIA, we appreciate the fact that NetApp is actively engaged with vendors in the semiconductor market to provide acceleration for processor design workflows and to create a storage platform capable of handling the entire chip design lifecycle. We frequently meet with technical experts on the NetApp electronic design automa-tion team to optimize performance in our environment, and these conversations have yielded several insights:

• NetApp partners closely with Red Hat and helped educate us about the use of read-ahead algorithms in the Linux® kernel and how to optimize I/O requests between the client and the storage.

• Like many semiconductor design companies, we use IBM Platform Computing Load Sharing Facility (LSF) job-scheduling software. NVIDIA is looking forward to exploring a storage-aware plug-in that NetApp has developed to monitor and report on available storage resources for jobs submitted in our compute factory. This allows the LSF scheduler to make informed decisions while submitting jobs, reducing the likelihood of job failures.

• Different electronic design tools have different storage requirements, and NetApp provides detailed recommen-dations and guidelines for each. Best practices, storage architecture, con-figuration, and sizing are provided, for example, for Synopsys VCS verification workloads and for Perforce software configuration management deployed on NetApp Data ONTAP storage solutions.

Figure 1) NVIDIA engineering compute factory with NetApp Data ONTAP 8.2. Data integrity is preserved completely by Data ONTAP when the same file systems are accessed over NFS and CIFS.

Users Submit Jobs to the Cluster via the Scheduler

Scheduler Layer (IBM LSF)

Compute Layer(Linux Nodes) Execution Hosts

NetApp FAS6290 and FAS6280Storage Systems RunningData ONTAP 8.2

Master Node

4

Storage Efficiencies Behind Faster Time to MarketNetApp technologies play a major role in the performance, efficiency, and reliability of our compute factory, allowing us to optimize performance for both sequential and random workloads using the same storage platform. This results in faster time to market.

Improving Performance for Small File Random I/ONetApp does a great job of handling I/O requests as they come in over NFS; part of this is due to WAFL® (Write Anywhere File Layout), which is one of the features that impressed me most about NetApp. Instead of storing data and metadata in predetermined locations on disk, WAFL writes metadata alongside user data using a temporal data layout to minimize the number of disk operations required to commit data to storage. Very small files (less than 64 bytes) are not stored in disk blocks, but rather in inode data structures within the file system; therefore, no disk access (lookup time) is required, improving performance.

Optimizing Reads While Saving Space and Power with Intelligent CachingMuch of our workload is dependent on reads, which we accelerate by using NetApp Flash Cache. By caching recently read data and metadata on controller-attached PCIe cards, Flash Cache works as an extended WAFL buffer in the PCI bus, helping us accommodate our very large datasets. We worked closely with NetApp to determine the amount of Flash Cache that makes sense for our workloads and decided to use 512GB and 1TB PCIe cards. As a result, cache usage is always above 90%.

Flash Cache allows us to use a hybrid storage model that mixes high-performance serial-attached SCSI (SAS) drives with higher density, lower cost serial ATA (SATA) drives, helping us minimize our storage footprint and keep our costs down. To achieve our current level of performance without Flash Cache would have required threefold more disk shelves and the corresponding power and cooling resources. Without Flash Cache, using high-capacity SATA disk wouldn’t work in our environment, and scaling our compute factory would be difficult. We would have already outgrown our data centers. In fact, the energy efficiency of the new NetApp storage systems helped us earn a $200,000 rebate from the power company—after we expanded our compute factory’s capacity.

Maintaining Data Consistency While Reducing RiskAnother compelling feature of NetApp storage is Snapshot copies, which are pointer-based, read-only clones of the active file system. WAFL uses a copy-on-write technique to minimize the disk space that Snapshot copies consume, so we can keep point-in-time copies of datasets without sacrificing storage space, and with no performance impact. Snapshot copies help us maintain data consistency, which is critical in an engineering environment, and help mitigate the risk of data loss.

Snapshot copies are a very convenient way for us to temporarily protect data that we don’t need to keep when a compute job is complete, without incurring the cost of duplicate storage. Snapshot copies provide quick recovery in a high–file count environment such as ours by simply flipping the file system pointers—if something goes wrong during an experiment, we can quickly restore to a known state using the copy of the data from Snapshot copies. NVIDIA is using NetApp SnapVault® to back up and NetApp SnapMirror® to replicate the data to a disaster recovery site in Sacramento.

Five Benefits of NetApp FAS6200 Series for Processor Design Workloads

• Processing power. A single FAS6290 controller has 12 processing cores, all of which are used to accelerate data processing so we can handle more concurrent jobs.

• Controller memory (DRAM). With 96GB of memory per controller, metadata can be cached in base memory, which gives us less than 1ms response time for metadata. This is critical to accommodating larger active working set sizes.

• Networking. Two IOH chips in the FAS6290 give us 72 PCIe gen 2 lanes, which are broken out further using switches to create 152 PCIe lanes of I/O connectivity within the FAS6290, with total internal bandwidth in excess of 72GB per second.

• NetApp Flash Cache™. Controller-attached PCIe-based intelligent caching reduces the number of spindles we need to achieve the same level of performance and significantly reduces the latency of read operations.

• RAID group optimization. NetApp gives us the flexibility to size our RAID groups appropriately for our scratch-space write workloads, minimizing latency.

5

We are also benefitting from NetApp deduplication, which we use to eliminate redundant blocks of data within certain volumes. Deduplication locates identical blocks of data and replaces them with references to a single shared block. This works particularly well for our Perforce software configuration management system, which maintains multiple copies that contain a lot of duplicate data. In these volumes, we’ve reduced capacity requirements by 30%.

Managing More Storage with Less HeadcountEven though the storage capacity in our compute factory has increased significantly, we have not hired more infrastructure people, and our budget has remained flat year over year. In fact, we are able to operate with one less full-time employee. This is only possible because of the way NetApp makes it easy and simple for us to administer and manage our 15PB data footprint.

We use NetApp OnCommand® Unified Manager management software, which gives us performance metrics and utilization statistics at a glance. To identify storage infrastructure issues before they affect compute jobs, we rely on NetApp AutoSupport™, which provides fast response and sends us alerts about disk failures or other potential problems.

Business Impact: Faster Time to Market for NVIDIA and Our CustomersFor NVIDIA, a 25% improvement in compute factory efficiency means that chip designs can be tested, validated, and brought to market more quickly. NetApp helped us achieve the higher CPU-to–wall time ratio that is so essential to our time to market. With the increased performance and capacity, we are able to support more than twice as many jobs per day, which in turn allows us to support more designs. We are also unhampered by downtime, measuring greater than 99.99% availability on our NetApp systems. We have stopped measuring storage uptime, because our NetApp storage is always available when engineers need it.

Accelerating our release cycles also delivers significant business value to our customers, making us a strategic business partner and enabling them to release groundbreaking products based on NVIDIA technology.

What’s NextWith NetApp Flash Cache and other storage efficiencies, NVIDIA achieved our goal of transforming R&D computing and creating a compute factory that allows innovation to flourish. We are continuing to rely on our partnership with NetApp as we expand and improve our compute factory, and we look forward to more performance gains and additional power and cooling advantages with the next generation of NetApp FAS6000 storage systems. Meanwhile, we are expanding our NetApp footprint in other areas of the business, including corporate IT and our VMware vSphere® virtual server environment.

In the near future, we plan to migrate our compute factory to the NetApp clustered Data ONTAP operating system, which we are currently testing. By combining our existing NetApp storage systems into a single, global namespace under clustered Data ONTAP, we will benefit from seamless scale-out, easy load balancing, and the ability to keep chip design data online continuously throughout its lifecycle.

Pethuraj PerumalIT Storage Manager NVIDIA Corporation

Pethuraj Perumal joined NVIDIA as IT Storage Manager in June 2011 and is responsible for a global storage environment of more than 20 petabytes. With more than 15 years of experience managing and architecting complex informa tion technology systems, Pethuraj previously worked as Data Protection Services Manager at Synopsys, a leading manufacturer of semiconductor design software.

www.netapp.com

Follow us on:

© 2014 NetApp, Inc. All rights reserved. No portions of this document may be reproduced without prior written consent of NetApp, Inc. Specifications are subject to change without notice. NetApp, the NetApp logo, Go further, faster, AutoSupport, Data ONTAP, Flash Cache, OnCommand, SnapMirror, SnapRestore, Snapshot, SnapVault, and WAFL are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. Linux is a registered trademark of Linus Torvalds. VMware vSphere is a registered trademark of VMware, Inc. Cisco is a registered trademark of Cisco Systems, Inc. All other brands or products are trademarks

or registered trademarks of their respective holders and should be treated as such. NA-187-0214

About NVIDIASince 1993, NVIDIA (NASDAQ: NVDA) has pioneered the art and science of visual computing. The company’s technologies are transforming a world of displays into a world of interactive discovery—for everyone from gamers to scientists, and consumers to enterprise customers. More information at http://nvidianews.nvidia.com and http://blogs.nvidia.com.

About NetAppNetApp creates innovative storage and data management solutions that deliver outstanding cost efficiency and accelerate business breakthroughs. Discover our passion for helping companies around the world go further, faster at www.netapp.com.

Go further, faster®

NetApp Products• NetApp FAS6290 and

FAS6280 storage systems

• NetApp Data ONTAP Operating System 8.2

• NetApp OnCommand Unified Manager 5.1

• NetApp Flash Cache

• NetApp deduplication

• NetApp SnapMirror replication technology

• NetApp SnapVault

• NetApp Snapshot and SnapRestore® technologies

• NetApp AutoSupport

Third-Party Products• IBM Platform LSF job

scheduling software

• Perforce software configuration management system

• Synopsys Verilog Compile Simulation (VCS) logic simulation tool

• Red Hat and CentOS Linux

• Cisco® and Arista Networks switches