Storage on the Lunatic Fringe - SNIA · 2PB disk storage 1-2TB/sec sustained bandwidth 2011-2012 – 10 PFlop and beyond Blue Waters at NCSA – University of Illinois and State of

Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.

Storage on the Lunatic Fringe

Thomas M. RuwartChief [Mad] Scientist

[email protected]

SNIA Developers ConferenceSan Jose, CASeptember 25, 2008


Why you are here

To learn that there are organizations with bigger data storage requirements than yoursWhat some of their issues and problems areHow they are addressing those issues and problemsA glimpse into the future of data storage hardware and software technologies and possible solutions


Orientation

A bit of historyWho are the lunatics in 2008-2009?What are their requirements?Why is this interesting to the Storage Industry?What is anyone doing about this?Conclusions


A bit of History1988

Supercomputer Centers operating with HUGE disk farms of 50-100 GB! 1GB disk drives cost $20,000 each

8-inch FF, 60Lbs, Average seek time of 15ms, 3600RPM

19953.5-inch half-height disk drives are the standard form factor at 4GB/diskBuilt a 1+TB array of disks with

Used 296 4GB 3.5-inch disks3600RPM, average seek time 12ms, 2lbs, $2000 per disk drive ($500/GB)37 RAID5 7+1 disk arrays mounted in 5 racksMore than $1M in disk arraysCreated a single SGI xFS file system across all the drivesCreated a single 1TB file

2002 ASCI Q – 700TB – online, high performance, pushing limits of traditional [legacy] block-based file systems2004 ASCI Red Storm – 240TB – online, high bandwidth, massively parallel

Technician working on IBM 3380 Disk Drive ©1986


A bit more history…2002 ASCI Q – 700TB – online, high performance, pushing limits of traditional [legacy] block-based file systems2004 ASCI Red Storm – 240TB – online, high bandwidth, massively parallel2006 – ASC Purple at LLNL

269 racks for the entire machine12208 Processors in 131 racks48 racks just for switches (17,000 cables)2 PBytes of storage: >11,000 disks in 90 racks

2008 – Road Runner at LANLFirst PetaFlop machine (1015 FLOPs)6,912 AMD dual-core Opterons plus 12,960 IBM Cell eDP80TB Main Memory (aggregate)216 GB/sec sustained I/O to storage (432x10GigE)

See the Top500 List for complete details www.top500.org

http://www.top500.org/


Number of Processors by Rank

1

10

100

1000

10000

100000

1000000

0 50 100 150 200 250 300 350 400 450 500

Rank

Num

ber

of P

roce

ssor

s

Processor Count


Looking Ahead…2009 – ZIA – Joint development between Sandia National Lab and LANL

2PFlops256K Processor cores2PB disk storage1-2TB/sec sustained bandwidth

2011-2012 – 10 PFlop and beyondBlue Waters at NCSA – University of Illinois and State of Illinois joint project for open peta-scale computing>> 200,000 processors>> 800TB Main Memory>> 10PB disk Storage

Looking way ahead, 2020 – Tom retires ☺


Who said that?

“I think there is a world market for maybe five computers.”Thomas Watson (1874-1956), Chairman of IBM, 1943

“There is no reason anyone would want a computer in their home.”Ken Olson, president, chairman and founder of Digital Equipment Corp., 1977

“640K ought to be enough for anybody.”Bill Gates (1955-), in 1981

“Who the hell wants to hear actors talk?”H. M. Warner (1881-1958), founder of Warner Brothers, in 1927

“Everything that can be invented has been invented.”Charles H. Duell, Commissioner, U.S. Office of Patents, 1899


Who would say this?

Who on earth needs an ExaByte (1018 bytes) of storage space?Who needs a TeraByte per Second data transfer rate from storage to the application?Who needs millions, billions, trillions of data transactions per second?Who would ever need to manage a trillion files?You did not hear these questions from me…


Who are the Lunatics?

High-End Computing (HEC) CommunityBIG data or LOTS of data, locally and widely distributed, high bandwidth access or high transaction rate, relatively few users, secure, short-term and long-term retention

High Energy Physics (HEP) – Fermilab, CERN, DESY

BIG data, locally distributed, widely available, moderate number of users, sparse access, long-term retention

DARPA HPCSSets the Requirements


HEP – LHC at CERN

The LHC ( http://lhc.web.cern.ch/lhc/ )$750M Experiment being built at CERN in SwitzerlandActivating this year (2008) Holy black holes Batman….

The Easy Part – collecting the dataData rate from the detectors is ~1 PB/secData rate after filtering is a few GB/sec

The Hard Part: Storing and AccessDataset for a single experiment is ~1PBSeveral experiments per year are runMust be made available to 5000 scientists all over the planet (Earth primarily) for the next 10-25 yearsDense dataset, sparse data access by any one scientistAccess patterns are not deterministic

http://lhc.web.cern.ch/lhc/

Tier 1

Tier2 Center

Online System

eventreconstruction

French Regional Center

German Regional Center

InstituteInstituteInstituteInstitute ~0.25TIPS

Workstations

~100 MBytes/sec

~0.6-2.5 Gbps

100 - 1000 Mbits/sec

Physics data cache

~PByte/sec

~2.5 Gbits/sec

Tier2 CenterTier2 CenterTier2 Center

~0.6-2.5 Gbps

Tier 0 +1

Tier 3

Tier 4

Tier2 Center

LHC Data Grid HierarchyCMS as example, Atlas is similar

Tier 2

CERN/CMS data goes to 6-8 Tier 1 regional centers, and from each of these to 6-10 Tier 2 centers.Physicists work on analysis “channels” at 135 institutes. Each institute has ~10 physicists working on one or more channels.2000 physicists in 31 countries are involved in this 20-year experiment in which DOE is a major player.

CMS detector: 15m X 15m X 22m12,500 tons, $700M.

human=2m

analysis

event simulation

Italian Center FermiLab, USA Regional Center

Courtesy Harvey

Newman, CalTech and

CERN






What are the DARPA requirements?

HEC Community – The High Productivity Computing Systems (HPCS) from DARPA

1015 computations per second – Peta-scale computing1-10 trillion files in a single file system100’s of thousands of processors Millions of process threads all needing and generating data1-100 TBytes/sec aggregate bandwidth to disk30,000+ file creations per secondFocus on ease of use, efficiency, and RAS


Why is the Number of Processors Important?

Indicator of Number of Independent Program Threads that need access to storageWhen Number of Processors is greater than the Number of Disks, then I/O will be “random”Past the age of purely sequential bandwidthCurrently in the age of purely random data access patternsStrictly a result of the computer architecture


What are we getting ourselves into?

What is 1TB/sec bandwidth to disk?20,000 disk drives

@ 50MB/sec/disk average (assumes no seeks)@ 10ms average access time ≈ 2 million IOPS@ 1TB/disk ≈ 20PB raw capacity@ 25watts/disk (including cooling power) ≈ 500 KWatts

24,000-40,000 disk drives in an real design to include redundancy

Space and power/cooling increase up to 2x ≈1MWatt

And that is just the beginning….10TB/sec would be up to 400,000 disk drives…


The Storage Event Horizon

1 GByte/sec ≈ 20 Disk Drives 10 GBytes/sec ≈ 200 Disk Drives 100 GBytes/sec ≈ 2,000 Disk Drives 1 TByte/sec ≈ 20,000 Disk Drives

~~~~~~~~~~Storage Event Horizon ~~~~~~~~~~~

10 TBytes/sec ≈ 200,000 Disk Drives 100 TBytes/sec ≈ 2,000,000 Disk Drives 1 PByte/sec ≈ 20,000,000 Disk Drives


What does 1TB/sec really mean?

To what?1,000 processes @ 1GB/sec each?100,000 processes at 10MB/sec each?

Assumes a process/processor can absorb/generate data at that rateCurrent ratio of I/O transfer rate to instruction execution rate is about 1000:1 based on ZIA requirements – all machines are different

Therefore, 1PFlop implies 1TB/sec I/O transfer rate1 EFLOP implies an I/O transfer rate of 1PB/sec or 20 million disk drives – ooops!


Digging ourselves in deeper?

1 Trillion Files30,000 file creations per second for 1 year = 1 trillion files1PB of MetaData to describe 1Trillion filesFinding any one file within 1 Trillion filesFinding anything inside of the 1 Trillion files

This is a major transactional problem not a bandwidth problemTraditional file systems and associated [POSIX] semantics break down at these scales – need new/relaxed semanticsIs the concept of a “file” still valid in this context?


The Growing Disk Drive Bottleneck

Subsystem 19931 2007E1 IncreaseNetwork I/O2 0.001 2 2000xIntel CPU 0.48 100 200xStorage Channel I/O3 0.05 4 80xPCI7 0.13 16 123xIntel Front Side Processor Bus 0.53 13 24xRandom Disk IOPS5 90 150 1.7xRandom Disk IOPS per Gbyte5,6 43 4.2 -10xSequential Disk I/O4 0.005 0.1 20xSequential Disk BW/Gbyte 0.005 0.0001 -50x Notes: 1 Speed of subsystem in GBps 2 Ethernet 3 SCSI and Fibre Channel 4 IBM 3.5 inch drives internal data rate 5 IBM 3.5 inch drives seek + rotational latency 6 Horison/Fred Moore 7 PCI versus 16xPCIe Source: www.ArchiveBuilders.com, "Evolution of Intel Microprocessors: 1971 to 2001”


Need more disks, not higher capacity ones

Disk drive capacity improves faster thanData transfer rateSeek timeRotational Latency


Access Density


Serious Questions

How do you package it?How do you maintain it?How do you connect it all together?How do you access/use a storage system with 250,000 disk drives?


How do you package this?

Conservatively 200 x 3½ inch disks per rack with controllers200 racks of disk drives and controllers4,000 square feet10TB/sec is 10 times this or about the size of one football field (~40,000 sq ft)


How do you maintain it?

Assume40,000 disk configuration2,000,000 hours MTBF per Enterprise-class disk500,000 hours MTBF per Consumer-class disk

~4 disk failure per week for Enterprise-class disks~15 failures per week for Consumer-class disksContinual rebuilds in progress10TB/sec is 10 times this


How do you connect it all together?

10Gbit/sec/channel → 1,000 channels @ 100% efficiencyImplies a 2,000 channel non-blocking switch fabricWhat about transceiver failure ratesWhen it breaks, how do you find the broken transceiver?10TB/sec – who on earth would want to do that? (don’t ask)


How do you use this?

Current file system technology is based on 30+ year-old designs and does not scaleDisk I/O software stack is 30+ years old and does not scaleNeed lots of innovation in many areas

Common shared file system interfacesData Life Cycle Management and seamless integration into existing HEC environmentsChanges to standards that offer greater scalability without sacrificing data integrityStreaming I/O from zillions of single nodes Data alignment, small-block, large-block, and RAID issuesFile System Metadata

Application

OperatingSystem

Storage andTransport

Application

OperatingSystem

Storage andTransport


Commodity Reliability And Practices

Processors, Networks, Graphics Engines have for the most part gone “commodity”Disk drives are still largely “enterprise-class”Significant pressure to move toward more use of commodity disk drivesRequires a fundamental change in how we think about RAS for storage – i.e. Fail-In-PlaceAssumes something is always in the process of breakingMust re-orient engineering to think about how to build reliable systems using unreliable componentsAKA – How to build reliable systems using CRAP


History has shown…

The problems that the Lunatic Fringe is working on today are the problems that will become mainstream in 3-5 yearsLegacy data access hardware and software mechanisms are breaking down at these scalesWe need to continue to innovate

Individual at all levelsGlobally across levelsRe-orient our thinking on many levels


What’s happening now?

Areal Density is at about 250Gigabits per square inch3.5-inch form factor is currently the standard2.5-inch form factor is emerging in the enterpriseSAS and SATA are getting significant traction OSD has been demonstrated and is in active developmentConsumer-grade storage is cheap cheap cheapCommodity interface speeds are up to 20-40Gigabits/secStorage and Network processing engines are availableNew applications for storage are rapidly evolvingRelaxed POSIX standardsNFS V4 and Parallel NFS


Common thread

Their data storage capacity, access, and retention requirements are continually increasingSome of the technologies and concepts the Lunatic Fringe are looking at include:

Object-based Storage DevicesIntelligent Storage SystemsData GridsHigh-density disk drive packagingCommodity Reliability And Practices

Building Reliable systems with inherently unreliable componentsOr – Building Reliable systems using CRAP

New and/or improved software standardsError Detection Techniques and Methods


Conclusions

Lunatic Fringe users will continue to push the limits of existing hardware and software technologiesLunatic Fringe is a moving target – there will always be a Lunatic Fringe well beyond where you areThe Storage Industry at large should pay attention to

What they are doingWhy they are doing itWhat they learn


Some Interesting Sites…

www.llnl.gov – Livermore National Labswww.lanl.gov – Los Alamos National Labswww.sandia.gov – Sandia National Labswww.top500.org – The Top500 Listwww.ncsa.uiuc.edu - NCSAwww.psc.edu – Pittsburgh Supercomputer Centerwww.tacc.utexas.edu – Texas Advanced Computing Centerwww.ornl.gov – Oak Ridge National Labhttp://lhc.web.cern.ch/lhc - CERN and the LHC

http://www.llnl.gov/

http://www.lanl.gov/

http://www.sandia.gov/

http://www.top500.org/

http://www.ncsa.uiuc.edu/

http://www.psc.edu/

http://www.tacc.utexas.edu/

http://www.ornl.gov/

http://lhc.web.cern.ch/lhc


Government Research

DoE ASCI TriLabs – LANL, LLNL, SandiaLustre (www.lustre.org)Parallel NFS ( www.ietf.org/proceedings/04mar/slides/nfsv4-1.pdf )

NFS Version 4 (nfsv4.org)DICE Data Intensive Computing Environments

http://www.avetec.org/dice/NASA and the IEEE – Mass Storage Technical Committee –Annual symposium on Mass Storage Systems and Technologies (MSSTC) ( www.storageconference.org )

http://www.avetec.org/dice/

http://www.storageconference.org/


Academic Storage Research

University of Minnesota Digital Technology Center Intelligent Storage Consortium (DISC)

www.dtc.umn.edu/programs/DISC.htmlUniversity of California Santa Cruz Storage Systems Research Center (SSRC)

http://ssrc.soe.ucsc.eduCMU Parallel Data Lab (PDL)

www.pdl.cmu.edu

http://www.dtc.umn.edu/programs/DISC.html

http://ssrc.soe.ucsc.edu/

http://www.pdl.cmu.edu/


Thank you

Thomas M. RuwartChief [Mad] Scientist

[email protected]

Documents

Storage on the Lunatic Fringe - SNIA · 2PB disk storage 1-2TB/sec sustained bandwidth 2011-2012 – 10 PFlop and beyond Blue Waters at NCSA – University of Illinois and State of