Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Storage on the Lunatic Fringe
Thomas M. RuwartChief [Mad] Scientist
SNIA Developers ConferenceSan Jose, CASeptember 25, 2008
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Why you are here
To learn that there are organizations with bigger data storage requirements than yoursWhat some of their issues and problems areHow they are addressing those issues and problemsA glimpse into the future of data storage hardware and software technologies and possible solutions
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Orientation
A bit of historyWho are the lunatics in 2008-2009?What are their requirements?Why is this interesting to the Storage Industry?What is anyone doing about this?Conclusions
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
A bit of History1988
Supercomputer Centers operating with HUGE disk farms of 50-100 GB! 1GB disk drives cost $20,000 each
8-inch FF, 60Lbs, Average seek time of 15ms, 3600RPM
19953.5-inch half-height disk drives are the standard form factor at 4GB/diskBuilt a 1+TB array of disks with
Used 296 4GB 3.5-inch disks3600RPM, average seek time 12ms, 2lbs, $2000 per disk drive ($500/GB)37 RAID5 7+1 disk arrays mounted in 5 racksMore than $1M in disk arraysCreated a single SGI xFS file system across all the drivesCreated a single 1TB file
2002 ASCI Q – 700TB – online, high performance, pushing limits of traditional [legacy] block-based file systems2004 ASCI Red Storm – 240TB – online, high bandwidth, massively parallel
Technician working on IBM 3380 Disk Drive ©1986
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
A bit more history…2002 ASCI Q – 700TB – online, high performance, pushing limits of traditional [legacy] block-based file systems2004 ASCI Red Storm – 240TB – online, high bandwidth, massively parallel2006 – ASC Purple at LLNL
269 racks for the entire machine12208 Processors in 131 racks48 racks just for switches (17,000 cables)2 PBytes of storage: >11,000 disks in 90 racks
2008 – Road Runner at LANLFirst PetaFlop machine (1015 FLOPs)6,912 AMD dual-core Opterons plus 12,960 IBM Cell eDP80TB Main Memory (aggregate)216 GB/sec sustained I/O to storage (432x10GigE)
See the Top500 List for complete details www.top500.org
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Number of Processors by Rank
1
10
100
1000
10000
100000
1000000
0 50 100 150 200 250 300 350 400 450 500
Rank
Num
ber
of P
roce
ssor
s
Processor Count
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Looking Ahead…2009 – ZIA – Joint development between Sandia National Lab and LANL
2PFlops256K Processor cores2PB disk storage1-2TB/sec sustained bandwidth
2011-2012 – 10 PFlop and beyondBlue Waters at NCSA – University of Illinois and State of Illinois joint project for open peta-scale computing>> 200,000 processors>> 800TB Main Memory>> 10PB disk Storage
Looking way ahead, 2020 – Tom retires ☺
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Who said that?
“I think there is a world market for maybe five computers.”Thomas Watson (1874-1956), Chairman of IBM, 1943
“There is no reason anyone would want a computer in their home.”Ken Olson, president, chairman and founder of Digital Equipment Corp., 1977
“640K ought to be enough for anybody.”Bill Gates (1955-), in 1981
“Who the hell wants to hear actors talk?”H. M. Warner (1881-1958), founder of Warner Brothers, in 1927
“Everything that can be invented has been invented.”Charles H. Duell, Commissioner, U.S. Office of Patents, 1899
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Who would say this?
Who on earth needs an ExaByte (1018 bytes) of storage space?Who needs a TeraByte per Second data transfer rate from storage to the application?Who needs millions, billions, trillions of data transactions per second?Who would ever need to manage a trillion files?You did not hear these questions from me…
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Who are the Lunatics?
High-End Computing (HEC) CommunityBIG data or LOTS of data, locally and widely distributed, high bandwidth access or high transaction rate, relatively few users, secure, short-term and long-term retention
High Energy Physics (HEP) – Fermilab, CERN, DESY
BIG data, locally distributed, widely available, moderate number of users, sparse access, long-term retention
DARPA HPCSSets the Requirements
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
HEP – LHC at CERN
The LHC ( http://lhc.web.cern.ch/lhc/ )$750M Experiment being built at CERN in SwitzerlandActivating this year (2008) Holy black holes Batman….
The Easy Part – collecting the dataData rate from the detectors is ~1 PB/secData rate after filtering is a few GB/sec
The Hard Part: Storing and AccessDataset for a single experiment is ~1PBSeveral experiments per year are runMust be made available to 5000 scientists all over the planet (Earth primarily) for the next 10-25 yearsDense dataset, sparse data access by any one scientistAccess patterns are not deterministic
Tier 1
Tier2 Center
Online System
eventreconstruction
French Regional Center
German Regional Center
InstituteInstituteInstituteInstitute ~0.25TIPS
Workstations
~100 MBytes/sec
~0.6-2.5 Gbps
100 - 1000 Mbits/sec
Physics data cache
~PByte/sec
~2.5 Gbits/sec
Tier2 CenterTier2 CenterTier2 Center
~0.6-2.5 Gbps
Tier 0 +1
Tier 3
Tier 4
Tier2 Center
LHC Data Grid HierarchyCMS as example, Atlas is similar
Tier 2
CERN/CMS data goes to 6-8 Tier 1 regional centers, and from each of these to 6-10 Tier 2 centers.Physicists work on analysis “channels” at 135 institutes. Each institute has ~10 physicists working on one or more channels.2000 physicists in 31 countries are involved in this 20-year experiment in which DOE is a major player.
CMS detector: 15m X 15m X 22m12,500 tons, $700M.
human=2m
analysis
event simulation
Italian Center FermiLab, USA Regional Center
Courtesy Harvey
Newman, CalTech and
CERN
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
What are the DARPA requirements?
HEC Community – The High Productivity Computing Systems (HPCS) from DARPA
1015 computations per second – Peta-scale computing1-10 trillion files in a single file system100’s of thousands of processors Millions of process threads all needing and generating data1-100 TBytes/sec aggregate bandwidth to disk30,000+ file creations per secondFocus on ease of use, efficiency, and RAS
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Why is the Number of Processors Important?
Indicator of Number of Independent Program Threads that need access to storageWhen Number of Processors is greater than the Number of Disks, then I/O will be “random”Past the age of purely sequential bandwidthCurrently in the age of purely random data access patternsStrictly a result of the computer architecture
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
What are we getting ourselves into?
What is 1TB/sec bandwidth to disk?20,000 disk drives
@ 50MB/sec/disk average (assumes no seeks)@ 10ms average access time ≈ 2 million IOPS@ 1TB/disk ≈ 20PB raw capacity@ 25watts/disk (including cooling power) ≈ 500 KWatts
24,000-40,000 disk drives in an real design to include redundancy
Space and power/cooling increase up to 2x ≈1MWatt
And that is just the beginning….10TB/sec would be up to 400,000 disk drives…
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
The Storage Event Horizon
1 GByte/sec ≈ 20 Disk Drives 10 GBytes/sec ≈ 200 Disk Drives 100 GBytes/sec ≈ 2,000 Disk Drives 1 TByte/sec ≈ 20,000 Disk Drives
~~~~~~~~~~Storage Event Horizon ~~~~~~~~~~~
10 TBytes/sec ≈ 200,000 Disk Drives 100 TBytes/sec ≈ 2,000,000 Disk Drives 1 PByte/sec ≈ 20,000,000 Disk Drives
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
What does 1TB/sec really mean?
To what?1,000 processes @ 1GB/sec each?100,000 processes at 10MB/sec each?
Assumes a process/processor can absorb/generate data at that rateCurrent ratio of I/O transfer rate to instruction execution rate is about 1000:1 based on ZIA requirements – all machines are different
Therefore, 1PFlop implies 1TB/sec I/O transfer rate1 EFLOP implies an I/O transfer rate of 1PB/sec or 20 million disk drives – ooops!
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Digging ourselves in deeper?
1 Trillion Files30,000 file creations per second for 1 year = 1 trillion files1PB of MetaData to describe 1Trillion filesFinding any one file within 1 Trillion filesFinding anything inside of the 1 Trillion files
This is a major transactional problem not a bandwidth problemTraditional file systems and associated [POSIX] semantics break down at these scales – need new/relaxed semanticsIs the concept of a “file” still valid in this context?
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
The Growing Disk Drive Bottleneck
Subsystem 19931 2007E1 IncreaseNetwork I/O2 0.001 2 2000xIntel CPU 0.48 100 200xStorage Channel I/O3 0.05 4 80xPCI7 0.13 16 123xIntel Front Side Processor Bus 0.53 13 24xRandom Disk IOPS5 90 150 1.7xRandom Disk IOPS per Gbyte5,6 43 4.2 -10xSequential Disk I/O4 0.005 0.1 20xSequential Disk BW/Gbyte 0.005 0.0001 -50x Notes: 1 Speed of subsystem in GBps 2 Ethernet 3 SCSI and Fibre Channel 4 IBM 3.5 inch drives internal data rate 5 IBM 3.5 inch drives seek + rotational latency 6 Horison/Fred Moore 7 PCI versus 16xPCIe Source: www.ArchiveBuilders.com, "Evolution of Intel Microprocessors: 1971 to 2001”
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Need more disks, not higher capacity ones
Disk drive capacity improves faster thanData transfer rateSeek timeRotational Latency
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Access Density
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Serious Questions
How do you package it?How do you maintain it?How do you connect it all together?How do you access/use a storage system with 250,000 disk drives?
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
How do you package this?
Conservatively 200 x 3½ inch disks per rack with controllers200 racks of disk drives and controllers4,000 square feet10TB/sec is 10 times this or about the size of one football field (~40,000 sq ft)
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
How do you maintain it?
Assume40,000 disk configuration2,000,000 hours MTBF per Enterprise-class disk500,000 hours MTBF per Consumer-class disk
~4 disk failure per week for Enterprise-class disks~15 failures per week for Consumer-class disksContinual rebuilds in progress10TB/sec is 10 times this
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
How do you connect it all together?
10Gbit/sec/channel → 1,000 channels @ 100% efficiencyImplies a 2,000 channel non-blocking switch fabricWhat about transceiver failure ratesWhen it breaks, how do you find the broken transceiver?10TB/sec – who on earth would want to do that? (don’t ask)
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
How do you use this?
Current file system technology is based on 30+ year-old designs and does not scaleDisk I/O software stack is 30+ years old and does not scaleNeed lots of innovation in many areas
Common shared file system interfacesData Life Cycle Management and seamless integration into existing HEC environmentsChanges to standards that offer greater scalability without sacrificing data integrityStreaming I/O from zillions of single nodes Data alignment, small-block, large-block, and RAID issuesFile System Metadata
Application
OperatingSystem
Storage andTransport
Application
OperatingSystem
Storage andTransport
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Commodity Reliability And Practices
Processors, Networks, Graphics Engines have for the most part gone “commodity”Disk drives are still largely “enterprise-class”Significant pressure to move toward more use of commodity disk drivesRequires a fundamental change in how we think about RAS for storage – i.e. Fail-In-PlaceAssumes something is always in the process of breakingMust re-orient engineering to think about how to build reliable systems using unreliable componentsAKA – How to build reliable systems using CRAP
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
History has shown…
The problems that the Lunatic Fringe is working on today are the problems that will become mainstream in 3-5 yearsLegacy data access hardware and software mechanisms are breaking down at these scalesWe need to continue to innovate
Individual at all levelsGlobally across levelsRe-orient our thinking on many levels
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
What’s happening now?
Areal Density is at about 250Gigabits per square inch3.5-inch form factor is currently the standard2.5-inch form factor is emerging in the enterpriseSAS and SATA are getting significant traction OSD has been demonstrated and is in active developmentConsumer-grade storage is cheap cheap cheapCommodity interface speeds are up to 20-40Gigabits/secStorage and Network processing engines are availableNew applications for storage are rapidly evolvingRelaxed POSIX standardsNFS V4 and Parallel NFS
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Common thread
Their data storage capacity, access, and retention requirements are continually increasingSome of the technologies and concepts the Lunatic Fringe are looking at include:
Object-based Storage DevicesIntelligent Storage SystemsData GridsHigh-density disk drive packagingCommodity Reliability And Practices
Building Reliable systems with inherently unreliable componentsOr – Building Reliable systems using CRAP
New and/or improved software standardsError Detection Techniques and Methods
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Conclusions
Lunatic Fringe users will continue to push the limits of existing hardware and software technologiesLunatic Fringe is a moving target – there will always be a Lunatic Fringe well beyond where you areThe Storage Industry at large should pay attention to
What they are doingWhy they are doing itWhat they learn
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Some Interesting Sites…
www.llnl.gov – Livermore National Labswww.lanl.gov – Los Alamos National Labswww.sandia.gov – Sandia National Labswww.top500.org – The Top500 Listwww.ncsa.uiuc.edu - NCSAwww.psc.edu – Pittsburgh Supercomputer Centerwww.tacc.utexas.edu – Texas Advanced Computing Centerwww.ornl.gov – Oak Ridge National Labhttp://lhc.web.cern.ch/lhc - CERN and the LHC
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Government Research
DoE ASCI TriLabs – LANL, LLNL, SandiaLustre (www.lustre.org)Parallel NFS ( www.ietf.org/proceedings/04mar/slides/nfsv4-1.pdf )
NFS Version 4 (nfsv4.org)DICE Data Intensive Computing Environments
http://www.avetec.org/dice/NASA and the IEEE – Mass Storage Technical Committee –Annual symposium on Mass Storage Systems and Technologies (MSSTC) ( www.storageconference.org )
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Academic Storage Research
University of Minnesota Digital Technology Center Intelligent Storage Consortium (DISC)
www.dtc.umn.edu/programs/DISC.htmlUniversity of California Santa Cruz Storage Systems Research Center (SSRC)
http://ssrc.soe.ucsc.eduCMU Parallel Data Lab (PDL)
www.pdl.cmu.edu
Storage Developer Conference 2008 © 2008 Insert Copyright Information Here. All Rights Reserved.
Thank you
Thomas M. RuwartChief [Mad] Scientist