Update on the Spider II File System

Preview:

DESCRIPTION

In this presentation from the DDN User Meeting at SC13, Sarp Oral provides an update on the Spider II file system at Oak Ridge National Laboratory. Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/

Citation preview

16 Sarp Oral | SC’13, DDN User Meeting

Spider%II%Specs%

1'SFA12K40'IB'FDR'10'60Sdisk'enclosure'560'2'TB'NL'SAS'drives'

36'SFA12K40'IB'FDR'10'60Sdisk'trays/couplet'560'2'TB'NL'SAS/couplet'20,160'drives'40'PB'capacity'(raw)'>'1'TB/s'performance''

Scalable'Storage'System'

Test'and'Development'System'

32'PB'capacity'(aker'RAID)'

>'1'TB/s'performance''

288'Lustre'OSS'total'

8'OSS'per'couplet'

4'MDS'and'2'MGS'

Configured'in'4'rows'

2x'108Sport'FDR'IB'switches'

36x'36Sport'FDR'IB'switches'

440'Lustre'Titan'LNET'routers'(432&for&OSS,&8&for&MDS)&

Facts'

17 Sarp Oral | SC’13, DDN User Meeting

Spider%II%R%Architecture%

Enterprise Storagecontrollers and large

racks of disks are connectedvia InfiniBand.

36 DataDirect SFA12K-40controller pairs with

2 Tbyte NL- SAS drives and 8 InifiniBand FDR connections per pair

Storage Nodesrun parallel file system software and manage incoming FS traffic.

288 Dell servers with

64 GB of RAM each

SION II Networkprovides connectivity

between OLCF resources and

primarily carries storage traffic.

1600 ports, 56 Gbit/secInfiniBand switch

complex

Lustre Router Nodesrun parallel file system

client software andforward I/O operations

from HPC clients.

432 XK7 XIO nodesconfigured as Lustre

routers on Titan

Titan XK7

Other OLCFresources

XK7 Gemini 3D Torus

9.6 Gbytes/sec per directionInfiniBand56 Gbit/sec

Serial ATA6 Gbit/sec

18 Sarp Oral | SC’13, DDN User Meeting

Spider%II%R%Facili5es%

•  Sits'on'a'36’’'raised'floor'and'forced'air'cooled'•  4'iden>cal'rows'in'hotSaisle/coldSaisle'configura>on'

–  9'racks'for'DDN'SFA12KS40'equipment'

–  1'infrastructure'rack'–  ColdSaisle'is'fully'contained'with'overhead'panels'and'sliding'doors'at'each'end'of'the'rows'•  Prevents'hotSair'coldSair'mixing'and'increases'cooling'efficiency'

•  25%'perforated'>les'used'to'provide'coldSair'to'coldSaisles'•  Fully'compliant'with'the'requisite'Na>onal'Fire'Protec>on'Associa>on'(NFPA)'codes'

•  Total'space'required'is'672'square'feet'

19 Sarp Oral | SC’13, DDN User Meeting

Spider%II%R%Facili5es%• Ran'series'tests'on'a'DDN'SFA12KS40'testbed'unit'under'various'I/O'mode'and'load'scenarios'–  9'kW'per'DDN'rack'nominal'load'

•  Total'file'system'load'including'infrastructure'racks'is'400'kW'and'total'cooling'load'is'114'tons'

•  Each'rack'is'fed'with'a'pair'of'208VAC'3Sphase'electrical'feeds,'protected'by'a'50A'10%Srated'breaker'–  Fed'from'two'different'transformer'sources'

–  DDN'SFA12K'power'distribu>on'system'is'both'load'balanced'and'supports'failSover,'OLCF'can'conduct'both'scheduled'and'unscheduled'maintenance'on'one'transformer'without'disrup>ng'the'file'system'opera>on'

–  Neither'electrical'connec>on'is'protected'by'UPS'

20 Sarp Oral | SC’13, DDN User Meeting

Integra5on%efforts%

•  Lustre'2.4'tes>ng'–  SmallSscale'

•  Round'the'clock'tes>ng'for'stability,'regression,'and'performance'on'a'single'cabinet'Cray'XK7'(Arthur)'

•  Home'built'Cray'Lustre'2.4'client'as'well'as'servers'

•  Early'detec>on'and'correc>on'of'problems'and'bugs'

–  LargeSscale'•  Weekly'tes>ng'on'Titan'

•  Iden>fied'some'number'of'problems'at'scale'

•  IB'FDR'tes>ng'on'Cray'–  Cray'and'Mellanox'

21 Sarp Oral | SC’13, DDN User Meeting

Schedule%•  System'infrastructure'delivery''

–  Completed'

•  Block'storage'delivery'–  Completed'

•  Block'acceptance'–  Completed'–  Achieved'1.3'TB/s'for'reads'and'1.2'TB/s'for'writes'at'the'blockSlevel'–  Need'to'reSvisit'for'a'few'items'Q1’14'

•  Lustre'support'with'Intel''–  Completed.'Level'1,'2,'and'3'support'with'Intel'

•  File'system'integra>on'–  Completed'

•  Rolling'into'produc>on'–  Completed'

•  Performance'tuning'–  On'going.'To'be'completed'by'Q1‘14.'

22 Sarp Oral | SC’13, DDN User Meeting

23 Sarp Oral | SC’13, DDN User Meeting

Ques>ons?''oralhs@ornl.gov'

23

The research and activities described in this presentation were performed using the resources of the National Center for Computational Sciences at

Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy

under Contract No. DE-AC0500OR22725.

Recommended