Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SixNewTier-2Centres
• EPSRCprovided~£20mtofundanewwaveofTier-2HPCcentres• Tier-2sitsbetweenTier-1(Archer)andTier-3(Uni)
• EPSRCprovidedcapitalforthehardware• Regionalcentres providefundingforOpEx,e.g.
• Powerandhosting• Systemadministrators• ResearchSoftwareEngineers
Lightningtalks
• Cirrus– AndyTurner• Jade– PaulRichmond• Peta-5- FilippoSpiga• Thomas– HeatherKelly• HPCMidlandsPlus – StevenKenny• Isambard– ChristopherWoods
CirrusSystem
• 280nodeSGIICEXA• 10,080cores(2x18-coreBroadwell)per
node
• 128GiB memorypernode
• DDNLustre filesystem
• SinglerailFDRInfiniband
• Tier-2RDF• BasedonDDNWebObjectScalar
Appliances
• Total1.9PiB usable
• CoordinationacrossEPSRCTier-2sites• Technicalworkinggroup
• SAFEintegrationanddevelopment(if
wantedbysites)
Callum
Bennetts/Maverick Photography
Access• 30%resourcesforEPCC• Industrialprojects• Researchprojects
• 70%EPSRC• Applyviaregularlightweightcalls(similartoARCHERRAP)• Independentpanelawardsresources• AccessfreeforEPSRCresearchers• OtherRCresearcherschargedbyresourcesrequested
• CoordinationbetweenARCHER/Tier-2sites:• Multiplesystemsforresearchworkflows• ExchangeprojectsbetweenTier-2andARCHERtoensurebestfit• Avoidapplicantsapplyingforsameresourcestwice
SupportandTimescales• IndepthtechnicalsupportfromEPCCexpertsprovidedaspartofservice• Technicaladvice• Porting,performanceanalysisandoptimisation• Softwareinstallationandsupport
• Onlinetrainingplanned(initiallysimilartoARCHERDrivingTest)• CirrusChampion:AndyTurner([email protected])• Fulluseraccess1April2017• Documentationandinformationalreadyavailable:
http://cirrus.readthedocs.io
TheJADESystem
• 22NVIDIADGX-1• 3.740PetaFLOPs (FP16)• 2.816TerabytesHBMGPUMemory
• 1PBfilestore• P100GPUs- OptimisedforDeepLearning
• NVLink betweendevices• PCIe toHost(densenodes)
• Usecases• 50%ML(DeepLearning)• 30%MD• 20%Other
Procurement,HostingandAccess
• ATOShavebeenselectedasthepreferredbidder• Followingprocurementcommitteesreviewfromtender• Runningcoststoberecoupedthroughsellingtimetoindustrialusers
• TobehostedbySTFCDaresbury• WillrunSLURMschedulerforschedulingatthenodelevel
• Resourceallocation• Opentoallwithoutcharge• Someprioritytosupportinginstitutions• Lighttouchreviewprocess(similartoDiRAC)
Co-Investigators,GovernanceandRSESupport• AllCIshavecommittedRSEsupporttimefortheirlocalinstitutions• TosupportlocalusersofJADEsystem• RoleofTIER2RSEnetworktobeagreed
• Training• SomecommitmenttotrainingofferedbycomeCis(AlanGray atEPCC,PaulRichmondEPSRCRSEFellow)
• Governanceviasteeringcommittee• Prof. AnneTrefethen (chair),OxfordUniversityCIOandPVC• anEPSRCrepresentative• AlisonKennedy,DirectorofSTFCHartree Centre• Prof. SimonMcIntosh-Smith,Bristol• arepresentativeofthemachinelearningcommunity• arepresentativeofthemoleculardynamicscommunity
Cambridge Tier-2 “Peta-5” – Current Status
• Tier-2 theme: Data Intensive Computation and Analytics
• EPSRC PI: Prof. Paul Alexander
• Local Director: Dr. Paul Calleja
• Co-PI: G. Pullan, L. Drumright, MC. Payne, S. McIntosh-Smith, O. Kenway, M. Giles, C. Schönlieb, SJ. Cox, PD. Haynes, F. Fraternali, M. Wilkinson
• Institutions involved: Cambridge, Oxford, Bristol, KCL, UCL, ICL, Southampton, Leicester
• RAC: under definition
• Total expenditure: ~£10m (£5m EPSRC funded)
• Procurement: concluded before Christmas, contract signed.
• Deployment: KNL & GPU - April ’17, x86 SkyLake & “co-design I/O” - July ‘17
• General Service: July / August ‘17
Filippo SPIGA <[email protected]>
Cambridge Tier-2 “Peta-5” – HW components and Delivery
• KNL: 21,888 cores, ~0.5 PFlop/s (+ 4 login nodes)
• 342 nodes, 7210 (64 cores), 96 GB RAM, 120 GB SSD, OPA 2:1
• GPU: 1,080 cores, 360 GPU, ~1.0 PFlop/s (+ 4 login nodes)
• 90 nodes, 12c E5-2650 v4, 4x GPU P100 16 GB, 96 GB RAM, 120 GB SSD, EDR 2:1
• SkyLake: 24,576 cores, ~1.0 PFlop/s (+ 8 login nodes)
• (low mem) 384 nodes, 2x 16c “6142”, 384 GB RAM, 120 GB SSD, OPA 2:1
• (high mem) 384 nodes, 2x 16c “6142”, 192 GB RAM, 120 GB SSD, OPA 2:1
• Lustre: 5 PB usable
• 16 OSS, 2 MDS, 4 MDT, 8 LNET + IEEL support (over OPA)
• Tape: housing up to 10 PB tapes
• Hadoop: 50 nodes (600 TB) + 12 nodes (288 TB) (+ 2 head nodes)
Filippo SPIGA <[email protected]>
Cambridge Tier-2 “Peta-5” – Support & RSE
• Pool of 3 FTE RSE available to help porting new applications and exploit community codes on heterogeneous systems
• Cambridge RSE time handled by RAC, coordinated by Cambridge team
• Main training activities
• Training on GPU (OpenACC, CUDA FORTRAN, CUDA C)
• Training on Intel Many-Core (in collaboration with DELL and Intel)
• Training on Parallel I/O (under development)
• Happy to open to other Tier-2 RSE groups for testing purposes
• Rules of engagement yet to define, limited allocations to be reviewed
• Tier-2 “Champion” / Reference RSE contact: Mr. Filippo Spiga
Filippo SPIGA <[email protected]>
CentreFacilities
• 14,336x86cores– consistingof512nodeseachwith2xIntelXeonE5-2680v4cpus with14corespercpu and128GBRAMpernode• 3:1blockingEDRInfiniband network,giving756corenon-blockingislands• 1PBGPFSfilestore• 5x20corePOWER8systemseachwith1TBRAMconnectedtotheInfiniband network• Dedicated10TBfilestore forprestagingfiles
CentreSetup
• CentreinstallationcompletebyendofMarch• PilotserviceintheearlypartofApril• FullservicebytheendofApril• 15%timeonthesystemreleasedaspartofTier2nationalservice• Researchsoftwareengineersatallofthesitesacrossthecentre• Trainingcoursesacrossthecentre
• Team• UniversitiesofBath,Bristol,CardiffandExeter• MetOfficeandCray
• Hardware• Stage1:Technologycomparison:x86,KNLandPascal• Stage2:CrayCS-400,10,000+ARMv8cores
• Procurement• Goingwell– installingStage1inmid-March• Stage2willbeinstalledbetweenMarch-December
GW4- Isambard
GW4- Isambard
• 20—25%willbemadeavailabletothecommunityviaresourceallocationcalls• Plantoalsomakeitavailableviaacontinuousintegrationplatform• 2+RSEshiredtosupportcodedevelopmentandporting,plusCraywillprovideengineertime• Trainingmaterialswillbemadeavailable• Tier-2ChampionisChristopherWoods
Tier-2Opportunities
• LargenumberofRSEsbeinghiredtosupportlotsofcodeportingandoptimisation• Avoidduplicationofeffort
• Let’sshareknowledgeandworktogether• Lotsoftrainingwillbeprovided
• Howwillthisbeadvertised?• Lotsofallocationcallsandtimeavailable
• Howwillthisbeadvertised?• HowcanaresearcherorRSEatUniversityXfindoutwhatisavailable,gettraining,getsupportandapplyforaccess?
Tier-2Champions
• ProposedTier-2Champions.Theywillhelpshareinformation• ProvideanamedpersonateachTier-2centre• Willensureinformationregardingtraining,support,accessanddocumentationissharedwiththewidercommunity• WillhelpRSEsatdifferentTier-2centres totalktoeachother,shareknowledge,andavoidduplicationofeffort
Tier-2ChampionsWebsite
Mainportalforsharinginformation
HostedonUKRSEwebsite(http://rse.ac.uk)
WordPresssoeasytoaddandeditcontent
Tier-2ChampionsWebsite
• Contains• Shortdescriptionofwhatisprovidedbyeachcentre,withlinkstoeachcentre’s website
• Linkstotrainingmaterialsprovidedbyeachcentre• Listoftrainingcoursesandlinkstothesign-uppages• Listofupcomingresourcecallsandlinkstothesign-uppages
• Informationonhowtogetaccesstotimeforthepurposeofrunningexternaltrainingworkshops• e.g.gainingaccesstoIsambardfromUniversityXforrunningatrainingcourseatUniversityX
Tier-2Champions:Questions
• HowcanTier-2integratewithTier-1?• HowcanTier-2championsintegratewithArcherchampions?• WhatelseisneededontheTier-2websitetosupportexternalusersofeachcentre?• IsitoktostartthewebsitehostingunderUKRSE?
• Whoelsewouldwanttorunthewebsite?• WhatforumsareavailabletoallowTier-2RSEstocommunicatetosharebestpracticeandavoidduplicationofeffort?