26
Tier-2

Tier-2 · 2017-02-14 · Tier-2 Champions •Proposed Tier-2 Champions. They will help share information •Provide a named person at each Tier-2 centre •Will ensure information

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Tier-2

SixNewTier-2Centres

• EPSRCprovided~£20mtofundanewwaveofTier-2HPCcentres• Tier-2sitsbetweenTier-1(Archer)andTier-3(Uni)

• EPSRCprovidedcapitalforthehardware• Regionalcentres providefundingforOpEx,e.g.

• Powerandhosting• Systemadministrators• ResearchSoftwareEngineers

Lightningtalks

• Cirrus– AndyTurner• Jade– PaulRichmond• Peta-5- FilippoSpiga• Thomas– HeatherKelly• HPCMidlandsPlus – StevenKenny• Isambard– ChristopherWoods

ARCHERChampions,10Feb2017AndyTurner

[email protected]

CirrusSystem

• 280nodeSGIICEXA• 10,080cores(2x18-coreBroadwell)per

node

• 128GiB memorypernode

• DDNLustre filesystem

• SinglerailFDRInfiniband

• Tier-2RDF• BasedonDDNWebObjectScalar

Appliances

• Total1.9PiB usable

• CoordinationacrossEPSRCTier-2sites• Technicalworkinggroup

• SAFEintegrationanddevelopment(if

wantedbysites)

Callum

Bennetts/Maverick Photography

Access• 30%resourcesforEPCC• Industrialprojects• Researchprojects

• 70%EPSRC• Applyviaregularlightweightcalls(similartoARCHERRAP)• Independentpanelawardsresources• AccessfreeforEPSRCresearchers• OtherRCresearcherschargedbyresourcesrequested

• CoordinationbetweenARCHER/Tier-2sites:• Multiplesystemsforresearchworkflows• ExchangeprojectsbetweenTier-2andARCHERtoensurebestfit• Avoidapplicantsapplyingforsameresourcestwice

SupportandTimescales• IndepthtechnicalsupportfromEPCCexpertsprovidedaspartofservice• Technicaladvice• Porting,performanceanalysisandoptimisation• Softwareinstallationandsupport

• Onlinetrainingplanned(initiallysimilartoARCHERDrivingTest)• CirrusChampion:AndyTurner([email protected])• Fulluseraccess1April2017• Documentationandinformationalreadyavailable:

http://cirrus.readthedocs.io

TheJADESystem

• 22NVIDIADGX-1• 3.740PetaFLOPs (FP16)• 2.816TerabytesHBMGPUMemory

• 1PBfilestore• P100GPUs- OptimisedforDeepLearning

• NVLink betweendevices• PCIe toHost(densenodes)

• Usecases• 50%ML(DeepLearning)• 30%MD• 20%Other

Procurement,HostingandAccess

• ATOShavebeenselectedasthepreferredbidder• Followingprocurementcommitteesreviewfromtender• Runningcoststoberecoupedthroughsellingtimetoindustrialusers

• TobehostedbySTFCDaresbury• WillrunSLURMschedulerforschedulingatthenodelevel

• Resourceallocation• Opentoallwithoutcharge• Someprioritytosupportinginstitutions• Lighttouchreviewprocess(similartoDiRAC)

Co-Investigators,GovernanceandRSESupport• AllCIshavecommittedRSEsupporttimefortheirlocalinstitutions• TosupportlocalusersofJADEsystem• RoleofTIER2RSEnetworktobeagreed

• Training• SomecommitmenttotrainingofferedbycomeCis(AlanGray atEPCC,PaulRichmondEPSRCRSEFellow)

• Governanceviasteeringcommittee• Prof. AnneTrefethen (chair),OxfordUniversityCIOandPVC• anEPSRCrepresentative• AlisonKennedy,DirectorofSTFCHartree Centre• Prof. SimonMcIntosh-Smith,Bristol• arepresentativeofthemachinelearningcommunity• arepresentativeofthemoleculardynamicscommunity

Cambridge Tier-2 “Peta-5” – Current Status

• Tier-2 theme: Data Intensive Computation and Analytics

• EPSRC PI: Prof. Paul Alexander

• Local Director: Dr. Paul Calleja

• Co-PI: G. Pullan, L. Drumright, MC. Payne, S. McIntosh-Smith, O. Kenway, M. Giles, C. Schönlieb, SJ. Cox, PD. Haynes, F. Fraternali, M. Wilkinson

• Institutions involved: Cambridge, Oxford, Bristol, KCL, UCL, ICL, Southampton, Leicester

• RAC: under definition

• Total expenditure: ~£10m (£5m EPSRC funded)

• Procurement: concluded before Christmas, contract signed.

• Deployment: KNL & GPU - April ’17, x86 SkyLake & “co-design I/O” - July ‘17

• General Service: July / August ‘17

Filippo SPIGA <[email protected]>

Cambridge Tier-2 “Peta-5” – HW components and Delivery

• KNL: 21,888 cores, ~0.5 PFlop/s (+ 4 login nodes)

• 342 nodes, 7210 (64 cores), 96 GB RAM, 120 GB SSD, OPA 2:1

• GPU: 1,080 cores, 360 GPU, ~1.0 PFlop/s (+ 4 login nodes)

• 90 nodes, 12c E5-2650 v4, 4x GPU P100 16 GB, 96 GB RAM, 120 GB SSD, EDR 2:1

• SkyLake: 24,576 cores, ~1.0 PFlop/s (+ 8 login nodes)

• (low mem) 384 nodes, 2x 16c “6142”, 384 GB RAM, 120 GB SSD, OPA 2:1

• (high mem) 384 nodes, 2x 16c “6142”, 192 GB RAM, 120 GB SSD, OPA 2:1

• Lustre: 5 PB usable

• 16 OSS, 2 MDS, 4 MDT, 8 LNET + IEEL support (over OPA)

• Tape: housing up to 10 PB tapes

• Hadoop: 50 nodes (600 TB) + 12 nodes (288 TB) (+ 2 head nodes)

Filippo SPIGA <[email protected]>

Cambridge Tier-2 “Peta-5” – Support & RSE

• Pool of 3 FTE RSE available to help porting new applications and exploit community codes on heterogeneous systems

• Cambridge RSE time handled by RAC, coordinated by Cambridge team

• Main training activities

• Training on GPU (OpenACC, CUDA FORTRAN, CUDA C)

• Training on Intel Many-Core (in collaboration with DELL and Intel)

• Training on Parallel I/O (under development)

• Happy to open to other Tier-2 RSE groups for testing purposes

• Rules of engagement yet to define, limited allocations to be reviewed

• Tier-2 “Champion” / Reference RSE contact: Mr. Filippo Spiga

Filippo SPIGA <[email protected]>

HPCMidlandsPlusProf.StevenKenny

DepartmentofMaterialsLoughboroughUniversity

CentreFacilities

• 14,336x86cores– consistingof512nodeseachwith2xIntelXeonE5-2680v4cpus with14corespercpu and128GBRAMpernode• 3:1blockingEDRInfiniband network,giving756corenon-blockingislands• 1PBGPFSfilestore• 5x20corePOWER8systemseachwith1TBRAMconnectedtotheInfiniband network• Dedicated10TBfilestore forprestagingfiles

CentreSetup

• CentreinstallationcompletebyendofMarch• PilotserviceintheearlypartofApril• FullservicebytheendofApril• 15%timeonthesystemreleasedaspartofTier2nationalservice• Researchsoftwareengineersatallofthesitesacrossthecentre• Trainingcoursesacrossthecentre

• Team• UniversitiesofBath,Bristol,CardiffandExeter• MetOfficeandCray

• Hardware• Stage1:Technologycomparison:x86,KNLandPascal• Stage2:CrayCS-400,10,000+ARMv8cores

• Procurement• Goingwell– installingStage1inmid-March• Stage2willbeinstalledbetweenMarch-December

GW4- Isambard

GW4- Isambard

• 20—25%willbemadeavailabletothecommunityviaresourceallocationcalls• Plantoalsomakeitavailableviaacontinuousintegrationplatform• 2+RSEshiredtosupportcodedevelopmentandporting,plusCraywillprovideengineertime• Trainingmaterialswillbemadeavailable• Tier-2ChampionisChristopherWoods

Tier-2Opportunities

• LargenumberofRSEsbeinghiredtosupportlotsofcodeportingandoptimisation• Avoidduplicationofeffort

• Let’sshareknowledgeandworktogether• Lotsoftrainingwillbeprovided

• Howwillthisbeadvertised?• Lotsofallocationcallsandtimeavailable

• Howwillthisbeadvertised?• HowcanaresearcherorRSEatUniversityXfindoutwhatisavailable,gettraining,getsupportandapplyforaccess?

Tier-2Champions

• ProposedTier-2Champions.Theywillhelpshareinformation• ProvideanamedpersonateachTier-2centre• Willensureinformationregardingtraining,support,accessanddocumentationissharedwiththewidercommunity• WillhelpRSEsatdifferentTier-2centres totalktoeachother,shareknowledge,andavoidduplicationofeffort

Tier-2ChampionsWebsite

Mainportalforsharinginformation

HostedonUKRSEwebsite(http://rse.ac.uk)

WordPresssoeasytoaddandeditcontent

Tier-2ChampionsWebsite

• Contains• Shortdescriptionofwhatisprovidedbyeachcentre,withlinkstoeachcentre’s website

• Linkstotrainingmaterialsprovidedbyeachcentre• Listoftrainingcoursesandlinkstothesign-uppages• Listofupcomingresourcecallsandlinkstothesign-uppages

• Informationonhowtogetaccesstotimeforthepurposeofrunningexternaltrainingworkshops• e.g.gainingaccesstoIsambardfromUniversityXforrunningatrainingcourseatUniversityX

Tier-2Champions:Questions

• HowcanTier-2integratewithTier-1?• HowcanTier-2championsintegratewithArcherchampions?• WhatelseisneededontheTier-2websitetosupportexternalusersofeachcentre?• IsitoktostartthewebsitehostingunderUKRSE?

• Whoelsewouldwanttorunthewebsite?• WhatforumsareavailabletoallowTier-2RSEstocommunicatetosharebestpracticeandavoidduplicationofeffort?