56
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366 O Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by Triad National Security, LLC, for the National Nuclear Security Administration of the U.S. Department of Energy under contract 89233218CNA000001. LA-UR-18-25993. Approved for public release; distribution is unlimited. Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND2018-7366 O.

Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 1 of 56

Crossroads2021

TechnicalRequirementsDocument

LA-UR-18-25993SAND2018-7366O

LosAlamosNationalLaboratory,anaffirmativeaction/equalopportunityemployer,isoperatedbyTriadNationalSecurity,LLC,fortheNationalNuclearSecurityAdministrationoftheU.S.DepartmentofEnergyundercontract89233218CNA000001.LA-UR-18-25993.Approvedforpublicrelease;distributionisunlimited.SandiaNationalLaboratoriesisamulti-missionlaboratorymanagedandoperatedbyNationalTechnology&EngineeringSolutionsofSandia,LLC,awhollyownedsubsidiaryofHoneywellInternational,Inc.,fortheU.S.DepartmentofEnergy’sNationalNuclearSecurityAdministrationundercontractDE-NA0003525.SAND2018-7366O.

Page 2: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 2 of 56

Crossroads2021:TechnicalRequirements1 INTRODUCTION 4

1.1 SCHEDULE 6

2 SYSTEMDESCRIPTION 6

2.1 ARCHITECTURALDESCRIPTION 6

2.2 SOFTWAREDESCRIPTION 7

2.3 PRODUCTROADMAPDESCRIPTION 7

2.4 RISKMITIGATIONSTRATEGY 7

3 TARGETSFORSYSTEMDESIGN,FEATURES,ANDPERFORMANCEMETRICS 7

3.1 SCALABILITY 8

3.2 SYSTEMSOFTWAREANDRUNTIME 10

3.3 SOFTWARETOOLSANDPROGRAMMINGENVIRONMENT 12

3.4 PLATFORMSTORAGE 15

3.5 APPLICATIONPERFORMANCE 18

3.6 RESILIENCE,RELIABILITY,ANDAVAILABILITY 22

3.7 APPLICATIONTRANSITIONSUPPORTANDEARLYACCESSTOACESTECHNOLOGIES 23

3.8 TARGETSYSTEMCONFIGURATION 24

3.9 SYSTEMOPERATIONS 25

3.10 POWERANDENERGY 27

3.11 FACILITIESANDSITEINTEGRATION 29

4 OPTIONS 33

4.1 UPGRADES,EXPANSIONSANDADDITIONS 33

4.2 EARLYACCESSDEVELOPMENTSYSTEM 34

4.3 TESTSYSTEMS 35

4.4 ONSITESYSTEMANDAPPLICATIONSOFTWAREANALYSTS 35

4.5 DEINSTALLATION 35

4.6 MAINTENANCEANDSUPPORT 35

5 DELIVERYANDACCEPTANCE 38

Page 3: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 3 of 56

5.1 PRE-DELIVERYTESTING 38

5.2 SITEINTEGRATIONANDPOST-DELIVERYTESTING 38

5.3 ACCEPTANCETESTING 39

6 RISKANDPROJECTMANAGEMENT 39

7 DOCUMENTATIONANDTRAINING 40

7.1 DOCUMENTATION 40

7.2 TRAINING 40

8 REFERENCES 41

APPENDIXA:SAMPLEACCEPTANCEPLAN 42

APPENDIXB:TRIADSPECIFICPROJECTMANAGEMENTREQUIREMENTS 50

DEFINITIONSANDGLOSSARY 55

Page 4: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 4 of 56

1 IntroductionTheDepartmentofEnergy(DOE)NationalNuclearSecurityAdministration(NNSA)AdvancedSimulationandComputing(ASC)Programrequiresacomputingsystembedeployedin2021tosupporttheStockpileStewardshipProgram.Inresponsetothisrequirement,TriadNationalSecurity,LLC(TNS),infurtheranceofitsparticipationintheAllianceforComputingatExtremeScale(ACES),acollaborationbetweenLosAlamosNationalLaboratoryandSandiaNationalLaboratories,isreleasingaRequestforProposal(RFP)foranextgenerationsystem,Crossroads.Inthe2021timeframe,Trinity,thefirstASCAdvancedTechnologySystem(ATS-1),willbenearingtheendofitsusefullifetime.Crossroads,theproposedATS-3system,providesareplacement,tri-labcomputingresourceforexistingsimulationcodesandprovidesaresourceforever-increasingcomputingrequirementstosupporttheweaponsprogram.TheCrossroadssystem,tobesitedatLosAlamos,NM,isprojectedtoprovidealargeportionoftheATSresourcesfortheNNSAASCtri-labsimulationcommunity:LosAlamosNationalLaboratory(LANL),SandiaNationalLaboratories(SNL),andLawrenceLivermoreNationalLaboratory(LLNL),duringthe2021-2026timeframe.Crossroadsisrequiredtosupportstockpilestewardshipcertificationandassessmentstoensurethatthenation’snuclearstockpileissafe,reliableandsecure.TheASCProgramisfacedwithsignificantchallengesresultingfromtheongoingtechnologyrevolution.Theprogrammustcontinuetomeetmissionneedswhileadaptingtosometimesradicalchangesintechnology.CodesrunningonNNSAAdvancedTechnologySystems(TrinityandSierra)inthe2019timeframeareexpectedtorunefficientlyonCrossroads.ThegoaloftheCrossroadsplatformprocurementisEfficiency.Efficiencywillbeevaluatedintheareasof:

• Portingefficiency

• Performanceefficiency

• Workflowefficiency

Throughoutthisdocument,thetermefficiencywillrefertoefficiencyinthesethreeareasunlessotherwisespecified.

Trinity(ATS-1)willbeusedasthebaselineforevaluatingthesegoals.PortingefficiencyisdefinedastheeaseinwhichNNSAmissioncodescanbeportedtoexecuteontheproposedarchitecture.Minimalchangetotheexistingcodebaseisofhighvalue.Performanceefficiencyisdefinedastheachievedperformanceoftheapplicationonceportedtotheproposedplatform.Workflowefficiencyisdefinedastheefficiencythatacomplete

Page 5: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 5 of 56

NNSAworkflowexecutesontheproposedplatform.Whenevaluatingproposalsefficiencyinallthreestatedareaswillbeconsideredtogether.

Forexample,apoorresultwouldbeascenariowhereanapplicationrequireslittleportingefforttoexecuteontheproposedplatformbuttheresultingperformanceoftheapplicationispoorcomparedtothebaseline.Ideally,individualapplicationsthatcompriseaworkflowcanbeeasilyportedtotheproposedplatformandperformwellwhencomparedwiththebaselinesystem.If,however,anecessaryservicelikeIOorefficientschedulingofarequiredresourcefortheworkflowisinferiorandhampersoverallworkflowefficiencythiswouldstillbeapoorresult.

TohelpinformtheOfferorofthecharacteristicsofNNSAworkflowsanaccompanyingwhitepaper,“CrossroadsWorkflows,”isprovidedthatdescribeshowapplicationteamsuseHighPerformanceComputing(HPC)resourcestodaytoadvancescientificgoals.Thewhitepaperisdesignedtoprovideaframeworkforreasoningabouttheoptimalsolutiontothesechallenges.(TheworkflowsdocumentcanbefoundontheCrossroadswebsitehttp://crossroads.lanl.gov/.)AnOfferor’sTechnicalProposalshallincludenarrativeandgraphics,asappropriate,providingitsresponses/proposedsolutionstoeachofthenumberedsectionsofthisTechnicalRequirementsDocument.AnOfferorshallincorporateitsresponses/proposedsolutionsdirectlyintoeachofthenumberedsectionsoftheTechnicalRequirementsDocument.TheTechnicalRequirementsDocumentisprovidedinMSWordformattofacilitatethisproposalrequirement.TheevaluationcommitteewillmakenopresumptionoftechnicalcapabilitywhenevaluatinganOfferor’sresponses/proposedsolutionstothisTechnicalRequirementsDocumentandmaydowngradeaproposaliftheOfferor’sresponses/proposedsolutionsarenotmateriallyresponsive.Wheretheword“should”appearsthroughoutthisdocument,itisusedtoconveyatargetthatanOfferoroughttomeetorexceed.IfanOfferorexceedsatarget,itsproposalwillbeupgraded.IfanOfferorfailstomeetatarget,itsproposalwillbedowngraded.Wheretheword“shall”appearsthroughoutthisdocument,itisusedtoimposearequirementthatanOfferormustmeetorexceed.IfanOfferorfailstomeetarequirement,itsproposalwillbedowngradedordeemednon-responsive.Eachresponse/proposedsolutionshallclearlydescribetheroleofanylower-tiersubcontractor(s)andthetechnologyortechnologies,bothhardwareandsoftware,andvalueaddedthatthelower-tiersubcontractor(s)provide,whereappropriate.

Page 6: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 6 of 56

ThescopeofworkandtechnicalspecificationsforanysubcontractsresultingfromthisRFPwillbenegotiatedbasedonthisTechnicalRequirementsDocumentandthesuccessfulOfferor’sresponses/proposedsolutions.Crossroadshasamaximumfundinglimitoverthesystemlifetime,toincludealldesignanddevelopment,sitepreparation,maintenance,supportandanalysts.TotalCostofOwnership(TCO)willbeconsideredinsystemselection.TheOfferormustrespondwithconfigurationandpricingforboththeprimaryandalternatepointdesigns.

1.1 ScheduleThefollowingisthetentativeschedulefortheCrossroadssystem.

Table1CrossroadsSchedule

RFPReleased Q1CY19On-siteSystemDeliveryBegins Q2CY21On-siteSystemDeliveryComplete Q3CY21AcceptanceComplete Q1CY22

2 SystemDescription2.1 ArchitecturalDescription

TheOfferorshallprovideadetailedfullsystemarchitecturaldescriptionoftheCrossroadssystems,includingdiagramsandtextdescribingthefollowingdetailsastheypertaintotheOfferor’ssystemarchitectures(primaryandalternate):§ Componentarchitecture–detailsofallprocessor(s),memory

technologies,storagetechnologies,networkinterconnect(s)andanyotherapplicablecomponents.

§ Nodearchitecture(s)–detailsofhowcomponentsarecombinedintothenodearchitecture(s).Detailsshallincludebandwidthandlatencyspecifications(orprojections)betweencomponents.

§ Boardand/orbladearchitecture(s)–detailsofhowthenodearchitecture(s)isintegratedattheboardand/orbladelevel.Detailsshouldincludeallinter-nodeandinter-board/bladecommunicationpathsandanyadditionalboard/bladelevelcomponents.

§ Rackand/orcabinetarchitecture(s)–detailsofhowboardand/orbladesareorganizedandintegratedintoracksand/orcabinets.Detailsshouldincludeallinterrack/cabinetcommunicationpathsandanyadditionalrack/cabinetlevelcomponents.

Page 7: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 7 of 56

§ Platformstorage–detailsofhowstorageisintegratedwiththesystem,includingaplatformstoragearchitecturaldiagram.

§ Systemarchitecture–detailsofhowrackorcabinetsarecombinedtoproducesystemarchitecture,includingthehigh-speedinterconnectsandnetworktopologies(ifmultiple)andplatformstorage.

§ Proposedfloorplan–includingdetailsofthephysicalfootprintofthesystemandallofthesupportingcomponents,includingdetailsofsiteandfacilityintegrationrequirements(e.g.power,cooling,andnetwork).

2.2 SoftwareDescriptionTheOfferorshallprovideadetaileddescriptionoftheproposedsoftwareeco-system,includingahigh-levelsoftwarearchitecturaldiagram.Specifytheprovenanceofthesoftwarecomponent,forexampleopensourceorproprietary,andsupportmechanismforeach(forthelifetimeofthesystemincludingupdates).

2.3 ProductRoadmapDescriptionTheOfferorshalldescribehowthesystemdoesordoesnotfitintotheOfferor’slong-termproductroadmapandapotentialfollow-onsystemacquisitioninthe2025/26andbeyondtimeframe.

2.4 RiskMitigationStrategyTheOfferorshallprovideasummaryofanalternateriskmitigationpointdesign.Thealternatepointdesignshallbebasedonanarchitecturethatreducestheriskofsuccessfulon-timedeployment,forexample,poseslessscheduleriskfordelivery.Itisofgreatimportancethataviableplatform(primaryoralternate)isdeliveredintheCrossroadstimeframecapableofsupportingmissionneedsregardlessofunforeseentechnologydisruptions.TheOfferorshallnotsubmitafullalternativepointdesignproposal.InsteaditssummaryofthealternateriskmitigationpointdesignshallclearlydescribeanydifferencesfromtheprimarydesignpointandhoweachofthenoteddifferencessatisfythetechnicalrequirementscontainedinthisdocumentandreducesscheduleriskfordeliveryofCrossroads.

3 TargetsforSystemDesign,Features,andPerformanceMetricsThissectioncontainstargetsfordetailedsystemdesign,featuresandperformancemetrics.ItisdesirablethattheOfferor’sproposalmeetorexceedthetargetsoutlinedinthissection.Ifatargetcannotbemet,theOfferorshallprovideadevelopmentanddeploymentplan,includingaschedule,tosatisfythetarget.

Page 8: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 8 of 56

TheOfferormayalsoproposeanyhardwareand/orsoftwarearchitecturalfeaturesthatwillprovideimprovementsforanyaspectofthesystem.

3.1 ScalabilityThescaleofthesystemnecessarytomeettheneedsoftheapplicationperformance,portingandworkflowrequirementsoftheNNSAlaboratoriesaddssignificantchallenges.TheOfferorshouldproposeasystemthatenablesefficiencyatuptothefullscaleofthesystem.Additionally,thesystemproposedshouldprovidefunctionalitythatassistsusersinenhancingefficiencyatuptofullscale.Scalabilityfeatures,bothhardwareandsoftware,thatbenefitbothcurrentandfutureprogrammingmodelsareessential.MemorybandwidthandlatencyareoftenlimitingfactorsintheperformanceofNNSAmissionapplicationsthereforehighvaluewillbeputonfeaturesthatincreasememorybandwidthorlowermemorylatency.

3.1.1 Thesystemshouldsupportrunningjobsuptoandincludingthefullscaleofthesystem.

3.1.2 Thesystemshouldsupportlaunchinganapplicationatfullsystemscaleinlessthan30seconds.TheOfferorshalldescribefactors(suchasexecutablesize)thatcouldpotentiallyaffectapplicationlaunchtime.

3.1.3 TheOfferorshalldescribehowapplicationlaunchscaleswiththenumberofconcurrentlaunchrequests(persecond)andthescaleofeachlaunchrequest(resourcesrequested,suchasthenumberofschedulableunitsetc.),includinginformationsuchas:§ Allsystem-levelandnode-leveloverheadintheprocessstartupincluding

howoverheadscaleswithnodecountforparallelapplications,orhowoverheadscaleswiththeapplicationcountforlargenumbersofserialapplications.

§ Anylimitationsforprocessesoncomputenodesfrominterfacingwithanexternalwork-flowmanager,externaldatabaseormessagequeuesystem.

3.1.4 Thesystemshouldsupportatleast1000concurrentusersandmorethan20,000concurrentbatchjobs.Thesystemshouldallowasingleusertoexecutemultipleindependentapplicationsonasubsetorallofthepoolofnodesallocatedtothem.TheOfferorshalldescribedetails,includinglimitationsoftheirproposedsupportforthisrequirement.

3.1.5 TheOfferorshalldescribeallareasofthesysteminwhichnode-levelresourceusage(hardwareandsoftware)increasesasajobscalesup(node,coreorthreadcount).

Page 9: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 9 of 56

3.1.6 Thesystemshouldutilizeanoptimizedjobplacementalgorithmtoreducejobruntime,lowervariability,minimizelatency,etc.TheOfferorshalldescribeindetailhowthealgorithmisoptimizedtothesystemarchitecture.

3.1.7 Thesystemshouldincludeanapplicationprogramminginterfacetoallowapplicationsaccesstothephysical-to-logicalmappinginformationofthejob’snodeallocation–includingamappingbetweenMPIranksandnetworktopologycoordinates,andcore,nodeandrackidentifiers.

3.1.8 Thesystemsoftwaresolutionshouldprovidealowjitterenvironmentforapplicationsandshouldprovideanestimateofacomputenodeoperatingsystem’snoiseprofile,bothwhileidleandwhilerunninganon-trivialMPIapplication.Ifcorespecializationisused,theOfferorshalldescribethesystemsoftwareactivitythatremainsontheapplicationcores.

3.1.9 Thesystemshouldprovidecorrectnumericalresultsandconsistentruntimes(i.e.wallclocktime)thatdonotvarymorethan3%fromruntorunindedicatedmodeand5%inproductionmode.TheOfferorshalldescribestrategiesforminimizingruntimevariability.

3.1.10 Thesystem’shighspeedinterconnectshouldsupportahighmessagingbandwidth,highinjectionrate,lowlatency,highthroughput,andindependentprogress.TheOfferorshalldescribe:§ Thesysteminterconnectindetail,includinganymechanismsforadapting

toheavyloadsorinoperablelinks,aswellasadescriptionofhowdifferenttypesoffailureswillbeaddressed.

§ Howtheinterfacewillallowallcoresinthesystemtosimultaneouslycommunicatesynchronouslyorasynchronouslywiththehighspeedinterconnect.

§ Howtheinterconnectwillenablelow-latencycommunicationforone-andtwo-sidedparadigms.

3.1.11 TheOfferorshalldescribehowbothhardwareandsoftwarecomponentsoftheinterconnectsupporteffectivecomputationandcommunicationoverlapforbothpoint-to-pointoperationsandcollectiveoperations(i.e.,theabilityoftheinterconnectsubsystemtoprogressoutstandingcommunicationrequestsinthebackgroundofthemaincomputationthread).

3.1.12 TheOfferorshallreportorprojecttheproposedsystem’snodeinjection/ejectionbandwidth.

3.1.13 TheOfferorshallreportorprojecttheproposedsystem’sbiterrorrateoftheinterconnectintermsoftimeperiodbetweenerrorsthatinterruptajobrunningatthefullscaleofthesystem.

Page 10: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 10 of 56

3.1.14 TheOfferorshalldescribehowtheinterconnectofthesystemwillprovideQualityofService(QoS)capabilities(e.g.,intheformofvirtualchannelsorothersub-systemQoScapabilities),includingbutnotlimitedto:§ Anexplanationofhowthesecapabilitiescanbeusedtopreventcore

communicationtrafficfrominterferingwithotherclassesofcommunication,suchasdebuggingandperformancetoolsorwithI/Otraffic.

§ Anexplanationofhowthesecapabilitiesallowefficientadaptiveroutingaswellasacapabilitytopreventtrafficfromdifferentapplicationsinterferingwitheachother(eitherthroughQoScapabilitiesorappropriatejobpartitioning).

§ Anexplanationofanysub-systemQoScapabilities(e.g.platformstorageQoSfeatures).

3.1.15 TheOfferorshalldescribespecializedhardwareorsoftwarefeaturesofthesystemthatenhanceworkflowsorcomponentsofworkflowefficiency,anddescribeanylimitstotheirscalabilityonthesystem.Thehardwareshouldbeonthesamehighspeednetworkasthemaincomputeresourcesandshouldhaveequalaccesstoothercomputeresources(e.g.filesystemsandplatformstorage).Itisdesirablethatthehardwarehavethesamenodelevelarchitectureasthemaincomputeresources,butcould,forexample,havemorememorypernode.

3.2 SystemSoftwareandRuntimeThesystemshouldincludeawell-integratedandsupportedsystemsoftwareenvironment.Theoverallimperativeistoprovideuserswithaproductive,high-performing,reliable,andscalablesystemsoftwareenvironmentthatenablesefficientuseofthefullcapabilityofthesystem.

3.2.1 Thesystemshouldincludeafull-featuredLinuxoperatingsystemenvironmentonalluservisibleservicepartitions(e.g.,front-endnodes,servicenodes,I/Onodes).TheOfferorshalldescribetheproposedfull-featuredLinuxoperatingsystemenvironment.

3.2.2 Thesystemshouldincludeanoptimizedcomputepartitionoperatingsystemthatprovidesanefficientexecutionenvironmentforapplicationsrunninguptofull-systemscale.TheOfferorshalldescribeanyHPCrelevantoptimizationsmadetothecomputepartitionoperatingsystem.

3.2.3 TheOfferorshalldescribethesecuritycapabilitiesofalloperatingsystemsproposed,e.g.compute,service.

Page 11: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 11 of 56

3.2.4 Thesystemshouldsupportacohesiveandintegratedsolutionforlaunchinguserapplicationsbothatscaleandhighrequestfrequencythatrequiredataatruntimesuchas:sharedobjects,containerizedobjects,datafiles,anddependentsoftware.

3.2.5 Thesystemshouldincluderesourcemanagementfunctionality,includingjobmigration,backfill,targetingofspecifiedresources(e.g.,platformstorage,CPU,memory),advanceandpersistentreservations,jobpreemption,jobaccounting,architecture-awarejobplacement,powermanagement,jobdependencies(e.g.,workloadmanagement),resiliencemanagement,andhighthroughputworkloadexecution(e.g.,100,000jobsubmissionspernight).TheOfferormayproposemultiplesolutionsforavendor-supportedresourcemanagerandshoulddescribethebenefitsofeach.

3.2.6 Thesystemshouldsupportjobsconsistingofmultipleindividualapplicationsrunningsimultaneously(inter-nodeorintra-node)andcooperatingaspartofanoverallmulti-componentapplication(e.g.,ajobthatcouplesasimulationapplicationtoananalysisapplication).TheOfferorshalldescribeindetailhowthiswillbesupportedbythesystemsoftwareinfrastructure(e.g.,userinterfaces,securitymodel,andinter-applicationcommunication).

3.2.7 Thesystemshouldincludeamechanismthatwillallowuserstoprovidecontainerizedsoftwareimageswithoutrequiringprivilegedaccesstothesystemorallowingausertoescalateprivilege.Thestartuptimeforlaunchingaparallelapplicationinacontainerizedsoftwareimageatfullsystemscaleshouldnotgreatlyexceedthestartuptimeforlaunchingaparallelapplicationinthevendor-providedimage.

3.2.8 ThesystemshouldincludeamechanismfordynamicallyconfiguringexternalIPv4/IPv6connectivitytoandfromcomputenodes,enablingspecialconnectivitypathsforsubsetsofnodesonaper-batch-jobbasis,andallowingfullyroutableinteractionswithexternalservices.

3.2.9 TheSuccessfulOfferorshouldprovideaccesstosourcecode,andnecessarybuildenvironment,forallsoftwareexceptforfirmware,compilers,andthirdpartyproducts.TheSuccessfulOfferorshouldprovideupdatesofsourcecode,andanynecessarybuildenvironment,forallsoftwareoverthelifeofthesubcontract.

3.2.10 Theschedulershouldsupportjobworkflowswithdatastage-inandstage-outfromlocalfilesystemsandstoragesystemsaccessibleonlyfromaremotedatatransfersystem.

Page 12: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 12 of 56

3.3 SoftwareToolsandProgrammingEnvironmentTheprimaryprogrammingmodelsusedinproductionapplicationsinthistimeframearetheMessagePassingInterface(MPI),forinter-nodecommunication,andOpenMP,forfine-grainedon-nodeparallelism.WhileMPI+OpenMPwillbethemajorityoftheworkload,theACESlaboratoriesexpectsomenewapplicationstoexerciseemergingasynchronousprogrammingmodels.Systemsupportthatwouldacceleratetheseprogrammingmodels/runtimesandbenefitMPI+OpenMPisdesirable.

3.3.1 ThesystemshouldincludeanimplementationofthemostcurrentversionofMPIstandardspecification.TheOfferorshallprovideadetaileddescriptionoftheMPIimplementation(includingspecificationversion)andsupportforfeaturessuchashardwareacceleratedcollectiveoperations.TheOfferorshalldescribeanylimitationsrelativetotheMPIstandard.

3.3.2 TheOfferorshalldescribeatwhatparallelgranularitythesystemcanbeutilizedbyMPI-onlyapplications.

3.3.3 Thesystemshouldincludeoptimizedimplementationsofcollectiveoperationsutilizingbothinter-nodeandintra-nodefeatureswhereappropriate,includingMPI_Barrier,MPI_Allreduce,MPI_Reduce,MPI_Allgather,andMPI_Gather.

3.3.4 TheOfferorshalldescribethenetworktransportlayerofthesystemincludinganysupportforOpenUCX,Portals,libfabric,libverbs,andanyothertransportlayer,includinganyoptimizationsoftheirimplementationthatwillbenefitapplicationperformanceorworkflowefficiency.

3.3.5 ThesystemshouldincludeacompleteimplementationofthemostcurrentversionofOpenMPstandardincluding,ifapplicable,acceleratordirectives,aswellasasupportingprogrammingenvironment.TheOfferorshallprovideadetailedfeaturedescriptionoftheOpenMPimplementation(s)anddescribeanyexpecteddeviationsfromtheOpenMPstandard.

3.3.6 TheOfferorshallprovideadescriptionofhowapplicationswrittentoutilizeOpenMPwillbecompiledandexecutedonthesystem.

3.3.7 TheOfferorshallprovideadescriptionofanyproposedhardwareorsoftwarefeaturesthatenableOpenMPperformanceoptimizations.

3.3.8 TheOfferorshalllistanyPGASlanguagesand/orlibrariesthataresupported(e.g.UPC,SHMEM/OpenSHMEM,CAF,GlobalArrays)anddescribeanyhardwareand/orprogrammingenvironmentsoftwarethatoptimizesanyofthelistedPGASlanguagessupportedonthesystem.TheOfferorshalldescribeinteroperabilitywithMPI+OpenMP.

Page 13: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 13 of 56

3.3.9 TheOfferorshalldescribeandlistsupportforanyemergingprogrammingmodelssuchasasynchronoustask/datamodels(e.g.,Legion,STAPL,HPX,orOCR)anddescribeanysystemhardwareand/orprogrammingenvironmentsoftwareitwillprovidethatoptimizesanyofthesupportedmodels.TheOfferorshalldescribeinteroperabilitywithMPI+OpenMP.

3.3.10 TheOfferorshalldescribetheproposedhardwareandsoftwareenvironmentsupportfor:

§ Fastthreadsynchronizationofsubsetsofexecutionthreads;§ Atomicadd,fetch-and-add,multiply,bitwiseoperations,andcompare-

and-swapoperationsover32-bitand64-bitintegers,single-precision,anddouble-precisionoperands;

§ Atomiccompare-and-swapoperationsover16-bytewideoperandsthatcomprisetwodoubleprecisionvaluesortwo64-bitmemorypointeroperands;

§ Fastcontextswitchingortask-switching;

§ Fasttaskspawningforuniqueandidenticaltaskwithdatadependencies;§ Supportforactivemessages.

3.3.11 TheOfferorshalldescribeindetailallprogrammingAPIs,languages,compliersandcompilerextensions,etc.otherthanMPIandOpenMP(e.g.OpenACC,CUDA,OpenCL,etc.)thatwillbesupportedbythesystem.Itisdesirablethatinstancesofallprogrammingmodelsprovidedbeinteroperableandefficientwhenusedwithinasingleprocessorsinglejobrunningonthesamecomputenode.

3.3.12 ThesystemshouldincludesupportforthelanguagesC,C++(includingcompletesupportforC++11/14/17),Fortran77,Fortran90,andFortran2008programminglanguages.Providingmultiplecompilationenvironmentsishighlydesirable.TheOfferorshalldescribeanylimitationsthatcanbeexpectedinmeetingfullC++17supportbasedoncurrentexpectations.KeyASCapplicationspushthelimitsofcurrentFortrancompilers.TheOfferorshalldescribetheirsupportforFortran,includingstandardslevelsand/orcoverageofFortrantestsuites,suchastheFLANGFortranTestSuite.

3.3.13 ThesystemshouldincludeaPythonimplementationthatwillrunonthecomputepartitionwithoptimizedMPI4Py,NumPy,andSciPylibraries.

3.3.14 Thesystemshouldincludeaprogrammingtoolchain(s)thatenablesruntimecoexistenceofthreadinginC,C++,andFortran,fromwithinapplicationsandanysupportinglibrariesusingthesamecompilertoolchain.TheOfferorshalldescribetheinteractionbetweenOpenMPandnativeparallelismexpressedinlanguagestandards.

Page 14: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 14 of 56

3.3.15 ThesystemshouldincludeC++compiler(s)thatcansuccessfullybuildthelatestBoostC++library.TheOfferorshallsupportthemostrecentstableversionofBoost.

3.3.16 Thesystemshouldincludeoptimizedversionsoflibm,libgsl,BLASlevels1,2and3,LAPACK,ScaLAPACK,HDF5,NetCDF,andFFTW.ItisdesirableforthesetoefficientlyinteroperatewithapplicationsthatutilizeOpenMP.TheOfferorshalldescribeallotheroptimizedlibrariesthatwillbesupported,includingadescriptionoftheinteroperabilityoftheselibrarieswiththeprogrammingenvironmentsproposed.

3.3.17 Thesystemshouldincludeamechanismthatenablescontroloftaskandmemoryplacementwithinanodeforefficientperformance.TheOfferorshallprovideadetaileddescriptionofcontrolsprovidedandanylimitationsthatmayexist.

3.3.18 Thesystemshouldincludeacomprehensivesoftwaredevelopmentenvironmentwithconfigurationandsourcecodemanagementtools.Onheterogeneoussystems,amechanism(e.g.,anupgradedautoconf)shouldbeprovidedtocreateconfigurescriptstobuildcross-compiledapplicationsonloginnodes.

3.3.19 ThesystemshouldincludeaninteractiveparalleldebuggerwithanX11-basedgraphicaluserinterface.Thedebuggershouldprovideasinglepointofcontrolthatcandebugapplicationsinallsupportedlanguagesusingallgranularitiesofparallelism(e.g.MPI+X)andprogrammingenvironmentsprovidedandscaleupto25%ofthesystem.

3.3.20 Thesystemshouldincludeasuiteoftoolsfordetailedperformanceanalysisandprofilingofuserapplications.AtleastonetoolshouldsupportallgranularitiesofparallelisminmixedMPI+OpenMPprogramsandanyadditionalprogrammingmodelssupportedonthesystem.Thetoolsuitemustprovidetheabilitytosupportmulti-nodeintegratedprofilingofon-nodeparallelismandcommunicationperformanceanalysis.TheOfferorshalldescribeallproposedtoolsandthescalabilitylimitationsofeach.TheOfferorshalldescribetoolsformeasuringI/Obehaviorofuserapplications.

3.3.21 Thesystemshouldincludeevent-tracingtools.Eventtracingofinterestincludes:message-passingeventtracing,I/Oeventtracing,floatingpointexceptiontracing,andmessage-passingprofiling.Theevent-tracingtoolAPIshouldprovidefunctionstoactivateanddeactivateeventmonitoringmultipletimesduringexecutionfromwithinaprocess.

3.3.22 Thesystemshouldincludesingle-andmulti-nodestack-tracingtools.Thetoolsetshouldincludeasource-levelstacktraceback,includinganAPIthatallowsarunningprocessorthreadtoqueryitscurrentstacktrace.

Page 15: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 15 of 56

3.3.23 Thesystemshouldincludetoolstoassisttheprogrammerinintroducinglimitedlevelsofparallelismanddatastructurerefactoringtocodesusinganyproposedprogrammingmodelsandlanguages.Tool(s)shouldadditionallybeprovidedtoassistapplicationdevelopersinthedesignandplacementofthedatastructureswiththegoalofoptimizingdatamovement/placementfortheclassesofmemoryproposedinthesystem.

3.3.24 Thesystemshallsupportlicensingfortheprogrammingenvironment(compilers,debuggers,optimizationtools,optimizedmathlibraries,etc.)foruptotwenty(20)concurrentusersatjobsizesthatspanfrom100’sofsmall-scalejobsforasingleuserallthewayuptoasinglejoboccupyingthefullscale(100%ofthecomputepartition)oftheplatform.

3.4 PlatformStoragePlatformstorageiscertaintobeoneoftheadvancedtechnologyareasincludedinanysystemdeliveredinthistimeframe.TheACESlaboratoriesanticipatetheseemergingtechnologieswillenablenewusagemodels.Withthisinmind,anaccompanyingwhitepaper,“APEXWorkflows,”isprovidedthatdescribeshowapplicationteamsuseHPCresourcestodaytoadvancescientificgoals.Thewhitepaperisdesignedtoprovideaframeworkforreasoningabouttheoptimalsolutiontothesechallenges.ThewhitepaperisintendedtohelpanOfferordevelopaplatformstoragearchitectureresponsethatacceleratesthescienceworkflowswhileminimizingthetotalnumberofplatformstoragetiers.TheworkflowsdocumentcanbefoundontheCrossroadswebsite.

3.4.1 Thesystemshouldincludeplatformstoragecapableofretainingallapplicationinput,output,andworkingdatafor12weeks(84days),estimatedataminimumof12%ofbaselinesystemmemoryperday.

3.4.2 Thesystemshouldincludeplatformstoragewithawarranteddurabilityoramaintenanceplansuchthattheplatformstorageiscapableofabsorbingapproximatelytwotimesthesystemsbaselinememoryperdayforanominal5years.

3.4.3 TheOfferorshalldescribehowthesystemprovidessufficientbandwidthtosupportaJMTTI/Delta-Ckptratioofgreaterthan200.SeeTable2TargetSystemConfiguration.

3.4.4 TheOfferorshalldescribehowthestoragesystemprovidessufficientperformancetoasynchronouslymigrate80%ofmemory(i.e.acheckpointfrom3.4.3)fromthefastesttiertothecapacitytierin75%ofJMTTI.

Page 16: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 16 of 56

3.4.5 TheOfferorshalldescribehowthesystemsatisfiesaminimumstoragebandwidthrequirementcapableofwriting25%ofbaselinesystemmemoryinlessthan300seconds.

3.4.6 TheOfferorshalldescribehowajobrunningacrosstheentiresystemwithaMPIrankpercorecancreateafilefromeveryMPIrankinfewerthan10secondsbetweenthefirstandlastcreate.Iftheresponserequiresmorethanasinglepre-existingdirectorytheofferorshallalsodescribetherequireddirectorylayoutandthetimerequiredtocreatethosedirectories.

3.4.7 TheOfferorshalldescribeallavailableinterfacestoplatformstorageforthesystem,includingbutnotlimitedto:§ POSIX

§ APIs

§ ExceptionstoPOSIXcompliance.§ Timetoconsistencyandanypotentialdelaysforreliabledata

consumption.§ Anyspecialrequirementsforuserstoachieveperformanceand/or

consistentdata.

3.4.8 TheOfferorshalldescribethereliabilitycharacteristicsofplatformstorage,includingbutnotlimitedto:§ Anysinglepointoffailureforallproposedplatformstoragetiers(note

anycomponentfailurethatwillleadtotemporaryorpermanentlossofdataavailability).

§ Enumerateplatformstoragetiersthataredesignedtobelessreliableordonotusedataprotectiontechniques(e.g.,replication,erasurecoding).

§ Describetheimpactstoarunningcomputejobduetostorage-relatedfailuresandduringtherecoveryfromsaidfailureforeachreliableplatformtier.Specificallydescribethejobimpactduringfailure,andseparatelydescribethejobimpactduringrecovery.

§ Vendorsuppliedmechanismstoensuredataintegrityforeachplatformstoragetier(e.g.,datascrubbingprocesses,backgroundchecksumverification,etc.).

§ Loginorinteractivenodesaccesstoplatformstoragewhenthecomputenodesareunavailable.

3.4.9 TheOfferorshalldescribesystemfeaturesforplatformstoragetiermanagementdesignedtoaccelerateworkflows,includingbutnotlimitedto:

Page 17: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 17 of 56

§ Mechanismsformigratingdatabetweenplatformstoragetiers,includingmanual,scheduled,and/orautomaticdatamigrationtoincluderebalancing,draining,orrewritingdataacrossdeviceswithinatier.

§ Howplatformstoragewillbeinstantiatedwitheachjobifitneedstobe,andhowplatformstoragemaybepersistedacrossjobs.

§ Thecapabilitiesprovidedtodefineper-userpoliciesanddatamovementbetweendifferenttiersofplatformstorageorexternalstorageresources(e.g.,archives).

§ Describeanydata-relatedconsistency,whetheroptionalorinherent,betweenstoragetiers(e.g.write-backcaching).

§ Theabilitytointegratewithaschedulingresource.§ Mechanismtoincrementallyaddcapacityandbandwidthtoaparticular

tierofplatformstorage.Pleasealsodescribefunctionalandperformanceimpactstorunningjobswhilethesystemintegratesnewresources.

§ Capabilitiestomanageorinterfaceplatformstoragewithexternalstorageresourcesorarchives(e.g.,faststoragelayersorHPSS).

3.4.10 TheOfferorshalldescribesoftwarefeaturesthatallowuserstooptimizeI/Ofortheworkflowsofthesystem,includingbutnotlimitedto:§ Batchdatamovementcapabilities,especiallywhendataresideson

multipletiersofplatformstorage.

§ Methodsforuserstocreateandmanageplatformstorageallocations.§ Anyabilitytodirectlytargetatierforwritingorreadingdata.

§ Locality-awarejob/datascheduling.§ Methodsforuserstoexploitanyenhancedperformanceofrelaxed

consistency.§ Methodsforenablinguser-definedmetadatawiththeplatformstorage

solution.

3.4.11 TheOfferorshalldescribethemethodandrateforenumeratingtheentireplatformstoragemetadata.Describeanyspecialcapabilitiesthatwouldmitigateuserperformanceissuesand/orallowtheenumerationtocompleteinfewerthan4hours;expectatleast1billionobjects.

3.4.12 TheOfferorshalldescribecapabilitiestocomprehensivelycollectplatformstorageusagedataandnotethosethatcanbecollectedout-of-band.Storagemetricsforthesystemmayinclude,butarenotlimitedto:

Page 18: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 18 of 56

§ Perclientmetricsandfrequencyofcollection,includingbutnotlimitedto:thenumberofbytesreadorwritten,numberofreadorwriteinvocations,clientcachestatistics,andmetadatastatisticssuchasnumberofopens,closes,creates,andothersystemcallsofrelevancetotheperformanceofplatformstorage.

§ Joblevelmetrics,suchasthenumberofsessionseachjobinitiateswitheachplatformstoragetier,sessionduration,totaldatatransmitted(separatedasreadsandwrites)duringthesession,andthenumberoftotalplatformstorageinvocationsmadeduringthesession.

§ Platformstoragetiermetricsandfrequencyofcollection,suchasthenumberofbytesread,numberofbyteswritten,numberofreadinvocations,numberofwriteinvocations,bytesdeleted/purged,numberofI/Osessionsestablished,andperiodsofoutage/unavailability.

§ Joblevelmetricsdescribingusageofatieredplatformstoragehierarchy,suchashowlongfilesareresidentineachtier,hitrateoffilepagesineachtier(i.e.,whetherpagesareactuallyreadandhowmanytimesdataisre-read),fractionofdatamovedbetweentiersbecauseofa)explicitprogrammercontrolandb)transparentcaching,andtimeintervalbetweenaccessestothesamefile(e.g.,howlonguntilananalysisprogramreadsasimulationgeneratedoutputfile).

3.4.13 TheOfferorshallproposeamethodforprovidingaccesstoplatformstoragefromothersystemsatthefacility.Inthecaseoftieredplatformstorage,atleastonetiermustsatisfythisrequirement.

3.4.14 TheOfferorshalldescribethecapabilityforplatformstoragetierstoberepaired,serviced,andincrementallypatched/upgradedwhilerunningdifferentversionsofsoftwareorfirmwarewithoutrequiringastoragetier-wideoutage.TheOfferorshalldescribethelevelofperformancedegradation,ifany,anticipatedduringtherepairorserviceinterval.

3.4.15 TheOfferorshallspecifythetimerequiredandtheoptimalnumberofcomputenodesrequiredtoachievepeakreadandwriteperformancetothefastestplatformstoragetierusingthefollowingdatasets:

§ A1TBdatasetof20GBfiles.§ A5TBdatasetofanychosenfilesize.Offerorshallreportthefilesize

chosen.

§ Usablecapacityofthefastesttierusing32MBfiles.

3.5 ApplicationPerformanceAssuringthatrealapplicationsperformefficientlyonCrossroadsiskeyfortheirsuccess.Becausethefullapplicationsarelarge,oftenwithmillionsof

Page 19: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 19 of 56

linesofcode,andinsomecasesareexportcontrolled,asuiteofbenchmarkshavebeendevelopedforRFPresponseevaluationandsystemacceptance.ThebenchmarkcodesarerepresentativeoftheworkloadsoftheNNSAlaboratoriesbutoftensmallerthanthefullapplications.TheperformanceofthebenchmarkswillbeevaluatedaspartofboththeRFPresponseandsystemacceptance.Finalbenchmarkacceptanceperformancetargetswillbenegotiatedafterafinalsystemconfigurationisdefined.Allperformancetestsmustcontinuetomeetnegotiatedacceptancecriteriathroughoutthelifetimeofthesystem.SystemacceptanceforCrossroadswillincludeexportcontrolledASCcodes(toincludecodeat0D999andITARcontrollevel)fromeachofthethreeNNSAlaboratories.ThebenchmarkinformationandlicensingrequirementsregardingtheCrossroadsacceptancecodes,andsupplementalmaterialscanbefoundontheCrossroadswebsite.

3.5.1 TheOfferorshallprovideresponsesforthebenchmarks(SNAP,HPCG,PENNANT,MiniPIC,UMT,VPIC,Branson)providedontheCrossroadsbenchmarkslinkontheCrossroadswebsite.Allmodificationsornewvariantsofthebenchmarks(includingmakefiles,buildscripts,andenvironmentvariables)aretobesuppliedintheOfferor’sresponse.§ Theresultsofallproblemsizes(baselineandoptimized)shouldbe

providedintheOfferor'sScalableSystemImprovement(SSI)spreadsheets.SSIisthecalculationusedformeasuringimprovementandisdocumentedontheCrossroadswebsite,alongwiththeSSIspreadsheets.Ifpredictedorextrapolatedresultsareprovided,themethodologyusedtoderivethemshouldbeclearlydocumented.

§ TheOfferorshallprovidelicensesforthesystemforallcompilers,libraries,andruntimesusedtoachievebenchmarkperformance.

3.5.2 TheOfferorshallprovideperformanceresultsforthesystemthatmaybebenchmarked,predicted,and/orextrapolatedforthebaselineMPI+OpenMPvariantsofthebenchmarks.TheOfferormaymodifythebenchmarkstoincludeextraOpenMPpragmasasrequired,butthebenchmarkmustremainastandard-compliantprogramthatmaintainsexistingoutputsubjecttothevalidationcriteriadescribedinthebenchmarkrunrules.

3.5.3 TheOfferorshalloptionallyprovideperformanceresultsfromanOfferoroptimizedvariantofthebenchmarks.TheOfferormaymodifythebenchmarks,includingthealgorithmand/orprogrammingmodelusedtodemonstratehighsystemperformance.Ifalgorithmicchangesaremade,theOfferorshallprovideanexplanationofwhytheresultsmaydeviatefromvalidationcriteriadescribedinthebenchmarkrunrules.

Page 20: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 20 of 56

3.5.4 InadditiontotheCrossroadsbenchmarks,anASCSimulationCodeSuiterepresentingthethreeNNSAlaboratorieswillbeusedtojudgeperformanceattimeofacceptance(MercuryfromLawrenceLivermore,PartiSNfromLosAlamos,andSPARCfromSandia).NNSAmissionrequirementsforecasttheneedfora6XorgreaterimprovementovertheASCTrinitysystem(Haswellpartition)forthecodesuite,measuredusingSSI.Finalacceptanceperformancetargetswillbeestablishedduringnegotiationsafterafinalsystemconfigurationisdefined.InformationregardingtheASCSimulationCodeSuitecanbefoundontheCrossroadswebsite.SourcecodewillbeprovidedtotheOfferor,butitwillrequirecompliancewithexportcontrollawsandnocostlicensingagreements.

3.5.5 TheOfferorshallreportorprojectthenumberofcoresnecessarytosaturatetheavailablenodebaselinememorybandwidthasmeasuredbytheCrossroadsmemorybandwidthbenchmarkfoundontheCrossroadswebsite.

§ Ifthenodecontainsheterogeneouscores,theOfferorshallreportthenumberofcoresofeacharchitecturenecessarytosaturatetheavailablebaselinememorybandwidth.

§ Ifmultipletiersofmemoryareavailable,theOfferorshallreporttheaboveforeveryfunctionalcombinationofcorearchitectureandbaselineorextendedmemorytier.

3.5.6 TheOfferorshallreportorprojectthesustaineddensematrixmultiplicationperformanceoneachtypeofprocessorcore(individuallyand/orinparallel)ofthesystemnodearchitecture(s)asmeasuredbytheCrossroadsmultithreadedDGEMMbenchmarkfoundontheCrossroadswebsite.

§ TheOfferorshalldescribethepercentageoftheoreticaldouble-precision(64-bit)computationalpeak,whichthebenchmarkGFLOP/srateachievesforeachtypeofcomputecore/unitintheresponse,anddescribehowthisiscalculated.

3.5.7 TheOfferorshallreport,orproject,theMPItwo-sidedmessagerateofthenodesinthesystemunderthefollowingconditionsmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite:

§ Usingone,two,four,eight,andhalfthenumberofcoresofMPIrankspernodewithMPI_THREAD_SINGLE.

§ Usingone,two,four,eight,andhalfthenumberofcoresofMPIrankspernodeandmultiplethreadsperrankwithMPI_THREAD_MULTIPLE.

§ TheOfferormayadditionallychoosetoreportonotherconfigurations,includingMPI_THREAD_SERIALIZEDandMPI_THREAD_FUNNELED.

Page 21: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 21 of 56

3.5.8 TheOfferorshallreport,orproject,theMPIone-sidedmessagerateofthenodesinthesystemforallpassivesynchronizationRMAmethodswithbothpre-allocatedanddynamicmemorywindowsunderthefollowingconditionsmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsiteusing:§ One,two,four,eight,andhalfthenumberofcoresofMPIrankspernode

withMPI_THREAD_SINGLE.

§ One,two,four,eight,andhalfthenumberofcoresofMPIrankspernodeandmultiplethreadsperrankwithMPI_THREAD_MULTIPLE.

§ TheOfferormayadditionallychoosetoreportonotherconfigurations,includingMPI_THREAD_SERIALIZEDandMPI_THREAD_FUNNELED.

3.5.9 TheOfferorshallreport,orproject,thetimetoperformthefollowingcollectiveoperationsfor25%,50%,and100%ofthecomputepartitionnodesinthesystemandreportoncoreoccupancyduringtheoperationsmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsitefor:

§ An8byteMPI_Allreduceoperation.§ An8byteperrankMPI_Allgatheroperation.

3.5.10 TheOfferorshallreport,orproject,theminimumandmaximumoff-nodelatencyofthesystemforMPItwo-sidedmessagesusingthefollowingthreadingmodesmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite:

§ MPI_THREAD_SINGLEwithasinglethreadperrank.

§ MPI_THREAD_MULTIPLEwithtwoormorethreadsperrank.

3.5.11 TheOfferorshallreport,orproject,theminimumandmaximumoff-nodelatencyforMPIone-sidedmessagesofthesystemforallpassivesynchronizationRMAmethodswithbothpre-allocatedanddynamicmemorywindowsusingthefollowingthreadingmodesmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite:

§ MPI_THREAD_SINGLEwithasinglethreadperrank.

§ MPI_THREAD_MULTIPLEwithtwoormorethreadsperrank.

3.5.12 TheOfferorshallprovideanefficientimplementationofMPI_THREAD_MULTIPLE.Bandwidth,latency,andmessagethroughputmeasurementsusingtheMPI_THREAD_MULTIPLEthreadsupportlevelshouldhavenomorethana10%performancedegradationwhencomparedtousingtheMPI_THREAD_SINGLEsupportlevelasmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite.

Page 22: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 22 of 56

3.5.13 TheOfferorshallreport,orproject,themaximumI/ObandwidthsofthesystemasmeasuredbytheIORbenchmarkspecifiedontheCrossroadswebsite.

3.5.14 TheOfferorshallreport,orproject,themetadataratesofthesystemasmeasuredbytheMDTESTbenchmarkspecifiedontheCrossroadswebsite.

3.5.15 TheSuccessfulOfferorshallberequiredattimeofacceptancetomeetspecifiedtargetsforacceptancebenchmarks,andmissioncodesforCrossroads,listedontheCrossroadswebsite.

3.5.16 TheOfferorshalldescribehowthesystemmaybeconfiguredtosupportahighrateandbandwidthofTCP/IPconnectionstoexternalservicesbothfromcomputenodesanddirectlytoandfromtheplatformstorage,including:§ Computenodeexternalaccessshouldallowallnodestoeachinitiate1

connectionconcurrentlywithina1secondwindow.§ Transferofdataovertheexternalnetworktoandfromthecompute

nodesandplatformstorageat100GB/sperdirectionofa1TBdatasetcomprisedof20GBfilesin10seconds.

3.6 Resilience,Reliability,andAvailabilityTheabilitytoachievetheNNSAmissiongoalshingesontheproductivityofsystemusers.Systemavailabilityisthereforeessentialandrequiressystem-widefocustoachievearesilient,reliable,andavailablesystem.Foreachmetricspecifiedbelow,theOfferormustdescribehowtheyarrivedattheirestimates(e.g.failureratesofindividualcomponentsincludinghardwareandsoftwarethatmakeupmajoraspectsofOfferor’sestimate).

3.6.1 Failureofthesystemmanagementand/orRASsystem(s)shouldnotcauseasystemorjobinterrupt.ThisrequirementdoesnotapplytoaRASsystemfeature,whichautomaticallyshutsdownthesystemforsafetyreasons,suchasanoverheatingcondition.

3.6.2 TheminimumSystemMeanTimeBetweenInterrupt(SMTBI)shouldbegreaterthan720hours.

3.6.3 TheminimumJobMeanTimeToInterrupt(JMTTI)shouldbegreaterthan24hours.Automaticrestartsdonotmitigateajobinterruptforthismetric.

3.6.4 TheratioofJMTTI/Delta-Ckptshouldbegreaterthan200.Thismetricisameasureofthesystem’sabilitytomakeprogressoveralongperiodoftimeandcorrespondstoanefficiencyofapproximately90%.If,forexample,theJMTTIrequirementisnotmet,thetargetJMTTI/Delta-Ckptratioensuresthisminimumlevelofefficiency.

Page 23: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 23 of 56

3.6.5 Animmediatere-launchofaninterruptedjobshouldnotrequireacompleteresourcereallocation.Ifajobisinterrupted,thereshouldbeamechanismthatallowsre-launchoftheapplicationusingthesameallocationofresource(e.g.,computenodes)thatithadbeforetheinterruptoranaugmentedallocationwhenpartoftheoriginalallocationexperiencesahardfailure.

3.6.6 Acompletesysteminitializationshouldtakenomorethan30minutes.TheOfferorshalldescribethefullsysteminitializationsequenceandtimings.

3.6.7 Thesystemshouldachieve99%scheduledsystemavailability.Systemavailabilityisdefinedintheglossary.

3.6.8 TheOfferorshalldescribetheresilience,reliability,andavailabilitymechanismsandcapabilitiesofthesystemincluding,butnotlimitedto:

§ Anyconditionoreventthatcanpotentiallycauseajobinterrupt.§ Resiliencyfeaturestoachievetheavailabilitytargets.§ Singlepointsoffailure(hardwareorsoftware),andthepotentialeffecton

runningapplicationsandsystemavailability.

§ Howajobmaintainsitsresourceallocationandisabletorelaunchanapplicationafteraninterrupt.

§ Asystem-levelmechanismtocollectfailuredataforeachkindofcomponent.

3.7 ApplicationTransitionSupportandEarlyAccesstoACESTechnologiesTheCrossroadssystemmayincludenumerousadvancedtechnologies.TheOfferorshallincludeintheirproposalaplantoeffectivelyutilizethesetechnologiesandassistintransitioningthemissionworkflowstothesystem.TheSuccessfulOfferorshallsupporteffortstotransitiontheAdvancedTechnologyDevelopmentMitigation(ATDM)codestothesystems.ATDMcodesarecurrentlybeingdevelopedbythethreeNNSAweaponslaboratories,LawrenceLivermore,LosAlamos,andSandia.Thesecodesmayrequirecompliancewithexportcontrollawsandnocostlicensingagreements.InformationabouttheATDMprogramcanbefoundontheNNSAwebsite.

Page 24: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 24 of 56

3.7.1 TheSuccessfulOfferorshouldprovideavehicleforsupportingthesuccessfuldemonstrationoftheapplicationperformancerequirementsandthetransitionofkeyapplicationstotheCrossroadssystem(e.g.,aCenterofExcellence).SupportshouldbeprovidedbytheOfferorandallofitskeyadvancedtechnologyproviders(e.g.,processorvendors,integrators,etc.).TheSuccessfulOfferorshouldprovideexpertsintheareasofapplicationportingandperformanceoptimizationintheformofstafftraining,generalusertraining,anddeep-diveinteractionswithasetofapplicationcodeteams.Supportshouldbeprovidedfromthedateofsubcontractexecutionthroughtwo(2)yearsafterfinalacceptanceofthesystems.

3.7.2 TheSuccessfulOfferorshalldescribetheirsupportstructurefortheproposedprogrammingenvironment.Thisincludesmechanismsforreportingissuesandrequestingnewfunctionality,inadditiontoescalationpaths/prioritiesavailabletoCrossroads’applications.Supportshouldbeprovideduptotwo(2)yearsafterfinalacceptanceofthesystems.

3.7.3 TheOfferorshalldescribewhichoftheproposedhardwareandsoftwaretechnologies(physicalhardware,emulators,and/orsimulators),willbeavailableforaccessbeforesystemdeliveryandinwhattimeframe.TheproposedtechnologiesshouldprovidevalueinadvancedpreparationforthedeliveryofthefinalCrossroadssystemforpre-system-deliveryapplicationportingandperformanceassessmentactivities.

3.8 TargetSystemConfigurationACESdeterminedthefollowingtargetsforCrossroadsSystemConfigurations.Offerorsshallstateprojectionsfortheirproposedsystemconfigurationsrelativetothesetargets.

Table2TargetSystemConfiguration

Crossroads

BaselineMemoryCapacityExcludesalllevelsofon-die-CPUcache

>0.5PiB

BenchmarkSSIincreaseoverTrinitysystem(Haswellpartition)

>6X

PlatformStorage >10XBaselineMemory

NameplatePower <20MW

PeakPower <18MW

Page 25: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 25 of 56

3.9 SystemOperationsSystemmanagementshouldbeanintegralfeatureoftheoverallsystemandshouldprovidetheabilitytoeffectivelymanagesystemresourceswithhighutilizationandthroughputunderaworkloadwithawiderangeofconcurrencies.TheSuccessfulOfferorshouldprovidesystemadministrators,securityofficers,anduser-supportpersonnelwithproductiveandefficientsystemconfigurationmanagementcapabilitiesandanenhanceddiagnosticenvironment.

3.9.1 ThesystemshouldincludescalableintegratedsystemmanagementcapabilitiesthatprovidehumaninterfacesandAPIsforsystemconfigurationanditsabilitytobeautomatedthroughconfigurationmanagementsoftware,softwaremanagement,changemanagementthroughaversioncontrolsystem,localsiteintegration,andsystemconfigurationbackupandrecovery.

3.9.2 Thesystemshouldincludeameansfortrackingandanalyzingallsoftwareupdates,softwareandhardwarefailures,andhardwarereplacementsoverthelifetimeofthesystem.Allpatchesandreleasesshouldincludechangelogswithdetaileddescriptionsofbugfixesandfeaturesandalsowhatservicesareaffectedbythesechanges.

3.9.3 Thesystemshouldincludetheabilitytoperformrollingupgradesandrollbacksonasubsetofthesystemwhilethebalanceofthesystemremainsinproductionoperation.TheOfferorshalldescribethemechanisms,capabilities,workloadmanagementsupport,andlimitationsofrollingupgradesandrollbacks.Nomorethanhalfthesystempartitionshouldberequiredtobedownforrollingupgradesandrollbacks.

NominalPower <15MW

IdlePower <10%NameplatePower

JobMeanTimetoInterrupt(JMTTI)

Calculatedforasinglejobrunningontheentiresystem

>24Hours

SystemMeanTimetoInterrupt(SMTTI)

>720Hours

JMTTI/Delta-Ckpt >200

SystemAvailability >99%

Page 26: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 26 of 56

3.9.4 Thesystemshouldincludeanefficientmechanismforreconfiguringandrebootingcomputenodes.TheOfferorshalldescribeindetailthecomputenoderebootmechanism,differentiatingtypesofboots(warmbootvs.coldboot)requiredfordifferentnodefeatures,aswellashowthetimerequiredtorebootscaleswiththenumberofnodesbeingrebooted.Warmboottimingsforbothfilesystemandclusterbootshallbeindependentlyprovidedandalsoacombinedwarmbootfilesystemandclusterboottimingifitdiffers.

3.9.5 Thesystemshouldincludeamechanismwherebyallmonitoringdataandlogscapturedareavailabletothesystemowner,andwillsupportanopenmonitoringAPItofacilitatelossless,scalablesamplinganddatacollectionformonitoreddata.Anyfilteringthatmayneedtooccurwillbeattheoptionofthesystemmanager.Thesystemwillincludeasamplingandconnectionframeworkthatallowsthesystemmanagertoconfigureindependentalternativeparalleldatastreamstobedirectedoffthesystemtosite-configurableconsumers.

3.9.6 Thesystemshouldincludeamechanismtocollectandprovidemetricsandlogswhichmonitorthestatus,health,utilization,andperformanceofthesystem,subsystems,andallmajorcomponents,including,butnotlimitedto:§ Environmentalmeasurementcapabilitiesforallsystemsandperipherals

andtheirsub-systemsandsupportinginfrastructure,includingpowerandenergyconsumptionandcontrol.

§ Internalhighspeednetworkperformancecounters,includingmeasuresofnetworkcongestionandnetworkresourceconsumption.

§ Informationenablingtrafficandcongestionattribution,withexplanationoftheattributionlogic.

§ Hardwareperformancecountersenablingapplicationperformanceassessmentwiththeabilitytointegratethesewithsystemmetric(e.g.,networkperformancecounters)data.

§ Alllevelsofintegratedandattachedplatformstorage.§ Thesystemasawhole,includinghardwareperformancecountersfor

metricsforalllevelsofintegratedandattachedplatformstorage.

3.9.7 TheOfferorshalldescribewhattoolsandAPIsitwillprovideforthecollection,analysis,integration,andvisualizationofmetricsandlogsproducedbythesystem(e.g.,peripherals,integratedandattachedplatformstorage,andenvironmentaldata,includingpowerandenergyconsumption).Thedescriptionshouldincludeanycapabilitiestoconfigurecollectionratesfortheavailablemetrics.

Page 27: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 27 of 56

3.9.8 Theofferorshalldescribeallmechanismsforobtainingapplicationperformanceandprogressinformationsuchasenablingapplicationspecificlogsandmetricstobecollectedandtransportedoffthesystemorenablinghardwareandsystemperformancecounterstobecollectedandtransportedoffthesystemandassociatedwithparticularjobs/applications.

3.9.9 TheOfferorshalldescribethesystemconfigurationmanagementanddiagnosticcapabilitiesofthesystemthataddressthefollowingtopics:

§ Detaileddescriptionofthesystemmanagementsupport.§ Anyeffectoroverheadofsoftwaremanagementtoolcomponentsonthe

CPUormemoryavailableoncomputenodes.§ Releaseplan,withregressiontestingandvalidationforallsystemrelated

softwareandsecurityupdates.§ Supportformultiplesimultaneousoralternativesystemsoftware

configurations,includingestimatedtimeandeffortrequiredtoinstallbothamajorandaminorsystemsoftwareupdate.

§ Useractivitytracking,suchasauditloggingandprocessaccounting.§ Unrestrictedprivilegedaccesstoallsoftwareandhardwarecomponents

andallrelatedperformancemetricsdeliveredwiththesystem.

3.9.10 Thesystemshouldprovideamechanismforreportingallbasiccomponentandessentialservicesstate(e.g.,up/down/running)andchangesofstate.ThesystemshouldalsoprovidedocumentedAPIsforqueryingthestate.

3.9.11 Offerorshouldprovideadescriptionofallfundamentaldataandassociatedmetricsandcomputationsusedtoassessstatus,health,utilization,andperformancein3.9.6.

3.9.12 TheOfferorshalldescribeallmeasurementcapabilities(system,rack/cabinet,board,node,component,andsub-componentlevel)forthesystem,includingcontrolandresponsetimes,samplingfrequency,accuracyofthedata,andtimestampsofthedataforindividualpointsofmeasurementandcontrol.

3.10 PowerandEnergyPower,energy,andtemperaturewillbecriticalfactorsinhowtheACESlaboratoriesmanagesystemsinthistimeframeandmustbeanintegralpartofoverallSystemsOperations.Thesolutionmustbewellintegratedintootherintersectingareas(e.g.,facilities,resourcemanagement,runtimesystems,andapplications).TheACESlaboratoriesexpectagrowingnumberofusecasesinthisareathatwillrequireaverticallyintegratedsolution.

Page 28: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 28 of 56

3.10.1 TheOfferorshalldescribeallpower,energy,andtemperatureoperationalmeasurementcapabilities(system,rack/cabinet,board,node,component,andsub-componentlevel)forthesystem,includingcontrolandresponsetimes,samplingfrequency,accuracyofthedata,andtimestampsofthedataforindividualpointsofmeasurementandcontrol.

3.10.2 TheOfferorshalldescribealloperationalcontrolcapabilitiesitwillprovidetoaffectpowerorenergyuse(system,rack/cabinet,board,node,component,andsub-componentlevel).

3.10.3 Thesystemshouldincludesystem-levelinterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystem,includingbutnotlimitedto:

§ ACmeasurementcapabilitiesatthesystemorracklevel.

§ System-levelminimumandmaximumpowersettings(e.g.,powercaps).§ System-levelpowerrampupanddownrate.

§ Scalablecollectionandretentionallmeasurementdatasuchas:§ point-in-timepowerdata.

§ energyusageinformation.

§ minimumandmaximumpowerdata.

3.10.4 Thesystemshouldincluderesourcemanagerinterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystem,includingbutnotlimitedto:

§ Jobandnodelevelminimumandmaximumpowersettings.§ Jobandnodelevelpowerrampupanddownrate.

§ Jobandnodelevelprocessorand/orcorefrequencycontrol.

§ Systemandjoblevelprofilingandforecasting.§ e.g.,predictionofhourlypoweraverages>24hoursinadvancewitha1

MWtolerance.

3.10.5 Thesystemshouldincludeapplicationandruntimesysteminterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystemincludingbutnotlimitedto:

§ Nodelevelminimumandmaximumpowersettings.

§ Nodelevelprocessorand/orcorefrequencycontrol.§ Nodelevelapplicationhints,suchas:

Page 29: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 29 of 56

§ applicationenteringserial,parallel,computationallyintense,I/Ointenseorcommunicationintensephase.

3.10.6 ThesystemshouldincludeanintegratedAPIforalllevelsofmeasurementandcontrolofpowerrelevantcharacteristicsofthesystem.ItispreferablethattheprovidedAPIcomplieswiththeHighPerformanceComputingPowerApplicationProgrammingInterfaceSpecification(http://powerapi.sandia.gov).

3.10.7 TheOfferorshallproject(andreport)theNameplate,Peak,Nominal,andIdlePowerofthesystem.

3.10.8 TheOfferorshalldescribeanycontrolsavailabletoenforceorlimitpowerusagebelowNameplatepowerandthereactiontimeofthismechanism(e.g.,whatdurationandmagnitudecanpowerusageexceedtheimposedlimits).

3.10.9 TheOfferorshalldescribethestatusofthesystemwheninanIdleState(describeallIdleStatesifmultipleareavailable)andthetimetotransitionfromtheIdleState(oreachIdleStateiftherearemultiple)tothestartofjobexecution.

3.11 FacilitiesandSiteIntegration

3.11.1 Thesystemshoulduse3-phaseDelta480VAC(four-wiresystem,threephasesandoneground).Othersysteminfrastructurecomponents(e.g.,disks,switches,loginnodes,andmechanicalsubsystemssuchasCDUs)mustuseeither3-phase480VAC(stronglypreferred),3-phase208VAC(secondchoice),orsingle-phase120/208VAC(thirdchoice).Thetotalnumberofindividualbranchcircuitsandphaseloadimbalanceshouldbeminimized.

3.11.2 AllequipmentandpowercontrolhardwareofthesystemshouldbeNationallyRecognizedTestingLaboratories(NRTL)certifiedandbearappropriateNRTLlabels.

3.11.3 Everyrack,networkswitch,interconnectswitch,node,anddiskenclosureshouldbeclearlylabeledwithauniqueidentifiervisiblefromthefrontoftherackandtherearoftherack,asappropriate,whentherackdoorisopen.Theselabelswillbehighqualitysothattheydonotfalloff,fade,disintegrate,orotherwisebecomeunusableorunreadableduringthelifetimeofthesystem.Nodeswillbelabeledfromtherearwithauniqueserialnumberforinventorytracking.Itisdesirablethatmotherboardsalsohaveauniqueserialnumberforinventorytracking.Serialnumbersshallbevisiblewithouthavingtodisassemblethenode,ortheymustbeabletobequeriedfromthesystemmanagementconsole.

Page 30: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 30 of 56

3.11.4 Allcomponentsinarackintendedtobeservicedwhiletherackhaspowershallbefullyserviceablewithoutdangeroftouchinganexposedconductingsurface.Considerpowerswitchesforindividualcomponentsthatmayneedtobepowered-off/-onindividually.Considerminimizingthenumberofconnectingcablesthatneedtoberemovedtopower-off/-onacomponent.Considertheplacementofconnectors,handles,etc.,withrespecttoconductingservices,thatmustbeusedtoremoveandreplaceacomponent.

3.11.5 Table3belowshowstargetfacilityrequirementsidentifiedbyACESfortheCrossroadssystem.TheOfferorshalldescribethefeaturesofitsproposedsystemsrelativetositeintegrationattherespectivefacilities,including:§ Descriptionofthephysicalpackagingofthesystem,including

dimensioneddrawingsofindividualcabinetstypesandthefloorlayoutoftheentiresystem.

§ Remoteenvironmentalmonitoringcapabilitiesofthesystemandhowitwouldintegrateintofacilitymonitoring.

§ Emergencyshutdowncapabilities.§ Detaileddescriptionsofpowerandcoolingdistributionsthroughoutthe

system,includingpowerconsumptionforallsubsystems.§ DescriptionofparasiticpowerlosseswithinOfferor’sequipment,suchas

fans,powersupplyconversionlosses,power-factoreffects,etc.Forthecomputationalandplatformstoragesubsystemsseparately,giveanestimateofthetotalpowerandparasiticpowerlosses(whosedifferenceshouldbepowerusedbycomputationalorplatformstoragecomponents)attheminimumandmaximumITUE,whichisdefinedastheratiooftotalequipmentpoweroverpowerusedbycomputationalorplatformstoragecomponents.Describetheconditions(e.g.“idle”)atwhichtheextremaoccur.

§ OSdistributionsorotherclientrequirementstosupportoff-systemaccesstotheplatformstorage(e.g.LANLFileTransferAgents).

Table3CrossroadsFacilityRequirements

Location LosAlamosNationalLaboratory,LosAlamos,NM.ThesystemwillbehousedintheStrategicComputingComplex(SCC),Building2327

Altitude 7,500feet

Page 31: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 31 of 56

Seismic N/A

WaterCooling ThesystemshouldoperateinconformancewithASHRAEClassW2guidelines(dated2011).Thefacilitywillprovideoperatingwatertemperatureof75°F,atupto35PSIdifferentialpressureatthesystemcabinetsHowever,Offerorshouldnoteifthesystemiscapableofoperatingathighertemperatures.Note:LANLfacilitywillprovideinletwateratanominal75°F,persystemdesign.Totalflowrequirementsmaynotexceed9600GPM.

WaterChemistry ThesystemmustoperatewithfacilitywatermeetingbasicASHRAEwaterchemistry.Specialchemistrywaterisnotavailableinthemainbuildingloopandwouldrequireaseparatetertiaryloopprovidedwiththesystem.Iftertiaryloopsareincludedinthesystem,theOfferorshalldescribetheiroperationandmaintenance,includingcoolantchemistry,pressures,andflowcontrols.Allcoolantloopswithinthesystemshouldhavereliableleakdetection,temperature,andflowalarms,withautomaticprotectionandnotificationmechanisms.

AirCooling Thesystemmustoperatewithsupplyairat75°F-60°F,witharelativehumidityfrom30%-70%.Therateofairflowisbetween800-1500CFM/floortile.Nomorethan3MWofheatshouldberemovedbyaircooling.

MaximumPowerRateofChange

Thehourlyaverageinsystempowershouldnotexceedthe2MWwidepowerbandnegotiatedatleast2hoursinadvance.

PowerQuality ThesystemshouldberesilienttoincomingpowerfluctuationsatleasttothelevelguaranteedbytheITICpowerqualitycurve.

Floor 42”raisedfloor

Ceiling 16-footceilingand16-footceilingplenum

Page 32: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 32 of 56

MaximumFootprint 8000squarefeet;80feetlongand100feetdeep.

ShipmentDimensionsandWeight

Norestrictions.

FloorLoading Theaveragefloorloadingovertheeffectiveareashouldbenomorethan300poundspersquarefoot.Theeffectiveareaistheactualloadingareaplusatmostafootofsurroundingfullyunloadedarea.Amaximumlimitof300poundspersquarefootalsoappliestoallloadsduringinstallation.TheOfferorshalldescribehowtheweightwillbedistributedoverthefootprintoftherack(pointloads,lineloads,orevenlydistributedovertheentirefootprint).Apointloadappliedonaonesquareinchareashouldnotexceed1500pounds.AdynamicloadusingaCISCAWheel1sizeshouldnotexceed1250pounds(CISCAWheel2–1000pounds).

Cabling Allpowercablingandwaterconnectionsshouldbebelowtheaccessfloor.Itispreferablethatallothercabling(e.g.,systeminterconnect)isabovefloorandintegratedintothesystemcabinetry.Underfloorcables(ifunavoidable)shouldbeplenumratedandcomplywithNEC300.22andNEC645.5.Allcommunicationscables,whereverinstalled,shouldbesource/destinationlabeledatbothends.Allcommunicationscablesandfibersover10metersinlengthandinstalledunderthefloorshouldalsohaveauniqueserialnumberanddBlossdatadocument(orequivalent)deliveredattimeofinstallationforeachcable,ifamethodofmeasurementexistsforcabletype.

Externalnetworkinterfacessupportedbythesiteforconnectivityrequirementsspecifiedbelow

1Gb,10Gb,40Gb,100Gb,IBEDR,IBHDR.ThenetworkinfrastructureiscontinuouslyupgradedmovingtothelatestEthernetandIBcapabilities.

Page 33: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 33 of 56

Externalbandwidthon/offthesystemforgeneralTCP/IPconnectivity

Minimumof100GB/sperdirectionwithapreferencefor300GB/sperdirection.Describehow100GB/sperdirectioncouldbeexpandedto300GB/sperdirection.

Externalbandwidthon/offthesystemforaccessingthesystem’sPFS

Minimumof100GB/swithapreferencefor300GB/s.Describehow100GB/scouldbeexpandedto300GB/s.

Externalbandwidthon/offthesystemforaccessingexternal,sitesuppliedfilesystems.E.g.GPFS,NFS

Minimumof100GB/swithapreferencefor300GB/s.Describehow100GB/scouldbeexpandedto300GB/s.

4 OptionsTheACESteamexpectstohavefuturerequirementsforsystemupgradesand/oradditionalquantitiesofcomponentsbasedontheconfigurationsproposedinresponsetothissolicitation.TheOfferorshouldaddressanytechnicalchallengesforeseenwithrespecttoscalingandanyotherproductionissues.Proposalsshouldbeasdetailedaspossible.

4.1 Upgrades,ExpansionsandAdditionsTheOfferorshallproposethefollowingseparatelypricedoptionsusingwhateveristhenaturalunitfortheproposedarchitecturedesignasdeterminedbytheOfferor.Forexample,forsystemsize,theunitmaybenumberofracks,numberofblades,numberofnodesorsomeotherunitappropriateforthesystemarchitecture.Iftheproposeddesignhasnooptiontoscaleoneormoreofthesefeatures,theOfferorshouldsimplystatethisintheproposalresponse.

4.1.1 TheOfferorshalldescribeandseparatelypriceoptionsforscalinguptheoverallCrossroadssystem.Theseoptionsmaybelargerthanthesmallestcomputepartition.Anyoftheseoptionsmaybeexercisedmultipletimes.

4.1.1.1 TheOfferorshallproposeaconfigurationorconfigurationswhichincreasethebaselinememorycapacityinsteps,e.g.byadding25%,50%,100%,and200%.

Page 34: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 34 of 56

4.1.1.2 TheOfferorshallproposeandseparatelypriceupgradesorexpansionsforscalingthecapacityandperformanceoftheCrossroadsI/Osubsystemsuchthatitcanretainallapplicationinput,output,andworkingdatafor24and36weeksRefertosectionwhereitindicatesthattheminimumamountofsuchdatais12%ofbaselinesystemmemoryperday..IftheOfferor’sI/Osubsystemconsistsofmultiplestoragetiers,theOffershalldescribeandseparatelypriceoptionsforscalingeachstoragetierseparately.

4.1.1.3 TheOfferorshallproposeandseparatelypriceanyoptionsforupgradingtheOfferor’sproposedtechnologyoftheCrossroadssystemoveritsfive-yearlifetime.

4.1.2 TheOfferorshallalsoprovideseparatelypricedoptionsforsystemswhichtheCONTRACTORmayprocureinadditiontoCrossroadssystemthatprovideapproximately10%,25%,50%and200%oftheOfferor’sproposedcapabilityoftheCrossroadssystem.Theoptionsproposedfortheexpansionofthecrossroadssystemunder4.1.1,aboveshallalsoapplytoanyadditionalsystempurchasedatacostproportionaltothesystemcapabilitycomparedtoCrossroads.

4.2 EarlyAccessDevelopmentSystemToallowforearlyand/oraccelerateddevelopmentofapplicationsordevelopmentoffunctionalityrequiredasapartofthestatementofwork,theOfferorshallproposeoptionsforearlyaccessdevelopmentsystems.Thesesystemscanbeinsupportofthebaselinerequirementsoranyproposedoptions.

4.2.1 TheOfferorshallproposeanEarlyAccessDevelopmentSystem.Theprimarypurposeistoexposetheapplicationtothesameprogrammingenvironmentaswillbefoundonthefinalsystem.Itisacceptablefortheearlyaccesssystemnottousethefinalprocessor,node,orhigh-speedinterconnectarchitectures.However,theprogrammingandruntimeenvironmentmustbesufficientlysimilarthataporttothefinalsystemistrivial.Theearlyaccesssystemshallcontainsimilarfunctionalityofthefinalsystem,includingfilesystems,butscaleddowntotheappropriateconfiguration.TheOfferorshallproposeanoptionforthefollowingconfigurationsbasedonthesizeofthefinalCrossroadssystem.

§ 2%ofthecomputepartition.§ 5%ofthecomputepartition.

§ 10%ofthecomputepartition.

4.2.2 TheOfferorshallproposedevelopmenttestbedsystemsthatwillreduceriskandaidthedevelopmentofanyadvancedfunctionalitythatisexercisedasapartofthestatementofwork.

Page 35: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 35 of 56

4.3 TestSystemsTheOfferorshallproposethefollowingtestsystems.Thesystemsshallcontainallthefunctionalityofthemainsystem,includingfilesystems,butscaleddowntotheappropriateconfiguration.Multipletestsystemsmaybeawarded.

4.3.1 TheOfferorshallproposeanApplicationRegressiontestsystem,whichshouldcontainatleast200computenodes.

4.3.2 TheOfferorshallproposeaSystemDevelopmenttestsystem,whichshouldcontainatleast50computenodes.

4.4 OnSiteSystemandApplicationSoftwareAnalysts

4.4.1 TheOfferorshallproposeandseparatelypricetwo(2)SystemSoftwareAnalystsandtwo(2)ApplicationsSoftwareAnalystsforeachsite.Offerorsshallpresumeeachanalystwillbeutilizedforfour(4)years.ForCrossroads,thesepositionsrequireaDOEQ-clearanceforaccess.

4.5 DeinstallationTheOfferorshallproposetodeinstall,removeand/orrecyclethesystemandsupportinginfrastructureatendoflife.StoragemediashallbedestroyedtothesatisfactionofACES,and/orreturnedtoACESatitsrequest.

4.6 MaintenanceandSupportTheOfferorshallproposeandseparatelypricemaintenanceandsupportwiththefollowingfeatures:

4.6.1 MaintenanceandSupportPeriodTheOfferorshallproposeallmaintenanceandsupportforaperiodoffour(4)yearsfromthedateofacceptanceofthesystem.Warrantyshallbeincludedinthe4years.Forexample,ifthesystemisacceptedonApril1,2021andtheWarrantyisforoneyear,thentheWarrantyendsonMarch30,2022,andthemaintenanceperiodbeginsApril1,2022andendsonMarch30,2025.Offerorshallalsoproposeadditionalmaintenanceandsupportextensionforyears5-7.

4.6.2 MaintenanceandSupportSolutionsTheOfferorshallproposethefollowingmaintenanceandsupportsolutionsandproposepricingseparatelyforeachsolution.ACESmaypurchaseeitheroneofthesolutionsorneitherofthesolutions,atitsdiscretion.Differentmaintenancesolutionsmaybeselectedforthevarioustestsystemsandfinalsystem.

Page 36: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 36 of 56

4.6.2.1 Solution1–7x24TheOfferorshallpriceSolution1asfullhardwareandsoftwaresupportforallOfferorprovidedhardwarecomponentsandsoftware.Theprincipalperiodofmaintenance(PPM)shallbefor24hoursby7daysaweekwithafour-hourresponsetoanyrequestforservice.HardwareservicerequestsrequiretheOfferortobeon-sitewithinfour(4)hoursoftherequest.

4.6.2.2 Solution2–5x9TheOfferorshallpriceSolution2asfullhardwareandsoftwaresupportforallOfferorprovidedhardwarecomponentsandsoftware.Theprincipalperiodofmaintenance(PPM)shallbeona9hoursby5daysaweek(exclusiveofholidaysobservedbyACES).TheSuccessfulOfferorshallprovidehardwaremaintenancetrainingforACESstaffsothatstaffareabletoprovidehardwaresupportforallothertimestheOfferorisunabletoprovidehardwarerepairinatimelymanneroutsideofthePPM.TheSuccessfulOfferorshallsupplyhardwaremaintenanceproceduraldocumentation,training,andmanualsnecessarytosupportthiseffort.

Allproposedmaintenanceandsupportsolutionsshallincludethefollowingfeaturesandmeetallrequirementsofthissection.

4.6.3 GeneralServiceProvisionsTheSuccessfulOfferorshallberesponsible,atitsownexpense,fortherepairorreplacementofanyfailinghardwarecomponentthatitsuppliesandcorrectionofdefectsinsoftwarethatitprovidesaspartofthesystem.Atitssolediscretion,ACESmayrequestadvancereplacementofcomponentswhichshowapatternoffailureswhichreasonablyindicatesthatfuturefailuresmayoccurinexcessofreliabilitytargets,orforwhichthereisasystemicproblemthatpreventseffectiveuseofthesystem.

Hardwarefailuresduetoenvironmentalchangesinfacilitypowerandcoolingsystemswhichcanbereasonablyanticipated(suchasbrown-outs,voltage-spikesorcoolingsystemfailures)aretheresponsibilityoftheOfferor.

4.6.4 SoftwareandFirmwareUpdateServiceTheSuccessfulOfferorshallprovideanupdateserviceforallsoftwareandfirmwareprovidedforthedurationoftheWarrantyplusMaintenanceperiod.Thisshallincludenewreleasesofsoftware/firmwareandsoftware/firmwarepatchesasrequiredfornormaluse.TheSuccessfulOfferorshallintegratesoftwarefixes,revisionsorupgradedversionsin

Page 37: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 37 of 56

suppliedsoftware,includingcommunitysoftware(e.g.LinuxorLustre),andmakethemavailabletoACESwithintwelve(12)monthsoftheirgeneralavailability.TheSuccessfulOfferorshallprovidepromptavailabilityofpatchesforcybersecuritydefects.

4.6.5 CallServiceTheSuccessfulOfferorshallprovidecontactinformationfortechnicalpersonnelwithknowledgeoftheproposedequipmentandsoftware.ThesepersonnelshallbeavailableforconsultationbytelephoneandelectronicmailwithACESpersonnel.Inthecaseofdegradedperformance,theSuccessfulOfferor’sservicesshallbemadereadilyavailabletodevelopstrategiesforimprovingperformance,i.e.patches,workarounds.

4.6.6 On-sitePartsCacheTheSuccessfulOfferorshallmaintainapartscacheon-siteattheACESfacilities.Thepartscacheshallbesizedandprovisionedsufficientlytosupportallnormalrepairactionsfortwoweekswithouttheneedforpartsrefresh.TheinitialsizingandprovisioningofthecacheshallbebasedonOfferor’sMeanTimeBetweenFailure(MTBF)estimatesforeachFRUandeachrack,scaledbasedonthenumberofFRU’sandracksdelivered.Thepartscacheconfigurationwillbeperiodicallyreviewedforquantitiesneededtosatisfythisrequirement,andadjustedifnecessary,basedonobservedFRUornodefailurerates.Thepartscachewillberesized,attheOfferor’sexpense,shouldtheon-sitepartscacheprovetobeinsufficienttosustaintheactuallyobservedFRUornodefailurerates.

4.6.7 On-SiteNodeCacheTheSuccessfulOfferorshallalsomaintainanon-sitesparenodeinventoryofatleast1%ofthetotalnodesinallofthesystem.ThesenodesshallbemaintainedandtestedforhardwareintegrityandfunctionalityutilizingtheHardwareSupportClusterdefinedbelowifprovided.

4.6.8 HardwareSupportClusterTheSuccessfulOfferorshallprovideaHardwareSupportCluster(HSC).TheHSCshallsupportthehotsparenodesandprovidefunctionssuchashardwareburn-in,problemdiagnosis,etc.TheSuccessfulOfferorshallsupplysufficientracks,interconnect,networking,storageequipmentandanyassociatedhardware/softwarenecessarytomaketheHSCastand-alonesystemcapableofrunningdiagnosticsonindividualorclustersofHSCnodes.ACESwillstoreandinventorytheHSCandotheron-sitepartscachecomponents.

4.6.9 DOEQ-ClearedTechnicalServicePersonnel

Page 38: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 38 of 56

TheCrossroadssystemwillbeinstalledinsecurityareasthatrequireaDOEQ-clearanceforaccess.ItwillbepossibletoinstallthesystemwiththeassistanceofunclearedUScitizensorL-clearedpersonnel,buttheSuccessfulOfferorshallarrangeandpayforappropriate3rdpartysecurityescorts.TheSuccessfulOfferorshallobtainnecessaryclearancesforon-sitesupportstafftoperformtheirduties.

5 DeliveryandAcceptanceTestingofthesystemshallproceedinthreesteps:pre-delivery,post-delivery,andacceptance.Eachstepisintendedtovalidatethesystemandfeedsintosubsequentactivities.SampleAcceptanceTestplans(AppendixA)areprovidedaspartoftheRequestforProposal.

5.1 Pre-deliveryTestingTheACESteamandtheSuccessfulOfferorshallperformpre-deliverytestingatthefactoryonthehardwaretobedelivered.Anylimitationsforperformingthepre-deliverytestingshouldbeidentifiedintheOfferor’sproposal,includingscaleandlicensinglimitations(ifany).Duringpre-deliverytesting,theSuccessfulOfferorshall:§ DemonstrateRAScapabilitiesandrobustnessusingsimplefaultinjection

techniques,suchasdisconnectingcables,poweringdownsubsystems,orinstallingknownbadparts.

§ Demonstratefunctionalcapabilitiesoneachsegmentofthesystembuilt,includingthecapacitytobuildapplications,schedulejobs,andrunthemusingacustomer-providedtestingframework.Therootcauseofapplicationfailuremustbeidentifiedpriortosystemshipping.

§ Provideafilesystemsufficientlyprovisionedtosupportthesuiteoftests.§ ProvideonsiteandremoteaccesstotheACESteamtomonitortesting

andanalyzeresults.

§ Instillconfidenceintheabilitytoconformtothestatementofwork.

5.2 SiteIntegrationandPost-deliveryTestingTheACESteamandtheSuccessfulOfferorstaffshallperformsiteintegrationandpost-deliverytestingonthefullydeliveredsystem.Limitationsand/orspecialrequirementsmayexistforaccesstotheonsitesystembytheOfferor.§ Duringpost-deliverytesting,thepre-deliverytestsshallberunonthefull

systeminstallation.§ Whereapplicable,testsshallberunatfullscale.

Page 39: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 39 of 56

5.3 AcceptanceTestingTheACESteamandtheSuccessfulOfferorstaffshallperformonsiteacceptancetestingonthefullyinstalledsystem.Limitationsand/orspecialrequirementsmayexistforaccesstotheonsitesystembytheOfferor.

5.3.1 TheSuccessfulOfferorshalldemonstratethatthedeliveredsystemconformstothesubcontract’sStatementofWork.

6 RiskandProjectManagementTheOfferorshallproposeariskmanagementstrategyandprojectmanagementplanfortheCrossroadssystemthatiscloselycoordinatedbetweenthesubcontractsforTriadNationalSecurity,LLC.

6.1.1 TheOfferorshallProposeariskmanagementstrategyforthesystemintheeventoftechnologyproblemsorschedulingdelaysthataffectdeliveryofthesystemorachievementofperformancetargetsintheproposedtimeframe.Offerorshalldescribetheimpactofsubstitutetechnologies(ifany)ontheoverallarchitectureandperformanceofthesysteminparticularaddressingthefourtechnologyareaslistedbelow:§ Processor

§ Memory

§ High-speedinterconnect§ Platformstorage

6.1.2 TheOfferorshallidentifyanyotherhigh-riskareasandaccompanyingmitigationstrategiesforthesystem.

6.1.3 TheOfferorshallprovideaclearplanforeffectivelyrespondingtosoftwareandhardwaredefectsandsystemoutagesateachseveritylevelanddocumenthowproblemsordefectswillbeescalated.

6.1.4 TheOfferorshallproposearoadmapshowinghowtheirresponsetothisRequestforProposalalignswiththeirplansforExascalecomputing.

6.1.5 TheOfferorshallidentifyadditionalcapabilities,including:§ Itsabilitytoproduceandmaintainthesystemforthelifeofthesystem

§ Itsabilitytoachievespecificqualityassurance,reliability,availabilityandserviceabilitygoals

§ Itsin-housetestingandproblemdiagnosiscapability,includinghardwareresourcesatappropriatescale

Page 40: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 40 of 56

6.1.6 TheOfferorshallprovideprojectmanagementspecificsfortheACESteamdetailedaspartoftheRequestforProposaldocument.PleaseseeAppendixBforfurtherinformation.

7 DocumentationandTrainingTheSuccessfulOfferorshallprovidedocumentationandtrainingtoeffectivelyoperate,configure,maintain,andusethesystemstotheACESteamandusersoftheCrossroadssystem.TheACESteammay,attheiroption,makeaudioandvideorecordingsofpresentationsfromtheSuccessfulOfferor’sspeakersatpubliceventstargetedattheNNSAusercommunities(e.g.,usertrainingevents,collaborativeapplicationevents,bestpracticesdiscussions,etc.).TheSuccessfulOfferorshallgranttheACESteamuseranddistributionrightsofdocumentationprovidedbytheOfferor,sessionmaterials,andrecordedmediatobesharedwithotherDOELabs’staffandallauthorizedusersandsupportstaffforCrossroads.

7.1 Documentation

7.1.1 TheSuccessfulOfferorshallprovidedocumentationforeachdeliveredsystemdescribingtheconfiguration,interconnecttopology,labelingschema,hardwarelayout,etc.ofthesystemasdeployedbeforethecommencementofsystemacceptancetesting.

7.1.2 TheSuccessfulOfferorshallsupplyandsupportsystemanduser-leveldocumentationforallcomponentsbeforethedeliveryofthesystem.Uponrequestbythelaboratories,theSuccessfulOfferorshallsupplyadditionaldocumentationnecessaryforoperationandmaintenanceofthesystem.Alluser-leveldocumentationshallbepubliclyavailable.

7.1.3 TheSuccessfulOfferorshalldistributeandupdatealldocumentationelectronicallyandinatimelymanner.Forexample,changestothesystemshallbeaccompaniedbyrelevantdocumentation.Documentationofchangesandfixesmaybedistributedelectronicallyintheformofreleasenotes.Referencemanualsmaybeupdatedlater,buteffortshouldbemadetokeepalldocumentationcurrent.

7.2 Training

7.2.1 TheSuccessfulOfferorshallprovidethefollowingtypesoftrainingatfacilitiesspecifiedbyACES:

ClassType NumberofClasses

SystemOperationsandAdvancedAdministration 2

Page 41: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 41 of 56

UserProgramming 3

7.2.2 TheOfferorshalldescribeallproposedtraininganddocumentationrelevanttotheproposedsolutionsutilizingthefollowingmethods:

§ Classroomtraining

§ Onsitetraining§ Onlinedocumentation

§ Onlinetraining

8 ReferencesACESscheduleandhigh-levelinformationcanbefoundattheprimaryCrossroadswebsite(http://crossroads.lanl.gov/).

CrossroadsbenchmarksandworkflowswhitepapercanbefoundattheCrossroadsBenchmarkandWorkflowswebsite.HighPerformanceComputingPowerApplicationProgrammingInterfaceSpecificationwebsite(http://powerapi.sandia.gov/).

Page 42: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 42 of 56

AppendixA:SampleAcceptancePlanAppendixA-1:ACESSampleAcceptancePlan

Testingofthesystemshallproceedinthreesteps:pre-delivery,post-deliveryandacceptance.Eachstepisintendedtovalidatethesystemandfeedsintosubsequentactivities.

Pre-delivery(Factory)Test

TheSubcontractorshalldemonstrateallhardwareisfullyfunctionalpriortoshipping.Ifthesystemistobedeliveredinseparateshipments,eachshipmentshallundergopre-deliverytesting.IftheSubcontractorproposesadevelopmentsystemsubcomponent,Triadrecognizesthatthedevelopmentsystemisnotpartofthepre-deliveryacceptancecriteria.

ACESandSubcontractorstaffshallperformpre-deliverytestingatthefactoryonthehardwaretobedelivered.Anylimitationsforperformingthepre-deliverytestingneedtobeidentifiedincludingscaleandlicensinglimitations.

• DemonstrateRAScapabilitiesandrobustness,usingsimplefaultinjectiontechniquessuchasdisconnectingcables,poweringdownsubsystems,orinstallingknownbadparts.

• Demonstratefunctionalcapabilitiesoneachsegmentofthesystembuilt,includingthecapabilitytobuildapplications,schedulejobs,andrunthemusingthecustomer-providedtestingframework.Therootcauseofanyapplicationfailuremustbeidentified.

• TheOfferorshallprovideafilesystemsufficientlyprovisionedtosupportthesuiteoftests.

• ProvideonsiteandremoteaccessforACESstafftomonitortestingandanalyzeresults.

• Instillconfidenceintheabilitytoconformtothestatementofwork.

Pre-DeliveryAssembly

• TheSubcontractorshallperformthepre-deliverytestofCrossroadsoragreed-uponsub-configurationsofCrossroadsattheSubcontractor’slocationpriortoshipment.Atitsoption,ACESmaysendarepresentative(s)toobservetestingattheSubcontractor’sfacility.WorktobeperformedbytheSubcontractorincludes:

o Allhardwareinstallationandassembly

o Burninofallcomponentso Installationofsoftware

Page 43: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 43 of 56

o ImplementationoftheACES-specificproductionsystem-configurationandprogrammingenvironment

o Performtestsandbenchmarkstovalidatefunctionality,performance,reliability,andquality

• Runbenchmarksanddemonstratethatbenchmarksmeetperformancecommitments.

Pre-DeliveryConfiguration

• TBD

Pre-DeliveryTestSubcontractorshallprovideACESon-siteaccesstothesysteminordertoverifythatthesystemdemonstratestheabilitytopassacceptancecriteria.

Thepre-deliverytestshallconsistof(butisnotlimitedto)thefollowingtests:

NameofTest PassCriteria

Systempowerup Allnodesbootsuccessfully

Systempowerdown Allnodesshutdown

Unixcommands AllUNIX/Linuxandvendorspecificcommandsfunctioncorrectly

Monitoring Monitoringsoftwareshowsstatusforallnodes

Reset “Reset”functionsonallnodes

PowerOn/Off Powercycleallcomponentsoftheentiresystemfromtheconsole

FailOver/Resilience Demonstrateproperoperationofallfail-overorresiliencemechanisms

FullConfigurationTest Pre-deliverysystemcanefficientlyrunapplicationsthatusetheentirecomputeresourceofthepre-deliverysystem.Theapplicationstoberunwillbedrawnfromthe72-hourtestruns,scaledtothepre-deliveryconfiguration

Benchmarks Benchmarksshallachieveperformancewithinthelimitsofpre-deliveryconfiguration

Page 44: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 44 of 56

NameofTest PassCriteria

72Hourtest 100%availabilityofthepre-deliverysystemfora72-hourtestperiodwhilerunninganagreed-uponworkloadthatexercisesatleast99%ofthecomputeresources

Post-deliveryIntegrationandTestPost-deliveryIntegration

DuringPost-DeliveryIntegration,theSubcontractor’ssystem(s)shallbedelivered,installed,fullyintegrated,andshallundergoSubcontractorstabilizationprocesses.Post-deliverytestingshallincludereplicationofallofthepre-deliverytestingsteps,alongwithappropriatetestsatscale,onthefullyintegratedplatform.Whereapplicable,testsshallberunatfullscale.

SiteIntegrationWhentheSubcontractorhasdeclaredthesystemtobestable,theSubcontractorshallmakethesystemavailabletoACESpersonnelforsite-specificintegrationandcustomization.OncetheSubcontractor’ssystemhasundergonesite-specificintegrationandcustomization,theacceptancetestshallcommence.

AcceptanceTestTheAcceptanceTestPeriodshallcommencewhenthesystemhasbeendelivered,physicallyinstalled,andundergonestabilizationandsite-specificintegrationandcustomizationcompleted.ThedurationoftheAcceptanceTestperiodisdefinedintheStatementofWork.AlltestsshallbeperformedontheinitialproductionconfigurationasdefinedbyACES.TheSubcontractorshallsupplysourcecodeused,compilescripts,output,andverificationfilesforalltestsrunbytheSubcontractor.AllsuchprovidedmaterialsbecomethepropertyofTriad.AlltestsshallbeperformedontheinitialproductionconfigurationoftheCrossroadssystemasitwillbedeployedtotheACESusercommunity.ACESmayrunalloranyportionofthesetestsatanytimeonthesystemtoensuretheSubcontractor’scompliancewiththerequirementssetforthinthisdocument.TheacceptancetestshallconsistofaFunctionalityDemonstration,aSystemBootTest,aSystemResilienceTest,aPerformanceTest,andanAvailabilityTest,performedinthatorder.FunctionalityDemonstration

Page 45: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 45 of 56

SubcontractorandACESwillperformtheFunctionalityDemonstrationonadedicatedsystem.TheFunctionalityDemonstrationshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:

• Remotemonitoring,powercontrolandbootcapability

• Networkconnectivity

• Filesystemfunctionality

• Batchsystem

• Systemmanagementsoftware

• Programbuildinganddebugging(e.g.compilers,linkers,libraries,etc.)

• Unixfunctions

SystemBootTestSubcontractorandACESwillperformtheSystemBootTestonadedicatedsystem.TheSystemBootTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:Twosuccessfulsystemcoldbootstoproductionstate,withnointerventiontobringthesystemup.Productionstateisdefinedasrunningallsystemservicesrequiredforproductionuseandbeingabletocompileandrunparalleljobsonthefullsystem.Inacoldboot,allelementsofthesystem(compute,login,I/O)arecompletelypoweredoffbeforethebootsequenceisinitiated.Allcomponentsarethenpoweredon.

• Singlenodepower-fail/resettest:Failureorresetofasinglecomputenodeshallnotcausesystem-widefailure.

SystemResilienceTestSubcontractorandACESwillperformtheSystemResilienceTestonadedicatedsystem.TheSystemResilienceTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.AllsystemresiliencefeaturesofCrossroadsshallbedemonstratedviafault-injectiontestswhenrunningtestapplicationsatscale.Faultinjectionoperationsshouldincludebothgracefulandhardshutdownsofcomponents.Themetricsforresilienceoperationsincludecorrectoperation,anylossofaccessordata,andtimetocompletetheinitial

Page 46: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 46 of 56

recoveryplusanytimerequiredtorestore(fail-back)anormaloperatingmodeforthefailedcomponents.

PerformanceTestCrossroadssystemperformanceandbenchmarktestsarefullydocumentedintheStatementofWorkalongwithguidanceandtestinformationfoundattheCrossroadswebsite.TheSubcontractorshallruntheCrossroadstestsandapplicationbenchmarks,fullconfigurationtest,externalnetworktestandfilesystemmetadatatestasdescribedintheApplicationandBenchmarkRunRulesdocument.Benchmarkanswersmustbecorrect,andeachbenchmarkresultmustmeetorexceedperformancecommitmentsintheperformancerequirementssection.Benchmarksmustberunusingthesuppliedresourcemanagementandschedulingsoftware.Exceptasrequiredbytherunrules,benchmarksneednotberunconcurrently.IfrequestedbyTriad,Subcontractorshallreconfiguretheresourcemanagementsoftwaretoutilizeonlyasubsetofcomputenodes,specifiedbyTriad.

JMTTIandSystemAvailabilityTestingTheJMTTIandSystemAvailabilityTestwillcommenceaftersuccessfulcompletionoftheFunctionalityDemonstration,SystemTestandPerformanceTest.ACESwillperformtheJMTTIandAvailabilityTest.TheCrossroadssystemmustdemonstratetheJMTTIandavailabilitymetricsdefinedintheStatementofWork,withinanagreed-uponperiodoftime.Anautomatedjoblaunchandoutcomeanalysistool,suchasthePavilionHPCTestingFramework,shallbeusedtomanageanagreed-uponworkloadthatwillbeusedtomeasurethereliabilityofindividualjobs.ThesejobsshallbeamixtureofbenchmarksfromthePerformanceTestandotherapplications.EverytestintheJMTTIandSystemAvailabilityTestworkloadshallobtainacorrectresultinbothdedicatedandnon-dedicatedmodes:

• Indedicatedmode,eachbenchmarkinthePerformanceTestshallmeettheperformancecommitmentspecifiedintheStatementofWork.Innon-dedicatedmode,themeanperformanceofeachperformancetestshallmeetorexceedtheperformancecommitmentspecifiedintheStatementofWork

• DuringtheJMTTIandSystemAvailabilityTest,ACESshallhavefullaccesstothesystemandshallmonitorthesystem.TriadandusersdesignatedbyTriadshallsubmitjobsthroughtheCrossroadsresourcemanagementsystem.

Page 47: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 47 of 56

• DuringtheJMTTIandSystemAvailabilityTest,theSubcontractorshalladheretothefollowingrequirements:

o AllhardwareandsoftwareshallbefullyfunctionalattheendoftheJMTTIandAvailabilityTest.Anydowntimerequiredtorepairfailedhardwareorsoftwareshallbeconsideredanoutageunlessitcanberepairedwithoutimpactingsystemavailability.

o Hardwareandsoftwareupgradesshallnotbepermittedduringthelast7daysoftheJMTTIandAvailabilityTest.Thesystemshallbeconsidereddownforthetimerequiredtoperformanyupgrades,includingrollingupgrades.

o Nosignificant(i.e.levels1,2or3)problemsshallbeopenduringthelast7days.

• DuringtheJMTTIandAvailabilityTestingperiod,ifanysystemsoftwareupgradeorsignificanthardwarerepairsareapplied,theSubcontractorshallberequiredtorunthePerformanceTestsanddemonstratethatthechangesincurnolossofperformance.Atitsoption,Triadmayalsorunanytestdeemednecessary.TimetakentorunthePerformanceandothertestsshallnotcountasdowntime,providedthatalltestsperformtospecifications.

DefinitionsforNodeandSystemFailuresThebaselineofinterrupts,asusedintheJMTTIandSMTBIcalculations,shallinclude,butmaynotbelimitedto,thefollowingcircumstances:

• AnodeshallbedefinedasdownifahardwareproblemcausesSubcontractorsuppliedsoftwaretocrashorthenodeisunavailable.FailuresthataretransparenttoSubcontractor-suppliedsoftwarebecauseofredundanthardwareshallnotbeclassifiedasanodebeingdownaslongasthefailuredoesnotimpactnodeorsystemperformance.Lowseveritysoftwarebugsandsuggestions(e.g.wrongerrormessage)associatedwithSubcontractorsuppliedsoftwarewillnotbeclassifiedasanodebeingdown.

• AnodeshallbeclassifiedasdownifadefectintheSubcontractorsuppliedsoftwarecausesanodetobeunavailable.Communicationnetworkfailuresexternaltothesystem,anduserapplicationprogrambugsthatdonotimpactotherusersshallnotconstituteanodebeingdown.

• Repeatfailureswithineighthoursofthepreviousfailureshallbecountedasonecontinuousfailure.

Page 48: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 48 of 56

• TheSubcontractor'ssystemshallbeclassifiedasdown(andallnodesshallbeconsidereddown)ifanyofthefollowingrequirementscannotbemet(“system-widefailures”):o CompleteaPOSIX`stat'operationonanyfilewithinall

Subcontractor-providedfilesystemsandaccessalldatablocksassociatedwiththesefiles.

o CompleteasuccessfulinteractivelogintotheSubcontractor'ssystem.FailuresintheACESnetworkdonotconstituteasystem-widefailure.

o Successfullyrunanypartoftheperformancetest.ThePerformanceTestconsistsoftheCrossroadsBenchmarks,theFullConfigurationTestandtheExternalNetworkTest.

o Fullswitchbandwidthisavailable.Failureofaswitchadapterinanodedoesnotconstituteasystem-widefailure.However,failureofaswitchwouldconstitutefailure,evenifalternateswitchpathswereavailable,becausefullbandwidthwouldnotbeavailableformultiplenodes.

o Userapplicationscanbelaunchedand/orcompletedviathescheduler.

• OtherfailuresinSubcontractorsuppliedproductsandservicesthatdisruptworkonasignificantportionofthenodesshallconstituteasystem-wideoutage.

• Ifthereisasystem-wideoutage,TriadshallturnoverthesystemtotheSubcontractorforservicewhentheSubcontractorindicatestheyarereadytobeginworkonthesystem.Allnodesareconsidereddownduringasystem-wideoutage.

• DowntimeforanyoutageshallbeginwhenTriadnotifiestheSubcontractorofaproblem(e.g.anofficialproblemreportisopened)and,forsystemoutages,whenthesystemismadeavailabletotheSubcontractor.Downtimeshallendwhen:o Forproblemsthatcanbeaddressedbybringingupasparenode

orbyrebootingthedownnode,thedowntimeshallendwhenasparenodeorthedownnodeisavailableforproductionuse.

o ForproblemsrequiringtheSubcontractortorepairafailedhardwarecomponent,thedowntimeshallendwhenthefailedcomponentisreturnedtoTriadandavailableforproductionuse.

Forsoftwaredowntime,thedowntimeshallendwhentheSubcontractorsuppliesafixthatrectifiestheproblemorwhenTriadrevertstoapriorcopyofthefailingsoftwarethatdoesnotexhibitthesameproblem.Afailuredue

Page 49: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 49 of 56

toACESortoothercausesoutoftheSubcontractor'scontrolshallnotbecountedagainsttheSubcontractorunlessthefailuredemonstratesadefectinthesystem.IfthereareanydisagreementsastowhetherafailureisthefaultoftheSubcontractororACES,theyshallberesolvedpriortotheendoftheacceptanceperiod.

Page 50: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 50 of 56

AppendixB:TriadSpecificProjectManagementRequirementsAppendixB-1:TriadProjectManagementRequirementsNOTE:Thefollowingrequirementsapplytotheprojectmanagementofthedeliveryofthesystemproposedbythesubcontractor.ProjectManagement

Thedevelopment,pre-shipmenttesting,installationandacceptancetestingoftheCrossroadssystemisacomplexendeavorandwillrequireclosecooperationbetweentheSubcontractor,TriadNationalSecurity,LLC(Triad,TNS),andACES.ThereshallbequarterlyexecutivereviewsbycorporateofficersoftheSubcontractor,ACES,andrepresentativesofDOENNSA,toassesstheprogressoftheproject.

ProjectPlanningWorkshop

• TriadandSubcontractorshallscheduleandcompleteaworkshoptomutuallyunderstandandagreeuponprojectmanagementgoals,techniques,andprocesses.

• Theworkshopshalltakeplacenolaterthanaward+45days

ProjectPlan

• DeliveryMilestone:nolaterthanaward+60days

SubcontractorshallprovideTriadwithadetailedProjectPlan–whichincludesadetailedWorkBreakdownStructure(WBS).TheProjectPlanshallcontainallaspectsoftheproposedSubcontractor’ssolutionandassociatedengineering(hardwareandsoftware)andsupportactivities.

TheProjectPlanshalladdressorinclude:

• ProgramManagement

• HighAssuranceDeliveryProcess

WBS:

o FacilitiesPlanning(e.g.,floor,power&cooling,cabling);o ComputerHardwarePlanning;

o Installation&TestPlanning;

o DeploymentandIntegrationMilestoneso SystemStabilityPlanning;

o SystemScalabilityPlanning;

Page 51: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 51 of 56

o SoftwarePlano Testing

o Development

o Testingo Deployment

o RiskAssessment&RiskMitigationo Staffing;

o On-siteWarrantyandMaintenanceandSupportPlanning;

o Training&Education;ProjectPlan–ProgramManagement

Ataminimum,theProjectPlan–ProgramManagementSectionshall:

o Identify,byname,theProgramManagementTeammembers;o Identify,byname,theleadCrossroadsSystemArchitect

o Identify,byname,theCrossroadsSystemRASPointofContacto DescribetherolesandresponsibilitiesoftheTeammembers;

o ListSubcontractor’sManagementContacts;o DefineandinstitutionalizethePeriodicProgressReviewprocess

withregardtofrequency(daily,weekly,monthly,quarterly,andannually),level(support,technical,andexecutive),andescalationprocedures.

• Additionally,theProjectPlan–ProgramManagementSectionshalldetailthejointactivitiesoftheSubcontractorandTriadtomonitorandassesstheoverallProgramPerformance.

• TriadwillfurnishtheSubcontractorwithatop-10listofproblemsandissues.TheSubcontractorisresponsibleforappointingapointofcontactforeachoftheitemsonthelist.Thislistshallbereviewedweekly.

• AllSubcontractorProgramManagementshallinterfacewiththedesignatedTriadCrossroadsprojectmanager.

• TheWBSwillbeupdatedbytheSubcontractormonthlyandreviewedforapprovalbyTriad

• TheSubcontractorProjectPlanshallbeupdatedbytheSubcontractorquarterlyandreviewedforapprovalbyTriad

ProjectPlan-HighAssuranceHardwareDeliveryProcess

Page 52: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 52 of 56

SubcontractorshallprovideTriadwithahighassurancedeliveryprocessandcertificationprogramforhardwaredeliverablesofallstagesofthedeploymentandoperationalusebytheASCApplicationsCommunityofthesystems.Allassetsdeliveredshallbe,ataminimum,factory-testedandfield–certified;A“pre-deliverytest”shalltakeplaceatthefactorypriortoeachshipment.FunctionaldiagnosticsandagreeduponACESapplicationsshallbeexecutedtoverifytheproperfunctioningofeachsystempriortoshipment.Problemsidentifiedasaresultofthesetestsshallbecorrectedpriortoshipment.Assetsthathavesuccessfullycompletedthispre-deliverytestare“pre-verified.”

ProjectPlan-HighAssuranceSoftwareDeliveryProcessSubcontractorshallprovideTriadwithahighassurancedeliveryprocessandcertificationprogramforsoftwaredeliverablesofallstagesofthedeploymentandoperationalusebytheNNSAASCtri-labsimulationcommunityoftheCrossroadssystems.Inaddition,SubcontractorshallprovideTriadwithdocumentationofSubcontractor’santicipatedsoftwarereleaseschedulesduringlifetimeofthesubcontract.Thisincludesmajorandminorreleases,updates,andfixesaswellasexpectedbeta-levelavailability.

• WhileBetasoftwareand/orpre-GAsoftwareisanticipatedtobeinstalledandrunonthesesystems,howeverallsuchinstallationsaresubjecttoTriadapproval;

• SubcontractorshallprovideTriadwithalistofinterdependenciesbetweenhardwareandsoftwareastheypertaintothedeliveredsystems;

ProjectPlan–WBS,MilestonesSubcontractorshalldefineappropriatehigh-levelMilestonesfortheexecutionofthedeliveryandacceptanceoftheCrossroadssystem.

ProjectPlan–WBS,FacilitiesPlanningCompliantwiththerequirementsoftheFacilitiesdescribedintheTechnicalRequirements.

ProjectPlan–WBS,SystemStabilityPlanning

Scalablesystemsofthesizebeingdeliveredcanattimesprovedifficulttopredictintermsofstability.Thenumberofcomponentscanhaveasignificanteffectonthestabilityandmayprovidesomescalabilityproblemsintermsofstabilityofthesystem.Triadrequiresaplantoprogressivelyqualifyaseriesofconfigurationsofincreasingcomplexity,intermsofbothprocessorcountsandinterconnecttopology.

Page 53: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 53 of 56

SubcontractorshallberesponsiblefordeliveringaStabilizationPlanthatincludesthefollowing:

• Planobjectives

• TargetGoalsforStability,asagreedtojointlywithTriadandACES

• TechnicalStrategy

• Rolesandresponsibilities

• TestingPlan

• ProgressEvaluationCheckpoints

• Contingencies

ProjectPlan–Staffing:

• StaffSupportshallbeforthelifeofthesubcontract.

• SubcontractorshallidentifyitsmembersoftheProjectTeam.

ProjectPlan–On-siteWarrantyandMaintenanceandSupportPlanning

• On-siteWarrantyandMaintenanceandSupportshallbeforthelifeofthesubcontract

• On-siteWarrantyandMaintenanceandSupportshallincludeSubcontractor’spreventivemaintenanceschedule.

• On-siteWarrantyandMaintenanceandSupportshallincludeloggingandweeklyreportingofallinterruptionstoservice.Ataminimum,theSubcontractorshallenterallinterruptloggingintoTriadtrackingsystem.

ProjectPlan–TrainingandEducation

• InadditiontoSubcontractor’susualandcustomarycustomerTrainingandEducationprogram,SubcontractorshallallowACESstaffaccesstoSubcontractor’sinternalTraining&Educationprogram;

• TrainingandEducationSupportshallbeforthelifeofthesubcontract.

ProjectPlan–RiskAssessmentandRiskMitigation

• SubcontractorshallprovideTriadwithaRiskManagementPlanthatidentifiesandaddressesallidentifiedrisks.

• Subcontractorshallprovideariskmanagementstrategyfortheproposedsystemincaseoftechnologyproblemsorschedulingdelaysthataffectavailabilityorachievementofperformancetargetsintheproposedtimeframe.Subcontractorshalldescribetheimpactofsubstitutetechnologiesontheoverallarchitectureandperformanceofthesystem.Inparticular,thesubcontractorshalladdressthetechnologyareaslistedbelow:

Page 54: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 54 of 56

o Processoro Memory

o High-SpeedInterconnect

o PlatformStorageandallotherI/Osubsystems

• SubcontractorshallcontinuouslymonitorandassesstherisksinvolvedforthosemajortechnologycomponentsthatSubcontractoridentifiestobeontheCriticalPath(i.e.,RiskAssessment);

• SubcontractorshallprovideTriadwithtimelyandregularupdatesregardingSubcontractor’sRiskAssessment;

• SubcontractorshallprovideTriadwithaRiskMitigationPlan.EachriskmitigationstrategyshallbesubjecttoTriadapproval.SuchRiskMitigationPlanshallinclude:

o RisksCategorization–Risksshallbecategorizedaccordingto

o Probabilityofoccurrence(Low,medium,orhigh)

o Impacttotheprogramiftheyoccur(low,medium,orhigh)o DatesforRiskMitigationDecisionPointsIdentified

o ExecutionofmitigationplansaresubjecttoTriadapprovalandmayinclude:

§ TechnologySubstitution–subjecttotheconditionthatsubstitutedtechnologiesshallnothaveaggregateperformance,capability,orcapacitylessthanoriginallyproposed;

§ 3rdPartyAssistance–especiallyinareasofcriticalsoftwaredevelopment;

§ SourceCodeAvailability–especiallyintheareasofOperatingSystems,CommunicationLibraries;

§ PerformanceCompensation–possibilityofcompensatingforperformanceshortfallsviaadditionaldeliveries.

o Subcontractor’sRiskMitigationPlanwillbereviewedquarterlybyTriad.

Page 55: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 55 of 56

DefinitionsandGlossaryBaselineMemory:HighperformancememorytechnologiessuchasDDR-DRAM,HBM,andHMC,forexample,thatmaybeincludedinthesystemsmemorycapacityrequirement.Itdoesnotincludememoryassociatedwithcaches.

CoefficientofVariation:Theratioofthestandarddeviationtothemean.Coldboot:Fullpower-onofasystemfromanon-energizedstate,suchasapostpoweroutagesituation.Itcanbeassumedthatfacilitiespower,water,network,andsiteinfrastructureserviceshavebeenreturnedtoserviceandtimingsforallofferedfilesystemsandclustercanbeginfromthispoint.Delta-Ckpt:Thetimetocheckpoint80%ofaggregatememoryofthesystemtopersistentstorage.Forexample,iftheaggregatememoryofthecomputepartitionis3PiB,Delta-Ckptisthetimetocheckpoint2.4PiB.Rationale:Thiswillprovideacheckpointefficiencyofabout90%forfullsystemjobs.EjectionBandwidth:Bandwidthleavingthenode(i.e.,NICtorouter).

FullScale:Allofthecomputenodesinthesystem.Thismayormaynotincludeallavailablecomputeresourcesonanode,dependingontheusecase.IdlePower:TheprojectedpowerconsumedonthesystemwhenthesystemisinanIdleState.IdleState:Astatewhenthesystemispreparedtobutnotcurrentlyexecutingjobs.Theremaybemultipleidlestates.

InjectionBandwidth:Bandwidthenteringthenode(i.e.,routertoNIC).JobInterrupt:Anysystemeventthatcausesajobtounintentionallyterminate.

JobMeanTimetoInterrupt(JMTTI):Averagetimebetweenjobinterruptsoveragiventimeintervalonthefullscaleofthesystem.Automaticrestartsdonotmitigateajobinterruptforthismetric.JMTTI/Delta-Ckpt:RatiooftheJMTTItoDelta-Ckpt,whichprovidesameasureofhowmuchusefulworkcanbeachievedonthesystem.

NameplatePower:Themaximumtheoreticalpowerthesystemcouldconsume.Thisisadesignlimit,likelynotachievableinoperation,commonlyspecifiedonelectricalequipmentlabelsandusedforpowerprovisioningdesignperNationalElectricalCode(NEC,NFPA70).NominalPower:TheprojectedpowerconsumedonthesystembytheACESworkflows(e.g.,acombinationoftheACESbenchmarkcodesrunninglargeproblemsontheentiresystem).OperationalCapability:Real,usablecapabilitiesinproductionoperation,nottheoreticalcapabilities.

Page 56: Crossroads 2021 Technical Requirements Document D Technical... · Dated 07-19-18 RFP No. 511017 Page 1 of 56 Crossroads 2021 Technical Requirements Document LA-UR-18-25993 SAND2018-7366

LA-UR-18-25993 Crossroads 2021 Technical Requirements Document

Dated 07-19-18

RFP No. 511017 Page 56 of 56

PeakPower:TheprojectedpowerconsumedbyanapplicationthatutilizesthemaximumachievablepowerconsumptionsuchasDGEMM.

PlatformStorage:Anynonvolatilestoragethatisdirectlyusablebythesystem,itssystemsoftware,andapplications.Exampleswouldincludediskdrives,RAIDdevices,andsolid-statedrives,nomatterthemethodofattachment.RollingUpgrades/RollingRollbacks:Arollingupgradeorarollbackisdefinedaschangingtheoperatingsoftwareorfirmwareofasystemcomponentinsuchawaythatthechangedoesnotrequiresynchronizationacrosstheentiresystem.Rollingupgradesandrollbacksaredesignedtobeperformedwiththosepartsofthesystemthatarenotbeingworkedonremaininginfulloperationalcapacity.

SystemInterrupt:Anysystemevent,oraccumulationofsystemeventsovertime,resultinginmorethan1%ofthecomputeresourcebeingunavailableatanygiventime.Lossofaccesstoanydependentsubsystem(e.g.,platformstorageorservicepartitionresource)willalsoincurasysteminterrupt.SystemMeanTimeBetweenInterrupt(SMTBI):Averagetimebetweensysteminterruptsoveragiventimeinterval.SystemAvailability:((timeinperiod–timeunavailableduetooutagesinperiod)/(timeinperiod–timeunavailableduetoscheduledoutagesinperiod))*100SystemInitialization:Thetimetobring99%ofthecomputeresourceand100%ofanyserviceresourcetothepointwhereajobcanbesuccessfullylaunched.Warmboot:Thecluster/filesystemmanagementserversbeingbootedandconfigured.