Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
ARCHERSPServiceQuarterlyReport
Quarter42015
2
DocumentInformationandVersionHistoryVersion: 0.5Status Draft
Author(s): AlanSimpson,AnneWhiting,AndyTurner,MikeBrown,StephenBooth,JoBeech-Brandt
Reviewer(s) AlanSimpson
Version Date Comments,Changes,Status Authors,contributors,
reviewers0.1 2015-12-14 InitialDraft AnneWhiting0.2 2016-01-07 Furtherinformationadded AnneWhiting0.3 2016-01-12 Graphsadded AnneWhiting0.4 2016-01-13 Furtherinformationadded
includingappendixAnneWhiting
0.5 2016-01-15 ChangesmadefollowingreviewbyAndyTurnerandAlanSimpson
AlanSimpson,AndyTurner,AnneWhiting,MikeBrown
1.0 2016-01-19 Approved AlanSimpson1.1 2016-03-01 CaptiononPage14corrected JoBeech-Brandt,AlanSimpson
3
1. TheService1.1ServiceHighlightsThisisthereportfortheARCHERSPServicefortheReportingPeriods:October2015,November2015,December2015• Utilisationonthesystemduring15Q4was88%,thesameasin15Q3.• TheremainingtasksrequiredtoreturntonormalaftertheSonexionLustrefilesystemissues
werecompletedsuccessfullythisquarter:
Ø Thediskreplacementprogrammecompletedon22ndOctober.Thediskfailureratehasreturnedtothelevelsthatwouldbeexpectedafterthereplacements.
Ø RaidcheckwasreturnedtorunningonamonthlyschedulewiththesessionadvertisedontheARCHERpubliccalendar,minimisingthedisruptiontouserscomparedtotheweeklyschedulethatRaidcheckwaspreviouslyrunningon.
Ø EPSRC’scontractorcarriedoutalessonslearnedreview,withstafffromEPCCcontributingfullyandconstructivelytothediscussionandtothereport.Recommendedactionsarebeingreviewed,andwhereappropriate,implemented,withtheeffectsofanychangesbeingmonitored.
Ø Thenormalcycleofpatchingandupgradeshadhadtobepostponeduntiltheresolutionofthefilesystemissues;thishasnowrecommenced.
• Themuch-postponedCLEupgradewassuccessfullycarriedouton11November,bringing
ARCHERuptoversionCLE5.2.ThisisfacilitatingthecompletionofactionsdependentontheCLEupgrade,inadditiontoprovidingthenewfeaturesandfixesincluded.Adviceandassistancewasprovidedtousersbeforeandaftertheupgradetoassistthemwithactionsrequiredfollowingtheupgrade(suchasrecompilingapplicationsandverifyingoutput).
• Crayreplacedcablesinthehighspeednetworkinfrastructureon3and4November2015to
optimizethenetworkspeedandtransferperformance;• Crayisinvestigatinganissuewheremultipleesloginnodesarerandomlyhanging.Diagnosishas
beenhamperedbythemalfunctioningofkdump,thetooltoenableamemorydumptobetaken,forwhichafixhasbeeninstalledbytheServiceProvideron9thDecember.AfurtherrequiredchangetoincreasethekdumppartitionisscheduledtobeimplementedbytheServiceProviderduringthemaintenancesessionon27thJanuary.
• TherehavebeenanumberofPBS-relatedissuesthattheServiceProviderhasraisedwithCray
andwhichCrayareworkingoninconjunctionwithAltair.Theseissuesareprincipallywithregardtoschedulingparametersanditishopedthatimprovementsinjobturnaroundandbetterjobplacementwillresult.Worktodatelookspromising.
• InconjunctionwithEPSRCandtheCSEteam,benefitsrealisationmeasurementfortheARCHER
servicecontinues.Themetricsusedtomeasurethebenefitshavebeenreviewedandadjustedtoensurethechosenmetricsarebothmeasurableandmeaningful.Theaimistoensurethattheoptimalmetricsetpresentanaccurateandconsistentviewonthebenefitsprovided.
1.2ForwardLook
• RSIPnodeimplementationwillallowtheuseofapplicationsrequiringlicenseserversonthecomputenodes,forexample,compilersandISVapplications.RSIPcanbeusedtoprovide
4
licensestothecomputenodesforthecommercialCFDsoftwareSTAR-CCM+fromCD-Adapco.FollowingthesuccessfulCLEupgrade,theimplementationisanticipatedtobe24thFebruary;
• Aquarterlyround-tablediscussionbetweenallpartiesinvolvedindeliveringtheARCHER
serviceswillbeaddedtothecurrentquarterlyreview,withEPSRC,CrayandEPCCtodiscusssharedissues,exchangeviewsandsuggestactionstocontributetoaprogrammeofcontinuousimprovement.
• WorkiscommencingtopreparethenationalserviceforISO9000certification,tofurther
formalisetheimprovementmechanismsforservicedeliveryandprocessimprovement,withakeymeasurementbeingusersatisfaction.
5
2.ContractualPerformanceReportThisisthecontractualperformancereportfortheARCHERSPService.
2.1ServicePointsandServiceCreditsTheServiceLevelsandServicePointsfortheSPservicearedefinedasbelowinSchedule2.2.• 2.6.2-PhoneResponse(PR):90%ofincomingtelephonecallsansweredpersonallywithin2
minutesforanyServicePeriod.ServiceThreshold:85.0%;OperatingServiceLevel:90.0%.• 2.6.3-QueryClosure(QC):97%ofalladministrativequeries,problemreportsandnonin-depth
queriesshallbesuccessfullyresolvedwithin2workingdays.ServiceThreshold:94.0%;OperatingServiceLevel:97.0%.
• 2.6.4-NewUserRegistration(UR):ProcessNewUserRegistrationswithin1workingday.Definitions:OperatingServiceLevel:TheminimumlevelofperformanceforaServiceLevelwhichisrequiredbytheAuthorityiftheContractoristoavoidtheneedtoaccounttotheAuthorityforServiceCredits.ServiceThreshold:Thistermisnotdefinedinthecontract.Ourinterpretationisthatitreferstotheminimumallowedservicelevel.Belowthisthreshold,theContractorisinbreachofcontract.NonIn-Depth:Thistermisnotdefinedinthecontract.OurinterpretationisthatitreferstoBasicquerieswhicharehandledbytheSPService.ThisincludesallAdminqueries(e.g.requestsforDiskQuota,AdjustmentstoAllocations,CreationofProjects)andTechnicalQueries(Batchscriptquestions,highleveltechnical‘HowdoI?’requests).Queriesrequiringdetailedtechnicaland/orscientificanalysis(debugging,softwarepackageinstallations,codeporting)arereferredtotheCSETeamasIn-Depthqueries.ChangeRequest:Thistermisnotdefinedinthecontract.TherearetimeswhenSPreceivesrequeststhatmayrequirechangestobedeployedonARCHER.Theserequestsmaycomefromtheusers,theCSEteamorCray.ExamplesmayincludethedeploymentofnewOSpatches,thedeploymentCraybugfixes,ortheadditionofnewsystemssoftware.SuchchangesaresubjecttoChangeControlandmayhavetowaitforaMaintenanceSession.Thenatureofsuchrequestsmeansthattheycannotbecompletedin2workingdays.
2.1.1ServicePointsInthepreviousServiceQuartertheServicePointscanbesummarisedasfollows:
Period Oct15 Nov15 Dec15 15Q4Metric Service
LevelServicePoints
ServiceLevel
ServicePoints
ServiceLevel
ServicePoints
ServicePoints
2.6.2–PR 100% -5 100% -5 100% -5 -152.6.3–QC 99.2% -2 96.6% 1 99.8% -2 -32.6.4–UR 1WD 0 1WD 0 1WD 0 0Total -7 -4 -7 -18ThedetailsoftheabovecanbefoundinSection2.2ofthisreport.
6
2.1.2ServiceFailuresTherewerenoservicefailuresintheperiodasdefinedinthemetric.DetailsofplannedmaintenancesessioncanbefoundinSection2.3.2.
2.1.3ServiceCreditsAstheTotalServicePointsarenegative(-18),noServiceCreditsapplyin15Q4.
2.2DetailedServiceLevelBreakdown
2.2.1PhoneResponse(PR)
Oct15 Nov15 Dec15 15Q4PhoneCallsReceived 28(7) 27(3) 37(4) 92(14)Answered2Minutes 28 27 37 92ServiceLevel 100.0% 100.0% 100.0% 100.0%
Thevolumeoftelephonecallsremainedlowin15Q4.Ofthetotalof92callsreceivedabove,only14weregenuineARCHERusercallsthateitherresultedinqueriesoranswereduserquestionsdirectly.
2.2.2QueryClosure(QC)
Oct15 Nov15 Dec15 15Q4Self-ServiceAdmin 637 503 430 1564Admin 197 263 141 601Technical 20 56 21 87TotalQueries 848 812 592 2252TotalClosedin2Days 841 784 591 2216ServiceLevel 99.2% 96.6% 99.8% 98.4%
TheabovetableshowsthequeriesclosedbySPduringtheperiod.ThefiguresforNovemberinclude10ticketswhichcouldnotbeprocessedwhilsttheRDFwasunavailablebetween9thand12thofNovemberduetopowerissues.Hadthoseticketsbeenabletobeprocessed,theservicelevelpercentageforNovemberwouldhavebeen97.8%.InadditiontotheAdminandTechnicalqueries,thefollowingChangeRequestswereresolvedin15Q4.
Oct15 Nov15 Dec15 15Q4ChangeRequests 0 5 0 5
7
2.2.3UserRegistration(UR)
Oct15 Nov15 Dec15 15Q4NoofRequests 120 89 98 302ClosedinOneWorkingDay 120 89 98 302AverageClosureTime(Hrs) 0.6 1.0 0.6 0.7AverageClosureTime(WorkingDays)
0.1 0.1 0.1 0.1
ServiceLevel 1WD 1WD 1WD 1WDToavoiddoublecounting,theserequestsarenotincludedintheabovemetricsfor“AdminandTechnical”QueryClosure.
2.3AdditionalMetrics
2.3.1TargetResponseTimesThefollowingmetricsarealsodefinedinSchedule2.2,buthavenoServicePointsassociated.
TargetResponseTimes1 Duringcoretime,aninitialresponsetotheuseracknowledgingreceiptofthequery2 ATrackingIdentifierwithin5minutesofreceivingthequery3 DuringCoreTime,90%ofincomingtelephonecallsshouldbeansweredpersonally(notby
computer)within2minutes4 DuringUKofficehours,allnontelephonecommunicationsshallbeacknowledgedwithin1
Hour
1–InitialResponseThisissentautomaticallywhentheuserraisesaquerytotheaddresshelpdesk@archer.ac.uk.Usersmaychoosenottoreceivesuchemailsbymailingsupport@archer.ac.uk.
2–TrackingIdentifierThisissentautomaticallywhentheuserraisesaquerytotheaddresshelpdesk@archer.ac.uk.Usersmaychoosenottoreceivesuchemailsbymailingsupport@archer.ac.uk.ThetrackingidentifierissetintheSAFEregardlesswhichoptiontheuserselects.
3–IncomingCallsThesearecoveredintheprevioussectionofthereport.ServicePointsapply.
4-QueryAcknowledgementAcknowledgmentofthequeryisdefinedaswhentheHelpdeskassignsthenewincomingquerytotherelevantServiceProvider.Thisshouldhappenwithin1workinghourofthequeryarrivingattheHelpdesk.TheHelpdeskprocessedthefollowingnumberofincomingqueriesduringtheServiceQuarter:
Oct15 Nov15 Dec15 15Q4CRAY 8 9 5 22ARCHER_CSE 133 151 81 365ARCHER_SP 1371 1229 960 3560TotalQueriesAssigned 1512 1389 1046 3947TotalAssignedin1Hour 1512 1389 1046 3947ServiceLevel 100% 100% 100% 100%
8
TheServiceDeskassignsqueriestoallgroupssupportingtheservicei.e.SP,CSEandCray.Theabovetableincludesquerieshandledbytheothergroupssupportingtheserviceaswellasinternallygeneratedqueriesusedtomanagetheoperationoftheservice.
2.3.2MaintenanceSPisallowedtobookamaximumoftwomaintenanceoccasionsinany28-dayperiod,andtheseshalllastnolongerthanfourhours;thesearedefinedasPermittedMaintenance.SuchMaintenancePeriodsarerecordedintheMaintenanceSchedule.A6-monthforwardplanofmaintenancehasbeenagreedwiththeAuthority.IthasbeenagreedwiththeAuthoritythatSPmaycombinethehoursnormallyallocatedfortwoconsecutivemaintenancesessionsintoasinglesessionwithamaximumofeighthoursandthishasbeenthenormalmodeofoperationasrecordedinthetablebelow.Thisreducesthenumberofsessionstaken,whichreducesuserimpactsincethejobsrunningontheservicehavetobedraineddownonceandnottwice.Ifgreaterthan4hoursdowntimeisrequiredformaintenance,20dayspriorapprovalisrequiredfromtheAuthority.Wherepossible,SPwillperformmaintenanceonan‘At-risk’basis,thusmaximisingtheAvailabilityoftheService.ThefollowingplannedmaintenancetookplaceintheServiceQuarter.
Date Start End Duration Type Notes Reason14/1015 09:00 17:00 08:00 Pre-
ApprovedEPSRCApproved0900-1700
SMWupgradeinpreparationforCLEupgrade
11/11/15 09:00 17:20 08:20 Pre-Approved
EPSRCApproved0900-1900
CLE5.2upgrade
Theplannedmaintenanceon14/10/15over-ranintounplannedservicedowntime(until1756)owingtotechnicalissueswheretheresponsibilitylaywithCray.EPCChandedresponsibilitybacktothemwhenitwasplaintheservicecouldnotbebroughtbackby1700.
9
3.ServiceStatisticsThissectioncontainsstatisticsontheARCHERserviceasrequestedbyEPSRC,SACandSMB.
3.1UtilisationUtilisationoverthequarterwas88%.Atrendlinehasbeenaddedtothegraph.
TheutilisationbytheResearchCouncils,relativetotheirrespectiveallocations,ispresentedbelow.ThisbarchartshowstheusageofARCHERbythetwoResearchCouncilspresentedasapercentageofthetotalResearchCouncilallocationonARCHER.
10
ThecumulativeallocationutilisationforthequarterbytheResearchCouncilsisshownbelow:
ThecumulativeallocationutilisationforthequarterbyEPSRCbrokendownbydifferentprojecttypes(seebelow)showsthatthemajorityofusagecomesfromthescientificconsortia(asexpected)withsignificantusagefromresearchgrants,ARCHERLeadershipprojectsandARCHERRAPprojects.ThetimesusedbyInstantAccessprojects,trainingprojectsandgeneralserviceusageareverysmall.
11
3.2SchedulingCoefficientMatrix
ThecolourinthematrixindicatesthevalueoftheSchedulingCoefficient.Thisisdefinedastheratioofruntimetoruntimepluswaittime.Hence,avalueof1(green)indicatesthatajobranwithnotimewaitinginthequeue,avalueof0.5(paleyellow)indicatesajobqueuedforthesameamountoftimethatitran,andanythingbelow0.5(orangetored)indicatesthatajobqueuedforlongerthanitran.Thematrixshowsthatgenerallyqueuingtimesareshort.Theonlycaseswherelongerwaittimesthanruntimesareencounteredareeitherforveryshortjobs(asthereisalwaysaschedulingoverhead)orforverylargejobs(wherethesystemhastodraincomputenodestomakespaceforthejobs).
12
3.3AdditionalUsageGraphsThefollowingchartsprovidedifferentviewsofthedistributionofjobsizesonARCHER.
NumberofCores
Thefirstgraphshowsthat,intermsofnumbers,thereisasignificantnumberofjobsusingnomorethan512cores.However,thesecondgraphrevealsthatmostofthekAUswerespentonjobsbetween257coresand8192cores.ThenumberofkAUsusediscloselyrelatedtomoneyandshowsbetterhowtheinvestmentinthesystemisutilised.
13
Wallclock
Fromthefirstgraph,itwouldappearthatthesystemisdominatedbyshortjobs.However,thesecondgraphshowsthatactualusageofthesystemismorespreadanddominatedbyjobsoflessthanorequalto27hours.
14
CoreHours
Theabovegraphsshowthat,whiletherearequiteafewjobsthatuseonlyasmallnumberofcorehoursperjob,mostoftheresourceisconsumedbyjobsthatusetensofthousandsofcorehoursperjob.