14
1 ARCHER SP Service Quarterly Report Quarter 4 2015

ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

1

ARCHERSPServiceQuarterlyReport

Quarter42015

Page 2: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

2

DocumentInformationandVersionHistoryVersion: 0.5Status Draft

Author(s): AlanSimpson,AnneWhiting,AndyTurner,MikeBrown,StephenBooth,JoBeech-Brandt

Reviewer(s) AlanSimpson

Version Date Comments,Changes,Status Authors,contributors,

reviewers0.1 2015-12-14 InitialDraft AnneWhiting0.2 2016-01-07 Furtherinformationadded AnneWhiting0.3 2016-01-12 Graphsadded AnneWhiting0.4 2016-01-13 Furtherinformationadded

includingappendixAnneWhiting

0.5 2016-01-15 ChangesmadefollowingreviewbyAndyTurnerandAlanSimpson

AlanSimpson,AndyTurner,AnneWhiting,MikeBrown

1.0 2016-01-19 Approved AlanSimpson1.1 2016-03-01 CaptiononPage14corrected JoBeech-Brandt,AlanSimpson

Page 3: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

3

1. TheService1.1ServiceHighlightsThisisthereportfortheARCHERSPServicefortheReportingPeriods:October2015,November2015,December2015• Utilisationonthesystemduring15Q4was88%,thesameasin15Q3.• TheremainingtasksrequiredtoreturntonormalaftertheSonexionLustrefilesystemissues

werecompletedsuccessfullythisquarter:

Ø Thediskreplacementprogrammecompletedon22ndOctober.Thediskfailureratehasreturnedtothelevelsthatwouldbeexpectedafterthereplacements.

Ø RaidcheckwasreturnedtorunningonamonthlyschedulewiththesessionadvertisedontheARCHERpubliccalendar,minimisingthedisruptiontouserscomparedtotheweeklyschedulethatRaidcheckwaspreviouslyrunningon.

Ø EPSRC’scontractorcarriedoutalessonslearnedreview,withstafffromEPCCcontributingfullyandconstructivelytothediscussionandtothereport.Recommendedactionsarebeingreviewed,andwhereappropriate,implemented,withtheeffectsofanychangesbeingmonitored.

Ø Thenormalcycleofpatchingandupgradeshadhadtobepostponeduntiltheresolutionofthefilesystemissues;thishasnowrecommenced.

• Themuch-postponedCLEupgradewassuccessfullycarriedouton11November,bringing

ARCHERuptoversionCLE5.2.ThisisfacilitatingthecompletionofactionsdependentontheCLEupgrade,inadditiontoprovidingthenewfeaturesandfixesincluded.Adviceandassistancewasprovidedtousersbeforeandaftertheupgradetoassistthemwithactionsrequiredfollowingtheupgrade(suchasrecompilingapplicationsandverifyingoutput).

• Crayreplacedcablesinthehighspeednetworkinfrastructureon3and4November2015to

optimizethenetworkspeedandtransferperformance;• Crayisinvestigatinganissuewheremultipleesloginnodesarerandomlyhanging.Diagnosishas

beenhamperedbythemalfunctioningofkdump,thetooltoenableamemorydumptobetaken,forwhichafixhasbeeninstalledbytheServiceProvideron9thDecember.AfurtherrequiredchangetoincreasethekdumppartitionisscheduledtobeimplementedbytheServiceProviderduringthemaintenancesessionon27thJanuary.

• TherehavebeenanumberofPBS-relatedissuesthattheServiceProviderhasraisedwithCray

andwhichCrayareworkingoninconjunctionwithAltair.Theseissuesareprincipallywithregardtoschedulingparametersanditishopedthatimprovementsinjobturnaroundandbetterjobplacementwillresult.Worktodatelookspromising.

• InconjunctionwithEPSRCandtheCSEteam,benefitsrealisationmeasurementfortheARCHER

servicecontinues.Themetricsusedtomeasurethebenefitshavebeenreviewedandadjustedtoensurethechosenmetricsarebothmeasurableandmeaningful.Theaimistoensurethattheoptimalmetricsetpresentanaccurateandconsistentviewonthebenefitsprovided.

1.2ForwardLook

• RSIPnodeimplementationwillallowtheuseofapplicationsrequiringlicenseserversonthecomputenodes,forexample,compilersandISVapplications.RSIPcanbeusedtoprovide

Page 4: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

4

licensestothecomputenodesforthecommercialCFDsoftwareSTAR-CCM+fromCD-Adapco.FollowingthesuccessfulCLEupgrade,theimplementationisanticipatedtobe24thFebruary;

• Aquarterlyround-tablediscussionbetweenallpartiesinvolvedindeliveringtheARCHER

serviceswillbeaddedtothecurrentquarterlyreview,withEPSRC,CrayandEPCCtodiscusssharedissues,exchangeviewsandsuggestactionstocontributetoaprogrammeofcontinuousimprovement.

• WorkiscommencingtopreparethenationalserviceforISO9000certification,tofurther

formalisetheimprovementmechanismsforservicedeliveryandprocessimprovement,withakeymeasurementbeingusersatisfaction.

Page 5: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

5

2.ContractualPerformanceReportThisisthecontractualperformancereportfortheARCHERSPService.

2.1ServicePointsandServiceCreditsTheServiceLevelsandServicePointsfortheSPservicearedefinedasbelowinSchedule2.2.• 2.6.2-PhoneResponse(PR):90%ofincomingtelephonecallsansweredpersonallywithin2

minutesforanyServicePeriod.ServiceThreshold:85.0%;OperatingServiceLevel:90.0%.• 2.6.3-QueryClosure(QC):97%ofalladministrativequeries,problemreportsandnonin-depth

queriesshallbesuccessfullyresolvedwithin2workingdays.ServiceThreshold:94.0%;OperatingServiceLevel:97.0%.

• 2.6.4-NewUserRegistration(UR):ProcessNewUserRegistrationswithin1workingday.Definitions:OperatingServiceLevel:TheminimumlevelofperformanceforaServiceLevelwhichisrequiredbytheAuthorityiftheContractoristoavoidtheneedtoaccounttotheAuthorityforServiceCredits.ServiceThreshold:Thistermisnotdefinedinthecontract.Ourinterpretationisthatitreferstotheminimumallowedservicelevel.Belowthisthreshold,theContractorisinbreachofcontract.NonIn-Depth:Thistermisnotdefinedinthecontract.OurinterpretationisthatitreferstoBasicquerieswhicharehandledbytheSPService.ThisincludesallAdminqueries(e.g.requestsforDiskQuota,AdjustmentstoAllocations,CreationofProjects)andTechnicalQueries(Batchscriptquestions,highleveltechnical‘HowdoI?’requests).Queriesrequiringdetailedtechnicaland/orscientificanalysis(debugging,softwarepackageinstallations,codeporting)arereferredtotheCSETeamasIn-Depthqueries.ChangeRequest:Thistermisnotdefinedinthecontract.TherearetimeswhenSPreceivesrequeststhatmayrequirechangestobedeployedonARCHER.Theserequestsmaycomefromtheusers,theCSEteamorCray.ExamplesmayincludethedeploymentofnewOSpatches,thedeploymentCraybugfixes,ortheadditionofnewsystemssoftware.SuchchangesaresubjecttoChangeControlandmayhavetowaitforaMaintenanceSession.Thenatureofsuchrequestsmeansthattheycannotbecompletedin2workingdays.

2.1.1ServicePointsInthepreviousServiceQuartertheServicePointscanbesummarisedasfollows:

Period Oct15 Nov15 Dec15 15Q4Metric Service

LevelServicePoints

ServiceLevel

ServicePoints

ServiceLevel

ServicePoints

ServicePoints

2.6.2–PR 100% -5 100% -5 100% -5 -152.6.3–QC 99.2% -2 96.6% 1 99.8% -2 -32.6.4–UR 1WD 0 1WD 0 1WD 0 0Total -7 -4 -7 -18ThedetailsoftheabovecanbefoundinSection2.2ofthisreport.

Page 6: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

6

2.1.2ServiceFailuresTherewerenoservicefailuresintheperiodasdefinedinthemetric.DetailsofplannedmaintenancesessioncanbefoundinSection2.3.2.

2.1.3ServiceCreditsAstheTotalServicePointsarenegative(-18),noServiceCreditsapplyin15Q4.

2.2DetailedServiceLevelBreakdown

2.2.1PhoneResponse(PR)

Oct15 Nov15 Dec15 15Q4PhoneCallsReceived 28(7) 27(3) 37(4) 92(14)Answered2Minutes 28 27 37 92ServiceLevel 100.0% 100.0% 100.0% 100.0%

Thevolumeoftelephonecallsremainedlowin15Q4.Ofthetotalof92callsreceivedabove,only14weregenuineARCHERusercallsthateitherresultedinqueriesoranswereduserquestionsdirectly.

2.2.2QueryClosure(QC)

Oct15 Nov15 Dec15 15Q4Self-ServiceAdmin 637 503 430 1564Admin 197 263 141 601Technical 20 56 21 87TotalQueries 848 812 592 2252TotalClosedin2Days 841 784 591 2216ServiceLevel 99.2% 96.6% 99.8% 98.4%

TheabovetableshowsthequeriesclosedbySPduringtheperiod.ThefiguresforNovemberinclude10ticketswhichcouldnotbeprocessedwhilsttheRDFwasunavailablebetween9thand12thofNovemberduetopowerissues.Hadthoseticketsbeenabletobeprocessed,theservicelevelpercentageforNovemberwouldhavebeen97.8%.InadditiontotheAdminandTechnicalqueries,thefollowingChangeRequestswereresolvedin15Q4.

Oct15 Nov15 Dec15 15Q4ChangeRequests 0 5 0 5

Page 7: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

7

2.2.3UserRegistration(UR)

Oct15 Nov15 Dec15 15Q4NoofRequests 120 89 98 302ClosedinOneWorkingDay 120 89 98 302AverageClosureTime(Hrs) 0.6 1.0 0.6 0.7AverageClosureTime(WorkingDays)

0.1 0.1 0.1 0.1

ServiceLevel 1WD 1WD 1WD 1WDToavoiddoublecounting,theserequestsarenotincludedintheabovemetricsfor“AdminandTechnical”QueryClosure.

2.3AdditionalMetrics

2.3.1TargetResponseTimesThefollowingmetricsarealsodefinedinSchedule2.2,buthavenoServicePointsassociated.

TargetResponseTimes1 Duringcoretime,aninitialresponsetotheuseracknowledgingreceiptofthequery2 ATrackingIdentifierwithin5minutesofreceivingthequery3 DuringCoreTime,90%ofincomingtelephonecallsshouldbeansweredpersonally(notby

computer)within2minutes4 DuringUKofficehours,allnontelephonecommunicationsshallbeacknowledgedwithin1

Hour

1–InitialResponseThisissentautomaticallywhentheuserraisesaquerytotheaddresshelpdesk@archer.ac.uk.Usersmaychoosenottoreceivesuchemailsbymailingsupport@archer.ac.uk.

2–TrackingIdentifierThisissentautomaticallywhentheuserraisesaquerytotheaddresshelpdesk@archer.ac.uk.Usersmaychoosenottoreceivesuchemailsbymailingsupport@archer.ac.uk.ThetrackingidentifierissetintheSAFEregardlesswhichoptiontheuserselects.

3–IncomingCallsThesearecoveredintheprevioussectionofthereport.ServicePointsapply.

4-QueryAcknowledgementAcknowledgmentofthequeryisdefinedaswhentheHelpdeskassignsthenewincomingquerytotherelevantServiceProvider.Thisshouldhappenwithin1workinghourofthequeryarrivingattheHelpdesk.TheHelpdeskprocessedthefollowingnumberofincomingqueriesduringtheServiceQuarter:

Oct15 Nov15 Dec15 15Q4CRAY 8 9 5 22ARCHER_CSE 133 151 81 365ARCHER_SP 1371 1229 960 3560TotalQueriesAssigned 1512 1389 1046 3947TotalAssignedin1Hour 1512 1389 1046 3947ServiceLevel 100% 100% 100% 100%

Page 8: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

8

TheServiceDeskassignsqueriestoallgroupssupportingtheservicei.e.SP,CSEandCray.Theabovetableincludesquerieshandledbytheothergroupssupportingtheserviceaswellasinternallygeneratedqueriesusedtomanagetheoperationoftheservice.

2.3.2MaintenanceSPisallowedtobookamaximumoftwomaintenanceoccasionsinany28-dayperiod,andtheseshalllastnolongerthanfourhours;thesearedefinedasPermittedMaintenance.SuchMaintenancePeriodsarerecordedintheMaintenanceSchedule.A6-monthforwardplanofmaintenancehasbeenagreedwiththeAuthority.IthasbeenagreedwiththeAuthoritythatSPmaycombinethehoursnormallyallocatedfortwoconsecutivemaintenancesessionsintoasinglesessionwithamaximumofeighthoursandthishasbeenthenormalmodeofoperationasrecordedinthetablebelow.Thisreducesthenumberofsessionstaken,whichreducesuserimpactsincethejobsrunningontheservicehavetobedraineddownonceandnottwice.Ifgreaterthan4hoursdowntimeisrequiredformaintenance,20dayspriorapprovalisrequiredfromtheAuthority.Wherepossible,SPwillperformmaintenanceonan‘At-risk’basis,thusmaximisingtheAvailabilityoftheService.ThefollowingplannedmaintenancetookplaceintheServiceQuarter.

Date Start End Duration Type Notes Reason14/1015 09:00 17:00 08:00 Pre-

ApprovedEPSRCApproved0900-1700

SMWupgradeinpreparationforCLEupgrade

11/11/15 09:00 17:20 08:20 Pre-Approved

EPSRCApproved0900-1900

CLE5.2upgrade

Theplannedmaintenanceon14/10/15over-ranintounplannedservicedowntime(until1756)owingtotechnicalissueswheretheresponsibilitylaywithCray.EPCChandedresponsibilitybacktothemwhenitwasplaintheservicecouldnotbebroughtbackby1700.

Page 9: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

9

3.ServiceStatisticsThissectioncontainsstatisticsontheARCHERserviceasrequestedbyEPSRC,SACandSMB.

3.1UtilisationUtilisationoverthequarterwas88%.Atrendlinehasbeenaddedtothegraph.

TheutilisationbytheResearchCouncils,relativetotheirrespectiveallocations,ispresentedbelow.ThisbarchartshowstheusageofARCHERbythetwoResearchCouncilspresentedasapercentageofthetotalResearchCouncilallocationonARCHER.

Page 10: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

10

ThecumulativeallocationutilisationforthequarterbytheResearchCouncilsisshownbelow:

ThecumulativeallocationutilisationforthequarterbyEPSRCbrokendownbydifferentprojecttypes(seebelow)showsthatthemajorityofusagecomesfromthescientificconsortia(asexpected)withsignificantusagefromresearchgrants,ARCHERLeadershipprojectsandARCHERRAPprojects.ThetimesusedbyInstantAccessprojects,trainingprojectsandgeneralserviceusageareverysmall.

Page 11: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

11

3.2SchedulingCoefficientMatrix

ThecolourinthematrixindicatesthevalueoftheSchedulingCoefficient.Thisisdefinedastheratioofruntimetoruntimepluswaittime.Hence,avalueof1(green)indicatesthatajobranwithnotimewaitinginthequeue,avalueof0.5(paleyellow)indicatesajobqueuedforthesameamountoftimethatitran,andanythingbelow0.5(orangetored)indicatesthatajobqueuedforlongerthanitran.Thematrixshowsthatgenerallyqueuingtimesareshort.Theonlycaseswherelongerwaittimesthanruntimesareencounteredareeitherforveryshortjobs(asthereisalwaysaschedulingoverhead)orforverylargejobs(wherethesystemhastodraincomputenodestomakespaceforthejobs).

Page 12: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

12

3.3AdditionalUsageGraphsThefollowingchartsprovidedifferentviewsofthedistributionofjobsizesonARCHER.

NumberofCores

Thefirstgraphshowsthat,intermsofnumbers,thereisasignificantnumberofjobsusingnomorethan512cores.However,thesecondgraphrevealsthatmostofthekAUswerespentonjobsbetween257coresand8192cores.ThenumberofkAUsusediscloselyrelatedtomoneyandshowsbetterhowtheinvestmentinthesystemisutilised.

Page 13: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

13

Wallclock

Fromthefirstgraph,itwouldappearthatthesystemisdominatedbyshortjobs.However,thesecondgraphshowsthatactualusageofthesystemismorespreadanddominatedbyjobsoflessthanorequalto27hours.

Page 14: ARCHER SP Service Quarterly Report · • Utilisation on the system during 15Q4 was 88%, the same as in 15Q3. • The remaining tasks required to return to normal after the Sonexion

14

CoreHours

Theabovegraphsshowthat,whiletherearequiteafewjobsthatuseonlyasmallnumberofcorehoursperjob,mostoftheresourceisconsumedbyjobsthatusetensofthousandsofcorehoursperjob.