Deployment metrics and planning (aka Potentially the most boring talk this week) GridPP16 Jeremy Coles [email protected] 27 th June 2006

Deployment metrics and planning

(aka “Potentially the most boring talk this week”)GridPP16

Jeremy [email protected]

27th June 2006

Overview

2 Even more metrics….zzZ

3 zzzzzzzzzZZZZZZZZZZZZZ

4 What came out of the recent deployment workshops

5 What is happening with SC4

6 Summary

1 An update on some of the high-level metrics

Available job slots have steadily

increased

Contribution to EGEE varies between 15% and 20%.From this plot stability looks like a problem!

0

1000

2000

3000

4000

5000

6000

06/2

0/04

07/1

1/20

04

08/0

1/20

04

08/2

2/04

09/1

2/20

04

10/0

3/20

04

10/2

4/04

11/1

4/04

12/0

5/20

04

12/2

6/04

01/1

6/05

02/0

6/20

05

02/2

7/05

03/2

0/05

04/1

0/20

05

05/0

1/20

05

05/2

2/05

06/1

2/20

05

07/0

3/20

05

07/2

4/05

08/1

4/05

09/0

4/20

05

09/2

5/05

10/1

6/05

11/0

6/20

05

11/2

7/05

18/1

2/20

05

08/0

1/20

06

29/0

1/20

06

19/0

2/20

06

12/0

3/20

06

02/0

4/20

06

23/0

4/20

06

15/0

5/20

06

05/0

6/20

06

Date

Pu

bli

shed

jo

b s

lots

UK total job slots

Thanks

to F

rase

r fo

r data

update

Our contribution to EGEE “work” done remains

significant but…

… but be aware that not all sites have published all data to APEL. Only 1 GridPP site is not currently publishing

CPU usage has been above 60% since May

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

06/0

2/20

04

06/2

5/04

07/1

8/04

08/1

0/20

04

09/0

2/20

04

09/2

5/04

10/1

8/04

11/1

0/20

04

12/0

3/20

04

12/2

6/04

01/1

8/05

02/1

0/20

05

03/0

5/20

05

03/2

8/05

04/2

0/05

05/1

3/05

06/0

5/20

05

06/2

8/05

07/2

1/05

08/1

3/05

09/0

5/20

05

09/2

8/05

10/2

1/05

11/1

3/05

06/1

2/20

05

29/1

2/20

05

21/0

1/20

06

13/0

2/20

06

08/0

3/20

06

31/0

3/20

06

23/0

4/20

06

17/0

5/20

06

09/0

6/20

06

Date

% j

ob

slo

ts u

sed

% EGEE slots used % UK slots used

Update for GridPP15

This is because most VOs have doubled job rates –

note LHCb!

IC-HEP are developing a tool to show job histories (per CE or Tier-2)

View for GridPP CEs covering last week

..but it looks a little rough sometimes!

Over 5000 jobs running

The largest GridPP users by VO for last 3 months

LHCb

ATLAS

BABAR

CMS

BIOMED

DZERO

ZEUS

VOs = a big success

0

5

10

15

20

25

Nu

mb

er o

f en

able

d V

Os

Jan-06

Jun-06

• But we do now need to make sure that schedulers are giving the correct priority to LHC VO jobs!• The ops VO will be used for monitoring from the start of July

Ranked CEs for Apr-Jun 2006

Thanks

to G

idon a

nd O

livie

r fo

r th

is p

lot.

Ranked CEs for Apr-June 06

Successful time / total time

Thanks

to G

idon a

nd O

livie

r fo

r th

is p

lot.

An interesting view by Tier

0

100000

200000

300000

400000

500000

600000

700000

800000

London Tier-2 NorthGrid ScotGrid SouthGrid RAL Tier-1

To

tal H

ou

rs

Failed Hours

Success Hours

A little out of date Q1 view for contribution and occupancy

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

Average occupancy Contribution to UK Tier-2 processing

Some sites appear more successful at staying full even when overall job throughput is not saturating the resources. For Q2 most sites should show decent utilisation. (of course this plot involves estimates and assumes 100% availability).

Storage has seen a healthy increase – but

usage ~40%

SRM V2.2 is delayed. There have been several workshops/meetings taking forward the details of storage types (custodial vs permanent

etc.)

Scheduled downtime is better than EGEE average

…. Still not really good enough to meet MoU targets. Sites need to update without draining site… there are still open questions in the area of what “available” means. GOCDB needs finer granularity for different services.

So are there any recent trends!?

0

20

40

60

80

100

120

140

160

UCL-Cen

tral

Brunel

Sheffie

ld

IC L

eSC*

Cambridge

Queen

Mar

y, UL

UCL-HEP

Royal H

olloway

, UL

Durham

Liver

pool

Edinbu

rgh

IC H

EP

Glasgo

w

RALPP

Birming

ham

RAL-LC

G2

Bristo

l

Oxford

Lanc

aster

Man

ches

ter

Sta

cked

% s

ched

ule

d d

ow

n e

ach

mo

nth

May

April

March

February

January

This is the percentage of time that a site was down for a given period – if down for whole month the monthly stack (each colour) would be 100%

% SFTs failed for UKI

Seems better than the EGEE

average for April and May but

slightly worse in June so far.

These figures really need translating into hours unavailable and the impact on the 95% annual availability target.

SFTs per site - time

0

2

4

6

8

10

12

14

RAL-LC

G2

IC L

eSC*

RALPP

Glasgo

w

IC H

EP

Queen

Mar

y, UL

UCL-HEP

Durham

Man

ches

ter

Lanc

aster

Liver

pool

Bristo

l

Oxford

Birming

ham

Cambridge

Edinbu

rgh

Brunel

Sheffie

ld

Royal H

olloway

, UL

UCL-Cen

tral

Sta

cked

% o

f S

FT

s fa

iled

eac

h m

on

th

May

April

March

February

January

0

10

20

30

40

50

60

70

80

90

Queen

Mar

y, UL

Durham

Sheffie

ld

UCL-HEP

Man

ches

ter

Oxford

IC H

EP

RAL-LC

G2

Glasgo

w

Brunel

RALPP

IC L

eSC*

UCL-CENTRAL

Edinbu

rgh

Liver

pool

Cambridge

Lanc

aster

Birming

ham

Royal H

olloway

, UL

Bristo

l

Sta

cked

%s

of

tim

e u

nav

aila

ble

eac

h m

on

th (

SF

T)

May

April

March

February

January

Generally April and May seem to be improvements on January to March

Number of trouble tickets

0

2

4

6

8

10

12

14

16

18

RAL-LC

G2

QMUL-e

Science

Man

ches

ter

Durham

IC-L

eSC

Oxford

Lanc

aster

IC-H

EP

Birming

ham

Brunel

Liver

pool

Edinbu

rgh

RHUL

UCL-Cen

tral

RALPP

Cambridge

Glasgo

w

UCL-HEP

Sheffie

ld

Bristo

l

Nu

mb

er o

f n

ew t

icke

ts

Q1-2006 Q2-2006

More tickets in Q2 2006 so far! This seems correlated with the increased job loads. The profile is really quite similar between Q1 and Q2 2006

Average time to close tickets

0

20

40

60

80

100

120

140

160

RAL-LC

G2

QMUL-e

Science

Man

ches

ter

Durham

IC-L

eSC

Oxford

Lanc

aster

IC-H

EP

Birming

ham

Brunel

Liver

pool

Edinbu

rgh

RHUL

UCL-Cen

tral

RALPP

Cambridge

Glasgo

w

UCL-HEP

Sheffie

ld

Bristo

l

Ave

rag

e ti

me

to c

lose

tic

kets

(h

rs)

Q1-2006 Q2-2006

Tickets are usually from grid operator on duty. We need to look at factors behind these times. Note that just a few tickets staying open for a long time can distort the conclusions. We need better defined targets. The MoU talks about time to response of 12hrs (prime time) and 72 hrs (not prime time).

Middleware upgrade profiles remain similar

0

5

10

15

20

25

30

35

40

09/0

4/20

05

23/0

4/20

05

07/0

5/20

05

21/0

5/20

05

04/0

6/20

05

18/0

6/20

05

02/0

7/20

05

16/0

7/20

05

30/0

7/20

05

13/0

8/20

05

27/0

8/20

05

10/0

9/20

05

24/0

9/20

05

08/1

0/20

05

22/1

0/20

05

05/1

1/20

05

19/1

1/20

05

03/1

2/20

05

17/1

2/20

05

31/1

2/20

05

14/0

1/20

06

28/0

1/20

06

11/0

2/20

06

25/0

2/20

06

11/0

3/20

06

25/0

3/20

06

08/0

4/20

06

22/0

4/20

06

06/0

5/20

06

20/0

5/20

06

03/0

6/20

06

17/0

6/20

06

# si

tes

at r

elea

se

LCG-2_6_0 LCG-2_7_0 GLITE-3_0_0 LCG-2_4_0

• gLite 3.0.0 was deployed late but released on time raising questions about project wide communications. Our target remains 1 month from agreed start date.

• EGEE wants to move to “rolling updates” but there are still issues around tracking (publishing) component versions installed.

Disk to disk transfer rates

0

100

200

300

400

500

600

700

800

900

Lanc

aster

RALPP

Birming

ham

Glasgo

w

Edinbu

rgh

Oxford

Man

ches

ter

Sheffie

ld

Cambridge

Bristo

l

UCL-CENTRAL

Durham

QMUL

IC-H

EP

IC-L

eSC

UCL-HEP

RHUL

Brunel

Liver

pool

Tra

nsf

er r

ate

in M

b/s

Inbound Outbound

• The testing went well (thanks to Graeme) but we have a lot to do to improve rates.• Suspected/actual problems and possible solutions are listed in the SC wiki:http://www.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Test_Summary

http://www.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Test_Summary

Some key work areas for Q3 and Q4 2006

• Improving site availability/monitoring (e.g. Nagios scripts with alarms)• Getting the transfer rates higher• Understanding external connectivity data transfer needs• Understand performance differences across the sites• Adapt to rolling update of middleware model• Implement storage accounting• Improve cross-site support• Understand WLCG MoU mapping to UK Tier-2 structure (and how we meet it)• Take part in LCG experiment challenges (SC4 and beyond)• Streamlining of the support structure (helpdesk)• SRM upgrades (SRM v2.2)• New resources integration (start to address the CPU:disk imbalance vs requirements)• Security: incident response • Exploiting SuperJanet upgrades• Improved alignment with UK National Grid Service• The usual: documentation and communication

Workshop outputs

Tier-2 workshop/tutorials already covered – next planned for January 2007

OSG/EGEE operations workshop

RELEASE AND DEPLOYMENT PROCESS– Why do sites need to schedule downtime for upgrades?– Release: Is local certification needed? sites required for testing against batch systems– Links to deployment timetable and progress area

USER SUPPORT– How to improve communications (role of TCG was even debated!)– Experiment/VO experience. Improving error messaging!

SITE VALIDATION– Site Availability Monitoring (SFTs for critical services – will remove some of the general SFT

problems that end up logged against sites)

VULNERABILITY & RISK ANALYSIS– New in EGEE-II = SA3. – Move to a new policy for going public with vulnerabilities – RATS (risk analysis teams)

Service Challenge technical workshop– Review of individual Tier-1 rates and problems– Experiments plans are getting clearer and were reviewed– Commitment to use GGUS for problem tickets

Identified experiment interactions (please give

feedback!)

ScotGrid (Signed up to ATLAS SC4)DurhamEdinburghGlasgow – PPS site involved with work for ATLAS

NorthGrid (Signed up to ATLAS SC4)Lancaster – Involved with ATLAS SC4LiverpoolManchester – Already working with ATLAS but not SC4 specific Sheffield

SouthGridBirminghamBristolCambridgeOxford – ATLAS?RAL-PPD – Will get involved with CMS

London Tier-2Brunel – Offer to contribute to ATLAS MC production.Imperial – Working with CMSQMUL – ATLAS? (manpower concerns)RHUL – Bandwidth concern. ATLAS MC?UCL

Summary

2 Within EGEE and WLCG our contribution remains strong

3 Some issues with SFTs and scheduled downtime

4 Workshops over last 2 weeks have been useful

6 We need more sites to be involved with experiment challenges

1 There is a lot of data but not in a consistent format

5 Some clear tasks for next 6 months

Documents

Deployment metrics and planning (aka Potentially the most boring talk this week) GridPP16 Jeremy Coles [email protected] 27 th June 2006