28
TESTING FAX USING SSS AND FDR DATASETS 2 ND APRIL 2013

TESTING FAX USING SSS and FDR datasets

  • Upload
    marlon

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

TESTING FAX USING SSS and FDR datasets. 2 nd April 2013. DETAILS. Dataset: user.flegger .*.data12_8TeV .00212172. physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00 500GB WNs: UC3 and UCT3 Discovery: Global redirector Running against: fax.mwt2.org - PowerPoint PPT Presentation

Citation preview

Page 1: TESTING FAX USING SSS and FDR datasets

TESTING FAX USING SSS AND FDR DATASETS

2ND APRIL 2013

Page 2: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 2

DETAILSDataset:user.flegger.*.data12_8TeV.00212172.physics_Muons.merge.NTUP_SMWZ.f479_m1228_p1067_p1141_tid01007411_00

500GB

WNs: UC3 and UCT3

Discovery: Global redirector

Running against: fax.mwt2.org

Ramp-up: 4 jobs a minute

Full data copy – split in 138 jobs for each site

Average input size: 3.62 GB

Duration does not include time for job to start

Duration does not include dq2-put time.

Page 3: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 3

JOBS

Page 4: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 4

MWT2- 2 jobs hanging – finish with no error, but only next day

- UCT3 show the same efficiency as UC3

- Avg. cpu eff.: 76.5%

- Avg. dur. 5:59

- Avg. rate: 290 kB/s

- Total rate: 39 MB/s

1.5 2 2.5 3 3.5 4 4.50:00:00

2:24:00

4:48:00

7:12:00

9:36:00

12:00:00

14:24:00

16:48:00

19:12:00

f(x) = 0.0417583087209933 x + 0.0981891437026868R² = 0.0196108364468329

GB

du

rati

on

Page 5: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 5

AGLT2- 4 jobs hanging – finish with no error, but next day

- Avg. CPU efficiency: 70.5%

- Avg. dur. 6 h 14 min

- Avg. rate: 165 kB/s

- Total rate: 22MB/s

1.5 2 2.5 3 3.5 4 4.500

4320

8640

12960

17280

21600

25920

30240

34560

f(x) = 0.0733543217178845 x − 0.00457386591299641R² = 0.545737785138636

GB

seco

nd

s

Page 6: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 6

BU

1.5 2 2.5 3 3.5 4 4.50:00:00

2:24:00

4:48:00

7:12:00

9:36:00

12:00:00

14:24:00

16:48:00

19:12:00

21:36:00

0:00:00

f(x) = 0.112782777608483 x + 0.052593461910591R² = 0.26105314335286

GB

du

rati

on

- 18 jobs hanging

- Avg. CPU efficiency: 35%

- Avg. dur. 11 h 2 min

- Avg. rate: 108 kB/s

- Total rate: 14 MB/s

Page 7: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 7

MWT2 – 300 BRANCHES

1.5 3.5 5.5 7.5 9.5 11.5 13.50:00:00

1:12:00

2:24:00

3:36:00

4:48:00

6:00:00

7:12:00

8:24:00

GB

du

rati

on

- 48 jobs in parallel

- Avg. CPU efficiency: 17%

- Avg. dur. 3 h 20 min

- Avg. rate: 926 kB/s

- Total rate: 44 MB/s

Page 8: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 8

CONCLUSION 1

Rechecked that dq2-put times were not included.

Times seems to be properly measured.

Need to solve mystery of huge CPU times.

• Maybe will have to move to c++ version.

Page 9: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 9

SSS DOING XRDCP

The same DS.

But doing simple xrdcp to /dev/null.

Up to 290 jobs in parallel (UC3 and UCT3)

Page 10: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 10

SSS DOING XRDCPWanted to do all sites that are in FAX and have FDR dataset.

Most did not work:

• When asked through glrd.usatlas.org.

• Some of them even when asked directly.

• Some work for 5-10 files but then give up.

• Some work on repeated queries.

ML monitor not adequate anymore.

• CERN, some UK sites sending all traffic

• Something strange with AGLT2 numbers

• Something wrong with ML

Page 11: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected]

SSS DOING XRDCPErrors mostlyLast server error 10000 ('’) Error accessing path/file for … (BNL)

Very strange error in setting up environment. Not FAX related.Created //.asetup. Please look and (optional) edit it.AtlasSetup(WARNING): Unable to write ${HOME} save filemkdir: cannot create directory `//workarea': Permission denied /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/utilities/createUserASetup.sh: line 40: //.asetup: Permission denied

Page 12: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 12

RESULTS

Page 13: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 13

CONCLUSION 2

Automatic tests for SSB are not enough.

In absence of users that would report problem, will need additional manual checks from time to time.

Monitoring needs to be validated from beginning till the end.

Huge difference in rates – need cost matrix ASAP

Rates observed sound reasonable.

Our understanding would hugely benefit from perfSonar tests over the same links.

Page 14: TESTING FAX USING SSS and FDR datasets

TESTING FAX USING HC AND FDR DATASETS

2ND APRIL 2013

Page 15: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 15

20019750- RC pilot

- Data from SLAC only

Page 16: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 16

200197507 worked3 did not start4 failed

Page 17: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 17

20019750SWT2_CPBLog put error: Error copying the file: 256, cp: cannot create regular file /xrd/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SWT2_CPB.25/user.gangarbt.32893735._SLACPut error: Error copying the file: 256, cp: accessing `/xrootd/atlas/atlasuserdisk/user.gangarbt.hc20019750.ANALY_SLAC.43/user.gangarbt.32887595.EXT0._00418.HWWSkimmedNTUP.root?oss.cgroup=ATLASUSERDISK': Transport endpoint is not connectedQMULGet error: Staging input file failedMWT2Download: 2444 seconds ROMA1Finished: 44 Timed out:12 FZKFinished: 4 Timed out: 46 Get error: Staging input file failedECDFFinished: 36 Failed: 11 pilotErrorDiag: Too little space left on local disk to run jobCERNGet error: Staging input file failedBU Finished 23 Failed:12Not enough local space for staging input files and run the jobAGLTFinished: 17BNLFinished: 231 Failed:8 – lost heart beat or unspecified.OU_OCHEP_SWT2, JINR,FZU – did not start

Page 18: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 18

20019750

Page 19: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 19

20019749- RC pilot

- Data from anywhere

Page 20: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 20

20019749The same idea as 20019749 but much more sites and random files:user.flegger.*…

Did not work as I expected it: each site was always running against a random but same dataset.

Page 21: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 21

20019749

Page 22: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 22

CONCLUSION 3

While there are many fails, some seem easy to fix (not enough space on disk, etc.)

Some are the same ones observed in SSS based tests.

We need to look at performance. Often it is better to fail than have very low performance. How low is unacceptably low?

Need to start looking at site that are not part of FAX.

Page 23: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 23

DIRECT FDR HC JOBS

Page 24: TESTING FAX USING SSS and FDR datasets

Ilija Vukotic [email protected] 24

CONCLUSION

Testing:

• Need faster turn around.

• Would it help:

• Each 6 hours one HC submitted job at each ANALY queue• Against a very stable door • With tools we have now there is no way to precisely stress

test sites.• Fill up table at the slide 21. make it green

Monitoring:

• ML almost useless now.

• Need full validation, specially CERN FAX dashboard

Page 25: TESTING FAX USING SSS and FDR datasets

25

SYSTEMATIC FDR LOAD TESTS IN PROGRESS

US cloud results. 10 jobs * 10 SMWZ files ~ 50GB

MWT2 BNL-ATLAS AGLT2 BU_ATLAS_Tier2

WT20

10

20

30

40

50

60

70

80

XRDCP BNL-ATLASAGLT2OU_OCHEP_SWT2

Source

MB

/s

MWT2 BNL-ATLAS AGLT2 BU_ATLAS_Tier2

WT20

5

10

15

20

25

Read 10% ev. 30MB TTCBNL-ATLASAGLT2OU_OCHEP_SWT2

SOURCE

MB

/s

CPU limited

Factors affecting spreads: pair-wise network latency, throughput, storage “business”

Page 26: TESTING FAX USING SSS and FDR datasets

26

SYSTEMATIC FDR LOAD TESTS IN PROGRESS

US cloud results

Page 27: TESTING FAX USING SSS and FDR datasets

27

SYSTEMATIC FDR LOAD TESTS IN PROGRESS

EU cloud results

BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL0

20

40

60

80

100

120

XRDCP BNL-ATLASCERN-PRODECDFDESY-HHROMA1QMUL

Source

MB

/s

Page 28: TESTING FAX USING SSS and FDR datasets

28

SYSTEMATIC FDR LOAD TESTS IN PROGRESS

EU cloud results

destinationevents/s BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL

source

BNL-ATLAS 126.76 29.4 25.1 26.05 57.26CERN-PROD 82.68 232.52 108.46 123.52 145.96

ECDF 80.68 56.06 252.39 62.83 145.18ROMA1 32 73.66 23.95 197.01 49.72QMUL 41.34 24.14 52.2 99.43 105.46

MB/s BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL

source

BNL-ATLAS 13.07 3.03 2.61 2.65 5.84CERN-PROD 8.36 23.26 11.02 12.71 14.68

ECDF 8.23 5.64 25.14 6.52 14.42ROMA1 3.15 7.49 2.47 20.77 4.79QMUL 4.26 2.6 5.33 9.65 10.38

BNL-ATLAS CERN-PROD ECDF ROMA1 QMUL0

5

10

15

20

25

30

Read 10% events 30MB TTC BNL-ATLAS

CERN-PROD

ECDF

ROMA1

QMUL

Source

MB

/s