23
efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated- xrootd working group Computation and Enrico Fermi Institutes University of Chicago ADC Development Meeting February 3, 2014

Ramping up FAX and WAN direct access

  • Upload
    dessa

  • View
    59

  • Download
    1

Embed Size (px)

DESCRIPTION

Ramping up FAX and WAN direct access. Rob Gardner on behalf of the atlas- adc -federated- xrootd working group Computation and Enrico Fermi Institutes University of Chicago ADC Development Meeting February 3 , 2014. Examine the Layers – as in prior reports. - PowerPoint PPT Presentation

Citation preview

efi.uchicago.educi.uchicago.edu

Ramping up FAX and WAN direct accessRob Gardner on behalf of the atlas-adc-federated-xrootd working group

Computation and Enrico Fermi InstitutesUniversity of Chicago

ADC Development MeetingFebruary 3, 2014

efi.uchicago.educi.uchicago.edu

2

Examine the Layers – as in prior reports

• New new results at increasing scale and complexity. Limit tests to renamed Rucio sites

Capa

bilit

y

Panda re-broker (future)

HammerCloud FunctionalHammerCloud Stress

WAN testing

Network cost matrix (continuous)

SSB Functional (continuous)

Failover to Federation (production)

efi.uchicago.educi.uchicago.edu

3

The New Global Logical Filename

• With Rucio we are no longer dependent on LFC– Brings a lot in stability, scalability– Simplifies joining the Federation– Speeds up file lookups– Makes for much nicer looking gLFNs

• New format gLFN /atlas/rucio/scope:filename

• N2N recalculates Rucio LFN /rucio/scope/xx/yy/filename

• Checks each space token at the site if there is such a path– Reducing # space token paths will make this even more efficient

efi.uchicago.educi.uchicago.edu

4

Summary of FAX Site Deployments• Standardize the deployment procedures

– Goals are largely achieved: Twiki doc, FAX rpms in WLCG repo, etc.• Software Components

– Xrootd release requirement, X509 are mostly achieved– Rucio N2N deployment in progress (dCache, DPM, Xrootd, Posix (GPFS,

Lustre))o ~60% of sites deployed the N2N: Sites are either cautious on this or are delayed

by a libcurl bug on SL5 platform– FIx is ready but would still like to hear from the DPM team of their validation result.

o EOS has its own, functioning N2N plug-in

• Redirection network has been stable since switch to Rucio• Recommending scalable FAX site configuration for Tier1s

– Use a small xrootd cluster instead of a single machine – Similar to multiple GridFTPs doors– BNL and SLAC use this configuration

efi.uchicago.educi.uchicago.edu

5

Infrastructure: 10 redirectors

efi.uchicago.educi.uchicago.edu

6

Infrastructure: 44 SE’s with XROOTD

efi.uchicago.educi.uchicago.edu7

efi.uchicago.educi.uchicago.edu8

Active FAX sites

efi.uchicago.educi.uchicago.edu

9

Basic redirection functionality

• Direct access from clients to sites

• Redirection to non-local data (“upstream”)

• Redirection from central redirectors to the site (“downstream”)

Uses a host at CERN which runs set of probes against sites

Waiting: • Rucio-based gLFN PFN

mapper plugin• Storage software upgrade• Rucio renaming

efi.uchicago.educi.uchicago.edu

10

Regular Status Testing from the SSB

• Functional tests run once per hour• Checks whether direct Xrootd access is working• Sends an email to cloud support, fax-ops

w/info Problem notification

Problem resolved

efi.uchicago.educi.uchicago.edu

11

FAX Throughput

efi.uchicago.educi.uchicago.edu

12

Status of Cost Matrix

• Submits jobs into 20 largest ATLAS compute sites (continuously)

• Measures average IO to each endpoint (an xrdcp of 100 MB file)

• Stores in SSB, along with FTS and perfsonar BW data

• Data sent to Pandafor use in WAN brokering decisions

efi.uchicago.educi.uchicago.edu13

Comparison of data used for cost matrix collectionbetween a representative compute site-storage site pair.

efi.uchicago.educi.uchicago.edu

14

Performance mapfor the selection ofWAN links

Can be used asa rough control factorfor WAN load

Track as we seenetwork upgradesin the next year

WAN performance map

efi.uchicago.educi.uchicago.edu

15

In Production: Failover-to-FAX

Two month window Mix of PROD and ANALY Failover rates are relatively modest About 60k jobs, 60% recovered

efi.uchicago.educi.uchicago.edu

16

Failover-to-FAX rate comparisons

# jobs

Low rate of usage is a measure of existing reliability of ATLAS storage sites

Storage issues

efi.uchicago.educi.uchicago.edu

17

Failover-to-FAX rate comparisons

WAN failover IO reasonable

Thus no penaltyfor queue by usingWAN failover

efi.uchicago.educi.uchicago.edu

18

Failover-to-FAX enabled queue

Any queue Pandaresource can be easily enabled touse FAX for thefallback case.

efi.uchicago.educi.uchicago.edu

19

WAN Direct Access Testing

• Directly access a remote FAX endpoints• Reveals an interesting WAN landcape

Relative WAN event rates and CPU effvery good in DE(at 10’s of jobs scale)

Question is at whatjob scale does onereach diminishingreturns?

(HammerCloud results from Friedrich Hoenig)

efi.uchicago.educi.uchicago.edu

20

WAN Load Test (200 job scale)

• Using HC framework in DE cloud; SMWZ HWW

• Some uncertainty of #concurrently running jobs (not directly controllable)

• Indicates reasonable opportunity for re-brokering

efi.uchicago.educi.uchicago.edu

21

Load Testing with Direct WAN IO

• 744 files (~3.7 GB ea.) reading FDR dataset over WAN, TTC=30MB

• Limited to 250 jobs in test queue

• “Deep read”: 10% events, all 8k branches

• Used most of 10g connection

efi.uchicago.educi.uchicago.edu

22

FAX user tools

• Useful for Tier 3 users or access to ATLAS data from non-grid clusters (e.g. cloud, campus cluster, etc.)

• AtlasLocalRootBase package: localSetupFAX– Sets up dq2-client– Sets up grid middleware – Sets up xrootd client– Sets up an optimal FAX access point

o Uses geographical distance from client IP to FAX endpoints – FAX tools

o isDSinFAX.pyo FAX-setRedirector.sho FAX-get-gLFNs.sh

• Removes need for redirector knowledge

• Eases support

efi.uchicago.educi.uchicago.edu

23

Conclusions, Lessons, To-do• Significant stability improvements for sites using the Rucio

namespace mapper– Also, with removal of LFC callouts, no redirector stability issues observed

• Tier 1 Xrootd proxy stability issues – Have been observed for very large loads during stress tests (O(1000)

clients) (but no impact on backend SE)– Adjustments made and success on re-test – Suggests configuration for protecting Tier 1 storage

• The WAN landscape is obviously diverse– Cost matrix captures capacities

• Probes of 10g link scale – Indicate appropriate WAN job level < 500 jobs – (typically 10% CPU capacity)

• Controlled load testing on-going