60
BNL Grid Projects

BNL Grid Projects. 2 OutLine Network/dCache USATLAS Tier 1 Network Design TeraPaths Service Challenge 3 Service Challenge 4 Planning USATLS

Embed Size (px)

Citation preview

Page 1: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

BNL Grid Projects

Page 2: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

2

OutLine

Network/dCacheNetwork/dCache

USATLAS Tier 1 Network DesignUSATLAS Tier 1 Network Design

TeraPathsTeraPaths

Service Challenge 3Service Challenge 3

Service Challenge 4 PlanningService Challenge 4 Planning

USATLS OSG ConfigurationUSATLS OSG Configuration

LCG 2 Status LCG 2 Status

3D (Distributed Deployment of Database) Project 3D (Distributed Deployment of Database) Project

PHENIX Data Transfer (Non-USATLAS)PHENIX Data Transfer (Non-USATLAS)

Page 3: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

Network/dCache

Page 4: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

4

Current Network Configuration

How is (was) our network configured for SC3?  What performance did we observe?  What How is (was) our network configured for SC3?  What performance did we observe?  What

adjustments did we make?  How significant is (or has been) the firewall?  How many adjustments did we make?  How significant is (or has been) the firewall?  How many

servers of what kind did we use for dCache?servers of what kind did we use for dCache?

Page 5: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

5

Network in the Past

The performance for SC3 throughput: the peak performance for several hour is 150M The performance for SC3 throughput: the peak performance for several hour is 150M bytes/second. The average data transfer rate is 120M Bytes. During SC3 service phase, bytes/second. The average data transfer rate is 120M Bytes. During SC3 service phase, we re-installed the dCache system and tuned the system during service phase, we we re-installed the dCache system and tuned the system during service phase, we experienced some data transfer problem. But we can still maintain the data transfer rate experienced some data transfer problem. But we can still maintain the data transfer rate around 100M byte/second for several hours. This time around 100M byte/second for several hours. This time

Several adjustments that we made after SC throughput phase: Several adjustments that we made after SC throughput phase: September: The disk dCache Write Pool was changed from RAID 0 to RAID 5 to add redundancy

to the precious data. The file system was switched to EXT3 due to that a XFS bug crashed the RAID5 based disk. The performance was degraded for past several weeks.

December: we upgrade the OS system of dCache to RHEL 4.0 and redeploy XFS in dCache write pool nodes.

We constantly hit the performance bottleneck of 1 Gbps. We found that there was excessive traffic between door nodes (SW9) and pool nodes (SW7). The traffic was already put on aggregated ethernet channels (3*1Gbps) between two ATLAS switches. We found that the hashing algorithm always sent traffic to one physical network fiber, therefore led to in-balanced load distribution.

We finally relocated all dCache servers into one network switch to avoid inter-switch traffic. We did not find any performance issues associated with firewall, but firewall indeed drops some

packages between 2 ATLAS subnets (130.199.48.0 and 130.199.185.0), which prevents the job submission from ATLAS grid gatekeeper to the condor pool. This problem does not affect SC3 data transfer to BNL dCache system.

Page 6: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

6

Current dCache Configuration

dCache consists of write pool nodes, read pool nodes and dCache consists of write pool nodes, read pool nodes and

core services: (courtesy of Zhenping)core services: (courtesy of Zhenping) PNFS Core server node 1 (dedicated) RHEL 4.0, DELL 3.0Ghz

SRM server (door) node 1 (dedicated) RHEL 4.0, DELL 3.0Ghz

GridFTP and DCAP Core server nodes (doors) 4 (dedicated) RHEL 4 Dell 3.0Ghz

Internal/External Read pool nodes 322 (shared) 145 TB SL3, mix of Penguin 3.0 and

Dell 3.4 Ghz.

Internal/External write pool nodes 8 (dedicated) 1 TB, RHEL 4.0, Dell 3.0GHz

Total 336 146 TB

Page 7: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

7

Read poolsDCap doors

SRM door doors

GridFTP doors doors

Control Channel

write pools

Data Channel

DCap Clients

Pnfs Manager Pool Manager

HPSS

GridFTP Clientsd

SRM Clients

Oak Ridge Batch system

DCache System

One BNL dCache Instance

Page 8: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

8

Future Network/dCache Plan

The design of USATLAS is in the following slides.The design of USATLAS is in the following slides.

The network bandwidth to ACF will be 20Gpbs redundant network The network bandwidth to ACF will be 20Gpbs redundant network bandwidth to external. BNL to CERN connection is 10Gpbs. bandwidth to external. BNL to CERN connection is 10Gpbs.

dCache should be expanded to accommodate LHC data. We tried to dCache should be expanded to accommodate LHC data. We tried to avoid mixing of LHC data traffic with the remaining ATLAS production avoid mixing of LHC data traffic with the remaining ATLAS production traffic. Either we created a dedicated dCache instance, or we dedicate traffic. Either we created a dedicated dCache instance, or we dedicate fraction of dCache resource (separated dCache write pool group) to fraction of dCache resource (separated dCache write pool group) to LHC data transfer. Zhenping and I prefer to have a dedicated dCache LHC data transfer. Zhenping and I prefer to have a dedicated dCache instance since the number of nodes in BNL dCache managed by the instance since the number of nodes in BNL dCache managed by the current dCache technology is running into the limitation. Anyway, in the current dCache technology is running into the limitation. Anyway, in the next several month, LHC fraction of dCache should be able to handle next several month, LHC fraction of dCache should be able to handle 200MB/seconds, one day worth of disk space (16.5 Tera). We need to 200MB/seconds, one day worth of disk space (16.5 Tera). We need to have 20TB (20% will be used as redundancy in RAID 5) local disk have 20TB (20% will be used as redundancy in RAID 5) local disk space. space. 10 Nodes, each with 2TB local disks.

Page 9: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

USATLAS Tier 1 Network Design

Page 10: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

10

Current Unsolved/Unsettled Issues

LHCOPN did not address Tier 2 sites issues. What is the policy on LHCOPN did not address Tier 2 sites issues. What is the policy on

trusting non-US Tier 2 sites? We simplify the issue and treat these non-trusting non-US Tier 2 sites? We simplify the issue and treat these non-

US Tier 2 sites as regular internet end points.US Tier 2 sites as regular internet end points.

LHCOPN include T0, All T1 sites and their existing connection: All T0-LHCOPN include T0, All T1 sites and their existing connection: All T0-

BNL and other ATLAS T1s-BNL traffic will be treated as LHCOPN BNL and other ATLAS T1s-BNL traffic will be treated as LHCOPN

traffic and they could share network resource provided by US LHCnet. traffic and they could share network resource provided by US LHCnet.

If one Tier 1 goes down, its LHC traffic will be routed via another Tier 1 If one Tier 1 goes down, its LHC traffic will be routed via another Tier 1

and use fraction of network resource owned by the Tier 1. This type of and use fraction of network resource owned by the Tier 1. This type of

traffic does not affect BNL internal network design. The AUP should be traffic does not affect BNL internal network design. The AUP should be

negotiated between Tier 1 sites. It is not done yet.negotiated between Tier 1 sites. It is not done yet.

Page 11: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

11

User Scenarios

1.1. LHC data is transferred via LHCOPN from CERN to BNL. Data is transferred into dCache, LHC data is transferred via LHCOPN from CERN to BNL. Data is transferred into dCache, then migrated into HPSS. A small fraction of data will be read immediately by users at Tier then migrated into HPSS. A small fraction of data will be read immediately by users at Tier 2s. (Volume_{LHC})2s. (Volume_{LHC})

2.2. All of Tier 2s upload their simulation/analysis data to Tier 1 dCache site. The data will be All of Tier 2s upload their simulation/analysis data to Tier 1 dCache site. The data will be immediately replicated to the dCache cluster and migrated into HPSS. (Volume_{Tier 2})immediately replicated to the dCache cluster and migrated into HPSS. (Volume_{Tier 2})

3.3. Physicists at Tier 3 read data (Input data) from Tier 1 dCache Read Pool, run Physicists at Tier 3 read data (Input data) from Tier 1 dCache Read Pool, run analysis/transformation on their home institution, upload the result data to Tier 1 dCache analysis/transformation on their home institution, upload the result data to Tier 1 dCache write pool. Then the results will be immediately replicated into dCache read pool and write pool. Then the results will be immediately replicated into dCache read pool and archived into HPSS system. (Volume_{Physicists} = Volume_{Inputs}+Volumes_{Results})archived into HPSS system. (Volume_{Physicists} = Volume_{Inputs}+Volumes_{Results})

4.4. BNL owns fraction of ATLAS reconstruction data, ESD, AOD/Tag data. This data will be BNL owns fraction of ATLAS reconstruction data, ESD, AOD/Tag data. This data will be read from dCache and send to other Tier 1 sites. Similarly, BNL needs to read the same read from dCache and send to other Tier 1 sites. Similarly, BNL needs to read the same type of data from other Tier 1. (Volume_{T1}=Volume_{in}+Volume_{out}.type of data from other Tier 1. (Volume_{T1}=Volume_{in}+Volume_{out}.

5.5. European Tier 2, 2+ sites needs to read data from BNL, the traffic will be treated as regular European Tier 2, 2+ sites needs to read data from BNL, the traffic will be treated as regular internet Traffic.internet Traffic.

The total data volumes that we put on network links and backplane: The total data volumes that we put on network links and backplane: Volume_{Total}= 2*Volume_{LHC}+3*Volume_{Tier 2}+ Volume_{Inputs}+3*Volumes_{Results}

+Volume_{T1}+ Volume_{Others}.

Page 12: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

12

Requirements

dCache brings the subsystems (grid and computing cluster) in ACF even closer, dCache brings the subsystems (grid and computing cluster) in ACF even closer,

in which computing cluster serves as data storage system). Data will be in which computing cluster serves as data storage system). Data will be

constantly replicated among them. Any connection restriction (firewall conduits) constantly replicated among them. Any connection restriction (firewall conduits)

among them will potentially impact the functionality and performance. among them will potentially impact the functionality and performance.

We should isolate the internal ATLAS traffic within ATLAS network domain.We should isolate the internal ATLAS traffic within ATLAS network domain.

We needs to optimize the network traffic volume between BNL campus/ACF. We needs to optimize the network traffic volume between BNL campus/ACF.

What fraction of data are we going to filtered by firewall? Item 1, 2, 3…… Any What fraction of data are we going to filtered by firewall? Item 1, 2, 3…… Any

traffic that we plan to firewall, then we might double or triple the tax on the link traffic that we plan to firewall, then we might double or triple the tax on the link

between BNL Campus/ACF.between BNL Campus/ACF.

Any operation issues in BNL campus network should not impact the ACF Any operation issues in BNL campus network should not impact the ACF

internal network traffic between different USATLAS subnets.internal network traffic between different USATLAS subnets.

We should not overload the BNL firewall with large data volumes of physics We should not overload the BNL firewall with large data volumes of physics

data.data.

Page 13: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

13

USATLAS/BNL LAN

Page 14: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

14

dCache dCache

HPSSHPSS

ACF farmACF farm

ATLASATLASDL2DL2CERNCERN

ACL ACL

PolicyPolicy

RoutingRouting

Internet/AnalysisInternet/Analysis

CERN LHCOPN trafficCERN LHCOPN traffic

Internet TrafficInternet Traffic

All traffic between any two hosts inAll traffic between any two hosts in

ACF, routed or switched.ACF, routed or switched.

Tier 2sTier 2s

US ATLAS Tier 2sUS ATLAS Tier 2s

Option 1

Tier 1sTier 1s

Tier 1sTier 1s

Page 15: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

15

ATLASATLASDL2DL2CERNCERN

ACL ACL

PolicyPolicy

RoutingRouting

InternetInternet

LHC/SC4LHC/SC4

dCachedCache

GridGrid

HPSS HPSS

ACF farmACF farmTier 2sTier 2s

CERN LHCOPN trafficCERN LHCOPN traffic Internet TrafficInternet Traffic

LHC data to HPSS is internal toLHC data to HPSS is internal to

ATLAS. It never leaves ATLAS routerATLAS. It never leaves ATLAS router..USATLAS Tier 2sUSATLAS Tier 2s

Option 2

Tier 1sTier 1s

Tier 1sTier 1s

Page 16: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

16

ATLASATLASDL2DL2CERNCERNACL Policy RoutingACL Policy Routing

InternetInternet

Single Network cableSingle Network cable

LHCLHC

dCachedCache

GridGrid

HPSS HPSS

ACF farmACF farm

Tier 2sTier 2s

CERN LHCOPN trafficCERN LHCOPN traffic Internet TrafficInternet Traffic

LHC data to HPSS is external toLHC data to HPSS is external to

ATLAS. It leaves ATLAS routerATLAS. It leaves ATLAS router..

USATLAS Tier 2sUSATLAS Tier 2s

Option 3

Tier 1sTier 1s

Tier 1sTier 1s

Page 17: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

17

ATLASATLASDL2DL2CERNCERN

InternetInternet

CERN LHCOPN trafficCERN LHCOPN traffic

Internet TrafficInternet Traffic

LHC data to HPSS is routed via DL2, LHC data to HPSS is routed via DL2,

The traffic needs to leaves ATLAS router.The traffic needs to leaves ATLAS router.

LHCLHC

dCachedCache

HPSS HPSS

ACF farmACF farm

Option 4

Disadvantage: All ATLAS traffic may double, Disadvantage: All ATLAS traffic may double,

or triple tax the BNL/USATLAS link.or triple tax the BNL/USATLAS link.

Put All Traffic router via DL2.Put All Traffic router via DL2.

Network Management is not easy? Network Management is not easy?

Firewall becomes the bottleneck. Firewall becomes the bottleneck.

Does not utilize ATLAS routing capability.Does not utilize ATLAS routing capability.

Grid Grid

SystemSystem

Tier 2sTier 2s

USATLAS Tier2 TafficUSATLAS Tier2 Taffic

Tier 1sTier 1s

Tier 1sTier 1s

Page 18: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

TeraPaths

Page 19: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

19

QoS/MPLS

QoS/MPLS technology can be manually deployed into BNL QoS/MPLS technology can be manually deployed into BNL

campus/USATLAS network now. The behavior is well campus/USATLAS network now. The behavior is well

understood and LAN QoS expertise are handy now. understood and LAN QoS expertise are handy now.

The TeraPaths software system is under intensive re-The TeraPaths software system is under intensive re-

development to approach product quality. It will be ready by development to approach product quality. It will be ready by

the end of February. We will need one month (March) to the end of February. We will need one month (March) to

verify and deploy it into our production network verify and deploy it into our production network

infrastructure. When SC4 starts, we can quantitatively infrastructure. When SC4 starts, we can quantitatively

manage BNL LAN to send and receive data. The following manage BNL LAN to send and receive data. The following

month will be focusing on deploying the software package month will be focusing on deploying the software package

into Tier 2 sites participating SC4.into Tier 2 sites participating SC4.

Page 20: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

20

This Project Investigates the Integration and Use of LAN This Project Investigates the Integration and Use of LAN

QoS and MPLS Based Differentiated Network Services in QoS and MPLS Based Differentiated Network Services in

the ATLAS Data Intensive Distributed Computing the ATLAS Data Intensive Distributed Computing

Environment As a Way to Manage the Network As a Environment As a Way to Manage the Network As a

Critical Resource.Critical Resource.

The Collaboration Includes BNL and University of The Collaboration Includes BNL and University of

Michigan, and Other Collaborators From OSCAR Michigan, and Other Collaborators From OSCAR

(ESNET), Lambda(ESNET), Lambda Station (FNAL), and TeraPaths Station (FNAL), and TeraPaths

monitoring project (SLAC).monitoring project (SLAC).

What Is TeraPaths?

Page 21: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

21

TeraPaths System Architecture

Site A (initiator) Site B (remote)

WAN

web services web services

WAN monitoring

WAN web services

route planner

scheduler

user manager

site monitor

router manager

hardware drivershardware drivers

route planner

scheduler

user manager

site monitor

router manager

Web page

APIs

Cmd line

QoS requests

Page 22: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

22

TeraPaths SC2005

Two bbcp periodically copied data from BNL disk to UMICH disk. One used class Two bbcp periodically copied data from BNL disk to UMICH disk. One used class

2 traffic (200Mbps) and another used class EF (expedite forwarding: 400Mbps). 2 traffic (200Mbps) and another used class EF (expedite forwarding: 400Mbps).

Iperf sent out background traffic. The allocated network resource is 800Mbps. Iperf sent out background traffic. The allocated network resource is 800Mbps. We could quantitatively control shared network resource for mission critical tasks.

Verified the Effectiveness of MPLS/LAN QoS and Its Impact to prioritized traffic,

background best effort traffic, and overall Network Performance.

Page 23: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

Service Challenge 3

Page 24: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

24

What was the SC3 configuration, hardware, software, middleware we used?

Page 25: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

25

Services at BNL

FTS client + server (FTS 1.3) and its backend Oracle and myproxy servers.FTS client + server (FTS 1.3) and its backend Oracle and myproxy servers. FTS does the job of reliable file transfer from CERN to BNL. Most Functionalities were implemented. It became reliable in controlling data transfer

after several rounds of redeployments for bug fixing: short timeout value causing excessive failures, incompatibility with dCache/SRM.

Does not support DIRECT data transfer between CERN to BNL dCache data pool server (dCache SRM third party data transfer). The data transfers actually go through a few dCache GridFTP door nodes at BNL, which presents scalability issue. We had to move these door nodes to non-blocking networking ports to distribute traffic.

Both BNL and RAL discovered that the number of streams per file could not be more than 10, (a bug)?

Networking to CERN: Networking to CERN: Network for dCache was upgraded to 2*1Gpbs around June. Shared link with Long Round Trip Time: >140 ms, while RTT for Europe sites to CERN

is about 20ms. Occasional packet losses were discovered along the path between BNL-CERN. 1.5 G bps aggregated bandwidth observed by iperf with 160 TCP streams.

Page 26: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

26

Services Used at BNL SC3

dCache/SRM (dCache/SRM (V1.6.6.3, with SRM 1.3 interface). The detailed configuration V1.6.6.3, with SRM 1.3 interface). The detailed configuration

can be found in Slide 6. can be found in Slide 6. All read pool nodes have Scientific Linux 3 with XFS module compiled.

Experienced High load on write pool serves during large amount data transfer. Was fixed by replacing the EXT file systems with XFS file system.

Core server crashed once. Reason was identified and fixed.

Small buffer space (1.0TB) for data written into dCache system.

dCache can now deliver up to 200MB/second for input/output (limited by network speed.)

LFC (1.3.4) client and server was installed at BNL Replica Catalog Server.LFC (1.3.4) client and server was installed at BNL Replica Catalog Server. Server was installed. Tested the basic functionalities: lfc-ls, lfc-mkdir etc.

Will populate LFC with the entries in our production globus RLS server.

ATLAS VO Box (DDM + LCG VO box) was deployed at BNL.ATLAS VO Box (DDM + LCG VO box) was deployed at BNL.

Two Instances of Distributed Data Management (DDM) software (DQ2) were Two Instances of Distributed Data Management (DDM) software (DQ2) were

deployed at BNL, one for Panda Production and one for SC3 service phase.deployed at BNL, one for Panda Production and one for SC3 service phase.

Page 27: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

27

how did SC3 infrastructure evolve?

FTS was upgraded from 1.2 to 1.3.FTS was upgraded from 1.2 to 1.3.

dCache was upgraded from 1.6.5 to 1.6.6.3 (Dec/7/2005). dCache was upgraded from 1.6.5 to 1.6.6.3 (Dec/7/2005). Write Pool File System was migrated from EXT3 to XFS before

Service challenge 3 throughput phase. After SC3 throughput phase, we migrated the underlying disk from RAID0 to RAID 5 for better reliability. But it triggered the XFS file system bug when using RAID 5 disk and crashed server. We had to switch back to EXT3 file system. It fixed the bug, but significantly reduced the performance. The recent OS upgrade on dCache write pool and core servers alleviated the XFS bug (did not fix it), we migrated it back to XFS for better performance.

dCache software on read pool was upgraded as well. OS in Read Pool Nodes did not change after May/June Upgrade.

Page 28: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

28

BNL SC3 data transfer

All data actually are All data actually are routed through routed through GridFtp doorsGridFtp doors

SC3 Monitored SC3 Monitored at BNL and at BNL and CERN are CERN are consistent.consistent.

Page 29: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

29

Data Transfer Status

BNL stablized FTS data transfer BNL stablized FTS data transfer with high successful completion rate, with high successful completion rate, as shown in the left image during as shown in the left image during Throughput Phase.Throughput Phase.

We have attained150 MB/second We have attained150 MB/second rate for about one hour with large rate for about one hour with large number (> 50) of parallel file number (> 50) of parallel file transfers During SC3 throughput transfers During SC3 throughput Phase.Phase.

Page 30: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

30

Final SC3 Throughput Data Transfer Results

Page 31: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

31

Lessons Learned From SC2

Four file transfer servers with 1 Gigabit WAN network connection to CERN.Four file transfer servers with 1 Gigabit WAN network connection to CERN.

Meet the performance/throughput challenges (70~80MB/second disk to disk).Meet the performance/throughput challenges (70~80MB/second disk to disk). Enabled data transfer between dCache/SRM and CERN SRM at openlab

Design our own script to control SRM data transfer.

Enabled data transfer between BNL GridFtp servers and CERN openlab GridFtp

servers controlled by Radiant software.

Many components need to be tunedMany components need to be tuned Long Round Trip Time, high packet dropping rate, has to use multiple TCP streams

and multiple file transfers to fill up network pipe.

Sluggish parallel file I/O with EXT2/EXT3, lot of processes with I/O wait state, more

file streams, worse the performance on file system.

Slight improvement with XFS system. Still need to tune file system parameter

Page 32: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

32

Some Issues During SC3 Throughput Phase

Service Challenge also challenges resource: Service Challenge also challenges resource: Tuned network pipes, optimized the configuration and performance of BNL

production dCache system and its associate OS, file systems, Required more than one staff’s involvements to stabilize the newly deployed FTS,

dCache and network infrastructure. Staffing level decreased as services became stable.

Limited Resources are shared by experiments and users. Limited Resources are shared by experiments and users. At CERN, SC3 infrastructure are shared by multiple Tier 1 sites.

Due to the heterogeneous nature of Tier 1 sites, data transfer for each site should be optimized non-uniformly based on site’s various aspects: i.e. network RRT, packet loss rates, experiment requirements etc.

At BNL, network and dCache are also used by production users. Need to closely monitor the SRM and network to avoid impacting production activities.

At CERN, James Casey alone handles answering email, setting up the system, At CERN, James Casey alone handles answering email, setting up the system, reporting problems and running data transfer. He provides 7/16 support himself. reporting problems and running data transfer. He provides 7/16 support himself. How to scale to 7/24 production support/production center? How to handle the time difference between US and CERN? CERN Support Phone (Tried once, but the operator did not speak English)

Page 33: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

33

Some Issues During SC3 Service Phase

FTS was changed from version 1.3 to 1.4 at CERN. FTS version 1.4 FTS was changed from version 1.3 to 1.4 at CERN. FTS version 1.4

was supposed to support the direct third party transfer. When the direct was supposed to support the direct third party transfer. When the direct

data transfer into the pool without going through door was used, it could data transfer into the pool without going through door was used, it could

not handle the long wait, led to channel lockup. Therefore we had to not handle the long wait, led to channel lockup. Therefore we had to

switch to glite-url-copy which ad-hoc handles transferring into dCache.switch to glite-url-copy which ad-hoc handles transferring into dCache.

dCache was constantly improved for better performance and reliability dCache was constantly improved for better performance and reliability

during past several month, reached a stable dCache recently. during past several month, reached a stable dCache recently.

SC3 service phase exposed several problems when it started. We took SC3 service phase exposed several problems when it started. We took

this opportunity to find the problems and fixed them. The performance this opportunity to find the problems and fixed them. The performance

and stability were continuously improved over the course of SC3. We and stability were continuously improved over the course of SC3. We

was able to achieve high performance by the end of SC3. A good was able to achieve high performance by the end of SC3. A good

learning experience indeed.learning experience indeed.

SC Operation needs to be improved to timely problem reports. SC Operation needs to be improved to timely problem reports.

Page 34: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

34

What have been done.

SC3 Throughput Phase showed good data transfer SC3 Throughput Phase showed good data transfer bandwidth.bandwidth.

SC3 Tier 2 Data Transfer SC3 Tier 2 Data Transfer Data were transferred to three selected Tier 2 sites.

SC3 Tape TransferSC3 Tape Transfer Tape Data Transfer was stablized at 60 MB/second with loaned

tape resources. Met the goal defined at the beginning of Service Challenge. Full Chain of data transfer was exercised.

SC3 Service Phase: we showed very good peak SC3 Service Phase: we showed very good peak performance.performance.

Page 35: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

35

General view of SC3

When everything is running smoothly BNL got very good results 100M When everything is running smoothly BNL got very good results 100M

Byte/secondsByte/seconds

The middleware (FTS) is stable but there were still lots of compatibility issues: The middleware (FTS) is stable but there were still lots of compatibility issues: FTS does not work effectively with the new version of dCache/SRM (version 1.3).

We had to turn off FTS controlled direct data transfer into dCache Pool since lots of

time out errors completely blocked the data transfer channel.

We need to improve SC operation which included performance monitoring and We need to improve SC operation which included performance monitoring and

timely problem reporting for preventing from deteriorating and quick fixing.timely problem reporting for preventing from deteriorating and quick fixing.

We fixed many dCache issues after its upgrade. We also tuned the dCache We fixed many dCache issues after its upgrade. We also tuned the dCache

system to work under FTS/ATLAS DDM system (DQ2). system to work under FTS/ATLAS DDM system (DQ2).

We achieved the best performance among the dCache sites which participated We achieved the best performance among the dCache sites which participated

ATLAS SC3 service phase. 15 TB data was transferred to BNL. Sites using ATLAS SC3 service phase. 15 TB data was transferred to BNL. Sites using

CASTOR SRM showed better performance. CASTOR SRM showed better performance.

Page 36: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

SC3 re-run and SC4 Planning

Page 37: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

37

SC3 re-run

We upgraded BNL dCache core server OS to RHEL 4 and We upgraded BNL dCache core server OS to RHEL 4 and

dCache to 1.6.6 starting Dec/07/2005. dCache to 1.6.6 starting Dec/07/2005.

We will add few more dCache pool nodes if the software We will add few more dCache pool nodes if the software

upgrades did not meet our expectation. upgrades did not meet our expectation.

FTS should be upgraded if the necessary fix to prevent FTS should be upgraded if the necessary fix to prevent

channel blocking is ready before new year.channel blocking is ready before new year.

LCG BDII needs to report status of dCache, FTS. (before LCG BDII needs to report status of dCache, FTS. (before

Christmas).Christmas).

We would like to schedule a test period at the Beginning of We would like to schedule a test period at the Beginning of

January for stability and scalability. January for stability and scalability.

Everything should be ready by January 9.Everything should be ready by January 9.

Re-run will start at January 16.Re-run will start at January 16.

Page 38: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

38

What will our SC4 configuration look like, network, servers, software, etc?

The physical network location for SC4 is shown in Slide 15.The physical network location for SC4 is shown in Slide 15.

We subscribed two subnet to LHCOPN (130.199.185.0/24 We subscribed two subnet to LHCOPN (130.199.185.0/24

and 130.199.48.0/23). The current dCache instance will be and 130.199.48.0/23). The current dCache instance will be

on these two subnet. The new dCache instance for on these two subnet. The new dCache instance for

LHC/SC4 will be in 130.199.185.0/24 exclusively).LHC/SC4 will be in 130.199.185.0/24 exclusively). 10 dCache Write/Read Pool Servers.

4 Door servers (RAL already merged door nodes with pool nodes.

We will evaluated whether it is doable in BNL).

2 core servers. (dCache PNFS manager and SRM server).

The newest dCache production release: dCache 1.6.6.3+

Page 39: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

39

BNL Service Challenge 4 Plan

Several steps needed to set-up hardware or service (ex: choose, Several steps needed to set-up hardware or service (ex: choose, procure, start install, end install, make operational), starting at January, procure, start install, end install, make operational), starting at January, ending before the beginning of March. ending before the beginning of March. LAN, Tape system. FTS, LFC, DDM, LCG VO boxes and other base line sevices will be

maintained with agreed SLA and supported by USATLAS VO. Dedicated LHC dCache/SRM write pool which provides up to 17 Tera bytes

storage (24 hour worth data). (to be done synchronized with LAN, WAN).

Deploy and strengthen necessary monitoring infrastructure based on Deploy and strengthen necessary monitoring infrastructure based on ganglia, nagios, Monalisa and LCG-RGMA. (February).ganglia, nagios, Monalisa and LCG-RGMA. (February).

Drill for service integration (March)Drill for service integration (March) Simulate network failure, server crashes, and how support center will

respond to the issues. Tier 0/Tier 1 End-to-End high performance network operational: bandwidth,

stability and performance.

Page 40: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

40

BNL Service Challenge 4 Plan

April/2006, establish the stable data transfer in the speed of 200M April/2006, establish the stable data transfer in the speed of 200M Bytes/second to disks and 200 M Bytes/second to tape.Bytes/second to disks and 200 M Bytes/second to tape.

May/2006, disk and computing farm upgrading.May/2006, disk and computing farm upgrading.

July/01/2006: stable data transfer driven by ATLAS production system July/01/2006: stable data transfer driven by ATLAS production system and ATLAS data management infrastructure between T0~T1 (200M and ATLAS data management infrastructure between T0~T1 (200M Bytes/second) and provide services to satisfy SLA (Service level Bytes/second) and provide services to satisfy SLA (Service level agreement).agreement).

Details of involving Tier 2 are in planning too. (February and March)Details of involving Tier 2 are in planning too. (February and March) Tier 2 dCache: UC dCache needs to be stabilize and operational in

February, UTA and BU need to have dCache in March. Base line client tools should be deployed at Tier 2 centers. Base line services should support Tier 1~Tier2 data transfer before SC4

starts.

Page 41: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

3D project

Page 42: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

42

Oracle part

Tie0 – Tie1 Tie0 – Tie1 Oracle Oracle streams replication

BNL joined to the 3D replicatoin testbed BNL joined to the 3D replicatoin testbed Streams replication was setup between CERN and BNL

successfully in Oct 2005

Several experiments foresee Oracle clusters for online Several experiments foresee Oracle clusters for online systemssystems

Focus on Oracle database clusters as main building block Focus on Oracle database clusters as main building block for Tie0 and Tie1for Tie0 and Tie1

Propose to setup pre-production services for March and full Propose to setup pre-production services for March and full service after 6 months deployment experienceservice after 6 months deployment experience

Page 43: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

43

BNL 3D Oracle Production Schedule

Dec 2005: h/w setup (Done)Dec 2005: h/w setup (Done) Two nodes with 500GB fibre channel storage

Jan 2006: h/w acceptance tests, RAC(real application Jan 2006: h/w acceptance tests, RAC(real application

cluster) setupcluster) setup

March 2006: service startsMarch 2006: service starts

May 2006: service review ---> h/w defined for full productionMay 2006: service review ---> h/w defined for full production

September 2006: full database service in placeSeptember 2006: full database service in place

Page 44: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

44

MySQL Database replication at BNL

Oracle – MySQL replication: Oracle – MySQL replication: DataBase: ATLAS TAG DB

DB server at BNL: dbdevel2 (MySQL-4.0.25)

use case : Oracle CERN to MySQL BNL (push)

tool: Octopus replicator ( Java-based extraction, transformation and

loading)

thanks to Julius Hrivnac (LAL,Orsay) and Kristo Karr (ANL) for

successful collaboration

More details in Twiki:

https://uimon.cern.ch/twiki/bin/view/Atlas/DatabaseReplication

Page 45: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

45

MySQL Database replication at BNL

MySQL – MySQL replication:MySQL – MySQL replication: DataBases:

Geometry DB ATLASDD MySQL conditions DBs LArNBDC2 and LArIOVDC2

MySQL DB servers at BNL: dbdevel1.usatlas.bnl.gov (MySQL -4.0.25) db1.usatlas.bnl.gov (MySQL-4.0.25)

collected the first experience with CERN-BNL ATLAS DB replication

procedure using both mysqldump and on-line replication

Current versions correspond to most recent ATLAS production

release 11.0.3

Page 46: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

LCG 2 at BNl

Page 47: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

47

Summary

LCG setup at BNL is partially functional. LCG-VO box was LCG setup at BNL is partially functional. LCG-VO box was

used in SC3. There is no technical difficulties/hurdles used in SC3. There is no technical difficulties/hurdles

preventing the CE and SE from fully functional.preventing the CE and SE from fully functional.

Deployed at mixed of hardware: Dell 3.0 Ghz, and some VA Deployed at mixed of hardware: Dell 3.0 Ghz, and some VA

linux nodes: we deployed CE, RB, SE, Proxy server, linux nodes: we deployed CE, RB, SE, Proxy server,

Monitoring nodes (R-GMA), and a collection of worker Monitoring nodes (R-GMA), and a collection of worker

node. Some systems are combined into a single server. node. Some systems are combined into a single server.

Page 48: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

48

Progress and To Do

OS and LCG system installation and configuration is OS and LCG system installation and configuration is

automatic, can be reinstalled on new hardware with 2 hoursautomatic, can be reinstalled on new hardware with 2 hours

Managed via RPM and updatable via a local YUM Managed via RPM and updatable via a local YUM

repositories which are automatically rebuilt from CERN and repositories which are automatically rebuilt from CERN and

else where source.else where source.

GUMS controls LCG grid mapfile.GUMS controls LCG grid mapfile.

Site information is being published correctly, and some SFT Site information is being published correctly, and some SFT

(site functional tests) run from Operation CERN can (site functional tests) run from Operation CERN can

complete successfully.complete successfully.

Still need to configure LCG to run condor at ATLAS pool.Still need to configure LCG to run condor at ATLAS pool.

Page 49: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

BNL USATLAS Grid Testbed

Page 50: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

50

Internet

HPSS

Submit Grid Jobs

OSG Gatekeepers

Disks

RHIC/USATLAS Job scheduler

NFS

HPSS MOVER

SRM/GridFtp SERVERS

GridFtp

Panasas

BNL USATLAS OSG Configuration

Grid UsersGrid Users

Grid Job Requests

Condor and dCacheCondor and dCache

Page 51: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

PHENIX Data Transfer Activities

Page 52: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

52Courtesy of Y. WatanabeCourtesy of Y. Watanabe

Page 53: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

53

Page 54: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

54

Data Transfer to CCJ

2005 RHIC run ended on June 2005 RHIC run ended on June

24, Above shows the last day of 24, Above shows the last day of

RHIC Run.RHIC Run.

Total data transfer to CCJ Total data transfer to CCJ

(Computer Center in Japan) is (Computer Center in Japan) is

260 TB (polarized p+p raw data)260 TB (polarized p+p raw data)

100% data transferred via WAN, 100% data transferred via WAN,

Tool used here: GridFtp. No 747 Tool used here: GridFtp. No 747

involved.involved.

Average Data Rate: Average Data Rate:

60~90MB/second, Peak 60~90MB/second, Peak

Performance: 100 Mbytes/second Performance: 100 Mbytes/second

recorded in Ganglia Plot! About recorded in Ganglia Plot! About

5TB/day!5TB/day!Courtesy of Y. WatanabeCourtesy of Y. Watanabe

Page 55: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

55

Network Monitoring on NAT Box

Page 56: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

56

Month and Year

Page 57: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

57

Network Monitoring at Perimeter Router

Page 58: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

58

Network Monitoring at CCJ, JAPAN

Page 59: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

59

Our Role

Provide effective and efficient Network/Grid Solutions for Data Transfer.Provide effective and efficient Network/Grid Solutions for Data Transfer.

Install Grid Tools on the PHENIX Buffer boxes.Install Grid Tools on the PHENIX Buffer boxes.

Tune performance of network path along PHENIX Counting Tune performance of network path along PHENIX Counting

House/RCF/BNL LAN.House/RCF/BNL LAN.

Install Ganglia monitoring tools for data transfer.Install Ganglia monitoring tools for data transfer.

Diagnose problems and provide fix.Diagnose problems and provide fix.

For future PHENIX data transfer, we continue to play these role. We will For future PHENIX data transfer, we continue to play these role. We will

Integrate dCache/SRM into the future data transfer and automate the Integrate dCache/SRM into the future data transfer and automate the

data transfer. data transfer.

Ofer maintains the PHENIX dCache/SRM pools. He works on pilot Ofer maintains the PHENIX dCache/SRM pools. He works on pilot

transfer data from PHENIX dCache/SRM to CCJ. transfer data from PHENIX dCache/SRM to CCJ.

Page 60: BNL Grid Projects. 2 OutLine  Network/dCache  USATLAS Tier 1 Network Design  TeraPaths  Service Challenge 3  Service Challenge 4 Planning  USATLS

60

Lessons Learned

Four monitor systems: BNL NAT ganglia, Router MRTG (Multi-Router Four monitor systems: BNL NAT ganglia, Router MRTG (Multi-Router Traffic Grapher), CCJ ganglia and Data Transfer Monitoring, caught Traffic Grapher), CCJ ganglia and Data Transfer Monitoring, caught errors in early stage.errors in early stage.

EXT3 file system is not designed for high performance data transfer. EXT3 file system is not designed for high performance data transfer.

XFS has much better performance in disk I/O with high bandwidth, this XFS has much better performance in disk I/O with high bandwidth, this experience was used in LHC service challenge 3 for ATLAS experime experience was used in LHC service challenge 3 for ATLAS experime nt.nt.

Broadcom BCM95703 copper gigabit network card has much less Broadcom BCM95703 copper gigabit network card has much less packet erros than Intel Pro1000.packet erros than Intel Pro1000.

Several ES-net/SINET network outages, traffic was rerouted to Several ES-net/SINET network outages, traffic was rerouted to alternative paths. Problems were promptly discovered and resolved by alternative paths. Problems were promptly discovered and resolved by on-call personnel and network engineers. Because of large disk cache on-call personnel and network engineers. Because of large disk cache at both ends, no data were lost due to network outages. at both ends, no data were lost due to network outages.