Upload
joleen-lizbeth-atkins
View
218
Download
2
Embed Size (px)
Citation preview
BNL Grid Projects
2
OutLine
Network/dCacheNetwork/dCache
USATLAS Tier 1 Network DesignUSATLAS Tier 1 Network Design
TeraPathsTeraPaths
Service Challenge 3Service Challenge 3
Service Challenge 4 PlanningService Challenge 4 Planning
USATLS OSG ConfigurationUSATLS OSG Configuration
LCG 2 Status LCG 2 Status
3D (Distributed Deployment of Database) Project 3D (Distributed Deployment of Database) Project
PHENIX Data Transfer (Non-USATLAS)PHENIX Data Transfer (Non-USATLAS)
Network/dCache
4
Current Network Configuration
How is (was) our network configured for SC3? What performance did we observe? What How is (was) our network configured for SC3? What performance did we observe? What
adjustments did we make? How significant is (or has been) the firewall? How many adjustments did we make? How significant is (or has been) the firewall? How many
servers of what kind did we use for dCache?servers of what kind did we use for dCache?
5
Network in the Past
The performance for SC3 throughput: the peak performance for several hour is 150M The performance for SC3 throughput: the peak performance for several hour is 150M bytes/second. The average data transfer rate is 120M Bytes. During SC3 service phase, bytes/second. The average data transfer rate is 120M Bytes. During SC3 service phase, we re-installed the dCache system and tuned the system during service phase, we we re-installed the dCache system and tuned the system during service phase, we experienced some data transfer problem. But we can still maintain the data transfer rate experienced some data transfer problem. But we can still maintain the data transfer rate around 100M byte/second for several hours. This time around 100M byte/second for several hours. This time
Several adjustments that we made after SC throughput phase: Several adjustments that we made after SC throughput phase: September: The disk dCache Write Pool was changed from RAID 0 to RAID 5 to add redundancy
to the precious data. The file system was switched to EXT3 due to that a XFS bug crashed the RAID5 based disk. The performance was degraded for past several weeks.
December: we upgrade the OS system of dCache to RHEL 4.0 and redeploy XFS in dCache write pool nodes.
We constantly hit the performance bottleneck of 1 Gbps. We found that there was excessive traffic between door nodes (SW9) and pool nodes (SW7). The traffic was already put on aggregated ethernet channels (3*1Gbps) between two ATLAS switches. We found that the hashing algorithm always sent traffic to one physical network fiber, therefore led to in-balanced load distribution.
We finally relocated all dCache servers into one network switch to avoid inter-switch traffic. We did not find any performance issues associated with firewall, but firewall indeed drops some
packages between 2 ATLAS subnets (130.199.48.0 and 130.199.185.0), which prevents the job submission from ATLAS grid gatekeeper to the condor pool. This problem does not affect SC3 data transfer to BNL dCache system.
6
Current dCache Configuration
dCache consists of write pool nodes, read pool nodes and dCache consists of write pool nodes, read pool nodes and
core services: (courtesy of Zhenping)core services: (courtesy of Zhenping) PNFS Core server node 1 (dedicated) RHEL 4.0, DELL 3.0Ghz
SRM server (door) node 1 (dedicated) RHEL 4.0, DELL 3.0Ghz
GridFTP and DCAP Core server nodes (doors) 4 (dedicated) RHEL 4 Dell 3.0Ghz
Internal/External Read pool nodes 322 (shared) 145 TB SL3, mix of Penguin 3.0 and
Dell 3.4 Ghz.
Internal/External write pool nodes 8 (dedicated) 1 TB, RHEL 4.0, Dell 3.0GHz
Total 336 146 TB
7
Read poolsDCap doors
SRM door doors
GridFTP doors doors
Control Channel
write pools
Data Channel
DCap Clients
Pnfs Manager Pool Manager
HPSS
GridFTP Clientsd
SRM Clients
Oak Ridge Batch system
DCache System
One BNL dCache Instance
8
Future Network/dCache Plan
The design of USATLAS is in the following slides.The design of USATLAS is in the following slides.
The network bandwidth to ACF will be 20Gpbs redundant network The network bandwidth to ACF will be 20Gpbs redundant network bandwidth to external. BNL to CERN connection is 10Gpbs. bandwidth to external. BNL to CERN connection is 10Gpbs.
dCache should be expanded to accommodate LHC data. We tried to dCache should be expanded to accommodate LHC data. We tried to avoid mixing of LHC data traffic with the remaining ATLAS production avoid mixing of LHC data traffic with the remaining ATLAS production traffic. Either we created a dedicated dCache instance, or we dedicate traffic. Either we created a dedicated dCache instance, or we dedicate fraction of dCache resource (separated dCache write pool group) to fraction of dCache resource (separated dCache write pool group) to LHC data transfer. Zhenping and I prefer to have a dedicated dCache LHC data transfer. Zhenping and I prefer to have a dedicated dCache instance since the number of nodes in BNL dCache managed by the instance since the number of nodes in BNL dCache managed by the current dCache technology is running into the limitation. Anyway, in the current dCache technology is running into the limitation. Anyway, in the next several month, LHC fraction of dCache should be able to handle next several month, LHC fraction of dCache should be able to handle 200MB/seconds, one day worth of disk space (16.5 Tera). We need to 200MB/seconds, one day worth of disk space (16.5 Tera). We need to have 20TB (20% will be used as redundancy in RAID 5) local disk have 20TB (20% will be used as redundancy in RAID 5) local disk space. space. 10 Nodes, each with 2TB local disks.
USATLAS Tier 1 Network Design
10
Current Unsolved/Unsettled Issues
LHCOPN did not address Tier 2 sites issues. What is the policy on LHCOPN did not address Tier 2 sites issues. What is the policy on
trusting non-US Tier 2 sites? We simplify the issue and treat these non-trusting non-US Tier 2 sites? We simplify the issue and treat these non-
US Tier 2 sites as regular internet end points.US Tier 2 sites as regular internet end points.
LHCOPN include T0, All T1 sites and their existing connection: All T0-LHCOPN include T0, All T1 sites and their existing connection: All T0-
BNL and other ATLAS T1s-BNL traffic will be treated as LHCOPN BNL and other ATLAS T1s-BNL traffic will be treated as LHCOPN
traffic and they could share network resource provided by US LHCnet. traffic and they could share network resource provided by US LHCnet.
If one Tier 1 goes down, its LHC traffic will be routed via another Tier 1 If one Tier 1 goes down, its LHC traffic will be routed via another Tier 1
and use fraction of network resource owned by the Tier 1. This type of and use fraction of network resource owned by the Tier 1. This type of
traffic does not affect BNL internal network design. The AUP should be traffic does not affect BNL internal network design. The AUP should be
negotiated between Tier 1 sites. It is not done yet.negotiated between Tier 1 sites. It is not done yet.
11
User Scenarios
1.1. LHC data is transferred via LHCOPN from CERN to BNL. Data is transferred into dCache, LHC data is transferred via LHCOPN from CERN to BNL. Data is transferred into dCache, then migrated into HPSS. A small fraction of data will be read immediately by users at Tier then migrated into HPSS. A small fraction of data will be read immediately by users at Tier 2s. (Volume_{LHC})2s. (Volume_{LHC})
2.2. All of Tier 2s upload their simulation/analysis data to Tier 1 dCache site. The data will be All of Tier 2s upload their simulation/analysis data to Tier 1 dCache site. The data will be immediately replicated to the dCache cluster and migrated into HPSS. (Volume_{Tier 2})immediately replicated to the dCache cluster and migrated into HPSS. (Volume_{Tier 2})
3.3. Physicists at Tier 3 read data (Input data) from Tier 1 dCache Read Pool, run Physicists at Tier 3 read data (Input data) from Tier 1 dCache Read Pool, run analysis/transformation on their home institution, upload the result data to Tier 1 dCache analysis/transformation on their home institution, upload the result data to Tier 1 dCache write pool. Then the results will be immediately replicated into dCache read pool and write pool. Then the results will be immediately replicated into dCache read pool and archived into HPSS system. (Volume_{Physicists} = Volume_{Inputs}+Volumes_{Results})archived into HPSS system. (Volume_{Physicists} = Volume_{Inputs}+Volumes_{Results})
4.4. BNL owns fraction of ATLAS reconstruction data, ESD, AOD/Tag data. This data will be BNL owns fraction of ATLAS reconstruction data, ESD, AOD/Tag data. This data will be read from dCache and send to other Tier 1 sites. Similarly, BNL needs to read the same read from dCache and send to other Tier 1 sites. Similarly, BNL needs to read the same type of data from other Tier 1. (Volume_{T1}=Volume_{in}+Volume_{out}.type of data from other Tier 1. (Volume_{T1}=Volume_{in}+Volume_{out}.
5.5. European Tier 2, 2+ sites needs to read data from BNL, the traffic will be treated as regular European Tier 2, 2+ sites needs to read data from BNL, the traffic will be treated as regular internet Traffic.internet Traffic.
The total data volumes that we put on network links and backplane: The total data volumes that we put on network links and backplane: Volume_{Total}= 2*Volume_{LHC}+3*Volume_{Tier 2}+ Volume_{Inputs}+3*Volumes_{Results}
+Volume_{T1}+ Volume_{Others}.
12
Requirements
dCache brings the subsystems (grid and computing cluster) in ACF even closer, dCache brings the subsystems (grid and computing cluster) in ACF even closer,
in which computing cluster serves as data storage system). Data will be in which computing cluster serves as data storage system). Data will be
constantly replicated among them. Any connection restriction (firewall conduits) constantly replicated among them. Any connection restriction (firewall conduits)
among them will potentially impact the functionality and performance. among them will potentially impact the functionality and performance.
We should isolate the internal ATLAS traffic within ATLAS network domain.We should isolate the internal ATLAS traffic within ATLAS network domain.
We needs to optimize the network traffic volume between BNL campus/ACF. We needs to optimize the network traffic volume between BNL campus/ACF.
What fraction of data are we going to filtered by firewall? Item 1, 2, 3…… Any What fraction of data are we going to filtered by firewall? Item 1, 2, 3…… Any
traffic that we plan to firewall, then we might double or triple the tax on the link traffic that we plan to firewall, then we might double or triple the tax on the link
between BNL Campus/ACF.between BNL Campus/ACF.
Any operation issues in BNL campus network should not impact the ACF Any operation issues in BNL campus network should not impact the ACF
internal network traffic between different USATLAS subnets.internal network traffic between different USATLAS subnets.
We should not overload the BNL firewall with large data volumes of physics We should not overload the BNL firewall with large data volumes of physics
data.data.
13
USATLAS/BNL LAN
14
dCache dCache
HPSSHPSS
ACF farmACF farm
ATLASATLASDL2DL2CERNCERN
ACL ACL
PolicyPolicy
RoutingRouting
Internet/AnalysisInternet/Analysis
CERN LHCOPN trafficCERN LHCOPN traffic
Internet TrafficInternet Traffic
All traffic between any two hosts inAll traffic between any two hosts in
ACF, routed or switched.ACF, routed or switched.
Tier 2sTier 2s
US ATLAS Tier 2sUS ATLAS Tier 2s
Option 1
Tier 1sTier 1s
Tier 1sTier 1s
15
ATLASATLASDL2DL2CERNCERN
ACL ACL
PolicyPolicy
RoutingRouting
InternetInternet
LHC/SC4LHC/SC4
dCachedCache
GridGrid
HPSS HPSS
ACF farmACF farmTier 2sTier 2s
CERN LHCOPN trafficCERN LHCOPN traffic Internet TrafficInternet Traffic
LHC data to HPSS is internal toLHC data to HPSS is internal to
ATLAS. It never leaves ATLAS routerATLAS. It never leaves ATLAS router..USATLAS Tier 2sUSATLAS Tier 2s
Option 2
Tier 1sTier 1s
Tier 1sTier 1s
16
ATLASATLASDL2DL2CERNCERNACL Policy RoutingACL Policy Routing
InternetInternet
Single Network cableSingle Network cable
LHCLHC
dCachedCache
GridGrid
HPSS HPSS
ACF farmACF farm
Tier 2sTier 2s
CERN LHCOPN trafficCERN LHCOPN traffic Internet TrafficInternet Traffic
LHC data to HPSS is external toLHC data to HPSS is external to
ATLAS. It leaves ATLAS routerATLAS. It leaves ATLAS router..
USATLAS Tier 2sUSATLAS Tier 2s
Option 3
Tier 1sTier 1s
Tier 1sTier 1s
17
ATLASATLASDL2DL2CERNCERN
InternetInternet
CERN LHCOPN trafficCERN LHCOPN traffic
Internet TrafficInternet Traffic
LHC data to HPSS is routed via DL2, LHC data to HPSS is routed via DL2,
The traffic needs to leaves ATLAS router.The traffic needs to leaves ATLAS router.
LHCLHC
dCachedCache
HPSS HPSS
ACF farmACF farm
Option 4
Disadvantage: All ATLAS traffic may double, Disadvantage: All ATLAS traffic may double,
or triple tax the BNL/USATLAS link.or triple tax the BNL/USATLAS link.
Put All Traffic router via DL2.Put All Traffic router via DL2.
Network Management is not easy? Network Management is not easy?
Firewall becomes the bottleneck. Firewall becomes the bottleneck.
Does not utilize ATLAS routing capability.Does not utilize ATLAS routing capability.
Grid Grid
SystemSystem
Tier 2sTier 2s
USATLAS Tier2 TafficUSATLAS Tier2 Taffic
Tier 1sTier 1s
Tier 1sTier 1s
TeraPaths
19
QoS/MPLS
QoS/MPLS technology can be manually deployed into BNL QoS/MPLS technology can be manually deployed into BNL
campus/USATLAS network now. The behavior is well campus/USATLAS network now. The behavior is well
understood and LAN QoS expertise are handy now. understood and LAN QoS expertise are handy now.
The TeraPaths software system is under intensive re-The TeraPaths software system is under intensive re-
development to approach product quality. It will be ready by development to approach product quality. It will be ready by
the end of February. We will need one month (March) to the end of February. We will need one month (March) to
verify and deploy it into our production network verify and deploy it into our production network
infrastructure. When SC4 starts, we can quantitatively infrastructure. When SC4 starts, we can quantitatively
manage BNL LAN to send and receive data. The following manage BNL LAN to send and receive data. The following
month will be focusing on deploying the software package month will be focusing on deploying the software package
into Tier 2 sites participating SC4.into Tier 2 sites participating SC4.
20
This Project Investigates the Integration and Use of LAN This Project Investigates the Integration and Use of LAN
QoS and MPLS Based Differentiated Network Services in QoS and MPLS Based Differentiated Network Services in
the ATLAS Data Intensive Distributed Computing the ATLAS Data Intensive Distributed Computing
Environment As a Way to Manage the Network As a Environment As a Way to Manage the Network As a
Critical Resource.Critical Resource.
The Collaboration Includes BNL and University of The Collaboration Includes BNL and University of
Michigan, and Other Collaborators From OSCAR Michigan, and Other Collaborators From OSCAR
(ESNET), Lambda(ESNET), Lambda Station (FNAL), and TeraPaths Station (FNAL), and TeraPaths
monitoring project (SLAC).monitoring project (SLAC).
What Is TeraPaths?
21
TeraPaths System Architecture
Site A (initiator) Site B (remote)
WAN
web services web services
WAN monitoring
WAN web services
route planner
scheduler
user manager
site monitor
router manager
hardware drivershardware drivers
route planner
scheduler
user manager
site monitor
router manager
Web page
APIs
Cmd line
QoS requests
22
TeraPaths SC2005
Two bbcp periodically copied data from BNL disk to UMICH disk. One used class Two bbcp periodically copied data from BNL disk to UMICH disk. One used class
2 traffic (200Mbps) and another used class EF (expedite forwarding: 400Mbps). 2 traffic (200Mbps) and another used class EF (expedite forwarding: 400Mbps).
Iperf sent out background traffic. The allocated network resource is 800Mbps. Iperf sent out background traffic. The allocated network resource is 800Mbps. We could quantitatively control shared network resource for mission critical tasks.
Verified the Effectiveness of MPLS/LAN QoS and Its Impact to prioritized traffic,
background best effort traffic, and overall Network Performance.
Service Challenge 3
24
What was the SC3 configuration, hardware, software, middleware we used?
25
Services at BNL
FTS client + server (FTS 1.3) and its backend Oracle and myproxy servers.FTS client + server (FTS 1.3) and its backend Oracle and myproxy servers. FTS does the job of reliable file transfer from CERN to BNL. Most Functionalities were implemented. It became reliable in controlling data transfer
after several rounds of redeployments for bug fixing: short timeout value causing excessive failures, incompatibility with dCache/SRM.
Does not support DIRECT data transfer between CERN to BNL dCache data pool server (dCache SRM third party data transfer). The data transfers actually go through a few dCache GridFTP door nodes at BNL, which presents scalability issue. We had to move these door nodes to non-blocking networking ports to distribute traffic.
Both BNL and RAL discovered that the number of streams per file could not be more than 10, (a bug)?
Networking to CERN: Networking to CERN: Network for dCache was upgraded to 2*1Gpbs around June. Shared link with Long Round Trip Time: >140 ms, while RTT for Europe sites to CERN
is about 20ms. Occasional packet losses were discovered along the path between BNL-CERN. 1.5 G bps aggregated bandwidth observed by iperf with 160 TCP streams.
26
Services Used at BNL SC3
dCache/SRM (dCache/SRM (V1.6.6.3, with SRM 1.3 interface). The detailed configuration V1.6.6.3, with SRM 1.3 interface). The detailed configuration
can be found in Slide 6. can be found in Slide 6. All read pool nodes have Scientific Linux 3 with XFS module compiled.
Experienced High load on write pool serves during large amount data transfer. Was fixed by replacing the EXT file systems with XFS file system.
Core server crashed once. Reason was identified and fixed.
Small buffer space (1.0TB) for data written into dCache system.
dCache can now deliver up to 200MB/second for input/output (limited by network speed.)
LFC (1.3.4) client and server was installed at BNL Replica Catalog Server.LFC (1.3.4) client and server was installed at BNL Replica Catalog Server. Server was installed. Tested the basic functionalities: lfc-ls, lfc-mkdir etc.
Will populate LFC with the entries in our production globus RLS server.
ATLAS VO Box (DDM + LCG VO box) was deployed at BNL.ATLAS VO Box (DDM + LCG VO box) was deployed at BNL.
Two Instances of Distributed Data Management (DDM) software (DQ2) were Two Instances of Distributed Data Management (DDM) software (DQ2) were
deployed at BNL, one for Panda Production and one for SC3 service phase.deployed at BNL, one for Panda Production and one for SC3 service phase.
27
how did SC3 infrastructure evolve?
FTS was upgraded from 1.2 to 1.3.FTS was upgraded from 1.2 to 1.3.
dCache was upgraded from 1.6.5 to 1.6.6.3 (Dec/7/2005). dCache was upgraded from 1.6.5 to 1.6.6.3 (Dec/7/2005). Write Pool File System was migrated from EXT3 to XFS before
Service challenge 3 throughput phase. After SC3 throughput phase, we migrated the underlying disk from RAID0 to RAID 5 for better reliability. But it triggered the XFS file system bug when using RAID 5 disk and crashed server. We had to switch back to EXT3 file system. It fixed the bug, but significantly reduced the performance. The recent OS upgrade on dCache write pool and core servers alleviated the XFS bug (did not fix it), we migrated it back to XFS for better performance.
dCache software on read pool was upgraded as well. OS in Read Pool Nodes did not change after May/June Upgrade.
28
BNL SC3 data transfer
All data actually are All data actually are routed through routed through GridFtp doorsGridFtp doors
SC3 Monitored SC3 Monitored at BNL and at BNL and CERN are CERN are consistent.consistent.
29
Data Transfer Status
BNL stablized FTS data transfer BNL stablized FTS data transfer with high successful completion rate, with high successful completion rate, as shown in the left image during as shown in the left image during Throughput Phase.Throughput Phase.
We have attained150 MB/second We have attained150 MB/second rate for about one hour with large rate for about one hour with large number (> 50) of parallel file number (> 50) of parallel file transfers During SC3 throughput transfers During SC3 throughput Phase.Phase.
30
Final SC3 Throughput Data Transfer Results
31
Lessons Learned From SC2
Four file transfer servers with 1 Gigabit WAN network connection to CERN.Four file transfer servers with 1 Gigabit WAN network connection to CERN.
Meet the performance/throughput challenges (70~80MB/second disk to disk).Meet the performance/throughput challenges (70~80MB/second disk to disk). Enabled data transfer between dCache/SRM and CERN SRM at openlab
Design our own script to control SRM data transfer.
Enabled data transfer between BNL GridFtp servers and CERN openlab GridFtp
servers controlled by Radiant software.
Many components need to be tunedMany components need to be tuned Long Round Trip Time, high packet dropping rate, has to use multiple TCP streams
and multiple file transfers to fill up network pipe.
Sluggish parallel file I/O with EXT2/EXT3, lot of processes with I/O wait state, more
file streams, worse the performance on file system.
Slight improvement with XFS system. Still need to tune file system parameter
32
Some Issues During SC3 Throughput Phase
Service Challenge also challenges resource: Service Challenge also challenges resource: Tuned network pipes, optimized the configuration and performance of BNL
production dCache system and its associate OS, file systems, Required more than one staff’s involvements to stabilize the newly deployed FTS,
dCache and network infrastructure. Staffing level decreased as services became stable.
Limited Resources are shared by experiments and users. Limited Resources are shared by experiments and users. At CERN, SC3 infrastructure are shared by multiple Tier 1 sites.
Due to the heterogeneous nature of Tier 1 sites, data transfer for each site should be optimized non-uniformly based on site’s various aspects: i.e. network RRT, packet loss rates, experiment requirements etc.
At BNL, network and dCache are also used by production users. Need to closely monitor the SRM and network to avoid impacting production activities.
At CERN, James Casey alone handles answering email, setting up the system, At CERN, James Casey alone handles answering email, setting up the system, reporting problems and running data transfer. He provides 7/16 support himself. reporting problems and running data transfer. He provides 7/16 support himself. How to scale to 7/24 production support/production center? How to handle the time difference between US and CERN? CERN Support Phone (Tried once, but the operator did not speak English)
33
Some Issues During SC3 Service Phase
FTS was changed from version 1.3 to 1.4 at CERN. FTS version 1.4 FTS was changed from version 1.3 to 1.4 at CERN. FTS version 1.4
was supposed to support the direct third party transfer. When the direct was supposed to support the direct third party transfer. When the direct
data transfer into the pool without going through door was used, it could data transfer into the pool without going through door was used, it could
not handle the long wait, led to channel lockup. Therefore we had to not handle the long wait, led to channel lockup. Therefore we had to
switch to glite-url-copy which ad-hoc handles transferring into dCache.switch to glite-url-copy which ad-hoc handles transferring into dCache.
dCache was constantly improved for better performance and reliability dCache was constantly improved for better performance and reliability
during past several month, reached a stable dCache recently. during past several month, reached a stable dCache recently.
SC3 service phase exposed several problems when it started. We took SC3 service phase exposed several problems when it started. We took
this opportunity to find the problems and fixed them. The performance this opportunity to find the problems and fixed them. The performance
and stability were continuously improved over the course of SC3. We and stability were continuously improved over the course of SC3. We
was able to achieve high performance by the end of SC3. A good was able to achieve high performance by the end of SC3. A good
learning experience indeed.learning experience indeed.
SC Operation needs to be improved to timely problem reports. SC Operation needs to be improved to timely problem reports.
34
What have been done.
SC3 Throughput Phase showed good data transfer SC3 Throughput Phase showed good data transfer bandwidth.bandwidth.
SC3 Tier 2 Data Transfer SC3 Tier 2 Data Transfer Data were transferred to three selected Tier 2 sites.
SC3 Tape TransferSC3 Tape Transfer Tape Data Transfer was stablized at 60 MB/second with loaned
tape resources. Met the goal defined at the beginning of Service Challenge. Full Chain of data transfer was exercised.
SC3 Service Phase: we showed very good peak SC3 Service Phase: we showed very good peak performance.performance.
35
General view of SC3
When everything is running smoothly BNL got very good results 100M When everything is running smoothly BNL got very good results 100M
Byte/secondsByte/seconds
The middleware (FTS) is stable but there were still lots of compatibility issues: The middleware (FTS) is stable but there were still lots of compatibility issues: FTS does not work effectively with the new version of dCache/SRM (version 1.3).
We had to turn off FTS controlled direct data transfer into dCache Pool since lots of
time out errors completely blocked the data transfer channel.
We need to improve SC operation which included performance monitoring and We need to improve SC operation which included performance monitoring and
timely problem reporting for preventing from deteriorating and quick fixing.timely problem reporting for preventing from deteriorating and quick fixing.
We fixed many dCache issues after its upgrade. We also tuned the dCache We fixed many dCache issues after its upgrade. We also tuned the dCache
system to work under FTS/ATLAS DDM system (DQ2). system to work under FTS/ATLAS DDM system (DQ2).
We achieved the best performance among the dCache sites which participated We achieved the best performance among the dCache sites which participated
ATLAS SC3 service phase. 15 TB data was transferred to BNL. Sites using ATLAS SC3 service phase. 15 TB data was transferred to BNL. Sites using
CASTOR SRM showed better performance. CASTOR SRM showed better performance.
SC3 re-run and SC4 Planning
37
SC3 re-run
We upgraded BNL dCache core server OS to RHEL 4 and We upgraded BNL dCache core server OS to RHEL 4 and
dCache to 1.6.6 starting Dec/07/2005. dCache to 1.6.6 starting Dec/07/2005.
We will add few more dCache pool nodes if the software We will add few more dCache pool nodes if the software
upgrades did not meet our expectation. upgrades did not meet our expectation.
FTS should be upgraded if the necessary fix to prevent FTS should be upgraded if the necessary fix to prevent
channel blocking is ready before new year.channel blocking is ready before new year.
LCG BDII needs to report status of dCache, FTS. (before LCG BDII needs to report status of dCache, FTS. (before
Christmas).Christmas).
We would like to schedule a test period at the Beginning of We would like to schedule a test period at the Beginning of
January for stability and scalability. January for stability and scalability.
Everything should be ready by January 9.Everything should be ready by January 9.
Re-run will start at January 16.Re-run will start at January 16.
38
What will our SC4 configuration look like, network, servers, software, etc?
The physical network location for SC4 is shown in Slide 15.The physical network location for SC4 is shown in Slide 15.
We subscribed two subnet to LHCOPN (130.199.185.0/24 We subscribed two subnet to LHCOPN (130.199.185.0/24
and 130.199.48.0/23). The current dCache instance will be and 130.199.48.0/23). The current dCache instance will be
on these two subnet. The new dCache instance for on these two subnet. The new dCache instance for
LHC/SC4 will be in 130.199.185.0/24 exclusively).LHC/SC4 will be in 130.199.185.0/24 exclusively). 10 dCache Write/Read Pool Servers.
4 Door servers (RAL already merged door nodes with pool nodes.
We will evaluated whether it is doable in BNL).
2 core servers. (dCache PNFS manager and SRM server).
The newest dCache production release: dCache 1.6.6.3+
39
BNL Service Challenge 4 Plan
Several steps needed to set-up hardware or service (ex: choose, Several steps needed to set-up hardware or service (ex: choose, procure, start install, end install, make operational), starting at January, procure, start install, end install, make operational), starting at January, ending before the beginning of March. ending before the beginning of March. LAN, Tape system. FTS, LFC, DDM, LCG VO boxes and other base line sevices will be
maintained with agreed SLA and supported by USATLAS VO. Dedicated LHC dCache/SRM write pool which provides up to 17 Tera bytes
storage (24 hour worth data). (to be done synchronized with LAN, WAN).
Deploy and strengthen necessary monitoring infrastructure based on Deploy and strengthen necessary monitoring infrastructure based on ganglia, nagios, Monalisa and LCG-RGMA. (February).ganglia, nagios, Monalisa and LCG-RGMA. (February).
Drill for service integration (March)Drill for service integration (March) Simulate network failure, server crashes, and how support center will
respond to the issues. Tier 0/Tier 1 End-to-End high performance network operational: bandwidth,
stability and performance.
40
BNL Service Challenge 4 Plan
April/2006, establish the stable data transfer in the speed of 200M April/2006, establish the stable data transfer in the speed of 200M Bytes/second to disks and 200 M Bytes/second to tape.Bytes/second to disks and 200 M Bytes/second to tape.
May/2006, disk and computing farm upgrading.May/2006, disk and computing farm upgrading.
July/01/2006: stable data transfer driven by ATLAS production system July/01/2006: stable data transfer driven by ATLAS production system and ATLAS data management infrastructure between T0~T1 (200M and ATLAS data management infrastructure between T0~T1 (200M Bytes/second) and provide services to satisfy SLA (Service level Bytes/second) and provide services to satisfy SLA (Service level agreement).agreement).
Details of involving Tier 2 are in planning too. (February and March)Details of involving Tier 2 are in planning too. (February and March) Tier 2 dCache: UC dCache needs to be stabilize and operational in
February, UTA and BU need to have dCache in March. Base line client tools should be deployed at Tier 2 centers. Base line services should support Tier 1~Tier2 data transfer before SC4
starts.
3D project
42
Oracle part
Tie0 – Tie1 Tie0 – Tie1 Oracle Oracle streams replication
BNL joined to the 3D replicatoin testbed BNL joined to the 3D replicatoin testbed Streams replication was setup between CERN and BNL
successfully in Oct 2005
Several experiments foresee Oracle clusters for online Several experiments foresee Oracle clusters for online systemssystems
Focus on Oracle database clusters as main building block Focus on Oracle database clusters as main building block for Tie0 and Tie1for Tie0 and Tie1
Propose to setup pre-production services for March and full Propose to setup pre-production services for March and full service after 6 months deployment experienceservice after 6 months deployment experience
43
BNL 3D Oracle Production Schedule
Dec 2005: h/w setup (Done)Dec 2005: h/w setup (Done) Two nodes with 500GB fibre channel storage
Jan 2006: h/w acceptance tests, RAC(real application Jan 2006: h/w acceptance tests, RAC(real application
cluster) setupcluster) setup
March 2006: service startsMarch 2006: service starts
May 2006: service review ---> h/w defined for full productionMay 2006: service review ---> h/w defined for full production
September 2006: full database service in placeSeptember 2006: full database service in place
44
MySQL Database replication at BNL
Oracle – MySQL replication: Oracle – MySQL replication: DataBase: ATLAS TAG DB
DB server at BNL: dbdevel2 (MySQL-4.0.25)
use case : Oracle CERN to MySQL BNL (push)
tool: Octopus replicator ( Java-based extraction, transformation and
loading)
thanks to Julius Hrivnac (LAL,Orsay) and Kristo Karr (ANL) for
successful collaboration
More details in Twiki:
https://uimon.cern.ch/twiki/bin/view/Atlas/DatabaseReplication
45
MySQL Database replication at BNL
MySQL – MySQL replication:MySQL – MySQL replication: DataBases:
Geometry DB ATLASDD MySQL conditions DBs LArNBDC2 and LArIOVDC2
MySQL DB servers at BNL: dbdevel1.usatlas.bnl.gov (MySQL -4.0.25) db1.usatlas.bnl.gov (MySQL-4.0.25)
collected the first experience with CERN-BNL ATLAS DB replication
procedure using both mysqldump and on-line replication
Current versions correspond to most recent ATLAS production
release 11.0.3
LCG 2 at BNl
47
Summary
LCG setup at BNL is partially functional. LCG-VO box was LCG setup at BNL is partially functional. LCG-VO box was
used in SC3. There is no technical difficulties/hurdles used in SC3. There is no technical difficulties/hurdles
preventing the CE and SE from fully functional.preventing the CE and SE from fully functional.
Deployed at mixed of hardware: Dell 3.0 Ghz, and some VA Deployed at mixed of hardware: Dell 3.0 Ghz, and some VA
linux nodes: we deployed CE, RB, SE, Proxy server, linux nodes: we deployed CE, RB, SE, Proxy server,
Monitoring nodes (R-GMA), and a collection of worker Monitoring nodes (R-GMA), and a collection of worker
node. Some systems are combined into a single server. node. Some systems are combined into a single server.
48
Progress and To Do
OS and LCG system installation and configuration is OS and LCG system installation and configuration is
automatic, can be reinstalled on new hardware with 2 hoursautomatic, can be reinstalled on new hardware with 2 hours
Managed via RPM and updatable via a local YUM Managed via RPM and updatable via a local YUM
repositories which are automatically rebuilt from CERN and repositories which are automatically rebuilt from CERN and
else where source.else where source.
GUMS controls LCG grid mapfile.GUMS controls LCG grid mapfile.
Site information is being published correctly, and some SFT Site information is being published correctly, and some SFT
(site functional tests) run from Operation CERN can (site functional tests) run from Operation CERN can
complete successfully.complete successfully.
Still need to configure LCG to run condor at ATLAS pool.Still need to configure LCG to run condor at ATLAS pool.
BNL USATLAS Grid Testbed
50
Internet
HPSS
Submit Grid Jobs
OSG Gatekeepers
Disks
RHIC/USATLAS Job scheduler
NFS
HPSS MOVER
SRM/GridFtp SERVERS
GridFtp
Panasas
BNL USATLAS OSG Configuration
Grid UsersGrid Users
Grid Job Requests
Condor and dCacheCondor and dCache
PHENIX Data Transfer Activities
52Courtesy of Y. WatanabeCourtesy of Y. Watanabe
53
54
Data Transfer to CCJ
2005 RHIC run ended on June 2005 RHIC run ended on June
24, Above shows the last day of 24, Above shows the last day of
RHIC Run.RHIC Run.
Total data transfer to CCJ Total data transfer to CCJ
(Computer Center in Japan) is (Computer Center in Japan) is
260 TB (polarized p+p raw data)260 TB (polarized p+p raw data)
100% data transferred via WAN, 100% data transferred via WAN,
Tool used here: GridFtp. No 747 Tool used here: GridFtp. No 747
involved.involved.
Average Data Rate: Average Data Rate:
60~90MB/second, Peak 60~90MB/second, Peak
Performance: 100 Mbytes/second Performance: 100 Mbytes/second
recorded in Ganglia Plot! About recorded in Ganglia Plot! About
5TB/day!5TB/day!Courtesy of Y. WatanabeCourtesy of Y. Watanabe
55
Network Monitoring on NAT Box
56
Month and Year
57
Network Monitoring at Perimeter Router
58
Network Monitoring at CCJ, JAPAN
59
Our Role
Provide effective and efficient Network/Grid Solutions for Data Transfer.Provide effective and efficient Network/Grid Solutions for Data Transfer.
Install Grid Tools on the PHENIX Buffer boxes.Install Grid Tools on the PHENIX Buffer boxes.
Tune performance of network path along PHENIX Counting Tune performance of network path along PHENIX Counting
House/RCF/BNL LAN.House/RCF/BNL LAN.
Install Ganglia monitoring tools for data transfer.Install Ganglia monitoring tools for data transfer.
Diagnose problems and provide fix.Diagnose problems and provide fix.
For future PHENIX data transfer, we continue to play these role. We will For future PHENIX data transfer, we continue to play these role. We will
Integrate dCache/SRM into the future data transfer and automate the Integrate dCache/SRM into the future data transfer and automate the
data transfer. data transfer.
Ofer maintains the PHENIX dCache/SRM pools. He works on pilot Ofer maintains the PHENIX dCache/SRM pools. He works on pilot
transfer data from PHENIX dCache/SRM to CCJ. transfer data from PHENIX dCache/SRM to CCJ.
60
Lessons Learned
Four monitor systems: BNL NAT ganglia, Router MRTG (Multi-Router Four monitor systems: BNL NAT ganglia, Router MRTG (Multi-Router Traffic Grapher), CCJ ganglia and Data Transfer Monitoring, caught Traffic Grapher), CCJ ganglia and Data Transfer Monitoring, caught errors in early stage.errors in early stage.
EXT3 file system is not designed for high performance data transfer. EXT3 file system is not designed for high performance data transfer.
XFS has much better performance in disk I/O with high bandwidth, this XFS has much better performance in disk I/O with high bandwidth, this experience was used in LHC service challenge 3 for ATLAS experime experience was used in LHC service challenge 3 for ATLAS experime nt.nt.
Broadcom BCM95703 copper gigabit network card has much less Broadcom BCM95703 copper gigabit network card has much less packet erros than Intel Pro1000.packet erros than Intel Pro1000.
Several ES-net/SINET network outages, traffic was rerouted to Several ES-net/SINET network outages, traffic was rerouted to alternative paths. Problems were promptly discovered and resolved by alternative paths. Problems were promptly discovered and resolved by on-call personnel and network engineers. Because of large disk cache on-call personnel and network engineers. Because of large disk cache at both ends, no data were lost due to network outages. at both ends, no data were lost due to network outages.