24
25-29 May 2009, HEPiX Spring ASGC Site Report Jason Shih ASGC/OPS HEPiX Fall 2009 Umea, Sweden

ASGC Site Report

Embed Size (px)

DESCRIPTION

ASGC Site Report. Jason Shih ASGC/OPS HEPiX Fall 2009 Umea, Sweden. Overview. Fire incident Hardware Network Storage Future remarks. Fire incident – event summary. Damage Analysis: fire was limited at the power room Severe damage of UPS wiring of power system, AHR - PowerPoint PPT Presentation

Citation preview

Page 1: ASGC Site Report

25-29 May 2009, HEPiX Spring

ASGC Site Report

Jason ShihASGC/OPS

HEPiX Fall 2009Umea, Sweden

Page 2: ASGC Site Report

25-29 May 2009 HEPiX Spring

Overview

• Fire incident• Hardware • Network• Storage• Future remarks

Page 3: ASGC Site Report

25-29 May 2009 HEPiX Spring

Fire incident – event summary

• Damage Analysis: fire was limited at the power room • Severe damage of UPS • wiring of power system, AHR• Smoke dust pervaded and smudged almost every where,

including computing & storage systems• History and Planning

• 16:53 Feb. 25 UPS battery burning • 19:50 Feb. 25 Fire extinguishment by Fire department• 10:00 Feb. 26 Fire scene investigation by Fire department• 15:00 Feb 26 ~ Mar 23 DC cleaning, re-partitioning, re-wiring,

deoderization, and re-installation• from ceiling to ground under raised floor, from power room to

machine room, from power system, air conditioning, fire prevention system to computing system

• All facilities moved outside to cleaning• Mar 23 Computing System installation• Mar 23 ~ Apr 9 Recovery of Monitoring, Environment control and

Access control system

Page 4: ASGC Site Report

25-29 May 2009 HEPiX Spring

Fire incident – recovery plan

• DC Consultant will review the re-design on Mar. 11, schedule will be revised based on the inspection

• Tier1/Tier2 services will be collocated at IDC for 3 months from Mar. 20

Page 5: ASGC Site Report

25-29 May 2009 HEPiX Spring

Fire incident – review/lessons (I)

• DC Infrastructure Standards to comply with• ANSI TIA/EIA• ASHRAE thermal guideline for data processing

env.• Guidelines for green data centers are available,

e.g., LEED• NFPA: Fire suppression system

• Capacity and type of UPS (min. scale)• Vary by the responding time of generators

• Adjust rating of all breaks (NFB and ACB)• Location of UPS (open space & outside PR) • Regular maintenance of batteries

• Inner resistance measurement

Page 6: ASGC Site Report

25-29 May 2009 HEPiX Spring

Fire incident – review/lessons (II)

• Smoke damage: Fire stopping• Improvement of monitoring system

• Re-design the monitoring sys.• Earlier pre-action: consider: VESDA

• Emergent response and procedures• Routine Fire drill is indispensable

• Disaster Recovery plan is necessary

• Other improvement:• PP and H/C aisle splitting• Fiber panels: MDF and FOR• OH cable tray (exist: PWR tray in subfloor)+ Fiber

guide• Raised floor grommets

Page 7: ASGC Site Report

25-29 May 2009 HEPiX Spring

Move out all facilities for cleaning

Container as storage and humidification

Protect Racks from Dust

Ceiling Removal

Page 8: ASGC Site Report

25-29 May 2009 HEPiX Spring

Fire incident - Tape system

• Snapshots of decommissioned tape drives after the incident

Page 9: ASGC Site Report

25-29 May 2009 HEPiX Spring

DC recovered – mid of May

• FOR in area #1• MDF move to center of DC area• H/C aisle fully split

• Plan to replace racks to provide 1100mm depth

Page 10: ASGC Site Report

25-29 May 2009 HEPiX Spring

IDC Collocation (I)

• Site selection and paper processing - one week

• Preparation at IDC – one week• 15R + reservation for tape system (6R)• Power (14kW per racks)• cooling (perforated raise floor)• 10G protection SDH STM-64 networking

between IDC and ASGC

Page 11: ASGC Site Report

25-29 May 2009 HEPiX Spring

IDC collocation (II)

• Relocation of 50+% computing/storage – one week• 2k job slots (3.2MSI2K), 26 chassis of blade

servers• 2.3PB storage (1PB allocated dynamically)

• Cabling + setup + reconfiguration – one week

Page 12: ASGC Site Report

25-29 May 2009 HEPiX Spring

IDC collocation (III)

• Facility install complete at Mar 27th

• Tape system delay after Apr 9th

• Realignment• RMA for faulty parts

Page 13: ASGC Site Report

25-29 May 2009 HEPiX Spring

T1 performance

• 7G peak reach to Amsterdam• 9G peak observed between

IDC/ASGC

Page 14: ASGC Site Report

25-29 May 2009 HEPiX Spring

Network – before May

KREONET2

CSTNet

HARNet

GE

GEGE

GE

HKIX

M120

Pacnet IP Transit

APAN-JPKEK

GE GEGE

JPIX

SINet

WIDEGE

GE*2

NUS

GE GE

AARNet

2.5G WL non-protect

NCIC -2.5G(STM-16) SDH

622M(STM-4) SDH on APCN2

100M

M120

M20

M320

CERNet

TWGate IP Transit

100M

JP, KDDI Otemachi

Sinica, TaipeiHK, Mega-iAdvantage

SG, KIM CHUNG

Page 15: ASGC Site Report

25-29 May 2009 HEPiX Spring

Network - 2009

KREONET2

CSTNet

HARNet

GE

GEGE

GE

HKIX

M120

Pacnet IP Transit

APAN-JPKEK

GE GEGE

JPIX

SINet

WIDEGE

GE*2

SingAREN

GE GE

AARNet

NUS

GE

STM-16 SDH

2.5G(STM-16) SDH

622M(STM-4) SDH on EAC

100M

M120

M20

M320

CERNet

TWGate IP Transit

100M

Sinica, TaipeiHK, Mega-iAdvantage

JP, KDDI Otemachi

Singapore, Global Switch

Page 16: ASGC Site Report

25-29 May 2009 HEPiX Spring

ASGC Resource Level Targets

Date CPU (MSI2k) Disk (PB) Tape (PB)

Current 2.4 1.2 0.8

Year End 5.6 2.4 1.3

MoU 2009

7.55 3.15 2.1

• 2008• 0.5PB expansion of Tape system in Q2• Meet MOU target mid of Nov.• 1.3MSI2k per rack base on recent E5450 processor.

• 2009• 150 QC blade servers• 2TB per drives for raid subsystem• 42TB net capacity per chassis and 0.75PB in total

Page 17: ASGC Site Report

25-29 May 2009 HEPiX Spring

Hardware Profile and Selection (I)

• CPU:• 2K8 Expansion: 330 blade server provide

3.6KSI2k• 7U height chassis• SMP Xeon E5430 processors, 16GB FB-DIMM• each blade provide 11KSI2k• 2 blade/U density, Web/SOL management

• current capacity: 2.4MSI2k• Year end total computing power: ~5.6MSI2k

• 22KSI2k/U (24 chassis in 168U)

Page 18: ASGC Site Report

25-29 May 2009 HEPiX Spring

Tape system• Before incident:

• LTO3 * 8 + LTO4 * 4• 720TB with LTO3• 530TB with LTO4

• May 2009:• Two loan LOT3 drives• MES: 6 LTO4 drives end of May• Capacity: 1.3PB (old) + 0.8PB (LTO4)

• New S54 model introduced• 2K slots with tier model• Upgrade ALMS• Enhanced gripper

Page 19: ASGC Site Report

25-29 May 2009 HEPiX Spring

Roadmap – Host I/F 2009

Q1 Q2 Q3 Q4

4G FC ( ≈ 400 MB/sec)

8G FC ( ≈ 800 MB/sec)

SAS 3G (4-lane ≈ 1200 MB/sec)

iSCSI – 1Gb

U320 - SCSI ( ≈ 320 MB/sec)

iSCSI – 10 Gb

SAS 6G (4-lane ≈ 2400 MB/sec)

3U16bay FC-SAS in May, 2U/12 and 4U/24 bay in June

Page 20: ASGC Site Report

25-29 May 2009 HEPiX Spring

Roadmap – Drive I/F 2009

Q1 Q2 Q3 Q4

4G FC

SAS 3G

SAS 6G

U320 - SCSI

SATA-II

2.5” SSD (B12F series)

Page 21: ASGC Site Report

25-29 May 2009 HEPiX Spring

Est. Density

• 2009 H1 1TB, 1 rack (42U)= 240TB• 2009 H2 2TB, 1 rack (42U)= 480TB• 2010 H1 2TB, 1 rack (42U)= 480TB• 2010 H2 3TB, 1 rack (42U)= 720TB• 2012 5TB…..

Page 22: ASGC Site Report

25-29 May 2009 HEPiX Spring

Future remarks

• DC full restore end of May• Restart run-the-clock operation

• Resources relocated fully involved in STEP09

• Facility relocation end of Jun from IDC• New resource expansion end of Jul• Improve DC monitoring

Page 23: ASGC Site Report

25-29 May 2009 HEPiX Spring

Water mist

• Fire suppresion system• Review the implementation of Gas

supression system• Consider water mist in power room

• Wall cabinet outside data center area

Page 24: ASGC Site Report

25-29 May 2009 HEPiX Spring

Water mist – design plan