13
GGF12 – 20 Sept 2004 - 1 LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN

LCG Incident Response

  • Upload
    hang

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

LCG Incident Response. Ian Neilson LCG Security Officer Grid Deployment Group CERN. Background. LCG – Large Hadron Collider ( L HC) C omputing G rid Computing environment for the 4 LHC experiments ALICE, ATLAS, CMS, LHCb LHC operation in 2007 - PowerPoint PPT Presentation

Citation preview

Page 1: LCG Incident Response

GGF12 – 20 Sept 2004 - 1

LCG Incident Response

Ian NeilsonLCG Security Officer

Grid Deployment Group

CERN

Page 2: LCG Incident Response

GGF12 – 20 Sept 2004 - 2

Background

• LCG – Large Hadron Collider (LHC) Computing Grid• Computing environment for the 4 LHC experiments

• ALICE, ATLAS, CMS, LHCb• LHC operation in 2007• Required 12-14 PetaBytes/year, equivalent 70,000 PCs compute

• * LCG1/2003 * LCG2/2003-4 * EGEE

• 70+ sites in Europe, USA, Asia, S. America ……• 7000+ CPUs • 6000GB+ Storage• Software certification, testing, deployment group• Distributed GOCs

• UK • http://goc.grid-support.ac.uk/gridsite/gocmain/monitoring/

• Taiwan• http://goc.grid.sinica.edu.tw/goc/

www.cern.ch/lcg

Page 3: LCG Incident Response

GGF12 – 20 Sept 2004 - 3

Grid monitoring

Page 4: LCG Incident Response

GGF12 – 20 Sept 2004 - 4

EGEE - Enabling Grids for E-science in Europe

• 12 federations with 70 partner institutions• 2 year + 2 project

• Operate a service grid facility for e-science• Initial built on LCG2 infrastructure

• Re-engineer a robust middleware layer• glite

• Attract new users• Research and Industry

• Broader focus than HEP: Biomedical, Earth Science ……..

www.cern.ch/egee

Page 5: LCG Incident Response

GGF12 – 20 Sept 2004 - 5

Policy – the Joint Security Group

Security & Availability Policy

UsageRules

Certification Authorities

AuditRequirements

GOCGuides

Incident Response

User RegistrationApplication Development& Network Admin Guide

http://cern.ch/proj-lcg-security/documents.html

Page 6: LCG Incident Response

GGF12 – 20 Sept 2004 - 6

Incident Response Policy

• Agreement on Incident Response• June 2003 for LCG1

• What is an incident?• Security investigation causing service interruption• Suspected misuse of resources beyond site• “Reasonable possibility” of stolen credentials

• Not to expire or be revoked within 3 days

• Classifications• Identity theft

• Suspected / Probable / Confirmed

• Actions • Misuse / Enforcement / Restoration / Escalation

Page 7: LCG Incident Response

GGF12 – 20 Sept 2004 - 7

Incident Response - Communications

• Site enrolment collects 2 entries per site• Registration questionnaire

• Site Contacts mail list• Closed list of named individuals

• email, telephone

• CSIRT list mail • List-of-lists (Open)

• 1 entry per site

• Updated list circulated to contacts list as sites enrol• Pointers to policy documents for responsibilities

• Channels• Users - local site contacts (& GOC)• Contacts - discussion and information exchange• CSIRT - incident notification, update• Roll-out - system administrators

Page 8: LCG Incident Response

GGF12 – 20 Sept 2004 - 8

Incident Response – management issues

• LCG “community” known at CERN, EGEE community is broader• User enrolment is well controlled, site enrolment is not

• Incomplete questionnaires• Personal instead of list• List instead of personal• Undeliverable addresses• Delayed delivery• Moderated delivery• Enrolment information not circulated• SPAM, SPAM, SPAM, SPAM

• Lists need active management!• Can we “see” all the sites?

• CERN/GOC view• VO “private” information systems

Page 9: LCG Incident Response

GGF12 – 20 Sept 2004 - 9

Incident response – operational issues

• Recognising and reporting • What is a local CSIRT?

• Scale of coverage• 24x7 site/campus network operations team

• Department Security Officer

• LCG system administrator

• Who is a security contact?• as above

• Intersection with local CSIRT procedures• Local quarantine and analysis

• Keeping emergency channels clear• Discussions, cross-postings

Page 10: LCG Incident Response

GGF12 – 20 Sept 2004 - 10

Incident response – near-term

• JSG, EGEE MWSG/JRA3, OSG, ……• Site and VO registration policy and process

• Control gathering, distribution and management of data• Sites need to understand requirements and responsibilities

• Coverage, access, audit

• Needs to be actively managed (? Self managed)

• Operational Security Co-ordination Team (OSCT)• Ownership of security incidents

• From notification to resolution• Liaise with national/institute CERTs

• Ownership of known problems• Liaise with development & deployment groups

• Co-ordination of monitoring• Post-mortem analysis• Team of experts

Page 11: LCG Incident Response

GGF12 – 20 Sept 2004 - 11

Security Co-ordination

• How does OSCT map onto EGEE operations structures?• Resource Centres (lots)• Regional Operations Centres - ROC (~9)• Core Infrastructure Centres - CIC (~5)• Operations Management Centre - OMC (1)

• Co-ordination with Open Science Grid ………• Adopt same co-ordinating model

Page 12: LCG Incident Response

GGF12 – 20 Sept 2004 - 12

2004 Security Service Challenges

• Objectives• Evaluate the effectiveness of current procedures by simulating a small and

well defined set of security incidents.• Use the experiences of a) in an iterative fashion (during the challenges) to

update procedures.• Formalise the understanding gained in a) & b) in updated incident response

procedures.• Provide feedback to middleware development and testing activities to inform

the process of building security test components.

• Exercise response procedures in controlled manner• Non-intrusive

• Compute resource usage trace to owner– Run a job to send an email

• Storage resource trace to owner– Run a job to store a file

• Disruptive• Disrupt a service and map the effects on the service and grid

Page 13: LCG Incident Response

GGF12 – 20 Sept 2004 - 13

LCG/EGEE Incident Response

Thank You

Thank you to UK PPARC