27
Online Testing of BGP Marco Canini EPFL, Switzerland Work supported by the European Research Council Joint work with: Vojin Jovanović, Daniele Venzano, Gautam Kumar, Dejan Novaković, Boris Spasojević, Olivier Crameri, and Dejan Kostić 4/5/2011 Marco Canini, RIPE 62 1 Network ed Systems Laborat ory

Online Testing of BGP

  • Upload
    clint

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Networked Systems Laboratory. Online Testing of BGP. Marco Canini EPFL, Switzerland. Joint work with:Vojin Jovanović , Daniele Venzano, Gautam Kumar, Dejan Novaković , Boris Spasojević , Olivier Crameri, and Dejan Kostić. Work supported by the European Research Council. - PowerPoint PPT Presentation

Citation preview

Page 1: Online Testing of BGP

Marco Canini, RIPE 62 1

Online Testing of BGPMarco Canini

EPFL, Switzerland

Work supported by the European Research Council

Joint work with: Vojin Jovanović, Daniele Venzano, Gautam Kumar, Dejan Novaković, Boris Spasojević, Olivier Crameri, and Dejan Kostić

4/5/2011

NetworkedSystemsLaboratory

Page 2: Online Testing of BGP

Marco Canini, RIPE 62 2

Is it hard to crash the Internet?

• Software bugs in inter-domain routers

Router type A

Router type B

?

0-length AS4_PATH attribute!

Protocol-compliant, confusing message

At 17:07:26 UTC on August 19, 2009 CNCI (AS9354), a small network service provider in Nagoya, Japan, advertised a handful of BGP updates containing an empty AS4_PATH attribute. [renesys blog]

Reset session!

4/5/2011

Page 3: Online Testing of BGP

Marco Canini, RIPE 62 3

Is it hard to crash the Internet?

• What went wrong

Unaffected router

Affected router????

?

Unreachable!

Repeated service disruptions: routing instabilities!

4/5/2011

Page 4: Online Testing of BGP

Marco Canini, RIPE 62 4

BGP not always reliable

• Distributed system behavior– Aggregate result of interleaved actions of multiple

routers– Federated, heterogeneous and failure-prone

environment• Difficult to reason about all corner cases or

combinations of configurations– Unanticipated interactions, subtle differences in inter-

operable implementations, system-wide conflicts, seemingly valid local fault handling

4/5/2011

Page 5: Online Testing of BGP

Marco Canini, RIPE 62 5

Agenda

• Our system for online testing– Disclaimer: still a research work!– Not going to be an immediate solution– Hope it will be a tool for this community

• Solicit feedback– Which faults would you look for?– What would convince you to deploy our system?

• … discussion

4/5/2011

Page 6: Online Testing of BGP

Marco Canini, RIPE 62 6

DiCE comes to the rescue

• Key idea: automatically explore system behavior to detect potential faults1. Create an isolated snapshot of a BGP neighborhood2. Subject a router’s BGP process to many inputs that

systematically exercise router actions3. For each input, check if the snapshot misbehaves

BGP neighbors

BGP process

DiCE Error in the snapshot Evidence of possible future behavior of production system

4/5/2011

Page 7: Online Testing of BGP

BGP snapshot

• Isolate testing from production environmentSpecial IP prefix

Custom attribute

Local checkpoint of current state

and configuration BGP process Cloned BGP process

FIB Sockets BGP peers

Sockets BGP checkpoints

BGP’s federated environment Each router keeps its local checkpoint Private state & config stays in the AS

ASes collaborate to detect potential faults4/5/2011 Marco Canini, RIPE 62 7

Page 8: Online Testing of BGP

Marco Canini, RIPE 62 8

Exploration of behavior

Clone of BGP process

DiCE

Error!

Use a path exploration engine

Concolic (CONCrete + symbOLIC) execution systematically

exercises code pathsIs there an error?

123

4/5/2011

Page 9: Online Testing of BGP

Marco Canini, RIPE 62 9

Driving behavior by inputs

Code & current config

Path exploration engine

Messages FailuresConfiguration changes

Random choicesTimeouts

Input generation

Inputs

Path constraints

UPDATE

Header

Withdrawn Routes

Path AttributesAttribute Type | Length | Value

Network Layer Reachability InformationNLRI Length | PrefixSymbolic

inputs

Route selection

Route ranking: is most preferred route?

4/5/2011

Page 10: Online Testing of BGP

Marco Canini, RIPE 62 10

Detecting faults

• Check properties that capture desired behavior• Example: Harmful Global Events (session resets)

????

?∑DiCE

controller

f()

f()

f()

f()

f()

f()

f()

f()

f()

1 BGP error

1 BGP error

1 BGP error

1 BGP error

1 BGP error

0

0

0

0

Unaffected router

Affected router

5 BGP errors

Valid but ambiguous messages

Error count > threshold?

Log inputs that have harmful global behavior

4/5/2011

Page 11: Online Testing of BGP

Marco Canini, RIPE 62 11

Other properties

• Policy-induced divergence• Origin misconfiguration

– Check: routing tables polluted in external ASes?• Route leaks (hijacks) by customer or provider

P

Prefix AS_PATH

d X Y Z

C

UPDATEAS_PATH C

prefix d C

List of prefixes that can leak

4/5/2011

Page 12: Online Testing of BGP

Marco Canini, RIPE 62 12

Keeping confidential information

• Potential router behavior– Common code paths already exposed– Reverse engineering any easier than today?

• Private state or configuration– Information hiding through randomization– Avoid inputs driven by confidential data cannot leak

• Rate limit, refuse certain explorers• Anonymous property checks

– Secure multi-party computation no need for trusted 3rd party

4/5/2011

Page 13: Online Testing of BGP

Marco Canini, RIPE 62 13

Implementation details

• Integrated DiCE in BIRD 1.1.7– Open source router, coded in C

• Concolic execution instruments code to track symbolic inputs– Instrumentation needed only for testing– Negligible impact on the production environment

4/5/2011

Page 14: Online Testing of BGP

Marco Canini, RIPE 62 14

Evaluation

• Multiple BIRD instances on a 48-core machine• Properties checked

– Harmful global events– Origin misconfiguration– Policy conflict

4/5/2011

Page 15: Online Testing of BGP

Marco Canini, RIPE 62 15

Evaluation topology [Haeberlen et al., NSDI ’09] + Annotations

• Loaded ~300k BGP prefixes• Replayed 15-min trace • Policy and filtering• Installed in ModelNet

network emulator [OSDI ‘02] – 30 ms intra-AS – 5 ms inter-AS – 620 Mbps

AS 6

AS 165053 AS 8 AS 9 AS 10

AS 5AS 4

AS 2

AS 1

AS 3

Rest of the Internet

customer-provider linkpeering linkbackup linkrouter that resets session due to 0-length AS4_PATH

4/5/2011

Page 16: Online Testing of BGP

Marco Canini, RIPE 62 16

Micro benchmarks

• CPU overhead• Metric: BGP updates per s

– Stress test during RIB load• Baseline: 15.1 – W/ exploration: 13.9 – Impact 8%

– Realistic test during trace replay• Negligible impact

• Memory overhead– Cloned process has 37% overhead on avg

• Bandwidth– 8 Kbps avg for exploratory messaging

4/5/2011

Page 17: Online Testing of BGP

Marco Canini, RIPE 62 17

Results

• Avg: 243 s, 756 explorations– Max 670 s, 2002 explorations– Without ModelNet: avg 155 s– Detected session reset and origin misconfiguration

Explored all paths in the UPDATE handlers + across the Internet-like testbed in ~4 min avg (11 min max)

4/5/2011

Page 18: Online Testing of BGP

Marco Canini, RIPE 62 18

Deployment option 1

• Convince Cisco, Juniper, Huawei, etc. to integrate DiCE

4/5/2011

Page 19: Online Testing of BGP

Marco Canini, RIPE 62 19

Deployment option 2

• Deploy DiCE+BIRD in a server– Potentially run multiple router instances– Configure with the AS policy & BGP feed– Connect with DiCE servers in neighboring ASes

4/5/2011

Page 20: Online Testing of BGP

Marco Canini, RIPE 62 20

Incentives

• Common infrastructure• ISP benefits as an exploration target

– Knowing about its faults• Upstream ISPs can incentivize customer ISPs

to serve as an “explorer”– Fewer faults, lower operational costs

4/5/2011

Page 21: Online Testing of BGP

Marco Canini, RIPE 62 21

Conclusion

• We have an online testing system for BGP• Are you interested to try out our prototype?• Do you have suggestions for properties to check?

– Get in touch: [email protected]

• Thank you! Questions?• More info in our papers

– [LADIS ’10, USENIX ATC ’11]

4/5/2011

Page 22: Online Testing of BGP

Marco Canini, RIPE 62 22

Backup slides

4/5/2011

Page 23: Online Testing of BGP

Marco Canini, RIPE 62 23

My Research

• Improving the reliability of distributed systems• Why?

– Foundation of our society’s infrastructure– ... but it is difficult to make them reliable

• Produce robust design and implementation• Deploy and operate reliably

• A prime example: BGP (inter-domain routing)– Fundamental service for Internet’s operation– Additional challenges: federation & heterogeneity

4/5/2011

Page 24: Online Testing of BGP

DiCE/BGP Prototype in Action

24

Node 2Node 1 (explorer)

1’: fork()

2’: fork()/ run

1’: annotated message

3: message

1: c

reat

e sn

apsh

ot2:

inpu

tco

nstr

aint

s 2’’: connect

4: property check4: check ctrl

2’’’: fork()/ run

path exploration engine

1’’: fork()

1’’: ack

constraints/inputs

3’: ack

4/5/2011 Marco Canini, RIPE 62

Page 25: Online Testing of BGP

Marco Canini, RIPE 62

Inputs produced by DiCEa.b.c.d/l

Import filter1?

Drop update

Fuzz?

Fuzz attr

Fuzz?

Fuzz attr

fuzz?

Fuzz attr

x.y.z.w/l: (0-length AS4_PATH)

Apply update

Drop update

Send update

x.y.z.w/l: (fuzz)x.y.z.w/lOriginal input

Importfilter2?

Apply update

Drop update

Send update

a.b.c.d/l (leaked prefix)

Inpu

t gen

erati

on c

ode

Rout

er u

pdat

e ha

ndlin

g co

de

x.y.z.w/l

Importfilter2?

Apply update

Send update

yes

Import filter1?

yes

Importfilter2?

Import filter1?

yes

254/5/2011

Page 26: Online Testing of BGP

Property 3: BGP Policy Conflicts

Checking convergence is hard [Varadhan et al.,‘96, Griffin et al.,’00]

– Check: Dispute wheel? • Absence of: sufficient condition for robust convergence

[Timothy G. Griffin, Leiden Global Internet talk ‘00]

26

21

0

43

1 3 01 0

2 1 02 0

4 2 04 3 0

3 4 2 03 0

BAD GADGET II

Nodes locally prefer not routing directly

to 0

Cycle!4/5/2011 Marco Canini, RIPE 62

Page 27: Online Testing of BGP

Dispute Wheel Detection with DiCE• Use symbolic input to change policy

– Can cause a dispute wheel in a single step

• Use global precedence metric to detect and resolve conflict [Ee et al., SIGCOMM ‘07]– Metric invoked DW in the cloned snapshot Fault

27

21

0

43

1 3 01 0 2 1 0

2 0

4 2 04 3 0

3 4 2 03 0

GOOD GADGET BAD GADGET II

Report:

List of policy changes that cause oscillations

4/5/2011 Marco Canini, RIPE 62