36
10/30/22 1 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Embed Size (px)

Citation preview

Page 1: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 1

CompSci 514: Computer Networks

Lecture 5: BGP problemsXiaowei Yang

Page 2: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 2

Today

• Known problems of BGP– Multi-homing– Instability– Delayed convergence

• Slow failover

• Discussing fixes– Root cause, ghost flushing etc.

Page 3: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 3

Failover

• BGP is designed for scaling more than fast failover– Many mechanisms favor this balance– Route flap damping, for example.

• If excess routing changes (“flapping”), ignore for some time.

• Has unexpected effects on convergence times.– Route advertisement/withdrawal timers in the 30

second range– Effect: tens of seconds to many minutes to

recover from “simple” failures.– 15-30 minute outages not uncommon.

Page 4: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 4

Multi-homing

• Connect to multiple providers– Goal: Higher availability, more capacity

• Problems:– Provider-based addressing breaks– Everyone needs their own address space

Page 5: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 5

Multi-homing increases routing table size

Mutil-home.com

128.0.0.0/8204.0.0.0/8

204.1.0.0/16

ISP2 ISP1

You can reach 128.0.0.0/8And 204.1.0.0/16 via ISP1

ISP3

204.1.0.0/16 ISP1204.1.0.0/16128.0.0.0/8 ISP1

204.1.0.0/16 ISP2

204.0.0.0/8 ISP2

Page 6: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 6

Global routing tables continue to grow

Source: http://bgp.potaroo.net/as6447/

Page 7: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 7

Other BGP problems

• Convergence: BGP may explore many routes before finding the right new one.– Labovitz et al., SIGCOMM 2000

• Correctness: routes may not be valid, visible, or loop-free.

• Security: There is none!– Some providers filter what announcements their

customers can make. Not all do.– See paper discussion site for pointers

Page 8: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 8

Measurement studies

• Two papers (measurement)– End-to-end traffic– Routing messages

• Experimental techniques

• Results

Page 9: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 9

Internet Routing Instability

• Goals: how often BGP sends updates to change routes

• Methodology:– Analyzing BGP logs for a long time

Page 10: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Terms

• WADiff: withdrawal announcement

• AADiff: announcement announcement

• WADup: same route withdrawal announcement

• AADup: same route announcement announcement

• WWDup: same route withdrawal wthdrawal

04/18/23 CPS 214 10

Page 11: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Observed pathologies

• Repeated WWDup, WADup, AADup

• Why are they pathologies?

04/18/23 CPS 214 11

Page 12: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 12

• Majority of BGP updates are WWDup• WWDup belong to ASes that never announce them• Why?

– Stateless BGP, does not remember what have sent to peers– Send withdrawals to all peers

Page 13: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Possible origins of instability

• Stateless BGP

• Physical link errors

• Unjittered timers

• IGP, BGP interactions

• Conflicting routing policies

04/18/23 13

Page 14: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 14

Data analysis techniques

• Time series analysis

• Frequency analysis– Fast Fourier transform – Maximum entropy spectral estimation

• Different estimation methods, but both find significant frequencies at seven days, and 24 hours

Page 15: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Main results

• Much more updates than expected– 99% is pathological. Impressive!– A taxonomy to analyze pathologies

• Speculation of causes– Configuration errors, router bugs– Correlate with traffic load, perhaps due to router

architectures– Open research questions: root cause of updates

• Motivated much follow-up work

04/18/23 15

Page 16: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 16

End-to-end routing behavior

• Goals: study routing pathologies, route stability, and routing asymmetry

• Methodology:– End-to-end measumrents– Traceroutes from N sites, N2 paths– Exponentially spaced-out sampling

• Nice properties!• Unbiased• PASTA: the fraction of measures that observe a given

state is equal to the time that the system spends in that state

– Two datasets: D1 and D2 for different intervals• 1~2 days, 2 hrs, 2.75 days

Page 17: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Measurement infrastructure

• 37 hosts

• 8% of Internet at that time

• Pair-wise traceroute

04/18/23 17

Page 18: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 18

Pathologies• Persistent loops: 0.13-0.16%, some lasted hours

– Same router appearing in traceroute more than three times

• Erroneous routing: packets to UCL sent to Israel

• Mid-stream change, 0.16% (D1) and 0.44% (D2)– Suggests route change

• Infrastructure Failure: availability 99.8%, 99.5%– Unreachable to host– Telephone networks: 2 hours in 40 years, five nines

• Outages: more than six packet losses in a traceroute

Page 19: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 19

Stabilities

• Prevalence: the probability of observing a particular route

• Persistence: how long a route lasts• Examples:

– R1, R2, R1, R2– R1, R1, R2, R2– Same prevalence, different persistence

Page 20: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 CPS 214 20

Prevalence of dominant route• Pdomp =kp/np

• Internet paths are strongly dominated by one path

Page 21: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 21

2/3 of routes persist for days

Page 22: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 22

Path asymmetry

• Common– Don’t assume path symmetry in your

design– 49% of measures have asymmetric paths

differed by at least one city– 30% observed different ASes– 20% differ by more than one city/AS– Q: what might cause it?

Page 23: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Comments on this paper

• Seminal work on Internet measurement

• Solid data

• Rigorous analysis

04/18/23 23

Page 24: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 24

Delayed Internet Convergence

MeasurementProblem

discoveryModeling& analysis

Improvement

• Methodologies

Page 25: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 25

Experiments setup

• Actively inject BGP faults– How is fault injected?

• Passively listen at peering sessions, and use NTP synchronized machines to calculate the convergence time

• Actively send probe packets to observe end-to-end packet loss and latency

• Much BGP work later uses similar measurement techniques.

Page 26: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 26

Results show delayed convergence

• Bad news travels slow.

Page 27: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 27

Slow routing convergence results in poor end-to-end performance

Page 28: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 28

What causes the delayed routing convergence?

• A simple BGP convergence model reveals that in the worse case, all possible paths are explored before a prefix is withdrawn.

• No minimum advertisement timer: synchronized network, global message queue

0

1 2

R

(*0R, ∞,, 2R) (∞, ∞, *2R)(∞, ∞, *20R)

(*0R, 1R, ∞,)(01R, *1R, ∞)(*01R, 10R, ∞,)

(∞, *1R, 2R)(∞, ∞, *2R)(∞, ∞, ∞)

01R 01R

10R

10R

20R

20R

Page 29: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 29

Min router Advertisement interval timer(MRAI) reduces message

count• Why?

– MATI introduces synchronization. Multiple announcements are combined into one announcement, reducing the total message count.

• However, the convergence time becomes proportional to timer_interval * (n-3)

Page 30: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 30

Let’s brain storm…

• How can we fix the slow convergence problem?– What is the solution proposed by the authors?

• Sender-side loop detection. When a sender detects a loop, it sends a withdrawal to a neighbor immediately. Since withdrawal is not subject to MATI delay, this improvement reduces both message count and convergence time.

– What exactly is the root cause of BGP’s slow convergence problem?

– Can you come up with any solution?

Page 31: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 31

Sender-side loop detection

• Without sender-side loop detection:– AS3 AS1: 301R– This announcement is sent out when MRAI

timer expires

• With sender-side loop detection:– AS3->AS1: withdrawal– Withdrawal is sent out immediately. AS1

knows it has no path.

Page 32: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 32

BGP assertion

• Detect path inconsistency between different neighbors• If inconsistency is found, give path learned from direct

neighbors high priority• Sensitive to topology• Does not eliminate all invalid paths

N1R D

X Y

N2

XN1 R: N1 X Y D

R: N1 D N2 N1 D

Page 33: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 33

Ghost flushing

• If new path is worse than last announced path, and router advertisement timer has not expired yet, send a withdrawal immediately.

• The withdrawal flushes “ghost” information.• Reduces the convergence time because withdrawals

are not delayed by MRAI, but does not help much with “Tlong.”

N1R D

X Y

N2

XN1 R: withdrawal

R: N1 D N2 N1 D

Page 34: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 34

BGP root cause notification

• Neither BGP-assertion nor ghost flushing works well in this topology.– Why?– BGP-assertion: 3 and 6 are both direct neighbors to 5, but their

announcements may be inconsistent– BGP ghost flushing: the newer path is subject to MATI delay

• Explicitly send out link up/down information• Essentially adds link-state information into BGP• Sequence number is used to order the notifications.• Open research problem: can you get rid of sequence number?

5

3

4

2

1

0X

6

Page 35: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

04/18/23 35

Summary

• BGP’s slow convergence problem and other problems

• It represents a message overhead, processing overhead, and latency tradeoff. We do not yet know the best solution to address this problem.

Page 36: 4/12/20151 CompSci 514: Computer Networks Lecture 5: BGP problems Xiaowei Yang

Comments

• Measurement paper– Data– Data collection techniques– Data analysis

• When to write a measurement paper

04/18/23 36