One Network Engineer - ausnog.net · 2015. 8. 25. · 844 million mobile daily active users on...

Preview:

Citation preview

ONE

James Paussanetwork infrastructure engineer

One Network Engineer

What Do We Do?

• A single engineer runs the network at a time

• Network monitoring and alarming

• Responsible for the entire production network

Facebook Scale

968 million daily active users on average

844 million mobile daily active users on average

1.31 billion mobile monthly active users

1.49 billion monthly active users

Machine to machine

Machine to user

Facebook Scale

Automation

What We Don’t Do

• On site work (cleaning fibres, LCs, etc)

• Working with remote hands

• Device deployment

Myths

• Automation fixes everything

• We’ve fixed everything

• Doesn’t apply

Link Imbalance

Link Imbalance

Link Imbalance100

50

0

Interface Utilization

Link Imbalance

Link Imbalance100

50

0

Interface Utilization

Link Imbalance100

50

0

Interface Utilization

Link Imbalance

srsly wtf.

Link Imbalance

• Intercluster links only

• >80%+ Util

• Migratory

What Now?

• Detection

• Mitigation

Imbalance Detection

Aggregated Interfaces

Member Statistics

Compare Utilisation

Check Fails Notify Oncall

Imbalance Mitigation

SIP DIP SPORT DPORT Hash Key

Imbalance Mitigation

Roll Hash

Imbalance Mitigation

Link Imbalance

Roll HashDetect ResolvedFBAR

Link Imbalance

ZOMG!! WTF!! NO, NO, NO! &(#$&*(

Link Imbalance

RX Window

During

Link Imbalance

Open TCP connections

During After

Cache DB

It Isn’t (always) The Network

Cache DB

Cache DB

Cache DB

Link Imbalance - Lessons Learned

• Resolved issues

• Root Cause

• Software helps

• Service owner identified

• Resolution time

• Small loss, significant impact

3 Months In A Leaky Boat

!Capacity

Health

Errors

Memory Issues

DIP Loss Latency

1.1.1.1 0.1 10

2.2.2.2 0 10

DIP Loss Latency

1.1.1.1 0.1 10

Servers ServersServers

Servers ServersServers

Detection

Loss Effects on Throughput1000

400

00.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%

Packet Loss %

Thro

ugh

pu

t (m

bp

s) 800

600

200

RTT0

X20025

XXXX

XX

XX

XX X X0

0.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%

Packet Loss %

Thro

ugh

pu

t (m

bp

s)

350

50

100

150

250

300

200

Different algos?

X

RenoCubicVegasIllinois

Recovery time

0

Thro

ugh

pu

t (m

bp

s)

350

50

100

150

250

300

200

Time (sec)0 120110100908070605040302010

1% P

acke

t Los

s RenoCubicVegasIllinois

So, wait, how does this apply to me?

Alarms

0

70k

70k!

!Capacity

Health

Errors

Interface Issues

!!!!!!!!!!!…………………………………………………..

Alarms Now

0

70

<100 -99.99%

Automation - One Month

3.37b0.99%

750k99.6%

Why?

Sleep

Why?

750,000 * 2 = 1,500,0001,500,000 / 60 = 25,000

25,000 / 160150+

Lessons Learned & Take AwaysUse what you’ve got

Prototype early, iterate often

Duct tape keeps things running

Spend the time to root cause issues

The sooner the robots take over the better

Why?

?

Recommended