Modern & Scalable Network Designs forfiles.meetup.com/12607092/nyc-hadoop-meetup-sm.pdfModern...

Preview:

Citation preview

Modern & ScalableNetwork Designs for

Benoît “tsuna” SigoureMember of the Yak Shaving Stafftsuna@arista.com @tsunanet

1970

A Brief History of Network Software

Custommonolithicembedded

OS 19801990

2000

ModifiedBSD, QNXor Linux

kernel base2010

1970

A Brief History of Network Software

Custommonolithicembedded

OS

“The switchis just a

Linux box”

19801990

2000

ModifiedBSD, QNXor Linux

kernel base2010

1970

A Brief History of Network Software

Custommonolithicembedded

OS

“The switchis just a

Linux box”

19801990

2000

ModifiedBSD, QNXor Linux

kernel base2010

EOS = Linux

1970

A Brief History of Network Software

Custommonolithicembedded

OS

“The switchis just a

Linux box”

19801990

2000

ModifiedBSD, QNXor Linux

kernel base2010

EOS = Linux

Custom ASIC

Inside a Modern Switch

Arista 7050Q-16

Inside a Modern Switch

2 PSUs4 Fans

Arista 7050Q-16

Inside a Modern Switch

2 PSUs4 Fans

DDR3

x86 CPUUSB FlashSATA SSD

Arista 7050Q-16

Inside a Modern Switch

2 PSUs4 Fans

DDR3

x86 CPU

Networkchip(s)

USB FlashSATA SSD

Arista 7050Q-16

Inside a Modern Switch

2 PSUs4 Fans

DDR3

x86 CPU

Networkchip(s)

USB FlashSATA SSD

Arista 7050Q-16

Command API

It’s All About The Software

Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast

Failover DANZ Fabric Monitoring

Command API

It’s All About The Software

Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast

Failover DANZ Fabric Monitoring

OpenTSDB

Command API

#!/bin/bash C++

It’s All About The Software

Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast

Failover DANZ Fabric Monitoring

OpenTSDB

Command API

#!/bin/bash C++

It’s All About The Software

Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast

Failover DANZ Fabric Monitoring

OpenTSDB$ sudo yum install <pkg>$ sudo yum install <pkg>

Command APIOr a CLI over JSON-RPC

Command APIOr a CLI over JSON-RPC

Command APIOr a CLI over JSON-RPC

Command APIOr a CLI over JSON-RPC

import pexpect, reconnector = pexpect.spawn("ssh admin@1.2.3.4")connector.expect(".ssword:*")connector.sendline(password)index = connector.expect([">","#"])if index == 0: connector.sendline("enable") index = connector.expect(["assword", "#"]) if index == 0: connector.sendline(enable) connector.expect("#")connector.sendline("show version")connector.expect("#")output = connector.beforeswitches = re.findall("^(\*?)\s+(\d)\s+\d+\s+([A-Z\d-]+)", output, re.MULTILINE)for master, num, model in switches: ...

Command APIOr a CLI over JSON-RPC

import pexpect, reconnector = pexpect.spawn("ssh admin@1.2.3.4")connector.expect(".ssword:*")connector.sendline(password)index = connector.expect([">","#"])if index == 0: connector.sendline("enable") index = connector.expect(["assword", "#"]) if index == 0: connector.sendline(enable) connector.expect("#")connector.sendline("show version")connector.expect("#")output = connector.beforeswitches = re.findall("^(\*?)\s+(\d)\s+\d+\s+([A-Z\d-]+)", output, re.MULTILINE)for master, num, model in switches: ...

Traditional approach:“screen scraping”

Command APIOr a CLI over JSON-RPC

import pexpect, reconnector = pexpect.spawn("ssh admin@1.2.3.4")connector.expect(".ssword:*")connector.sendline(password)index = connector.expect([">","#"])if index == 0: connector.sendline("enable") index = connector.expect(["assword", "#"]) if index == 0: connector.sendline(enable) connector.expect("#")connector.sendline("show version")connector.expect("#")output = connector.beforeswitches = re.findall("^(\*?)\s+(\d)\s+\d+\s+([A-Z\d-]+)", output, re.MULTILINE)for master, num, model in switches: ...

Traditional approach:“screen scraping”

from jsonrpclib import Serverswitch = Server("http://admin:" + passwrd + "@1.2.3.4/command-api")

response = switch.runCmds(1, ["show privilege"])if response[0]["privilegeLevel"] != 15: switch.runCmds(1, [{"cmd": "enable", "input": enable}])

response = switch.runCmds(1, ["show version"])for num, switch in enumerate(response[0]["stack"]): # use switch["model"] and switch["master"]

Compare to:A Standard HTTP API

Integration

Neutron

Arista Driver

Datacenter Leaf Layer

Red VLAN

Blue VLAN

TOR1 TOR2 TOR3

VM1 VM2VM1 VM2 VM3 VM4VM3

Host 3Host 1 Host 4 Host 6Host 2 Host 5

EAPI

Server 1

Server 2

Server 3

Server N

Rack-1

Server 1

Server 2

Server 3

Server N

Rack-2

Server 1

Server 2

Server 3

Server N

Rack-3

5 to 7 racks, 7 to 9 switchesScale: 240 to 336 nodesOversubscription ratio 1:3

2x40Gbpsuplinks

Server 1

Server 2

Server 3

Server N

Rack-4

Server 1

Server 2

Server 3

Server N

Rack-5

Server 1

Server 2

Server 3

Server N

Rack-1

Server 1

Server 2

Server 3

Server N

Rack-2

56 racks, 8 core switchesScale: 3136 nodesOversubscription ratio 1:3

8x10Gbpsuplinks

Server 1

Server 2

Server 3

Server N

Rack-55

Server 1

Server 2

Server 3

Server N

Rack-56

...

...

...

Server 1

Server 2

Server 3

Server N

Rack-1

Server 1

Server 2

Server 3

Server N

Rack-2

1152 racks, 16 core modular switchesScale: 55296 nodesOversubscription ratio still 1:3

16x10Gbpsuplinks

Server 1

Server 2

Server 3

Server N

Rack-1151

Server 1

Server 2

Server 3

Server N

Rack-1152

...

...

...

Build Layer 3 Networks

3 layers of pink Cake by Catherine Scott

Layer 2 vs Layer 3

• Layer 2: “data link layer” – Ethernet “bridge”• Layer 3: “network layer” – IP “router”

A B C D E F G H

Uplinksblockedby STP

Layer 2

E, F

, G, H

← A, B, C, D

E, F, G, H →

← A

, B, C

, D

A B C D E F G H

ECMP:0.0.0.0/0 via core1 via core2

Layer 3x

Layer 2 vs Layer 3

• Layer 2: “data link layer” – Ethernet “bridge”• Layer 3: “network layer” – IP “router”

A B C D E F G H

E, F

, G, H

← A, B, C, D

E, F, G, H →

← A

, B, C

, D

A B C D E F G H

ECMP:0.0.0.0/0 via core1 via core2

Layer 3Layer 2with vendor magic x

Layer 2 vs Layer 3

• Layer 2: “data link layer” – Ethernet “bridge”• Layer 3: “network layer” – IP “router”

A B C D E F G H

E, F

, G, H

← A, B, C, D

E, F, G, H →

← A

, B, C

, D

A B C D E F G H

ECMP:0.0.0.0/0 via core1 via core2

Layer 3Layer 2with vendor magic x

1G vs 10G

• 1Gbps is today what 100Mbps was yesterday

• New generation servers come with 10G LAN on board

• At 1Gbps, the network is one of the first bottlenecks

• A single HBase client can very easily push over 4Gbps to HBase 0

175

350

525

700

TeraGen in seconds

1Gbps 10Gbps

3.5x

Jumbo Frames

0

85

170

255

340

Context switches

MTU=1500 MTU=9212

11%

0

125

250

375

500

Soft IRQ CPU Time

15%

(millions) (millions) (thousand jiffies)

0

500

1000

1500

2000

Packets per TeraGen

MTU=1500 MTU=9212

3x less

«Hardware» buffers by Roger Ferrer Ibáñez

Deep Buffers

«Hardware» buffers by Roger Ferrer Ibáñez

Deep Buffers

0

38

75

113

150

Packet buffer per 10G port (MB)

Ari

sta

7500

Ari

sta

7500

E

Indu

stry

Ave

rage

1.5

48

128

0

250000

500000

750000

1000000

Packets dropped per TeraGen

1MB 4:11MB 5.33:148MB 5.33:1

Zero!

4x10G ⇒ 4:13x10G ⇒ 5.33:1

...

16 hostsx 10G

16 hostsx 10G

...

Deep Buffers

1k T

CP

slo

w s

tart

/sec

RAILMaking the NetworkWork Better With

&

Host Failures in Hadoop Clusters

Server 1

Server 2

Server 3

Server NRack-1 Rack-2 Rack-3

Server 1

Server 2

Server 3

Server N

Server 1

Server 2

Server 3

Server N

Server 1

Server 2

Server 3

Server NRack-N

• Hardware failures• Kernel panics• Operator errors• NIC driver bugs

• RegionServer: wait for ZooKeeper lease timeoutTypical: ~30s

• DataNode: wait for heartbeats to timeout enough for NameNode to declare deadTypical: ~10min (or ~30s with dfs.namenode.check.stale.datanode see HDFS-3703)

Host Failures for HBase & Hadoop

• RegionServer: wait for ZooKeeper lease timeoutTypical: ~30s

• DataNode: wait for heartbeats to timeout enough for NameNode to declare deadTypical: ~10min (or ~30s with dfs.namenode.check.stale.datanode see HDFS-3703)

Host Failures for HBase & Hadoop

• RegionServer: wait for ZooKeeper lease timeoutTypical: ~30s

• DataNode: wait for heartbeats to timeout enough for NameNode to declare deadTypical: ~10min (or ~30s with dfs.namenode.check.stale.datanode see HDFS-3703)

Host Failures for HBase & Hadoop

• Redirect traffic to the switch• Have the switch send TCP resets• Immediately kills all active and new flows

10.4.5.1

10.4.5.3

10.4.6.1

10.4.6.2

10.4.6.3

10.4.5.254 10.4.6.254

10.4.5.2

How Does This Work?

10.4.5.2

• Redirect traffic to the switch• Have the switch send TCP resets• Immediately kills all active and new flows

10.4.5.1

10.4.5.2

10.4.5.3

10.4.6.1

10.4.6.2

10.4.6.3

10.4.5.254 10.4.6.25410.4.5.210.4.5.2

EOS = Linux

How Does This Work?

MapReduceTracer

For

MapReduce Tracer

• Keep track of nodes in the Hadoop cluster• Poll statistics about Hadoop activity:‣What jobs are running, when, where‣How “big” each job is, network-wise‣Historical data of MapReduce activity‣MapReduce vs HDFS traffic accounting

• CLI integration for real-time visibility• Multi-cluster support

Server 1

Server 2

Server 3

Server 16

Rack-1

Server 17

Server 18

Server 19

Server 32

Rack-2

40Guplinks

• 2 racks, 16 hosts each• L3 network (/27 per rack,

ECMP)• Both TOR use the following

additional config: monitor hadoop cluster batch jobtracker host 172.19.33.1 jobtracker user tsuna tasktracker http-port 10060 no shutdown

• No change required to Hadoop• No plugin, config change, or

additional daemon to run

Server 1

Server 2

Server 3

Server 16

Rack-1

Server 17

Server 18

Server 19

Server 32

Rack-2

NameNodeJobTracker

DN / TT

DN / TT

DN / TT

DN / TT

DN / TT

DN / TT

DN / TT

MapReduceTracer

MapReduceTracer

40Guplinks

• 2 racks, 16 hosts each• L3 network (/27 per rack,

ECMP)• Both TOR use the following

additional config: monitor hadoop cluster batch jobtracker host 172.19.33.1 jobtracker user tsuna tasktracker http-port 10060 no shutdown

• No change required to Hadoop• No plugin, config change, or

additional daemon to run

Hadoop Versions Supported• Apache Hadoop:

0.20.203 to 1.2.1• Cloudera’s distribution:

CDH3 and CDH4• HortonWorks’ distribution:

HDP 1.1, 1.2, 1.3• Microsoft HDInsight 0.10

• Incompatible:Apache 0.20.x and earlierApache 0.21.x, 0.22.x, 0.23.xApache 2.0.x

Thank You

Recommended