100G WAN OpenFlow Deployment experiences › presentations › tip2013 › 20130116-balas … · multi-master mysql drbd 6 Wednesday, January 9, 13. Challenges Timeframe 9 months

100G WAN OpenFlow Deployment experiences

Edward Balas

1

Wednesday, January 9, 13

Agenda

Review how / why Internet2 came to have an 100G OpenFlow network

Discuss how we manage the risks with testing

Talk about what we have learned

2


How/Why 100G Openflow

Membership charged Internet2 to build a national ethernet exchange fabric

interest in SDN and specifically OpenFlow

Cost effective

customer self-service (web + API)

BTOP funded backbone upgrade

NDDI mission: create an open, global scale SDN platform to enable network innovation and global scientific research.

3


How/Why 100G Openflow

NDDI

deploy a 10g x 6 NEC PF8520 nationwide network

baby step

AL2S

100G multi-vendor BTOP funded national network

14 Brocade MLXe-16, Junipers next

Partnership between Internet2, Indiana University, and Stanford

4


OE-SS software

sub-second provisioning

auto failover to backup paths

OpenFlow hardware independence

auto discovery of new equip and ckts

IDCP support for inter-domain

integrated measurement

high availability baked in

5


OE-SS architecture

controller is designed as a cluster

corosynch

multi-master mysql

drbd

6


Challenges

Timeframe

9 months for production code and deployed NDDI network

10 additional months to deploy AL2S network

Ecosystem

OpenFlow largely untested

brand new software stack

Multiple Vendors

7


Testing

Automated(mostly) test suite using Jenkins

Programmable topology with glimmer glass(thank the HOPI)

Several test points

2 devices per vendor

test every new vendor or tool chain software release

8


9

Types of testing

General

OFTest for protocol adherence

Implementation behaviors not mandated in spec

OE-SS

base functionality

performance

burn in / stability


Things we have seen

flow_mod processing speed issues

incomplete openflow spec support

layer2 or layer3 matching but not both

no viable QoS mechanisms

key actions not always supported

inconsistencies in behavior

session establishment

10


Testing Lessons

Expect to be the system integrator, inter-op is on you

a dedicated test infrastructure is essential

automation a *very* good investment

make sure you can run concurrent tests if you have multiple vendors

Don’t use exclusively black box tests

11


Protocol Lessons

OpenFlow is a youthful protocol, vague in places creating complexity as vendors get creative.

vendors have incomplete protocol support

multi-vendor == lowest common denominator feature set

need to test component together as system

looking forward to 1.3

12


Controller Placement

Central Controller cluster in Chicago

hot standby with synched state

failover in cluster causes controller to switch tcp reset

Backup cluster in Bloomington for DR

controller to switch latency

28ms RTT avg

64ms RTT worst case

13


Controller Placement

OE-SS mostly proactive, reacting to topology changes

backup path preconfigured, failover involves 1 flow mod @ each ingress

controller takes ~70ms to receive a packet_down message and send a flow_mod in response.(some low hanging fruit here)

~100ms to respond to failover assuming avg latency(ignoring time in switch)

IGP or Rapid Spanning Tree: ~n secs

MPLS path protect: ~100 ms

MPLS Fast Reroute: ~50 ms

14


Placement Considerations

The architectural simplicity of a central controller is very attractive

Central controller means management net is *critical*, if a fiber cut disrupts management and OpenFlow net then OE-SS failover blocks on management net failover.

If management net is resilient, central location a reasonable choice

Distributed controller will still be heavily dependent on management network to respond to non-local events.

15


Limits of the 1.0

all actions require a round trip to controller

Fast Failover port groups in > 1.0 to avoid controller round trip

controller failure

In 1.0 after loss of connection, normal rules are removed when entering emergency mode. some vendors ignore this and support non-stop forwarding until re-establishment others do not

>1.0 has fail secure mode, which allows for non-stop forwarding

>1.0 also supports multiple active controllers

16


impact of these limits

for switches that adhere closely to 1.0 spec

a loss of controller session will cause all rules to be flushed and re-added. If the switch processes flow_mods slowly, this can be a multi-minute disruption.

for switches that have non-stop forwarding

will black hole vlan traffic if the disruption in controller session is also causing a ckt outage in the forwarding path.

17


work arounds

modify controller to not delete switch rules on session start

hope your switches support non-stop forwarding

create backup DR cluster

future effort

modify controller cluster so that slave controller maintains tcp session

use geographically distributed cluster + multiple active controllers

migrate to 1.3

18


Documents

100G WAN OpenFlow Deployment experiences › presentations › tip2013 › 20130116-balas … · multi-master mysql drbd 6 Wednesday, January 9, 13. Challenges Timeframe 9 months