Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
100G WAN OpenFlow Deployment experiences
Edward Balas
1
Wednesday, January 9, 13
Agenda
Review how / why Internet2 came to have an 100G OpenFlow network
Discuss how we manage the risks with testing
Talk about what we have learned
2
Wednesday, January 9, 13
How/Why 100G Openflow
Membership charged Internet2 to build a national ethernet exchange fabric
interest in SDN and specifically OpenFlow
Cost effective
customer self-service (web + API)
BTOP funded backbone upgrade
NDDI mission: create an open, global scale SDN platform to enable network innovation and global scientific research.
3
Wednesday, January 9, 13
How/Why 100G Openflow
NDDI
deploy a 10g x 6 NEC PF8520 nationwide network
baby step
AL2S
100G multi-vendor BTOP funded national network
14 Brocade MLXe-16, Junipers next
Partnership between Internet2, Indiana University, and Stanford
4
Wednesday, January 9, 13
OE-SS software
sub-second provisioning
auto failover to backup paths
OpenFlow hardware independence
auto discovery of new equip and ckts
IDCP support for inter-domain
integrated measurement
high availability baked in
5
Wednesday, January 9, 13
OE-SS architecture
controller is designed as a cluster
corosynch
multi-master mysql
drbd
6
Wednesday, January 9, 13
Challenges
Timeframe
9 months for production code and deployed NDDI network
10 additional months to deploy AL2S network
Ecosystem
OpenFlow largely untested
brand new software stack
Multiple Vendors
7
Wednesday, January 9, 13
Testing
Automated(mostly) test suite using Jenkins
Programmable topology with glimmer glass(thank the HOPI)
Several test points
2 devices per vendor
test every new vendor or tool chain software release
8
Wednesday, January 9, 13
9
Types of testing
General
OFTest for protocol adherence
Implementation behaviors not mandated in spec
OE-SS
base functionality
performance
burn in / stability
Wednesday, January 9, 13
Things we have seen
flow_mod processing speed issues
incomplete openflow spec support
layer2 or layer3 matching but not both
no viable QoS mechanisms
key actions not always supported
inconsistencies in behavior
session establishment
10
Wednesday, January 9, 13
Testing Lessons
Expect to be the system integrator, inter-op is on you
a dedicated test infrastructure is essential
automation a *very* good investment
make sure you can run concurrent tests if you have multiple vendors
Don’t use exclusively black box tests
11
Wednesday, January 9, 13
Protocol Lessons
OpenFlow is a youthful protocol, vague in places creating complexity as vendors get creative.
vendors have incomplete protocol support
multi-vendor == lowest common denominator feature set
need to test component together as system
looking forward to 1.3
12
Wednesday, January 9, 13
Controller Placement
Central Controller cluster in Chicago
hot standby with synched state
failover in cluster causes controller to switch tcp reset
Backup cluster in Bloomington for DR
controller to switch latency
28ms RTT avg
64ms RTT worst case
13
Wednesday, January 9, 13
Controller Placement
OE-SS mostly proactive, reacting to topology changes
backup path preconfigured, failover involves 1 flow mod @ each ingress
controller takes ~70ms to receive a packet_down message and send a flow_mod in response.(some low hanging fruit here)
~100ms to respond to failover assuming avg latency(ignoring time in switch)
IGP or Rapid Spanning Tree: ~n secs
MPLS path protect: ~100 ms
MPLS Fast Reroute: ~50 ms
14
Wednesday, January 9, 13
Placement Considerations
The architectural simplicity of a central controller is very attractive
Central controller means management net is *critical*, if a fiber cut disrupts management and OpenFlow net then OE-SS failover blocks on management net failover.
If management net is resilient, central location a reasonable choice
Distributed controller will still be heavily dependent on management network to respond to non-local events.
15
Wednesday, January 9, 13
Limits of the 1.0
all actions require a round trip to controller
Fast Failover port groups in > 1.0 to avoid controller round trip
controller failure
In 1.0 after loss of connection, normal rules are removed when entering emergency mode. some vendors ignore this and support non-stop forwarding until re-establishment others do not
>1.0 has fail secure mode, which allows for non-stop forwarding
>1.0 also supports multiple active controllers
16
Wednesday, January 9, 13
impact of these limits
for switches that adhere closely to 1.0 spec
a loss of controller session will cause all rules to be flushed and re-added. If the switch processes flow_mods slowly, this can be a multi-minute disruption.
for switches that have non-stop forwarding
will black hole vlan traffic if the disruption in controller session is also causing a ckt outage in the forwarding path.
17
Wednesday, January 9, 13
work arounds
modify controller to not delete switch rules on session start
hope your switches support non-stop forwarding
create backup DR cluster
future effort
modify controller cluster so that slave controller maintains tcp session
use geographically distributed cluster + multiple active controllers
migrate to 1.3
18
Wednesday, January 9, 13