Upload
nymiter
View
29
Download
3
Embed Size (px)
DESCRIPTION
WCE-Reliability and Availability
Citation preview
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
1
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
2
vRAN2014/09/17
9771 WCE Reliability & Availability
Wirelesspacket core
StandardRouter
ALU ProvidedOr
Telco ProvidedComputing Cloud
IP
3G – 4G
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
3
Bringing NFV & vRAN to market Now
Wireless Cloud ElementConventional
BBU
IP mobilebackhaul
Centralized baseband
IP
IPCPRIover fiberIP
RF only sites
IP
Macrocell
Metrocell
IP
AGENDA
Wireless Cloud Element Product Strategy
Wireless Cloud Element
4
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
Wireless Cloud Element Demo
Wireless Cloud Element High Availability – Reliability
• 99.999% Availability in a Network Function (e.g. RNC) requires that failures are rapidly detected and a spare component takes over
• Failures with a duration of 15s or greater are classified as Outages
• In-order to ensure that both the active and spare component do not fail together they must never both depend on a single component or point of failure
• WCE uses “anti-affinity” rules to prevent allocation of active and
No Single Point of Failure
5
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
• Common sparing algorithms are:
- 1+1
- N+1
- N+M
• All three techniques are used in the virtual RNC
prevent allocation of active and spare VMs on the same blade
• WCE’s use of redundant links, switches and storage ensure that any single failure does not disable application redundancy schemes
WCE Platform – Hardware Configurations
• Two Cabinets with up to 4 Blade System enclosures (2nd Cabinet has no SAN)
• Minimum Configuration: 6 server blades
• Max Configuration: 64 Server Blades (16 blades per enclosure)
• Server Blades can be added in
6
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
• Server Blades can be added in increments of 1(The reference config concept used in 9370 does not exist)
• Coverage requirements (# of Cells) determines how many CMUs are required
• Traffic requirements (Erlangs, Mbps, Signaling) determines how many UMUs and PCs are required.
• # of CMUS, UMUs, PCs determine how many blades are required.
• The HP C7000 is a fully redundant, carrier grade computing system
• The front of the chassis contain:
- 16 hot swappable, dual socket
WCE Hardware – HP C7000 Front View
7
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
- 16 hot swappable, dual socket Xeon processor blades in the front
- 6 (5+1) -48V DC power supplies
- HP Insight Display for local maintenance
• The back of the C7000 contain:
- All cables
- 10 Shared Fan Modules
- DC Power Input
- Redundant OnBoard
WCE Hardware – HP C7000 Back View
8
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
- Redundant OnBoardAdministrators with OAM access to all of the boards
- Dual 6125XLG Ethernet Switches
- (optional) 6120 Ethernet OAM Switch(s)
6125XLG
SAN Storage
Controller A
Contro
ller B
6 Gb SASChip
32 port6 Gb SAS
32 port6 Gb SAS
6 Gb SASChip
NetApp E5424• Fully redundant path from host ports to drives
• Each controller can access all drive ports
• Each drive chip can access every drive
9
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
Controller A
Contro
ller B.........
ESM ESM
.........
6 Gb SASExpander
6 Gb SASExpander
24drives
24drives
drive
• Top down, bottom up cabling ensures continues access
• Expandable via an expansion unit from 24 to 48 drives
• DC powered, NEBS certified
• Data is spread across the drives via distributed RAID-6
WCE Link and Switch Redundancy
• WCE’s hardware configuration ensures there is no single point of failure
• Redundant, active-active, 10 Gigabit Ethernet data-path links
• Redundant switches
• Link or switch failure is rapidly detected and traffic is moved to alternate links
e5400 SAN
Ctlr A Ctlr B
WCE 4 Interconnect Topology
40GbE MAD
10
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
Ctlr A Ctlr BA1
A2
B1
B2
40GbE MAD
C7000 #1
6125XLG 6125XLG
C7000 #2
6125XLG 6125XLG
C7000 #4
6125XLG 6125XLG
C7000 #3
6125XLG 6125XLG
10GbEiSCSI
10GbEiSCSI
40GbE (IRF) 40GbE (IRF)40GbE (IRF)40GbE (IRF)
4x 10GbE (Backplane IRF) 4x 10GbE (Backplane IRF) 4x 10GbE (Backplane IRF) 4x 10GbE (Backplane IRF)40GbE MAD 40GbE MAD
10GbE DAC Cable
Multi-Chassis LAG Link
10GbE MM Fiber
10GbE DAC Cable
40GbE QSFP+ DAC Cable
*NHRActive
*NHRStandby
LAGGroup
LAGGroup
*NOTE: If Core Network and RAN NHR are required, the uplink connections are replicated to the second pair of NHR
Actual link configuration depends on bandwidth requirements
Actual link configuration depends on bandwidth requirements
• WCE uses fully dynamic VM allocation – there is no static association between VMs and blades
• VMs are created on any blade assigned to the cluster even it exists in another physical frame
• Multiple Tenant can share a
Dynamic VM Allocation – Anti-Affinity
11
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
• Multiple Tenant can share a single data-center
• To ensure services are highly available “anti-affinity” rules separate redundant VMs onto different physical blades
Multiple RNCs within a single data-center
Software implementation of Carrier Grade Sparing
99.999% Availability requires that failures are rapidly detected and a spare component takes over
Two CG Domains:
• Spared Nodes (3gOAM, CMU, PC & DA) that are required to maintain cells and allow new calls to be originated. Characterized by shared
Carrier Grade Overview
vCenter Server
LRC Mgr
3gOAM 3gOAM
1+1 (2)
CMU CMU CMU
Un-sparedRNC
Operations
CellManagement
DiskAccess
DiskAccess
1+1 (2)
NASFront End
To SAN
12
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
originated. Characterized by shared data and rapid (in the order of seconds) switch of activity.
• Un-spared Nodes (UMU) that are UE specific where sharing data is not required and return to service is slow (in the order of minutes).
N+M (2-16)
Un-spared (1-65)
N+1 (2-30)
CMU CMU CMU
PC PC PC
UMU UMU UMU UMU
Management
UserManagement
TransportTerminationNo Single Point of Failure
• When building carrier grade applications on an infrastructure that cannot provide lossless message transfer, network functions must provide their own messaging systems. A new Message Transfer Framework (MTF) was created for the WCE virtual RNC that offers the
Message Transfer Framework & Multi-Ring TOTEM Protocol (Patent Pending)
UMU Ring
CMU Ring
13
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
the WCE virtual RNC that offers the following services:
- Point-to-point best effort (over UDP),
- Point-to-point reliable (over SCTP),
- Point-to-point reliable and ordered (over multi-ring TOTEM),
- Multicast best effort (over UDP), and
- Multicast reliable and ordered (over multi-ring TOTEM).
Corosync nodes within a ring
3gOAM (gateway)PC Ring
Multi-Ring TOTEM Ring Messaging
• Each Cell Management Unit (CMU) hosts 3 Service Groups (collections of NodeBs)
• Each Service Group is independently spared
- single failures can be restarted, or
CMU Sparing
CMU-1
SG1 SG2 SG3
CMU-1
Before Failure After Failure
14
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
- single failures can be restarted, or
- relocated to another CMU
• Entire CMU failure results in all Service Groups being relocated
- the failed is restarted as a spare
• Failover typically takes 10-15 seconds
CMU-2
SG4 SG5
CMU-3
SG6
CMU-2
SG4 SG3 SG5
CMU-3
SG1 SG6 SG2
• VMware’s HA feature isn’t fast enough for Telecom application failover, but it is useful for return to redundancy
• The vRNC application HA mechanisms detect failures
VMware HA Failover Fast Return to Redundancy
Before FailureFailover Host
15
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
mechanisms detect failures and start spare VMs in seconds
• VMware’s HA feature detects failed hosts and restarts (now spare) VMs on a designated failover host in minutes
• Mean Time To Redundancy goes from hours to minutes
After Failure
WCE Geo-Redundancy / Reflection
RNC-1 Reflection
VMs of RNC-1
16
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
WCE Uptown
RNC-2 Reflection
WCE Midtown
RNC-1 Reflection
VMs of RNC-2
AGENDA
Wireless Cloud Element Product Strategy
Wireless Cloud Element
17
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
Wireless Cloud Element Demo
Wireless Cloud Element High Availability – Availability
• WCE provides the capability to create a “shadow” of a network element within the same physical hardware as the service providing network element
• This shadow can be used for:
Shadow (Patent Pending)
Core Network
18
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
- Fast and Graceful Reset
- Software Upgrade
- Software Configuration Change
- Virtual or Physical Machine Change
- Geo-Redundancy
• Roll-back to the previous version is possible after an upgrade
iBTSiBTSiBTS
RNCRNCRNCRNCRNCRNC
Shadow technology is useful for any type
of critical configuration change.
Here is the process:
Shadow Network Functions
Core Network
1. While the active tenant provides
service, create the shadow tenant
2. Once the shadow tenant is ready
the operator initiates a switch
2a. The active tenant is
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
19
RNCRNCRNC
iBTSiBTSiBTS
RNCRNCRNC
2a. The active tenant is
disconnected from the network
becoming a shadow
2b. The shadow tenant is connected to
the network becoming active
2c. The tenant re-establishes links
to other network equipment
3. Once the operator is satisfied with
the configuration change the prior
version of the tenant is removed
RNCRNCRNC
WCE Zero Downtime Maintenance
Disk
Virtual Machine
x86
Guest
O/S
Disk
Virtual Machine
x86
Guest
O/S• Higher level virtualization features are built around the capability to move active VMs between servers (called vMotion or Live Migration)
• Using live migration we
20
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
Blade 1 Blade 2
Hypervisor
• Using live migration we enable zero downtime maintenance of hardware Hypervisor
Call success stayed above 95% for the entire time even though all of the hardware was upgraded, a new S/W load was applied and a critical configuration change
Key Performance Indicators during MaintenanceMeasured results from a Live vRNC
60%
70%
80%
90%
100% KPIs of a Live vRNC
During Shadow Upgrades
Start of Upgrade
CallSetupSuccessRateCS(%)
� 32 blades upgraded (HeartBleed)
� New S/W Load
� Critical Configuration Change
21
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
configuration change (requiring an RNC reset) was done.
A combination of Shadow upgrades (a total of four) and Live VM Migration were used.
0%
10%
20%
30%
40%
50%
5/18/14 6:00 PM
5/19/14 12:00 AM
5/19/14 6:00 AM
5/19/14 12:00 PM
5/19/14 6:00 PM
5/20/14 12:00 AM
5/20/14 6:00 AM
5/20/14 12:00 PM
5/20/14 6:00 PM
5/21/14 12:00 AM
5/21/14 6:00 AM
5/21/14 12:00 PM
5/21/14 6:00 PM
5/22/14 12:00 AM
CallSetupSuccessRateCS(%)
HSDPA_Accessibility_SuccessRate(%)
3G2GHHOExecutionSuccessRate(%)
SHO_SuccessRate_Cel_SQM(%)
HSUPA(%)
Calldrop_HSDPA_cell(%)
HSUPA_DropRate(%)
RABDropRate_PS(%)
RABDropRateCS_Voice(%)
Key Performance Indicators during Live MigrationMeasured results from a vRNC under heavy traffic
42500
43000
43500
44000
44500
45000
99,95
99,96
99,97
99,98
99,99
100
add/delete
vMotionumu-29
vMotionpc-4
vMotioncmu-7
vMotion3goam
vMotionda
Movement of live VMs from host to host result in small impacts to KPIs
The worst case was movement of cmu-7 which resulted in a HA system timeout
22
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
40000
40500
41000
41500
42000
42500
99,9
99,91
99,92
99,93
99,94
99,95
5/13/2014 18:02
5/13/2014 18:07
5/13/2014 18:12
5/13/2014 18:17
5/13/2014 18:22
5/13/2014 18:27
5/13/2014 18:33
5/13/2014 18:38
5/13/2014 18:43
5/13/2014 18:48
5/13/2014 18:53
establishments
session
mobility
transfer
calls
HA system timeout and switch of activity
Note the vertical axis ranges from 99.9% to 100%