Upload
anita-burrington
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Change at Service Scale
Name: Gayathri VenkataramanTitle: Principal PM LeadChange Velocity TeamMicrosoft
Name: Rudra MitraTitle: Group Program ManagerService FundamentalsMicrosoft
DMI312
Agenda• Background• Types of changes we make to the service• How we apply changes• Safety during change • What happens when we mess up
Disclaimer: cloud focused session. Procedures and principles will apply to on-premises.
Rapidly growing service….
Presence in major continents Dozens of datacenters
Dedicated Chinese DCs
Service Story
• 7+ years of operating the service• Since 2007 directly run by the engineering team
• Continuous change/validation/monitoring• Standard configuration for all tenants, no
configuration drifts • Enhanced validation pipeline, many signals
• Virtuous cycle of learning• Engineers manage changes • They are on-call for service related issues• Nimble in reacting to issues• Incorporate learnings right away
Issues
FixLearn
Change
Service accrues to server • Single branch for cloud and on-premises• Cumulative Updates (CU’s) validated at
cloud scale• CU’s fork from cloud tested code (CU3)• CU’s run in Microsoft service/on-premises
topologies• Validation for on-premises features and topologies
15.0.x.y
Cloud/CU3 Cloud/CU4 Cloud/CU5
Main Branch
…. …. ….….
Fork Fork ForkCloud Cloud
Types of changes in the datacenter• Feature changes• Security patches• Bug fixes• Hardware upgrades and failures• New Capacity additions
change
is continuous
requires validation
results in mistake. Learn quickly.
Strategy
Staged rollout to build confidence Common platform for validating any changes Agile response through code
Staging
1s of servers10s of servers100s of servers1000s of servers
All changes are staged…
Each stage has a purposeGo fast, and listen
BE Change unit
• Backend (BE) change unit is a Database Availability Group (DAG)
• DAG has multiple copies of mailboxes
BE
DAG
...
FE and BE Change units
Frontend Client Access Servers (FE CAS) and Backend mailboxes (BE MBX) are not dependent on each other for changes in E15
BE
FE
DAG
…
…
Change units rolloutStarts with one unit. Within the unit:1) FE is rolled out based on capacity constraints2) BE is rolled out one copy group at a time
BE
FE
…
…DAG
Multiple change unitsMove onto multiple change units in parallel….Cookie cutter rollouts. No room for configuration drifts
BE
FE
….DAG 1 DAG 2 DAG 3 DAG N
….
….
Multiple change units in a forest
Forest-1
Multiple forests
Forest-1
Forest-2
Forest-3
…
Continue until WW change complete
Demo
Change Pipeline - Stages
Many trains
Upcoming train
Train in progress
Deployment complete
Demo
Change Pipeline – Detailed View
Change - Completion Insights
Demo
Insights to Change
Insights to Change
Rollout Automation System• All change rolled out using one automation
platform • Automation knows about capacity constraint
for FE, BE• Safety nets in place to prevent operator errors• Access approval needed to run the automation• Full tracing for debugging and auditing
Validation
1s of servers10s of servers100s of servers1000s of servers
ValidationConfidence in change increases as we progress…
Listen to many signal types, depending on change
Triangulate more than one signal Active/Passive monitoring Examples:
Availability Latency Errors Performance User experience Etc
Signals
Customer Signal - O365 TAP
Online tenants also have the capability to get a sneak preview of new features (e.g., Groups) Changes are rolled out in a controlled way to the users that sign upListen to the signal and incorporate the feedback for future changes
Incorporate
changeLearn
Resume the
change train
Fix problem
Isolate problem
Page engineer
Stop the change
train
When signals fail
• Engineers are vested in this model
• Goal is to never make the same mistake twice
24*7*365• No short-
cuts in rolling out the fix
• Stage, listen for signals
• Inform fabric that future rollouts take these fixes
Business hours• Engage the
engineers
Demo
Signals
Signal is green
Signal is red
Performance Signal
Multiple builds if there is failure
Hardware Changes
Hardware
Web SKU
Fault tolerance is built through software to maintain high availability. Hardware failures have high impact on change.
Storage SKU
Network
Reduced COGs
Zero Hardware Redundancy
Hard disk, disk controller, fan issues, motherboard issues, TOR failures etc.
Hardware issues are a fact of life…Factors• Hardware failure shouldn’t block change from rolling • Hardware failures take time to get fixed. Plan to
rollout with reduced capacity
How• Hardware failures are monitored by the same change
signals• Excess capacity within the scale unit so change can
continue• Repairs are handled “lazily” • Machines upgraded to DC baseline when repairs
complete
New CapacityFactors1.New capacity needed all the time2.Capacity lands in large increments3.Manage change to service without disruptionHow1. Pipeline for addition of hardware2. Bring hardware to same baseline as live
servers 3. Brand new servers also have hardware issues
Learning Cycle
Case Study 1 Issue summary:CAFE connectivity issues degrade user experience. Internal signals raised an alarm.
Root causeIssue happens during database failover window. Connectivity takes more time in E15 than E14.
Issue fixed by retrying the request against the new BE, rather than depending on the client to do so.
This fix went to on-prem customers as part of SP1. Any Functional/Performance/Scale improvements accrue to server.
Case Study 2Issue summaryOn Jan 10th 2014, Microsoft identified an issue in which some customers served from the European region were unable to access their Exchange mailboxes by using the Outlook client.
Incorporated the learning by improving Signals and Engineering.
Root causeA firmware issue caused network connectivity problems on a portion of backend servers, which disconnected some users.
The system attempted to reconnect by design, which exposed a coding error with the Client Access Front End (CAFE) servers. The large number of reconnect requests caused CAFE servers to become non-responsive.
Case Study 3 Issue SummaryMAC Outlook users were unable to access mailbox with OL2011 under certain conditions. Other related symptoms were “not able to load the content on sent”, “delayed time to delete messages”, clicking on messages and seeing “message has no content”.
Root cause Enhanced logging and throttling had impacted the users. Improved/fine tuned throttling logic to improve user experience.
Missing validation. Added monitoring signals.
Case Study 4 Issue SummarySaved encrypted drafts cannot be opened in OWA SDF. Current logic used signing cert for encryption at time of saving a draft to Drafts folder and it works when only one cert is created per user in our test environment. In SDF, each user has a specific signing cert and an encryption cert, so the user can not open a saved encrypted draft with this bug.
Root causeAdopted a different algorithm for OLK and OWA. Fixed by using the same algorithm used by OLK in picking up signing and encryption certs. And, use encryption cert to encrypt at time of saving a draft. Issue found on 12/18. Fixed on 12/27. Delivered to on-prem 2/25. Delivered on-prem 2/25.
Agile. Another example of Service accrues to Server.
In closing…
Summary • One way of managing change, get good at
it• One way of validating change, get good at
it • Triangulate multiple signals for validation • Agile response to mistakes• Continue to land cloud service accrues to
server
Thank You!
Relevant sessionsOptimizing Exchange Online for efficiencies and snappy experiences Wednesday 1.00PM to 2.15PM
Behind the curtain: How we run Exchange OnlineTuesday 3.00PM to 4.15PM
What’s that alert – Exchange Managed AvailabilityTuesday 4.45PM to 6.00PM
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Appendix
CUs and its dates
Exchange 2013 version General availability date Build number
Release to Manufacturing version of Exchange 2013 December 3, 2012 15.00.0516.032
Exchange 2013 Cumulative Update 1 (CU1) April 2, 2013 15.00.0620.029
Exchange 2013 CU2 July 9, 2013 15.00.0712.024
Exchange 2013 CU3 November 25, 2013 15.00.0775.038
Exchange 2013 Service Pack 1 (SP1) February 25, 2014 15.00.0847.032