58

DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Embed Size (px)

Citation preview

Page 1: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs
Page 2: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Change at Service Scale

Name: Gayathri VenkataramanTitle: Principal PM LeadChange Velocity TeamMicrosoft

Name: Rudra MitraTitle: Group Program ManagerService FundamentalsMicrosoft

DMI312

Page 3: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Agenda• Background• Types of changes we make to the service• How we apply changes• Safety during change • What happens when we mess up

Disclaimer: cloud focused session. Procedures and principles will apply to on-premises.

Page 4: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Rapidly growing service….

Presence in major continents Dozens of datacenters

Dedicated Chinese DCs

Page 5: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Service Story

• 7+ years of operating the service• Since 2007 directly run by the engineering team

• Continuous change/validation/monitoring• Standard configuration for all tenants, no

configuration drifts • Enhanced validation pipeline, many signals

• Virtuous cycle of learning• Engineers manage changes • They are on-call for service related issues• Nimble in reacting to issues• Incorporate learnings right away

Issues

FixLearn

Change

Page 6: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Service accrues to server • Single branch for cloud and on-premises• Cumulative Updates (CU’s) validated at

cloud scale• CU’s fork from cloud tested code (CU3)• CU’s run in Microsoft service/on-premises

topologies• Validation for on-premises features and topologies

15.0.x.y

Cloud/CU3 Cloud/CU4 Cloud/CU5

Main Branch

…. …. ….….

Fork Fork ForkCloud Cloud

Page 7: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Types of changes in the datacenter• Feature changes• Security patches• Bug fixes• Hardware upgrades and failures• New Capacity additions

Page 8: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

change

is continuous

requires validation

results in mistake. Learn quickly.

Page 9: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs
Page 10: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Strategy

Staged rollout to build confidence Common platform for validating any changes Agile response through code

Page 11: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Staging

Page 12: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

1s of servers10s of servers100s of servers1000s of servers

All changes are staged…

Each stage has a purposeGo fast, and listen

Page 13: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

BE Change unit

• Backend (BE) change unit is a Database Availability Group (DAG)

• DAG has multiple copies of mailboxes

BE

DAG

...

Page 14: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

FE and BE Change units

Frontend Client Access Servers (FE CAS) and Backend mailboxes (BE MBX) are not dependent on each other for changes in E15

BE

FE

DAG

Page 15: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Change units rolloutStarts with one unit. Within the unit:1) FE is rolled out based on capacity constraints2) BE is rolled out one copy group at a time

BE

FE

…DAG

Page 16: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Multiple change unitsMove onto multiple change units in parallel….Cookie cutter rollouts. No room for configuration drifts

BE

FE

….DAG 1 DAG 2 DAG 3 DAG N

….

….

Page 17: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Multiple change units in a forest

Forest-1

Page 18: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Multiple forests

Forest-1

Forest-2

Forest-3

Page 19: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Continue until WW change complete

Page 20: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Demo

Change Pipeline - Stages

Page 21: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Many trains

Page 22: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Upcoming train

Page 23: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Train in progress

Page 24: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Deployment complete

Page 25: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Demo

Change Pipeline – Detailed View

Page 26: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Change - Completion Insights

Page 27: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Demo

Insights to Change

Page 28: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Insights to Change

Page 29: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Rollout Automation System• All change rolled out using one automation

platform • Automation knows about capacity constraint

for FE, BE• Safety nets in place to prevent operator errors• Access approval needed to run the automation• Full tracing for debugging and auditing

Page 30: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Validation

Page 31: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

1s of servers10s of servers100s of servers1000s of servers

ValidationConfidence in change increases as we progress…

Page 32: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Listen to many signal types, depending on change

Triangulate more than one signal Active/Passive monitoring Examples:

Availability Latency Errors Performance User experience Etc

Signals

Page 33: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Customer Signal - O365 TAP

Online tenants also have the capability to get a sneak preview of new features (e.g., Groups) Changes are rolled out in a controlled way to the users that sign upListen to the signal and incorporate the feedback for future changes

Page 34: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Incorporate

changeLearn

Resume the

change train

Fix problem

Isolate problem

Page engineer

Stop the change

train

When signals fail

• Engineers are vested in this model

• Goal is to never make the same mistake twice

24*7*365• No short-

cuts in rolling out the fix

• Stage, listen for signals

• Inform fabric that future rollouts take these fixes

Business hours• Engage the

engineers

Page 35: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Demo

Signals

Page 36: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Signal is green

Page 37: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Signal is red

Page 38: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs
Page 39: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Performance Signal

Page 40: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Multiple builds if there is failure

Page 41: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs
Page 42: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Hardware Changes

Page 43: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Hardware

Web SKU

Fault tolerance is built through software to maintain high availability. Hardware failures have high impact on change.

Storage SKU

Network

Reduced COGs

Zero Hardware Redundancy

Hard disk, disk controller, fan issues, motherboard issues, TOR failures etc.

Page 44: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Hardware issues are a fact of life…Factors• Hardware failure shouldn’t block change from rolling • Hardware failures take time to get fixed. Plan to

rollout with reduced capacity

How• Hardware failures are monitored by the same change

signals• Excess capacity within the scale unit so change can

continue• Repairs are handled “lazily” • Machines upgraded to DC baseline when repairs

complete

Page 45: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

New CapacityFactors1.New capacity needed all the time2.Capacity lands in large increments3.Manage change to service without disruptionHow1. Pipeline for addition of hardware2. Bring hardware to same baseline as live

servers 3. Brand new servers also have hardware issues

Page 46: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Learning Cycle

Page 47: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Case Study 1 Issue summary:CAFE connectivity issues degrade user experience. Internal signals raised an alarm.

Root causeIssue happens during database failover window. Connectivity takes more time in E15 than E14.

Issue fixed by retrying the request against the new BE, rather than depending on the client to do so.

This fix went to on-prem customers as part of SP1. Any Functional/Performance/Scale improvements accrue to server.

Page 48: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Case Study 2Issue summaryOn Jan 10th 2014, Microsoft identified an issue in which some customers served from the European region were unable to access their Exchange mailboxes by using the Outlook client.

Incorporated the learning by improving Signals and Engineering.

Root causeA firmware issue caused network connectivity problems on a portion of backend servers, which disconnected some users.

The system attempted to reconnect by design, which exposed a coding error with the Client Access Front End (CAFE) servers. The large number of reconnect requests caused CAFE servers to become non-responsive.

Page 49: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Case Study 3 Issue SummaryMAC Outlook users were unable to access mailbox with OL2011 under certain conditions. Other related symptoms were “not able to load the content on sent”, “delayed time to delete messages”, clicking on messages and seeing “message has no content”.

Root cause Enhanced logging and throttling had impacted the users. Improved/fine tuned throttling logic to improve user experience.

Missing validation. Added monitoring signals.

Page 50: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Case Study 4 Issue SummarySaved encrypted drafts cannot be opened in OWA SDF. Current logic used signing cert for encryption at time of saving a draft to Drafts folder and it works when only one cert is created per user in our test environment. In SDF, each user has a specific signing cert and an encryption cert, so the user can not open a saved encrypted draft with this bug.

Root causeAdopted a different algorithm for OLK and OWA. Fixed by using the same algorithm used by OLK in picking up signing and encryption certs. And, use encryption cert to encrypt at time of saving a draft. Issue found on 12/18. Fixed on 12/27. Delivered to on-prem 2/25. Delivered on-prem 2/25.

Agile. Another example of Service accrues to Server.

Page 51: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

In closing…

Page 52: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Summary • One way of managing change, get good at

it• One way of validating change, get good at

it • Triangulate multiple signals for validation • Agile response to mistakes• Continue to land cloud service accrues to

server

Page 53: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Thank You!

Page 54: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Relevant sessionsOptimizing Exchange Online for efficiencies and snappy experiences Wednesday 1.00PM to 2.15PM

Behind the curtain: How we run Exchange OnlineTuesday 3.00PM to 4.15PM

What’s that alert – Exchange Managed AvailabilityTuesday 4.45PM to 6.00PM

 

Page 55: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs
Page 56: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 57: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

Appendix

Page 58: DMI312 Rapidly growing service…. Presence in major continents Dozens of datacenters Dedicated Chinese DCs

CUs and its dates

Exchange 2013 version General availability date Build number

Release to Manufacturing version of Exchange 2013 December 3, 2012 15.00.0516.032

Exchange 2013 Cumulative Update 1 (CU1) April 2, 2013 15.00.0620.029

Exchange 2013 CU2 July 9, 2013 15.00.0712.024

Exchange 2013 CU3 November 25, 2013 15.00.0775.038

Exchange 2013 Service Pack 1 (SP1) February 25, 2014 15.00.0847.032