Upload
devday-dresden
View
4.339
Download
2
Embed Size (px)
Citation preview
4,307
467
Spread out across 35 feature teams
ProductionDevelopment
Backlog
Requirements
Visual Studio& TFS
Update 1
Visual Studio& TFS
Update 2
Visual Studio& TFS
Update n
VS Team Services
Code Test & Stabilize Code Test & Stabilize
Beta RTM
2 years
Planning
Customer feedback – we should
change the way a feature works. We
didn’t get it quite right…
… but we’re booked solid already. 2 years
S1 S2 S3 S4 S5 Stabilization S6
A
B
S7 S8
2 years
3 weeks
https://flic.kr/p/arXUyP
Alignment
Autonomy
“Let’s try to give our teams three things…. Autonomy, Mastery, Purpose”
Scenarios
Features
Stories
Tasks
Sprint3 week
3
Plan3 sprint
Season6 month
Scenario18 month
3 6
SpringFallSpring Fall
Aspirational
60%
Sprint3 week
Plan3 sprint
3
Season6 month
Scenario18 month
3 6
SpringFallSpring Fall
Hopeful
80%
What Epics are we lighting up
Sprint3 week
3
Plan3 sprint
Season6 month
Scenario18 month
3 6
SpringFallSpring Fall
Thoughtful
90%
What features are planned?
Sprint3 week
3
Plan3 sprint
Scenario18 month
3 6
SpringFallSpring Fall
Confident
95%
What stories are we complete? What features are shipping?
Season6 month
Week 1 Week 2 Week 3
Week 1 Week 2 Week 3Week 2 Week 3
Sprint 98
Sprint 97 Sprint 99
The sprint plan What we accomplished
• Updates were large
• Months apart
• Lots of problems!
4/1/2010 4/23/2012
5/3/2010
TFS 2010 RTM
4/23/2011
Service Deployment
8/5/2011
Service Update
9/26/2011
//BUILD 2011
12/7/2011
Service Update
1/30/2012
Service Update
2/20/2012
Service Update
3/12/2012
Service Update
4/2/2012
Service Update
Program Management Development Testing
Operations
Program Management Engineering
Operations
Engineering
Program Management Engineering
Week 1 Week 2 Week 3
Week 1 Week 2 Week 3Week 2 Week 3
Sprint 98
Sprint 97 Sprint 99
DeploymentSprint Planning
Done
Week 1 Week 2 Week 3
Week 1 Week 2 Week 3
Week 1 Week 2 Week 3
Week 1 Week 2 Week 3
ONE
Code Test & Stabilize Code Test & Stabilize
Beta RTM
Planning
Code
Complete
ON
OFF
ON
OFF
ON
OFF
ON
OFF
ON
OFF
ON
OFF
VSO SU1
Chicago
VSO SU0
San Antonio
VSO SU4
Amsterdam
Shared Platform Services
San Antonio
Existing experience Baseline:
36% conversion to project
50% to 100% customers
conversion to project (+18%)
There’s no place like production!
Telemetry everywhere
Customer IntelligenceBusiness IntelligenceOperational Intelligence
Dashboard DevOps Debug Experiments
Getting the availability model right
0,8
0,82
0,84
0,86
0,88
0,9
0,92
0,94
0,96
0,98
1
-200
0
200
400
600
800
1000
1200
1400
1600
9.25.13 2:24 PM 9.25.13 3:36 PM 9.25.13 4:48 PM 9.25.13 6:00 PM 9.25.13 7:12 PM 9.25.13 8:24 PM 9.25.13 9:36 PM 9.25.13 10:48 PM
Sept 25th 2013 LSI
FailedExecutionCount SlowExecutionCount Start End Availability (ID4 - Activity Only) Availability (Current)
Alerting is key to fast detection
Every alert must be actionable and represent a real issue with the system.
Alerts should create a sense of urgency –false alerts dilutes that
Redundant alerts for same the issue
Needed to set right thresholds and tune often
Stateless alerts contributed to further noise
Health model in action
• 3 errors for memory
and performance
• All 3 related to same
code defect
• APM component mapped to feature team
• Auto-dialer engaged Global DRI
Eliminated alert noise
~928 alerts per week to
~22 and reduced DRI
escalations by ~56%
Live Site Issues (LSIs)
Time to MitigateTime to Detect
% o
f In
cid
en
ts
DRAFT
DRAFT
Microsoft Confidential 52
Service Availability & Health Metrics
DRAFT DRAFTDRAFT
Inci
den
t C
ou
nt
Inci
den
t C
ou
nt
DRAFT
DRAFTDRAFT
% o
f In
cid
en
ts
Use
r M
inu
tes
DRAFT
DRAFTDRAFT
Error By SourceIncidents by SeverityUser Impact Minutes During Incidents [TFS
Only]
3
2
1
4
1. TFS Availability is on an improving trend. No Sev0/Sev1 LSIs for July.
2. App Insights switched from synthetic availability to real-user experience in Ibiza portal. A high
volume of SEV-2 LSIs (72) contributed to customer impact in addition to intermittent UX errors.
(UX fixes applied on 8/11 that improves availability)
3. App Insights was impacted by 3 long running LSIs related to ES maintenance, Ibiza updates and
an Azure Storage outage.
4. TFS Service attainment (SLO) improved significantly MoM with focus on minimizing failed/slow
commands and reviewing in weekly LiveSite reviews
Service status
© 2015 Microsoft Corporation. All rights reserved.