of 51/51
KEVINA FINN-BRAUN INTUIT J. PAUL REED RELEASE ENGINEERING APPROACHES DEVOPS ENTERPRISE SUMMIT, 2016 BEYOND THE RETROSPECTIVE: EMBRACING COMPLEXITY ON THE ROAD TOWARDS SERVICE OWNERSHIP

Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

  • View
    92

  • Download
    0

Embed Size (px)

Text of Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership

  • K E V I N A F I N N - B R A U N I N T U I T

    J . PA U L R E E D R E L E A S E E N G I N E E R I N G A P P R O A C H E S

    D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 6

    B E Y O N D T H E R E T R O S P E C T I V E : E M B R A C I N G C O M P L E X I T Y O N T H E R O A D T O W A R D S S E R V I C E O W N E R S H I P

  • K E V I N A F I N N - B R A U N

    Director of Product Infrastructure Service Management at Intuit

    Director of Site Reliability Service Management at Salesforce; Business Continuity at Yahoo

    Geeks out on group dynamics and behavior

    @kfinnbraun on @[email protected] #DOES2016

    https://twitter.com/kfinnbraun

  • J . PA U L R E E D

    @jpaulreed on

    @shipshowpodcast alum

    Managing Partner, Release Engineering Approaches

    A DevOps Consultant

    Masters Candidate in Human Factors & Systems Safety

    @[email protected] #DOES2016

    https://www.twitter.com/jpaulreed

  • A Q U I C K R E C A P F R O M L A S T D O E S

    The Blameless Cloud: Bringing Actionable Retrospectives to SFDC DOES 2015 @[email protected]

    https://www.youtube.com/watch?v=vqRASkJs2yE

  • N E W M A R C H I N G O R D E R S

    @[email protected] #DOES2016

  • S E R V I C E O W N E R S H I P ?

    @[email protected] #DOES2016

  • I T S J U S T W H AT S F D C C A L L E D D E V O P S

    ( S S H H H , D O N T T E L L A N Y O N E )

    @[email protected] #DOES2016

  • W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?

    @[email protected] #DOES2016

  • W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?

    @[email protected] #DOES2016

  • W H I C H F L AV O R O F D E V O P S W O U L D Y O U L I K E ?

    @[email protected] #DOES2016

  • B U T H O W D O W E D O T H E D E V O P S ?

    Learned helplessness?

    Uncontrollable bad event

    Perceived lack of control

    Generalized helpless behavior

    @[email protected] #DOES2016

  • Learned helplessness?

    Uncontrollable bad event

    Perceived lack of control

    Generalized helpless behavior

    Actually: Structural blindness

    B U T H O W D O W E D O T H E D E V O P S ?

    @[email protected] #DOES2016

  • M A K I N G S E N S E O F S E R V I C E O W N E R S H I P

    @[email protected] #DOES2016

  • W O R K S H O P S U R P R I S E S !

    Understanding teams local rationality is key

    Words have meaning; meanings are important; but they arent necessarily shared

    Teams must be given space to deliver on transformations

    Teams can be retrospective blind

    @[email protected] #DOES2016

  • D E V O P S & N U C L E A R M E LT D O W N S ?

    @[email protected]

  • A N E W A D V E N T U R E

    @[email protected] #DOES2016

  • A N E W A D V E N T U R E

    Quickbooks

    TurboTax

    Mint

    FY 2016: $4.7b revenue

    8,000 employees worldwide

    Founded: 1983

    Improving the financial lives of over 45 million customersIPO: 1993

    @[email protected] #DOES2016

  • S O M E D I F F E R E N T C H A L L E N G E S

    Intuit not born in the cloud

    @[email protected] #DOES2016

  • S O M E D I F F E R E N T C H A L L E N G E S

    Intuit not born in the cloud

    Incidents meant something different

    @[email protected] #DOES2016

  • S O M E D I F F E R E N T C H A L L E N G E S

    Intuit not born in the cloud

    Incidents meant something different

    No Bermuda Blob

    @[email protected] #DOES2016

  • S O M E D I F F E R E N T C H A L L E N G E S

    Intuit not born in the cloud

    Incidents meant something different

    No Bermuda Blob

    (No blob at all!)

    @[email protected] #DOES2016

  • S O M E D I F F E R E N T C H A L L E N G E S

    Intuit not born in the cloud

    Incidents meant something different

    No Bermuda Blob

    (No blob at all!)

    Different business lifecycle

    @[email protected] #DOES2016

  • B U T S I M I L A R C H A L L E N G E S , T O O

    Inconsistencies in operational responses

    Postmortems centered around The Old View of human error

    Some incidents & remediations got lost in the shuffle

    Surprising amount of (aggregated) service impact due to P3s/P4s

    What, exactly, is an incident?

    @[email protected] #DOES2016

  • B L A M E L E S S P O S T M O R T E M S ?

    Bren Brown, research sociologist, on vulnerability

    Blame is a way to discharge pain and discomfort

    Postmortem has a heavy connotation

    Awesome postmortems? Really?!

    More at: http://jpaulreed.com/blame-aware-postmortems

    @[email protected] #DOES2016

    http://jpaulreed.com/blame-aware-postmortems

  • Lang

    uage

    Beha

    vior

    s

    Novice Competent Proficient ExpertBeginner

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Lets review the timeline/incident

    report to answer that.

    We need to find the root cause of this incident.

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Follows the prescribed format for retrospectives.

    Possesses and incorporates complete dataset for the incident

    into the retrospective.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Lets review the timeline/incident

    report to answer that.

    We need to find the root cause of this incident. Now that weve established

    what happened, how did it happen?

    How did these multiple factors

    influence our complex system?

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Follows the prescribed format for retrospectives.

    Possesses and incorporates complete dataset for the incident

    into the retrospective.

    Identifies inherent bias

    in self and others.

    Perspectives solicited from all involved team members/functional groups.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Lets review the timeline/incident

    report to answer that.

    We need to find the root cause of this incident. Now that weve established

    what happened, how did it happen?

    How did these multiple factors

    influence our complex system?

    How does our team/system contribute to our successes?

    What can we incorporate from this incident to

    better respond next time?

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Follows the prescribed format for retrospectives.

    Possesses and incorporates complete dataset for the incident

    into the retrospective.

    Identifies inherent bias

    in self and others.

    Perspectives solicited from all involved team members/functional groups.

    Able to facilitate retrospectives by healthily helping others address

    tendency to blame/ personal & systemic bias.

    Retrospective outcomes are fed back into the

    system and prioritized.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Lang

    uage

    Beha

    viors

    Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Lets review the timeline/incident

    report to answer that.

    We need to find the root cause of this incident. Now that weve established

    what happened, how did it happen?

    How did these multiple factors

    influence our complex system?

    How does our team/system contribute to our successes?

    What can we incorporate from this incident to

    better respond next time?

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Follows the prescribed format for retrospectives.

    Possesses and incorporates complete dataset for the incident

    into the retrospective.

    Identifies inherent bias

    in self and others.

    Perspectives solicited from all involved team members/functional groups.

    Able to facilitate retrospectives by healthily helping others address

    tendency to blame/ personal & systemic bias.

    Retrospective outcomes are fed back into the

    system and prioritized.

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Incident Analysis

    Lang

    uage

    Beha

    viors

    Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Lets review the timeline/incident

    report to answer that.

    We need to find the root cause of this incident. Now that weve established

    what happened, how did it happen?

    How did these multiple factors

    influence our complex system?

    How does our team/system contribute to our successes?

    What can we incorporate from this incident to

    better respond next time?

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Follows the prescribed format for retrospectives.

    Possesses and incorporates complete dataset for the incident

    into the retrospective.

    Identifies inherent bias

    in self and others.

    Perspectives solicited from all involved team members/functional groups.

    Able to facilitate retrospectives by healthily helping others address

    tendency to blame/ personal & systemic bias.

    Retrospective outcomes are fed back into the

    system and prioritized.

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Incident Analysis

    Incident Detection Incident

    Response

    Incident Remediation Incident

    Prevention*

    T H E I N C I D E N T L I F E C Y C L E

    Lang

    uage

    Beha

    viors

    Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Lets review the timeline/incident

    report to answer that.

    We need to find the root cause of this incident. Now that weve established

    what happened, how did it happen?

    How did these multiple factors

    influence our complex system?

    How does our team/system contribute to our successes?

    What can we incorporate from this incident to

    better respond next time?

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Follows the prescribed format for retrospectives.

    Possesses and incorporates complete dataset for the incident

    into the retrospective.

    Identifies inherent bias

    in self and others.

    Perspectives solicited from all involved team members/functional groups.

    Able to facilitate retrospectives by healthily helping others address

    tendency to blame/ personal & systemic bias.

    Retrospective outcomes are fed back into the

    system and prioritized.

    @kfinnbraun / #DOES2016 / @jpaulreed

  • I N C I D E N T D E T E C T I O N

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Problems with our service are obvious;

    outages are obvious.

    Other teams will notify us of any problems.

    Most of the time, were the first to know

    when a service is impacted.

    We use historical data to guess at service level changes.

    Weve detected service level transitions via

    monitoring and reduced MTTD.

    I know which specific code/infra change caused this

    service level change; heres how I know

    We prioritize feature requests and bug reports to monitoring hooks;

    monitoring is a 1st class citizen.

    Weve decoupled code/infra deployment, because we

    can roll back/forward.

    Were not paged anymore for changes

    automation can react to.

    Manual and/or external outage notifications.

    No baseline metrics/ service levels are broadly bucketed.

    External monitoring is in place to detect real time service transitions.

    Notifications are automated.

    External infra/API endpoints/ outward-facing interfaces

    monitored/recorded.

    Historical data exists and has been used to establish

    graduated service baselines.

    Application internals report data

    to the monitoring system.

    Monitoring systems employ deep statistical methods

    to (dis)prove service anomalies.

    Monitoring output is reincorporated into operational behavior in an

    automated fashion.

    Anomalies no longer result in defined incidents.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • I N C I D E N T R E S P O N S E

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Have you tried turning it off and turning it on again?

    Something is wrong with the X

    I think X is familiar with Y; lets find them.

    I think theres a problem with the database, network, etc.

    Standard Incident Management System

    language used.

    The deployment caused the database to hang

    The infrastructure on-calls: perform a system status &

    report back to the IC.

    Entire team is familiar with standardized

    IMS language.

    Standardized IMS language is used/valued by the

    entire team.

    What parts of the service did not self-heal and

    need attention?

    Team is event-focused; the team is

    alarmed by incidents.

    Inconsistent response once incident has commenced.

    Response based on tribal knowledge.

    Team is area-focused.

    Team is action-focused.

    Team has identified incident responders, and those

    people know their duties.

    Team is technology-focused.

    Incident response is an aspect of org and team culture.

    Incidents are embraced, but outside-business hours or

    repeated incidents are considered inhumane.

    Team is systems-focused.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • I N C I D E N T A N A LY S I S

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Incidents are bad; my job is on the line.

    Im getting sent to the principals office because

    of this outage.

    Lets fix this as fast as possible.

    Whats the correct fix to avoid this specific issue

    in the future?

    Lets review the timeline/incident

    report to answer that.

    We need to find the root cause of this incident. Now that weve established

    what happened, how did it happen?

    How did these multiple factors

    influence our complex system?

    How does our team/system contribute to our successes?

    What can we incorporate from this incident to

    better respond next time?

    Completes the post-incident

    paperwork.

    No formal retrospective/ hallway retrospectives.

    Some information

    (inconsistently) recorded.

    Jumps to a focus on why.

    Follows the prescribed format for retrospectives.

    Possesses and incorporates complete dataset for the incident

    into the retrospective.

    Identifies inherent bias

    in self and others.

    Perspectives solicited from all involved team members/functional groups.

    Able to facilitate retrospectives by healthily helping others address

    tendency to blame/ personal & systemic bias.

    Retrospective outcomes are fed back into the

    system and prioritized.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • I N C I D E N T R E M E D I AT I O N

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Lets just file a ticket to track the issue.

    Im am sure this is the issue; the fix will correct 100%

    of the occurrences.

    Im pretty sure we already fixed this?

    We need an action plan to address the process gaps.

    This needs to be fixed in the next release and

    documented in our incident response docs.

    We need to look deeper than this specific incident to really

    address the problem.What can we learn from

    this incident?

    What other system aspects have we learned

    from this incident? How can we use that?

    While operating our system today,

    how did we actively create & sustain

    success?

    Remediation activities (or lack

    thereof) contribute to a break-fix cycle.

    Discussions of the incident are aggressive/blameful.

    Low hanging fruit may be fixed, but

    not documented or incorporated into team behavior.

    More processes, more procedures,

    more rules.

    Issues of all sizes are actively managed.

    Issues have a priority and teams have bandwidth to address them.

    Completed issue remediation is

    valued by the org.

    Bandwidth exists to discuss, design and implement resiliency improvements.

    Remediation is not regarded as a separate activity & is

    culturally integrated into work.

    Resilience is considered in the design phase

    for new infra/software.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • I N C I D E N T P R E V E N T I O N *

    @kfinnbraun / #DOES2016 / @jpaulreed

  • Novice Competent Proficient ExpertBeginner

    Preventing future incidents is difficult

    because of lacking data.

    We can use predictive metrics

    to completely avoid future incidents.

    Our system has reasonable coverage

    of its metrics.

    We use metrics to inform attack/risk surface.

    We use trend analysis to raise soft problems

    to operators.

    Old documentation is problematic and dealt with accordingly.

    When we started game days, it was a real mess.

    We now care less about specific incidents &

    more about crew formation.

    The team is excited about game days.

    Our crews care about their formation

    and dissolution.

    Prevention efforts include documentation,

    process design, metrics collection.

    Retrospective focus is on static causes/effects.

    Retrospectives include discussions

    of active operator behaviors.

    Docs, process, metrics established,

    but < 100%.

    Preventative focus is on reviewing docs+process+ metrics collection, but in a

    day-to-day context.

    Retrospectives focus on the response of the team

    to an incident.

    We actively inject failure into our

    systems on a known schedule,

    to drill.

    We review our response to

    induced failures.

    The crew formation/dissolution process is considered our

    primary role+responsibility in addressing and preventing

    operational failure

    We actively inject failure at random intervals.

    Lang

    uage

    Beha

    vior

    s

    @kfinnbraun / #DOES2016 / @jpaulreed

  • H E L P U S M A K E I T B E T T E R !

    https://github.com/preed/[email protected]@kfinnbraun #DOES2016

    https://github.com/preed/incident-lifecycle-model

  • FA C I L I TAT E T E A M S E X P L O R I N G T H E I R D I S C R E T I O N A R Y S PA C E

    @[email protected] #DOES2016

  • I N C I D E N T R E S P O N S E ! = I N C I D E N T M A N A G E M E N T

    @[email protected] #DOES2016

  • I N C I D E N T R E S P O N S E ! = I N C I D E N T M A N A G E M E N T

    ( Y O U R I N C I D E N T VA L U E S T R E A M M AT T E R S )

    @[email protected] #DOES2016

  • Y O U A R E N E V E R D O N E .

    @[email protected] #DOES2016

  • Y O U . A R E . N E V E R . D O N E .

    @[email protected] #DOES2016

  • AV E N U E S F O R C O L L A B O R AT I O N

    Take a look at the extended incident lifecycle model and your organization: see where it fits and doesnt!

    (And then send us Github pull requests!)

    Compare your own (documented?) incident life cycle against your actual incident value stream; share what you find!

    @[email protected] #DOES2016

  • Kevina Finn-Braun [email protected] http://lnkdin.me/kevinafinnbraun

    J. Paul Reed [email protected]

    http://jpaulreed.com

    mailto:[email protected]?subject=http://lnkdin.me/kevinafinnbraunmailto:[email protected]?subject=http://jpaulreed.com/