Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

  • View
    3.956

  • Download
    0

  • Category

    Software

Preview:

Citation preview

Instrumentation as a Living Documentation

TEACHING HUMANS ABOUT COMPLEX SYSTEMS

I do things to/with computers.

I build real-time systems.

I build distributed systems.

I build critical systems.

AdRoll

L E S S T H I S

M O R E T H I S

W E ’ R E A N A D T E C H

C O M PA N Y .

R E A L - T I M E B I D D I N G

The nature of the problem domain:

• Low latency ( < 100ms per transaction )

• Firm real-time system

• Highly concurrent ( > 55 billion transactions per day )

• Global, 24/7 operation

I build Complex Systems

Complex Systems

• Non-linear feedback

• Tightly coupled to external systems

• Difficult to model, understand

• Usually a solution to some “wicked problem”

- - C . WEST CHURCHMAN, - GUEST ED I TOR IAL : W ICKED PROBLEMS - MANAGEMENT SC IENCE VOL . 4 , 1967

[WICKED PROBLEMS ARE] SOCIAL PROBLEMS WHICH ARE ILL FORMULATED, WHERE THE INFORMATION IS CONFUSING, WHERE THERE ARE MANY CLIENTS AND DECISION-MAKERS WITH CONFLICTING VALUES, AND WHERE THE RAMIFICATIONS IN THE WHOLE SYSTEM ARE THOROUGHLY CONFUSING. […] THE ADJECTIVE ‘WICKED’ IS SUPPOSED TO DESCRIBE THE MISCHIEVOUS AND EVEN EVIL QUALITY OF THESE PROBLEMS, WHERE PROPOSED ‘SOLUTIONS’ OFTEN TURN OUT TO BE WORSE THAN THE SYMPTOMS.

Bad things happen when Complex Systems fail.

Complex Systems often create worse problems than those they solve.

HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.

CARLOS BUENO “MATURE OPT IM IZAT ION HANDBOOK”

The key challenge to sustaining a complex system is maintaining

our understanding of it.

We write documentation.

Complex systems are fiendishly difficult to communicate about.

Miscommunications are accidents in the making.

Documentation reduces accidents.

I F Y O U D O N ’ T K N O W H O W T H E S Y S T E M

S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T

S H O U L D N ’ T O R I S N ’ T .

Trouble is, documentation goes out of date.

Complex Systems evolve and written words “rot”

as the system moves on.

Engineers fail to update documentation as the

system changes.

DAV ID E . HOFFMAN “THE DEAD HAND: THE UNTOLD STORY OF THE COLD

WAR ARMS RACE AND I T ’ S DANGEROUS LEGACY”

ONE OPERATOR (…) WAS CONFUSED BY THE LOGBOOK. HE CALLED SOMEONE ELSE TO INQUIRE. !

“WHAT SHALL I DO?” HE ASKED. “IN THE PROGRAM THERE ARE INSTRUCTIONS OF WHAT TO DO, AND THEN A LOT OF THINGS CROSSED OUT.” !

THE OTHER PERSON THOUGHT FOR A MINUTE, THEN R E P L I E D , “ F O L L O W T H E C R O S S E D O U T INSTRUCTIONS.”

Engineers can be unaware of the system as it is actually used.

ER IC SCHLOSSER COMMAND AND CONTROL : NUCLEAR WEAPONS, THE DAMASCUS ACC IDENT, AND THE I L LUS ION OF SAFETY

CLEARLY THE TEXTBOOKS (…) DIDN’T TELL YOU WHAT REALLY HAPPENED IN THE FIELD. (…) (T)HERE WAS A WAY YOU WERE SUPPOSED TO DO THINGS – AND THE WAY THINGS GOT DONE. RFHCO SUITS WERE HOT AND CUMBERSOME (…) AND IF A MAINTENANCE TASK COULD BE ACCOMPLISHED QUICKLY WITHOUT AN OFFICER NOTICING, SOMETIMES THE SUITS WEREN’T WORN.

(Normal) Accidents happen.

HENRY S . F. COOPER , JR . X I I I : THE APOLLO FL IGHT THAT FA I LED

THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS W E R E N O T E V E N S U R E T H AT ANYTHING HAD.

Documentation doesn’t necessarily reflect the reality of the system.

What can we do?

INSTRUMENTATION

Instrumentation reflects the reality of the system as it exists.

Instrumentation allows users and engineers to explore the system as

it exists.

Exploration, done honestly, guides us to a new, better understanding

of the system.

THIS “COLLECTIVE ENTITY” WAS ORGANIZED AROUND THE PILOT TO MAKE IT “SAFER AND MORE EFFICIENT IF THERE WAS A FOCAL POINT. AND I WAS THE FOCAL POINT. JIM FED THINGS INTO MY EARS. THE MOON FED THINGS INTO MY EYES AND I COULD FEEL THE MACHINE OPERATING.”

COMMANDER DAV ID SCOTT AS QUOTED IN DAV ID A . M INDELL 'S

D IG I TAL APOLLO : HUMAN AND MACH INE IN SPACEFL IGHT

Instrumentation democratizes the organization around a complex

system.

Case Studies

Case Study: Exchange Throttling

Case Study: Exchange Throttling

Healthy pattern of bid requests

Case Study: Exchange Throttling

The trough of throttling

B A D

G O O D

Case Study: Exchange Throttling

Problem confirmed with Exchange

Case Study: Exchange Throttling

Case Study: Exchange Throttling

• All other metrics (run-queue, CPU, network IO) were fine.

• Confirmed that no changes had been made to the running systems via deployment.

• Amazon data showed no network issues to our machines.

What happened?

Case Study: Exchange Throttling

We hit an implicit exchange limit. (Arguably, a bug.)

Case Study: Exchange Throttling

Case Study: Timeout Jumps

Case Study: Timeout Jumps

Healthy Pattern of Background Timeouts

Case Study: Timeout Jumps

Unhealthy timeouts.

Case Study: Timeout Jumps

Healthy Bid Requests

Case Study: Timeout Jumps

Unhealthy Bid Requests

Cliff of Throttling

Case Study: Timeout Jumps• Timeouts jump occurred only in US East, US

West fine.

• All other metrics (as above) checked out.

• System deployment strongly correlated with timeout jump.

• Rollback to previous release reduce timeouts to acceptable levels.

What happened?

Case Study: Timeout Jumps

Who can say? ¯\_(シ)_/¯

Case Study: Timeout Jumps

Lessons Learned

It is possible to have too little information.

(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF CHERNOBYL REACTOR 4). THEY KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.- SVETLANA ALEX IEV ICH - VO ICES FROM CHERNOBYL : THE ORAL H ISTORY OF A

NUCLEAR D ISASTER

It is possible to collect too much information, or

present it badly.

SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. (…) ONE OF THE LESSONS OF COMPLEX SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.

- CHARLES PERROW - NORMAL ACC IDENTS : L I V ING WITH H IGH -R ISK

TECHNOLOG IES

Instrumentation is not a

panacea.

Instruments may be misleading.

Must know some Mathematics.

Too much information hampers interpretation.

Instruments may be

inaccurate.

Instruments may be ignored.

Instrumentation may be used for undesirable purposes.

What can we do?

Write documentation!

Context reduces misinterpretations.Misleading Instruments

Procedure manuals and visualizations reduce the need for math background.

Must Know Math

The more contextual layers you add, the more you reduce “big boards of blinky lights”.

Too Much Information

INSTRUMENTATION IS LIKE A SUIT. IT NEEDS TO FIT YOUR OWN MIND.

VALENT INO VOLONGH I

Cross-checks and documented error margins mitigate instrument inaccuracy.

Inaccuracy

IF YOU DON'T TRUST A COMPUTER BECAUSE SOMETIMES IT DOESN'T TELL YOU THE TRUTH, TELLING IT TO TELL YOU TO TRUST IT IS ASKING IT TO LIE TO YOU SOMETIMES.

MIKE SASSAK , CURBS IDE

Checklists with references to instrumentation at decision points.

May be Ignored

Collaborative Workplaces, Cooperatives, Unions, Laws etc.

Undesirable Purposes

I PROPOSE THAT MEN AND WOMEN BE RETURNED TO WORK AS CONTROLLERS OF MACHINES, AND THAT THE CONTROL OF PEOPLE BY MACHINES BE CURTAILED. I PROPOSE, FURTHER, THAT THE EFFECTS OF CHANGES IN TECHNOLOGY AND ORGANIZATION ON LIFE PATTERNS BE TAKEN INTO CAREFUL CONSIDERATION, AND THAT THE CHANGES BE WITHHELD OR INTRODUCED ON THE BASIS OF THIS CONSIDERATION.

KURT VONNEGUT PLAYER P IANO

Instrumentation addresses the problems of documentation, documentation the problems of instrumentation.

TL;DR

Complex Systems need them both.

How do I get started?

Exometer

Dropwizard’s Metrics

Scales

DataDog NewRelic Librato

Questions?

Thanks! <3

@bltroutwine

Recommended