78
http://fayerplay.com lost art of troublesh @papa_fire ooting Leon Fayer

Lost art of troubleshooting

Embed Size (px)

Citation preview

Page 1: Lost art of troubleshooting

http://faye

rplay.co

mlost art

of troublesh

@papa_fire

ooting

Leon Fayer

Page 2: Lost art of troubleshooting

@papa_fire

{me}20+ years breaking & fixing

dev, architect, [DevOps]

vp @ OmniTI

fix other people’s

Page 3: Lost art of troubleshooting

@papa_fire

why troubleshooting?

Page 4: Lost art of troubleshooting

@papa_fire

cloud ruined everythingi t r e a l l y d i d

Page 5: Lost art of troubleshooting

Most reliable w

ay to fix Window

s problems

1997D

evO

ps m

antr

a fo

r m

anag

ing

clou

d-ba

sed

syst

ems

2017

when in doubt - reboot

destroy and rebuild

Page 6: Lost art of troubleshooting

old McDonald had a farm

Page 7: Lost art of troubleshooting

old McDonald lost a farm

d u e t o m a d c o w d i s e a s e

Page 8: Lost art of troubleshooting

@papa_fire

troubleshooting - a form of problem solving

Page 9: Lost art of troubleshooting

@papa_fire

problem solving - ability to fix things that you

know nothing about

Page 10: Lost art of troubleshooting

@papa_fire

why is problem solving important?

Page 11: Lost art of troubleshooting

@papa_fire

… because systems are complex

Page 12: Lost art of troubleshooting

@papa_fire

… because of Murphy’s law

Page 13: Lost art of troubleshooting

@papa_fire

… because someone is always watching

Page 14: Lost art of troubleshooting

@papa_fire

{disclamer}

Page 15: Lost art of troubleshooting

@papa_fire

Page 16: Lost art of troubleshooting

@papa_fire

wishful thinking

Page 17: Lost art of troubleshooting

@papa_fire

reality

Page 18: Lost art of troubleshooting

@papa_fire

where to begin?

Page 19: Lost art of troubleshooting

@papa_fire

replicate

Page 20: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

isolate

Page 21: Lost art of troubleshooting

@papa_fire

fix?

Page 22: Lost art of troubleshooting

@papa_fire

what’s the problem?

it’s broken!

Page 23: Lost art of troubleshooting

understanding

Page 24: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

understand problem

Page 25: Lost art of troubleshooting

@papa_fire

“we can’t support 100s req/min we need to scale better!

Page 26: Lost art of troubleshooting

@papa_fire

“we can’t support 100s req/min we need to scale better!

improve performance

Page 27: Lost art of troubleshooting

@papa_fire

performance problem

Page 28: Lost art of troubleshooting

@papa_fire

perceived problem

Page 29: Lost art of troubleshooting

@papa_fire

actual problem

Page 30: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

understand business

Page 31: Lost art of troubleshooting

@papa_fire

“I don’t give a **** if the datacenter is on fire as long as I am still

making money

Page 32: Lost art of troubleshooting

@papa_fire

what does it mean to you?

Page 33: Lost art of troubleshooting

@papa_fire

Page 34: Lost art of troubleshooting

@papa_fire

sales

Page 35: Lost art of troubleshooting

@papa_fire

Page 36: Lost art of troubleshooting

@papa_fire

content

Page 37: Lost art of troubleshooting

@papa_fire

contentad revenue

Page 38: Lost art of troubleshooting

@papa_fire

every technical decisionpowers a business need

Page 39: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

understand impact

Page 40: Lost art of troubleshooting

@papa_fire

Page 41: Lost art of troubleshooting

is there a lesser of

two evils?

Page 42: Lost art of troubleshooting

@papa_fire

sometimes breaking = fixing

Page 43: Lost art of troubleshooting

@papa_fire

80% now > 100% tomorrow

Page 44: Lost art of troubleshooting

@papa_fire

incremental improvements

Page 45: Lost art of troubleshooting

@papa_fire

anatomy of a problem

Page 46: Lost art of troubleshooting

@papa_fire

anatomy of a problem

problem

norm norm

Page 47: Lost art of troubleshooting

@papa_fire

anatomy of a problem

problem

norm

acceptable

norm

Page 48: Lost art of troubleshooting

@papa_fire

anatomy of a problem

problem

norm

acceptable

norm

fix fix fix fix

Page 49: Lost art of troubleshooting

@papa_fire

what have we learned?

understandi

ng

of what’s important

cause and effect

largest impact

acceptable risk

Page 50: Lost art of troubleshooting

@papa_fire

what not to do

Page 51: Lost art of troubleshooting

@papa_fire

don’t assume

Page 52: Lost art of troubleshooting

@papa_fire

Page 53: Lost art of troubleshooting

@papa_fire

I didn’t build it

it’s not documented

it passed the tests

works in dev

everything looks right

Page 54: Lost art of troubleshooting

@papa_fire

Page 55: Lost art of troubleshooting

@papa_fire

don’t feed your egosolve the problem

Page 56: Lost art of troubleshooting

@papa_fire

ask for help

Page 57: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

tools

Page 58: Lost art of troubleshooting

@papa_fire

logging monitoring

profiling

Page 59: Lost art of troubleshooting

@papa_fire

loggingactionable

concise

parsable

Page 60: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

Page 61: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

use

ful i

nfo

rmat

ion

[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

[2017-02-01 18:57:03] API GET data:

[2017-02-01 19:04:03] Post complete, took 420 seconds

[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

Page 62: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

info

rmat

ion

I need

[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0

[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)

Page 63: Lost art of troubleshooting

@papa_fire

monitoringall inclusive

business-first

correlatable

Page 64: Lost art of troubleshooting

@papa_fire

what’s the problem?

it’s broken!

Page 65: Lost art of troubleshooting

@papa_fire

revenue

Page 66: Lost art of troubleshooting

@papa_fire

revenue

Page 67: Lost art of troubleshooting

@papa_fire

revenue

user performance

Page 68: Lost art of troubleshooting

@papa_fire

revenue

database loaduser performance

Page 69: Lost art of troubleshooting

@papa_fire

revenue

database load

decline rate

user performance

Page 70: Lost art of troubleshooting

@papa_fire

profiling

Page 71: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

when you have the “what”

but still have no idea “why”

Page 72: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

#!/usr/sbin/dtrace -s

#pragma quiet

::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}

sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}

sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}

::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}

:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);

printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);

printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}

TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL

/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344

Page 73: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

#!/usr/sbin/dtrace -s

#pragma quiet

::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}

sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}

sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}

::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}

:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);

printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);

printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}

TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL

/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344/api/mobile/get_all_events 368584344

Page 74: Lost art of troubleshooting

@papa_fire

OU

R T

EA

M

TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL

/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344

#!/usr/sbin/dtrace -s

#pragma quiet

::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}

sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}

sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}

::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}

:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);

printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);

printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}

/api/get_item/60693 19773404

Page 75: Lost art of troubleshooting

@papa_firedow

n t

he

rab

bit

hol

e

Page 76: Lost art of troubleshooting

@papa_fire

troubleshooting is …

requ

ired

ski

ll educational

iterative

frustrating

rewarding

Page 77: Lost art of troubleshooting

@papa_fire

Page 78: Lost art of troubleshooting

@papa_fire

https://www.track5media.com/wp-content/uploads/2016/06/workers-gathered-around-comuputer-screen.jpghttp://more-sky.com/data/out/10/IMG_379964.jpghttps://ruwix.com/pics/trolls/9-rubix-cube-neversolved.jpghttp://blog.cartif.com/wp-content/uploads/2016/02/evolucion.pnghttps://cdn-images-1.medium.com/max/2000/1*t-yZUIXuaXo97yiqYtpC5A.jpeghttp://www.6speedonline.com/forums/attachment.php?attachmentid=286232&stc=1&d=1380726388http://www.wallpapers.faketrix.com/content/animal/feathered/page-2/1024/Ostrich-non-flying-winged-animals.jpghttp://oldmanyellsat.cloud/oldman.jpghttp://cdn.wccftech.com/wp-content/uploads/2016/05/4195797-windows-7-alternate-blue.jpghttps://www.poweradmin.com/blog/wp-content/uploads/2015/10/amazon-aws.pnghttps://supportingcmu.org/image/Herd.pnghttp://www.publicdomainpictures.net/pictures/30000/velka/green-fields-1351063140pg3.jpghttps://hurtigruten.global.ssl.fastly.net/assets/48dee2/globalassets/photos/voyages/explorer-voyages/2017-18/ms-fram-antarctica/the-frozen-land-of-the-penguins/2500x1250_r739816dominicbarrington.jpg?width=1600&height=800&transform=DownFillhttps://www.thegeneralistit.com/wp-content/uploads/2015/11/dreamstime_xxl_38819851-Business-woman-eliminate-problem-and-find-solution.jpghttp://paperzip.co.uk/wp-content/uploads/2016/01/word-of-the-day-newspaper.jpghttp://vignette3.wikia.nocookie.net/starwars/images/7/72/DeathStar1-SWE.png/revision/latest?cb=20150121020639https://lcarsgfx.files.wordpress.com/2014/10/prometheus1.pnghttps://cdn.meme.am/cache/instances/folder699/400x/65194699.jpghttp://blog.weespring.com/wp-content/uploads/2014/06/baby-safety-manual-5.jpghttps://4.bp.blogspot.com/-2fGfDw-sohs/V9_CAwCcnaI/AAAAAAAACos/zrARBywD2qAZOphkQMC7WZGdV3vMY5nTACLcB/s1600/Stop%2Bwhining.jpghttps://ih0.redbubble.net/image.14163956.5143/raf,750x1000,075,t,black_white.u4.jpghttp://www.inspireddad.org/wp-content/uploads/uploads/2013/02/ducttape_0930a8_3926013.jpghttps://katieleigh.files.wordpress.com/2014/10/img_0683.jpghttp://pre02.deviantart.net/020c/th/pre/i/2016/094/8/0/down_the_rabbit_hole_by_irenhorrors-d7hgsr3.jpghttp://i1-linux.softpedia-static.com/screenshots/Valgrind_1.pnghttp://i.imgur.com/m6Rkbdx.gif

credits

questions?