Upload
leon-fayer
View
294
Download
0
Embed Size (px)
Citation preview
http://faye
rplay.co
mlost art
of troublesh
@papa_fire
ooting
Leon Fayer
@papa_fire
{me}20+ years breaking & fixing
dev, architect, [DevOps]
vp @ OmniTI
fix other people’s
@papa_fire
why troubleshooting?
@papa_fire
cloud ruined everythingi t r e a l l y d i d
Most reliable w
ay to fix Window
s problems
1997D
evO
ps m
antr
a fo
r m
anag
ing
clou
d-ba
sed
syst
ems
2017
when in doubt - reboot
destroy and rebuild
old McDonald had a farm
old McDonald lost a farm
d u e t o m a d c o w d i s e a s e
@papa_fire
troubleshooting - a form of problem solving
@papa_fire
problem solving - ability to fix things that you
know nothing about
@papa_fire
why is problem solving important?
@papa_fire
… because systems are complex
@papa_fire
… because of Murphy’s law
@papa_fire
… because someone is always watching
@papa_fire
{disclamer}
@papa_fire
@papa_fire
wishful thinking
@papa_fire
reality
@papa_fire
where to begin?
@papa_fire
replicate
@papa_fire
OU
R T
EA
M
isolate
@papa_fire
fix?
@papa_fire
what’s the problem?
it’s broken!
understanding
@papa_fire
OU
R T
EA
M
understand problem
@papa_fire
“we can’t support 100s req/min we need to scale better!
@papa_fire
“we can’t support 100s req/min we need to scale better!
improve performance
@papa_fire
performance problem
@papa_fire
perceived problem
@papa_fire
actual problem
@papa_fire
OU
R T
EA
M
understand business
@papa_fire
“I don’t give a **** if the datacenter is on fire as long as I am still
making money
@papa_fire
what does it mean to you?
@papa_fire
@papa_fire
sales
@papa_fire
@papa_fire
content
@papa_fire
contentad revenue
@papa_fire
every technical decisionpowers a business need
@papa_fire
OU
R T
EA
M
understand impact
@papa_fire
is there a lesser of
two evils?
@papa_fire
sometimes breaking = fixing
@papa_fire
80% now > 100% tomorrow
@papa_fire
incremental improvements
@papa_fire
anatomy of a problem
@papa_fire
anatomy of a problem
problem
norm norm
@papa_fire
anatomy of a problem
problem
norm
acceptable
norm
@papa_fire
anatomy of a problem
problem
norm
acceptable
norm
fix fix fix fix
@papa_fire
what have we learned?
understandi
ng
of what’s important
cause and effect
largest impact
acceptable risk
@papa_fire
what not to do
@papa_fire
don’t assume
@papa_fire
@papa_fire
I didn’t build it
it’s not documented
it passed the tests
works in dev
everything looks right
@papa_fire
@papa_fire
don’t feed your egosolve the problem
@papa_fire
ask for help
@papa_fire
OU
R T
EA
M
tools
@papa_fire
logging monitoring
profiling
@papa_fire
loggingactionable
concise
parsable
@papa_fire
OU
R T
EA
M
[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
@papa_fire
OU
R T
EA
M
[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
use
ful i
nfo
rmat
ion
[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
[2017-02-01 18:57:03] API GET data:
[2017-02-01 19:04:03] Post complete, took 420 seconds
[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
@papa_fire
OU
R T
EA
M
[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 16:46:31] AbandonedReservation successfully enqueued.[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0[2017-02-01 18:57:02] Parsed args[2017-02-01 18:57:02] Posting to API[2017-02-01 18:57:02] Initializing args[2017-02-01 18:57:02] Loading reservation_form_data[2017-02-01 18:57:03] Reservation Form Data loaded successfully[2017-02-01 18:57:03] Appending campaign info[2017-02-01 18:57:03] Reservation name: [Some very very very long name][2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance.[2017-02-01 18:57:03] Setting currency to US Dollar[2017-02-01 18:57:03] Appending marketing info[2017-02-01 18:57:03] Have a non-sku source_code[2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE[2017-02-01 18:57:03] Appending match rule = Match Rule[2017-02-01 18:57:03] Appending user info[2017-02-01 18:57:03] Appending order info[2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871[2017-02-01 18:57:03] Determining actual cost table[2017-02-01 18:57:03] Appending comment notes[2017-02-01 18:57:03] Appending abandoned flag[2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = [email protected] contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... }[2017-02-01 19:04:03] Post complete, took 420 seconds[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)[2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
info
rmat
ion
I need
[2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
[2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)
@papa_fire
monitoringall inclusive
business-first
correlatable
@papa_fire
what’s the problem?
it’s broken!
@papa_fire
revenue
@papa_fire
revenue
@papa_fire
revenue
user performance
@papa_fire
revenue
database loaduser performance
@papa_fire
revenue
database load
decline rate
user performance
@papa_fire
profiling
@papa_fire
OU
R T
EA
M
when you have the “what”
but still have no idea “why”
@papa_fire
OU
R T
EA
M
#!/usr/sbin/dtrace -s
#pragma quiet
::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}
sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}
sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}
::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}
:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);
printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);
printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}
TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL
/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344
@papa_fire
OU
R T
EA
M
#!/usr/sbin/dtrace -s
#pragma quiet
::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}
sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}
sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}
::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}
:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);
printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);
printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}
TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL
/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344/api/mobile/get_all_events 368584344
@papa_fire
OU
R T
EA
M
TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL
/directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344
#!/usr/sbin/dtrace -s
#pragma quiet
::ap_process_request:process-request-entry/zonename == "www4"/{ self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp;}
sched:::off-cpu/self->uri != 0/{ self->runtime += timestamp - self->oncpu; self->offcpu = timestamp;}
sched:::on-cpu/self->uri != 0/{ self->oncpu = timestamp; self->waittime += timestamp - self->offcpu;}
::ap_process_request:process-request-return/self->uri != 0/{ @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count();}
:::tick-5min{ printf("\n%Y\n", walltimestamp); printf("\nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URL\n"); trunc(@duration,10); printa(@duration); trunc(@duration);
printf("\n\nNUMBER OF HITS\n"); trunc(@count,10); printa(@count); trunc(@count);
printf("\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n"); trunc(@waiting,10); printa(@waiting); trunc(@waiting);}
/api/get_item/60693 19773404
@papa_firedow
n t
he
rab
bit
hol
e
@papa_fire
troubleshooting is …
requ
ired
ski
ll educational
iterative
frustrating
rewarding
@papa_fire
@papa_fire
https://www.track5media.com/wp-content/uploads/2016/06/workers-gathered-around-comuputer-screen.jpghttp://more-sky.com/data/out/10/IMG_379964.jpghttps://ruwix.com/pics/trolls/9-rubix-cube-neversolved.jpghttp://blog.cartif.com/wp-content/uploads/2016/02/evolucion.pnghttps://cdn-images-1.medium.com/max/2000/1*t-yZUIXuaXo97yiqYtpC5A.jpeghttp://www.6speedonline.com/forums/attachment.php?attachmentid=286232&stc=1&d=1380726388http://www.wallpapers.faketrix.com/content/animal/feathered/page-2/1024/Ostrich-non-flying-winged-animals.jpghttp://oldmanyellsat.cloud/oldman.jpghttp://cdn.wccftech.com/wp-content/uploads/2016/05/4195797-windows-7-alternate-blue.jpghttps://www.poweradmin.com/blog/wp-content/uploads/2015/10/amazon-aws.pnghttps://supportingcmu.org/image/Herd.pnghttp://www.publicdomainpictures.net/pictures/30000/velka/green-fields-1351063140pg3.jpghttps://hurtigruten.global.ssl.fastly.net/assets/48dee2/globalassets/photos/voyages/explorer-voyages/2017-18/ms-fram-antarctica/the-frozen-land-of-the-penguins/2500x1250_r739816dominicbarrington.jpg?width=1600&height=800&transform=DownFillhttps://www.thegeneralistit.com/wp-content/uploads/2015/11/dreamstime_xxl_38819851-Business-woman-eliminate-problem-and-find-solution.jpghttp://paperzip.co.uk/wp-content/uploads/2016/01/word-of-the-day-newspaper.jpghttp://vignette3.wikia.nocookie.net/starwars/images/7/72/DeathStar1-SWE.png/revision/latest?cb=20150121020639https://lcarsgfx.files.wordpress.com/2014/10/prometheus1.pnghttps://cdn.meme.am/cache/instances/folder699/400x/65194699.jpghttp://blog.weespring.com/wp-content/uploads/2014/06/baby-safety-manual-5.jpghttps://4.bp.blogspot.com/-2fGfDw-sohs/V9_CAwCcnaI/AAAAAAAACos/zrARBywD2qAZOphkQMC7WZGdV3vMY5nTACLcB/s1600/Stop%2Bwhining.jpghttps://ih0.redbubble.net/image.14163956.5143/raf,750x1000,075,t,black_white.u4.jpghttp://www.inspireddad.org/wp-content/uploads/uploads/2013/02/ducttape_0930a8_3926013.jpghttps://katieleigh.files.wordpress.com/2014/10/img_0683.jpghttp://pre02.deviantart.net/020c/th/pre/i/2016/094/8/0/down_the_rabbit_hole_by_irenhorrors-d7hgsr3.jpghttp://i1-linux.softpedia-static.com/screenshots/Valgrind_1.pnghttp://i.imgur.com/m6Rkbdx.gif
credits
questions?