57
Building Large Scale Services PRESENTED BY Jennifer Davis November 8, 2013

Building Large Scale Services - LISA 2013

Embed Size (px)

DESCRIPTION

Yahoo! Service Engineers (SE) specialize in bridging the gap between system administration and development. SEs are tasked with delivering a reliable, consistent quality service through the use of best practices. They must understand network, OS, hardware, and customer use cases; and dive deep into the application internals. In this talk, Jennifer will describe her journey with the Sherpa service at Yahoo! and lessons learned about building a reliable, consistent, and high-quality service from scratch. The key takeaway from this talk will be to educate practitioners on successful strategies and pitfalls when building out a service.

Citation preview

Page 1: Building Large Scale Services - LISA 2013

Bu i l d i ng   L a r ge   S ca l e   Se r v i c e s  

PRESENTED  BY  Jennifer  Davis⎪  November  8,  2013  

Page 2: Building Large Scale Services - LISA 2013

Twitter: @sigje Email: [email protected]

Page 3: Building Large Scale Services - LISA 2013

SysAdmin Controls all the things

11/11/13  3  

Page 4: Building Large Scale Services - LISA 2013

Shared Dependencies

11/11/13  4  

Page 5: Building Large Scale Services - LISA 2013

The Reality…  

11/11/13  5  

Page 6: Building Large Scale Services - LISA 2013

The Dream…

11/11/13  6  

Page 7: Building Large Scale Services - LISA 2013

How?

Page 8: Building Large Scale Services - LISA 2013

Define Core Principles

11/11/13  8  

§ Common    ›  CollaboraGon  across  teams,  companies,  industry,  define  standards  

›  Incident,  Problem,  Change,  Config,  Release  management  

§ DisGnct  ›  Specifics  to  an  applicaGon  or  service  ›  Availability,  Service,  Business  ConGnuity,  Capacity    

Page 9: Building Large Scale Services - LISA 2013

Kill the Myths

11/11/13  9  

§ Stupid  User    

Page 10: Building Large Scale Services - LISA 2013

Kill the Myths  

11/11/13  10  

§ Stupid  User  § System  Admin  ==  Operator  

 

Page 11: Building Large Scale Services - LISA 2013

11/11/13  11  

Failing Gracefully

puppet

ruby

SKILLS

perl

nosql

operability security

mysql

unix

TCP/IP

bash

CHEF

Page 12: Building Large Scale Services - LISA 2013

11/11/13  12  

Page 13: Building Large Scale Services - LISA 2013

Kill the Myths  

11/11/13  13  

§ Stupid  User  § System  Admin  ==  Operator  § Words  have  a  common  universal  implicit  meaning    

 

Page 14: Building Large Scale Services - LISA 2013

11/11/13  14  

Page 15: Building Large Scale Services - LISA 2013

Learn to Modulate your Message  

11/11/13  15  

 

 

Page 16: Building Large Scale Services - LISA 2013

11/11/13  16  

Team

Manager Customer

Page 17: Building Large Scale Services - LISA 2013

Team

11/11/13  17  

§ People  working  towards  common  goal.  § Different  roles.    § Different  views.  § Same  objecGves.  

Page 18: Building Large Scale Services - LISA 2013

11/11/13  18  

 Image  Credit:  Kyle  LaGno  

Page 19: Building Large Scale Services - LISA 2013

Team

11/11/13  19  

Sugges/on:  Don’t  talk  about  the  “devs”  request,  talk  about  Elaine’s  request.    

Page 20: Building Large Scale Services - LISA 2013

Team

11/11/13  20  

Sugges/on:  Don’t  talk  about  the  “devs”  request,  talk  about  Elaine’s  request.    Sugges/on:  Verify  that  your  team  has  the  same  vision.  

Page 21: Building Large Scale Services - LISA 2013

Understand the vision.

11/11/13  21  

§  Are  there  other  opGons,  open  source  or  not  within  the  company?  §  Are  there  other  opGons  outside  the  company?  §  Is  EVERYONE  on  the  same  page  about  what  the  service  is?  

Page 22: Building Large Scale Services - LISA 2013

Vision Statement

11/11/13  22  

§  Clear  statement  about  the  problem  that  the  service  is  solving.  ›  DirecGon  ›  IdenGty  management  ›  Team  cohesion  

New  product?  Be  part  of  creaGng  that  vision!  

Page 23: Building Large Scale Services - LISA 2013

Sherpa’s Vision

11/11/13  23  

..  Distributed  replicated  eventually  consistent  key  value  store  that  had  a  focus  on  scalability  ..    

Page 24: Building Large Scale Services - LISA 2013

My Job

11/11/13  24  

§  Examine  soaware  §  Define  risk  §  Communicate  cost  of  risks    §  MiGgate  risks  §  IdenGfy  events  §  Manage  events  

Page 25: Building Large Scale Services - LISA 2013

Fragile Platforms are Bad.

11/11/13  25  

Page 26: Building Large Scale Services - LISA 2013

Change is inevitable

11/11/13  26  

§  Products  pivot  based  on  needs.  §  Requirements  change  and  evolve.  §  Know  core  issues.  

Page 27: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  27  

§  Limit  the  scope  of  focus.    

Page 28: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  28  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.    

Page 29: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  29  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  

 

Page 30: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  30  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  ›  IdenGfy  the  key  “Gme”  elements.  

 

Page 31: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  31  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  ›  IdenGfy  the  key  “Gme”  elements.  ›  Talk  to  them.  IdenGfy  their  key  terms.  “Enhancements”,  “Defects”  

 

Page 32: Building Large Scale Services - LISA 2013

Know Core Issues

11/11/13  32  

§  Limit  the  scope  of  focus.  §  Focus  on  the  biggest  prioriGes.  

›  Understand  Development  Methodology:  Waterfall,  Scrum,  ?  ›  IdenGfy  the  key  “Gme”  elements.  ›  Talk  to  them.  IdenGfy  their  key  terms.  “Enhancements”,  “Defects”  ›  Establish  the  “Top”  list.    

 

Page 33: Building Large Scale Services - LISA 2013

Create checklists

11/11/13  33  

§  Not  because  people  are  dumb.  §  Not  only  because  of  automaGon.  §  When  things  break,  knowing  what  needs  focus.  §  During  normal  maintenance,  can  idenGfy  “not  OK”.  

›  Audit  checklists  for  deployment  through  staging  environment.  

Page 34: Building Large Scale Services - LISA 2013

Know Outputs

11/11/13  34  

§  IdenGfy  components.  §  Well  defined  protocols  between  components.  §  Expected  Inputs.  §  Expected  Outputs.  

Page 35: Building Large Scale Services - LISA 2013

11/11/13  35  

Page 36: Building Large Scale Services - LISA 2013

11/11/13  36  

Page 37: Building Large Scale Services - LISA 2013

11/11/13  37  

Page 38: Building Large Scale Services - LISA 2013

11/11/13  38  

Page 39: Building Large Scale Services - LISA 2013

11/11/13  39  

Page 40: Building Large Scale Services - LISA 2013

Know State Transitions Explicitly.

11/11/13  40  

§  When  component  is  installed  but  not  ready  

Page 41: Building Large Scale Services - LISA 2013

Know State Transitions Explicitly.

11/11/13  41  

§  When  component  is  installed  but  not  ready  §  When  the  colo  is  going  away  §  Go  through  What  If  Scenarios.  

›  Document  them.  

Page 42: Building Large Scale Services - LISA 2013

Know choke points explicitly.

11/11/13  42  

§  Memory  §  Disk  §  Bandwidth  

Now  and  in  6  months.  JIT?  

Page 43: Building Large Scale Services - LISA 2013

Failure will happen.

11/11/13  43  

§  There  are  no  0  failure  systems.  §   “Give  me  the  brain”  documentaGon  so  that  anyone  can  be  the  brain.  §  Repeatable/Reliable  failure  handling.  §  Run  fire  drills.  Really.    

Page 44: Building Large Scale Services - LISA 2013

11/11/13  44  

Page 45: Building Large Scale Services - LISA 2013

System Administration is Gardening.

11/11/13  45  

§  No  guarantee  of  resources.  §  Only  guarantee  is  change.  

Page 46: Building Large Scale Services - LISA 2013

System Administration is Gardening.

11/11/13  46  

§  Nurture  relaGonships.  ›  Be  authenGc.  ›  Be  trusGng  and  trustworthy.  ›  Have  integrity.  

Page 47: Building Large Scale Services - LISA 2013

Success At Scale is Collaboration & Cooperation across Teams.

Page 48: Building Large Scale Services - LISA 2013

Decreasing Value

11/11/13  48  

Page 49: Building Large Scale Services - LISA 2013

11/11/13  49  

0

2

4

6

8

Jan Apr Jul Oct

# of Support Engineers

# of Support Engineers

Page 50: Building Large Scale Services - LISA 2013

11/11/13  50  

0

1

2

3

4

5

6

Jan Apr Jul Oct

# of Support Engineers

# of Support Engineers

Page 51: Building Large Scale Services - LISA 2013

11/11/13  51  

Page 52: Building Large Scale Services - LISA 2013

Documentation is not the cure.

11/11/13  52  

§ DocumentaGon  doesn’t  guarantee  understanding.  ›  OperaGons  Sandbox  Environment  

§ Don’t  spend  Gme  at  the  end  documenGng.  

Page 53: Building Large Scale Services - LISA 2013

53   11/11/13  

Page 54: Building Large Scale Services - LISA 2013

Summary

Page 55: Building Large Scale Services - LISA 2013

Be Expendable. Feed your brain.

11/11/13  55  

Page 56: Building Large Scale Services - LISA 2013

Acknowledgements

11/11/13  56  

•  hkp://www.flickr.com/photos/levork    •  hkp://www.flickr.com/photos/puggles  •  hkp://www.flickr.com/photos/byteorder  •  hkp://www.flickr.com/photos/egoant  •  hkp://www.flickr.com/photos/happymonkey  •  Kyle  LaGno    •  Greg  Connor    

Page 57: Building Large Scale Services - LISA 2013

Thanks!

11/11/13  57  

[email protected] http://www.slideshare.net/sigje/

presentation-lisa