53
December 4–9, 2016 | Boston, MA www.usenix.org/lisa16 #lisa16 SRE at a Startup: Lessons From LinkedIn Craig Sebenik Lead Infrastructure Engineer Matterport 1

SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

December 4–9, 2016 | Boston, MAwww.usenix.org/lisa16 #lisa16

SRE at a Startup:Lessons From LinkedIn

Craig SebenikLead Infrastructure Engineer

Matterport

1

Page 2: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Who Am I?• Scientist (computational chemistry)

• Sysadmin

• Developer (Perl, Java, C)

• Systems Engineer/SRE/Infrastructure Engineer/…

• Big Companies: Kodak, Kraft General Foods, NetApp, LinkedIn

• Small: 4info, Skyfire, Matterport

• Chef (real chef, not the software)

2

Page 3: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Does SRE work at a startup?

3

Page 4: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Outline• What is SRE?

• LinkedIn

• Early Times

• SRE

• Matterport

• Early Times

• Changes

• Lessons

4

Page 5: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

What is “SRE”?• Site Reliability Engineer (Engineering)

• Started at Google ca. 2003.

• “SRE is what happens when you ask a software engineer to design an operations team.” [sre]

• Member of team focused on build, deployment, monitoring, etc.

• But, entire team is still responsible.

• Each dev team is self-reliant.

• Still need someone to support the “core infrastructure”.

5

Page 6: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

DevOpsCaution: These are my opinions!

• NO DEVOPS TEAM!

• DevOps is a paradigm not a job title.

• “Everyone is a devops engineer.”

• “If you build it, you run it.” [in production]

• Werner Vogels, Amazon CTO [vogels]

• SRE is an implementation of the DevOps paradigm.

6

Page 7: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

LinkedIn

7

Page 8: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Outline• What is SRE?

• LinkedIn

• Early Times

• SRE

• Matterport

• Early Times

• Changes

• Lessons

8

Page 9: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

LinkedIn: The Early Times• Joined in (late) 2010

• Single AppOps team for everything

• Central NOC, only 24/5

• “Pager” passed around AppOps (SRE)

• Other ops teams: sysadmin, network, DBAs

9

Page 10: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

LinkedIn: Early Releases• Large releases, every other week

• Complex dependencies

• Done after work hours and took many hours to complete

• Complicated “release” branches

• “Release team” for changes

• Centralized (REST) service with configuration details

10

Page 11: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Outline• What is SRE?

• LinkedIn

• Early Times

• SRE

• Matterport

• Early Times

• Changes

• Lessons

11

Page 12: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

LinkedIn: SRE Changes• Changed name; aka “rebranded” (AppOps -> SRE)

• SRE broken up into dev specific teams

• Eventually, moved to sit with dev team

• Worked much closer with subset of devs

• Implemented Salt

• More coding

• Eventually, SRE teams for internal products

12

Page 13: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

LinkedIn: DevOps• Simplified “trunk” model*

• No more feature branches

• Devs had access to configuration data

• Richer testing platform

• Devs were able to deploy to production

• First to a canary, then entire cluster

• “A/B tests” for new features (aka “feature flags”)

• Devs involved in oncall (varied by team)

13

Page 14: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

LinkedIn: Dev Self Service• Automated code metrics

• Dev would annotate code to produce metrics

• No limits on number of metrics sent*

• inGraphs

• Dashboards in YAML

• Self service alerts

14

Page 15: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Why is this important?• Implementing a new paradigm is hard.

• Need management support.

• LinkedIn changed a LOT of things allowing it grow.

• Self service is important.

• If LinkedIn can do it, so can your startup.

15

Page 16: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Outline• What is SRE?

• LinkedIn

• Early Times

• SRE

• Matterport

• Early Times

• Changes

• Lessons

16

Page 17: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

17

aka “The Startup”

Page 18: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Who/What Is Matterport?• 3D visualization of spaces.

• Current focus: residential real estate

• Over 150 employees

• Based in Silicon Valley. (Offices in Chicago and UK.)

18

Page 19: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Matterport Technology• Camera w/firmware

• Client (javascript, Unity)

• C++

• Python (DJango)

• Salt

• AWS

• Tons of third parties (“startup micro-economy”)

19

Page 20: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

20

Page 21: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Outline• What is SRE?

• LinkedIn

• Early Times

• SRE

• Matterport

• Early Times

• Changes

• Lessons

21

Page 22: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

“Ops” Work• Single dev doing “operations” work

• He was the only one that knew entire stack

• Wrote tooling as needed

• Not all tools checked-in

• Several snowflake servers

• No Metrics

• Minimal monitoring

22

Page 23: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Releases• All deploys by one person

• “Blue - Green” deployments (2 environments: active and dormant)

• Lagging writes manually copied to new DB

• Little or no communication

• Hour+ downtime

• Scheduled for late at night

• Hand-edits made to code in prod for hotfix

23

Page 24: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

24

Page 25: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

25

Page 26: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Does this sound familiar?

26

Page 27: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

SNAFU - Situation Normal …• Startups start with “dev”

• Engineers want to code, not deploy

• Start getting “real customers”…

• “Ops” work happens organically

• Just the challenge I was looking for!

27

Page 28: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Outline• What is SRE?

• LinkedIn

• Early Times

• SRE

• Matterport

• Early Times

• Changes

• Lessons

28

Page 29: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

First Things First• Prioritize fixing things

• Simplify

• File lots of Jira tickets

• Communication is important

• Get management buy-in

• All hands meeting: “We all own the site.”

29

Page 30: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

30

Page 31: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Dickerson’s Hierarchy of Reliability

31

Page 32: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

My Hierarchy of Needs• Metrics and Monitoring

• Reproducible Builds (required SCM commit)

• Stable/Predictable Release Schedule

• More Communication (and documentation)

• Dev Ownership (includes testing)

• Build a Team

32

Page 33: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Metrics and Monitoring

33

Page 34: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Metrics and Monitoring• Datadog (statsd under the hood)

• Grassroots effort.

• Show lead dev statsd and its docs

• Create a few sample dashboards

• *Everyone* has access (login)

• Expand access to monitoring system

• Simplify by removing unused systems (third parties)

34

Page 35: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

SCM and Builds• Simplify: everyone uses git

• Github Enterprise

• Buildbot

• Reproducible builds of C++ code

• Python code deployed directly from git

• Not ideal… bigger fish to fry

• Automated tests with CircleCI

• Eventually; automated builds its CircleCI

• Self-support!

35

Page 36: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

SCM For Infra• All salt changes are committed to git

• Simple unit tests run on salt changes

• Testing hosts for every member of infra

• Simplify salt code

• Remove conditionals where possible

• Implement data structures

36

Page 37: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Release Improvements• All non-production environments are free-game

• Devs do “trip over each other” once in a while

• They figured it out and adapted

• Prod releases are during business hours

• If something goes wrong, dev needs to be available

37

Page 38: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

More Release• Backwards compatible

• Devs updated the process for schema changes

• Release tickets in Jira

• Release plans in wiki

• Release planning meetings

38

Page 39: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Release Got Better• Moved to every week

• Has become routine

• “Smooth” release is the norm

• Senior management has complete confidence in the process

39

Page 40: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Then Even More Betterer• No more planning meeting

• Everything in Jira and the wiki

• Slack channel and bots

• Feature flags

• Dev deploy directly to production

• On their own schedule

40

Page 41: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Communication• Slack everywhere

• Release channel

• “Outage” channel

• Bots integrated with automation

• Even the recruiters and marketing are using slack

• Docs on wiki

• Lots of Jira tickets

41

Page 42: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Dev Ownership• Self service: Github and Circle

• Root access to all dev hosts

• Most have root to staging hosts

• Many have root on production

• Access to datadog, loggly, sentry, etc

42

Page 43: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Team• Hired 2 more people

• Daily Standups

• They also attend (some) product standups

• Weekly meetings

• Ticket triage: all new, all “blockers”

• Kanban

43

Page 44: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Lots Left To Do• Only some products have devs that can deploy to prod.

• Some developers submitting PRs to salt code

• But, not enough.

• More automation, but only 1 autonomous system

• The rest would fill a dozen slides…

44

Page 45: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Outline• What is SRE?

• LinkedIn

• Early Times

• SRE

• Matterport

• Early Times

• Changes

• Lessons

45

Page 46: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Lessons• Patience: culture shifts take a long time

• Devs *have* to be involved

• Don't be afraid of failure

• Respect your ancestors

• Start small and iterate

46

Page 47: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

More Lessons• Shared experience with failure is better than “preaching”

• Hiring is hard at all sizes

• Make decisions with data

• Demonstrate effectiveness to management

• Support of senior dev(s) necessary

• SRE is an implementation of DevOps

• Constant teaching/learning

47

Page 48: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Questions?One more thing…

48

Page 49: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

We’re hiring!https://matterport.com/careers/

49

Page 50: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Related TalkClosing Plenary: “SRE in the Small and in the Large”

Niall Murphy and Todd Underwood, Google Constitution Ballroom

50

Page 51: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Q & A (For real this time…)

• craig -AT- matterport -DOT- com

• Twitter: @craigs55

• Github: craig5

• https://www.linkedin.com/in/craigsebenik

• https://matterport.com/careers/

51

Page 52: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Notes• [vogels] “A Conversation with Werner Vogels

http://queue.acm.org/detail.cfm?id=1142065

• [sre] “Site Reliability Engineering”Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphyhttp://shop.oreilly.com/product/0636920041528.do

52

Page 53: SRE at a Startup: Lessons From LinkedIn - USENIX · Outline • What is SRE? • LinkedIn • Early Times • SRE • Matterport • Early Times • Changes • Lessons 4

Additional Resources• Infrastructure as Code (Kief Morris)

• http://shop.oreilly.com/product/0636920039297.do

• The Phoenix Project (Gene Kim, Kevin Behr, George Spafford)

• https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262509

• The DevOps Handbook (Gene Kim, Patrick Debois, John Willis, Jez Humble)

• https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002

• Continuous Delivery (Jez Humble, David Farley)

• https://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-Wesley/dp/0321601912

53