28
Michael Kehoe Staff Site Reliability Engineer LinkedIn Going all in: From single use-case to many

Couchbase Connect 2016

Embed Size (px)

Citation preview

Page 1: Couchbase Connect 2016

Michael Kehoe Staff Site Reliability Engineer

LinkedIn

Going all in:From single use-case to many

Page 2: Couchbase Connect 2016

2

Overview

• The LinkedIn Story• Couchbase Use-Cases• Development & Operations• Conclusions• Questions

Page 3: Couchbase Connect 2016

3

 $ whoamiMichael Kehoe

• Staff Site Reliability Engineer (SRE)• Production-SRE team• Funny accent = Australian

• Contact• linkedin.com/in/michaelkkehoe• @matrixtek

Page 4: Couchbase Connect 2016

4

 $ whatis SREMichael Kehoe

• Site Reliability Engineering• Operations for the production application environment• Responsibilities include

• Architecture design• Capacity planning• Operations• Tooling

Page 5: Couchbase Connect 2016

5

 $ whatis CBVTMichael Kehoe

• Couchbase Virtual Team• ~10 SRE’s• 2 Software Engineers• Sponsored by SRE Director• 5-90% of their time to support Couchbase• Encourage as many people to contribute as possible

• What do we do?• Operational work on Couchbase clusters• Evangelize the use of Couchbase within LinkedIn• Develop tools for the Couchbase Ecosystem

Page 6: Couchbase Connect 2016

6

The LinkedIn Story

• Founded in 2002, LinkedIn has grown into the world’s largest professional social media network

• 30 offices in 24 countries, Available in 24 languages• More than 450+ million members worldwide

Page 7: Couchbase Connect 2016

7

The LinkedIn Story

• Growth in Products• Profiles• Groups• Recruiter• Sales Navigator

• Growth in Internet Traffic• Billions of page-hits per day• 100k+ QPS to production services

Page 8: Couchbase Connect 2016

8

 In-Memory Storage NeedsThe LinkedIn Story

• LinkedIn started as an Oracle shop

• Hyper-growth = Scaling challenges• Read-Scaling becomes important

• Applicable use-cases• Simple cache store

• Pre-warmed• Read through

• Potential for Source of Truth (SoT) store

Page 9: Couchbase Connect 2016

9

 Enter CouchbaseThe LinkedIn Story

• Until 2012, we were only using Memcache as a non SoT In-Memory store

• Drawbacks• Difficult to pre-warm• No partitioning/sharding (had to write our own)• Cold-cache restarts• Difficult to move data across hosts/clusters data-centers

Page 10: Couchbase Connect 2016

10

 Enter CouchbaseThe LinkedIn Story

• Evaluated replacement systems for Memcached: Mongo, Redis, and others• Couchbase had distinct advantages:

• Simple replacement for Memcached• Built-in replication and cluster expansion• Automatic partitioning• Low latency• Async writes to disk• Building tooling is simple

Page 11: Couchbase Connect 2016

11

 Enter CouchbaseThe LinkedIn Story

• Today we run Couchbase in our Corporate, Staging and Production environments

• Production/ Staging statistics:• 148 buckets• 2821 hosts• 10M+ QPS

• Largest Clusters:• By Hosts: 72 Hosts• By Documents: 1.4B Documents• By QPS: 2.5M QPS

Page 12: Couchbase Connect 2016

12

 SummaryUse-Cases

Today’s use-cases:• Simple read-through cache• Ephemeral Counter Store• Temporary de-duping store• SoT data-store for internal tooling

Page 13: Couchbase Connect 2016

13

 Simple read-through cacheUse-Cases

• Drop-in replacement for memcache• Read-scaling• Protecting backend database from large amounts of traffic

• E.g. 3rd party ingestion credential cache

Page 14: Couchbase Connect 2016

14

 Counter StoreUse-Cases

• In certain places, we simply need to increment counters from multiple systems and store them

• E.g. Anti-abuse/Anti-scraping systems (Fuse)

Page 15: Couchbase Connect 2016

15

 Temporary De-duping storeUse-Cases

• Need to de-dup data over a large application cluster• E.g. Email systems – Ensure we don’t send the same email twice

Page 16: Couchbase Connect 2016

16

 SoT Store for Internal ToolsUse-Cases

• For Non-Member facing tools, we use Couchbase as a SoT store.• Benefits:

• Schema-less• Short setup time• Couchbase Python Client works easily in our environment• Use views for simple map-reduce

• Example Uses:• Nurse – Autoremediation system• TrafficshiftIn – Global traffic automation system• Availability – Storing and tracking Linkedin availability data

Page 17: Couchbase Connect 2016

17

 Couchbase EcosystemThe LinkedIn Story

Page 18: Couchbase Connect 2016

18

Developing around Couchbase

• Java – li-couchbase-client• Wrapper around standard Java Couchbase Client• Custom metrics emission• Using Spring interface• Storing data as Java serialized objects

• Python – couchbase-python-client

Page 19: Couchbase Connect 2016

19

Operational Tooling

In order to efficiently use Couchbase as SRE’s, we need the following:• Provisioning• Installation• Monitoring & Alerting• Infrastructure Visibility

Page 20: Couchbase Connect 2016

20

 ProvisioningOperational Tooling

• Provisioning Flow• Seek estimated usage statistics for cluster

• Size of data to be stored• QPS• Redundancy Needs

• Calculate cluster sizing• Currently done with a template• Couchbase has a simple calculator available online: http://

docs.couchbase.com/prebuilt/calculators/sizing-calc.html• Request hardware for cluster(s)

Page 21: Couchbase Connect 2016

21

 InstallationOperational Tooling

• Process• Enter cluster metadata into our management system (Range)• Use Salt States to install and configure cluster• See Issa Fattah’s post for more information:

• https://engineering.linkedin.com/blog/2016/04/leveraging-saltstack-to-scale-couchbase

• Benefits• Ability to perform ‘state enforcement’• Using Salt Pillar’s to encrypt cluster/ bucket passwords end-to-end

Page 22: Couchbase Connect 2016

22

 Monitoring & AlertingOperational Tooling

• We run a daemon on each Couchbase Server that collects metrics every minute via Couchbase API’s

• Use cluster metadata from range to build dashboards with our own system InGraphs

• See: ‘Monitoring production deployments’: 4pm - Great America 1

Page 23: Couchbase Connect 2016

23

 Monitoring & AlertingOperational Tooling

Page 24: Couchbase Connect 2016

24

 ManagementOperational Tooling

• We want to see a world-view of all the clusters we run

• Having bucket cluster/server level statistics is useful

• Having a global view of who owns and operates each cluster/ bucket is useful

Page 25: Couchbase Connect 2016

25

 ManagementOperational Tooling

Page 26: Couchbase Connect 2016

26

Conclusions

• Couchbase was a natural fit into our existing infrastructure

• Building an ecosystem around Couchbase was important to us and has helped Couchbase be successful at LinkedIn

• Expanding use of Couchbase• In the past year we’ve grown the number of buckets over 50%• Starting to use Views in production• Moving Couchbase into LinkedIn standard deployment infrastructure

Page 27: Couchbase Connect 2016

27

Thank You

Questions?

Page 28: Couchbase Connect 2016

©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.