of 35 /35

Couchbase at LinkedIn: Couchbase Connect 2015

Embed Size (px)

Text of Couchbase at LinkedIn: Couchbase Connect 2015

  1. 1. Michael Kehoe Brian Cory Sherwin LinkedIn Couchbase at LinkedIn 2015
  2. 2. 3 Overview The LinkedIn Story Development & Operations Operational Tooling LinkedIns Couchbase as a Database Questions
  3. 3. Site Reliability Engineer (SRE) at LinkedIn SRE for Profile & Higher-Education Member of CBVT B.E. (Electrical Engineering) from the University of Queensland, Australia 4 Michael Kehoe
  4. 4. 5 The LinkedIn Story Founded in 2002, LinkedIn has grown into the worlds largest professional social media network Offices in 24 countries, Available in 23 languages Over 360M members Revenue of $638M in Q1 2015
  5. 5. In-Memory storage needs 6 The LinkedIn Story At our scale, it becomes challenging to scale data systems Read-Scaling becomes important Applicable use-cases: Simple cache store Pre-warmed Read through Temporary data storage for de-duping Potential for Source of Truth (SoT) store
  6. 6. Enter Couchbase 7 The LinkedIn Story Until 2012, we were only using Memcached as a non SoT In-Memory store However it had some drawbacks; Long cache warmup times No partitioning/sharing Had to write our own Cold-cache restarts Difficult to move data across hosts/clusters/datacentres
  7. 7. Enter Couchbase 8 The LinkedIn Story Evaluated systems to replace Memcached: Mongo, Redis, and others Couchbase had advantages Drop-in replacement for Memcached Built in replication and cluster expansion Memory latency for operations Asynchronous writes to disk Utilize some of the development infrastructure weve built
  8. 8. Coding 9 Development & Operations Memcached configured with Spring and implements a caching Java interface Implemented with Couchbase Native Client Developer just replaces the Spring
  9. 9. Operations 10 Development & Operations Hadoop jobs build warm cache data Tools to partition the data and load into Couchbase offline Apply deltas when brought on-line Clean, warm caches ready when needed
  10. 10. 11 Operational Tooling In order to efficiently use Couchbase as SREs, we need the following: Provisioning Installation Monitoring & Alerting Infrastructure Visibility
  11. 11. Provisioning 12 Operational Tooling Provisioning Flow Seek estimated usage statistics on cluster Size of data to be stored QPS Redundancy Needs Calculate cluster sizing Currently done via a spreadsheet with a template Moving into an in-house application Request hardware for cluster(s)
  12. 12. Installation 13 Operational Tooling Current System Enter cluster metadata into our management system (Yahoo range) Use SALT module to install & configure cluster Future System Use same metadata system Use SALT States to install and configure cluster Benefits of the new system Its possible to have state enforcement Use SALT Pillars to encrypt cluster/bucket passwords
  13. 13. Installation 14 Operational Tooling CLUSTER: - ela4.couchbase.30 - prod-lva1.couchbase.30 - prod-ltx1.couchbase.30 NAME: follow-blue PORT: 11211 INSTANCE: 30 ALERT_ADDRESSES: - q([email protected]) SRE_GROUPS: - sre-team-name CLIENT_CONTAINERS: - following-services EMAIL_ALERTS: - HIGHWATER_PERCENT_FULL - MEMORY_PERCENT_FULL - NOT_MY_VBUCKET - PERCENT_IN_MEMORY - KEY_USAGE - AUTOFAILOVER
  14. 14. Monitoring & Alerting 15 Operational Tooling We run a daemon on each Couchbase Server that collects metrics every minute via a Couchbase Library API Use cluster metadata from range to build dashboard definition file via Jinja template & Python
  15. 15. Monitoring & Alerting 16 Operational Tooling $ ./couchbase.py I 30 [INFO] Generating dashboard file: common-templates/couchbase.follow-blue
  16. 16. Monitoring & Alerting 17 Operational Tooling - title: couchbase.follow-blue AutoFailover Enabled defs: - range: "%{FABRIC}.couchbase.30" label: "autofailover_enabled" rrd: couchbase.follow-blue/autofailover_enabled.rrd params: vlabel: 'enabled_boolean' autoalerts: zones: ['COUCHBASE-SLA2'] enabled-fabrics: ['ela4', 'prod-lva1', 'prod-ltx1'] processor: 'ingraphs' filter-type: 'ingraphs_filter' contacts: [[email protected]'] state-check: threshold state-check-args: min: 1.0 consecutive-events: 10 alert-plugin: emailer alert-plugin-args: recipients: [[email protected]] interval: 3600 include-definition: True
  17. 17. Monitoring & Alerting 18 Operational Tooling
  18. 18. Management 19 Operational Tooling We want to see a world-view of all the clusters that we run Having bucket cluster/server level statistics are useful Having a view of who owns each cluster/bucket is useful
  19. 19. Management 20 Operational Tooling
  20. 20. Management 21 Operational Tooling
  21. 21. Management 22 Operational Tooling
  22. 22. Management 23 Operational Tooling
  23. 23. Management 24 Operational Tooling
  24. 24. Management 25 Operational Tooling
  25. 25. 26 Conclusions Couchbase fits into our existing infrastructure We have good management and monitoring of the clusters Rich set of tooling we extended for our environment Starting to expand our use from a cache to a store for internal tooling
  26. 26. Brian Cory Sherwin Site Reliability Engineer LinkedIn LinkedIns Couchbase as a Database
  27. 27. Our use case and requirements Why we chose Couchbase vs MySQL Pitfalls encountered 28 The Agenda
  28. 28. Memcache replacement Data resiliency Maintenance friendly 29 Couchbase @ Linkedin
  29. 29. AutoRemediation! A job execution platform to remediate operations issues Database backend for state tracking of a workflow engine 30 Using Couchbase as a Workflow Backend
  30. 30. Easy JSON documents Rapid iteration Horizontally scalable 31 Our Requirements
  31. 31. Couchbase as a database Document store Views for indexing Data resiliency Replication Simplicity 32 Why Couchbase?
  32. 32. Upfront cost in creating the schema Rapidly changing documents Number of columns Consistent incremental updates 33 Why not MySQL?
  33. 33. ACID implications Durability and Consistency Concurrency Different and new tech 34 Pitfalls using Couchbase
  34. 34. Questions? [email protected] If you want to learn more on AutoRemediaiton http://www.meetup.com/Auto- Remediation-and-Event-Driven- Automation/ 35 Questions?