Having a Pulse On Your Platform - Endpointcon Amsterdam 2015 Keynote

Preview:

Citation preview

Having a Pulse On Your Platform

Kamyar Mohager (@kamyarsayshello)Engineering Lead, Partner Engineering

LinkedIn

Kumaresh Pattabiraman
+rshendye@linkedin.com +hxie@linkedin.com Made some minor changes to the previous 2 slides. Pls edit/scrap them as needed.
Huangming Xie
maybe add one point: “simplify filtering and remove dependency on whitelisted pagekeys and tracking codes for metrics computation (eg. search clicks)"
Huangming Xie
Should we change the header to "How does it affect metrics and relevance/quality?"

WHAT WE’LL COVER

THE TECHNOLOGY

HOW WE OPERATIONALIZE

WHY BOTHER MONITORING?

WHY BOTHER MONITORING?

INTERNALLY• Operations: Need to know the health of your platform just like any

other app or frontend client. Know your API is down before your developers do

• Business: Make data-driven decisions based on the data

EXTERNALLY

• API availability impacts external apps and their business• Provide some level of monitoring (and possibly alerting) for

developers externally so they’re not left in the dark• Developer empathy is important

Technology

APACHE KAFKA INGRAPHS

● Pub-Sub Messaging and Queuing System

● Data backbone for LinkedIn

● Visualization Frontend for metrics

● Standard tool for all LinkedIn Eng & Ops

API-ANALYZER

● Visualization Frontend specific to LinkedIn Platform

● Used by Platform and SRE teams for Operational needs

APACHE HADOOP

● Distributed Data Storage and Processing

● Used by Platform for Business / Product Analytics

KAFKA AT A GLANCE

Broker

Consumer

Producer

AP0

AP1

API Gateway

InGraphs, API-Analyzer, Hadoop

Kafka Topic: ExternalApiAccessEvent

EXAMPLE KAFKA TOPIC

ExternalApiAccessEvent

INGRAPHS

• Standard visualization framework for operational metrics used @ LinkedIn

• Configuration driven with pre-selected applications to create monitoring dashboards

• Hooks into auto-alerting system

DATA FLOWING TO INGRAPH

DEVIL IN THE (MONITORING) DETAILS

WHO

WHAT

● Entire Platform (aggregate)● Per Partner Program● Per Application

● QPS● Latency● HTTP Response codes (4xx, 5xx)● APIs / Endpoints (granular to specific HTTP methods)

INGRAPHS FOR PLATFORM

PROS

CONS

● Efficient: filters latency/QPS/error rates/call types based on configurations

● Stable: used by all of Engineering and Ops

● Doesn’t support ad hoc queries● Dependency on SRE team to add any configuration changes

API-ANALYZER

• Visualization fronted specifically for ExternalApiAccessEvent metrics• Used by Platform and SRE Teams supporting API• Ad hoc based queries to help with troubleshooting

API-ANALYZER PROCESS FLOW

API-ANALYZER

PROS

CONS

● Supports fast ad hoc queries against a number of facets: appid, IP address, call types

● Free of dependencies on SRE team to maintain configurations for predefined applications

● Limited historical data available

APACHE HADOOP

• The hub of all offline tracking data @ LinkedIn• All ExternalApiAccessEvent data gets ETL’d into Hadoop in near real-

time• Platform team relies on Hadoop for product and business analytics• In-depth analytics beyond just QPS, Latency, Call Types, etc• Historical Data

How Do We Operationalize?

PARTNER ENGINEERING AT LINKEDIN

TEAM GOAL

ROLE OF A PARTNER ENGINEER

Provide a world-class developer platform where our partners and developers can build fantastic 3rd party applications for LinkedIn members

Guide and support partners and developers using our RESTful APIs and mobile SDKs

TREAT PLATFORM AS A PRODUCT Incorporate feedback from our external developers to influence roadmap

SUPPORT MODEL

• Organized by Partner Programs• Open Program: Stack Overflow + Developer Portal• Partner Programs: Dedicated Partner Engineers provide white-glove

support• SLAs vary by Partner Programs (and in certain cases, by strategic

partner)

THE TECHNOLOGY IN ACTION

InGraphs

API-Analyzer

● Dashboards created for a given Partner Program or a specific application

● Charts any metrics we care about (e.g. QPS)● Set up alerts for support teams based on a given threshold● Depending on SLA, team gets emailed and/or called (via on-call

rotation)● Used for ad hoc queries● Fast when needing to troubleshoot and triage a production issue for

a partnerHadoop● Long term look backs● Provides all ExternalApiAccessEvent tracking data not available in

visualization frontends (e.g. member IDs, paths, query params, etc)● Ability to create complex, in-depth reports

[In]SUMMARY

• Your external apps expect 99.99% API “site up”• Monitoring and Alerting essential for knowing health of your platform• Use data to make business and product decisions• It all goes back to tracking: necessary to solve operational and

business needs• Many different types of solutions: up to you to decide whether to

build or buy

THANKS!

Kamyar Mohager (@kamyarsayshello)Engineering Lead, Platform

LinkedIn

Kumaresh Pattabiraman
+rshendye@linkedin.com +hxie@linkedin.com Made some minor changes to the previous 2 slides. Pls edit/scrap them as needed.
Huangming Xie
maybe add one point: “simplify filtering and remove dependency on whitelisted pagekeys and tracking codes for metrics computation (eg. search clicks)"
Huangming Xie
Should we change the header to "How does it affect metrics and relevance/quality?"

Recommended