Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale microservice

Preview:

Citation preview

>>> 5 must-have Patterns for your web-scale

Microservices

@aliostad

Ali Kheyrollahi, ASOS

@aliostad

> stackoverflow> £1.5 bln

global fashion destination

> 35% every year

@aliostad

/// ASOS in numbers

2 0 1 6 T u r n O v e r → £15 bln

A c t i v e C u s t o m e r s → 12 M

N e w P r o d u c t s / w k → 4 k

U n i q u e V i s i t s / m o → 123 M

P a g e V i e w s / d a y → 95 M

P l a t f o r m T e a m s → 40

@aliostad

/// Microservices Architecture

@aliostad

/// why microservices> Scaling people not the solution

> Decentralising decision centres => Agility

> Frequent deployment => Agility

> Reduced complexity of each ms (Divide/Conquere) => Agility

> Overall solution complex but ...

@aliostad

/// anecdote

Often you can measure your success in implementing Microservice Architecture not be the number of services you build, but by the number you decommission.

@aliostad

/// microservices vs soaSOA Microservices

Main Goal Architectual Decoupling Agility

Audience Mainly Architecture Business (Everyone)

Set out to solve Architectural CouplingScaling People,

Frequent Deployment

Organisational Structure Impact Minimal Huge

Service Cardinality Usually up to a dozen >40 (Commonly >100)

When to do Always teams > ~5**** Debateable. There are articles and discussions on this very topic

@aliostad

/// microservice challenges

> Very difficult to build a complete mental picture of solution

> When things go wrong, need to know where before why

> Potentially increased latency

> Performance outliers intractable to solve

> A complete mind-shift requiring a new operating model

@aliostad

/// probability distribution

Response Time

Pro

bab

ilty

@aliostad

/// performance outliersMicroservice

AMicroservie

B

99th Percentile = 500ms 99th Percentile = 500ms

A B Total<1s 99% 99% 98.01%

>500m 1% 99% 0.99%>500m 99% 1% 0.99%

>1s 1% 1% 0.01%

@aliostad

/// ActivityId Propagator

@aliostad

/// ActivityId

> Every customer request matters

> Every request is unique

> Every request creates a chain (or tree) of calls/events

> Activities are correlated

> You need an ActivityId (or CorrelationId) to link calls/events

@aliostad

/// ActivityIdMicroservice

Id

IdId Thread Local Storage

Id

To Other APIs

Id

Event

@aliostad

/// ActivityId - HTTPRequest

GET /api/v2/foo HTTP/1.1 host: foo.com activity-id: 96c5a1f106ce468ebcca8303ed7464bd

Response

200 OK activity-id: 96c5a1f106ce468ebcca8303ed7464bd

@aliostad

/// Retry and Timeout Policy

@aliostad

/// FailureMicroservice

A

1% chance of failure

XWait (back-off)XWait (back-off longer)

Microservice B

1% chance of failure

@aliostad

/// Preemptive TimeoutMicroservice

A

XretryXretry

Short timeout

Short timeout

Microservice B

@aliostad

/// TimeoutC

B

A

A > B > CA > B + C

@aliostad

/// Choosing a timeout?

Static => Based on Server SLO

Dynamic => 95th percentile

@aliostad

/// IO Monitor

@aliostad

/// Blame Game“If there is a single place where

you can play blame game, instead of collective responsibility,

it is in Microservices troubleshooting”

@aliostad

/// Did you say IO??

Microservice

DBAPI

Cache

Measure... every time your code

goes out of your process

@aliostad

/// Recording Methods> Explicitly by calling record()

> Asking the library to record a closure

> Aspect-oriented

Java (spf4j)

private static final MeasurementRecorder recorder = RecorderFactory.createScalableCountingRecorder(forWhat, unitOfMeasurement, sampleTimeMillis);

… recorder.record(measurement);

.NET (PerfIt)

var ins = new SimpleInstrumentor(new InstrumentationInfo() { Counters = CounterTypes.StandardCounters, Description = "test", InstanceName = "Test instance", CategoryName = TestCategory });

ins.Instrument(() => Thread.Sleep(100), "test...");

Java and .NET

@PerformanceMonitor(warnThresholdMillis=1, errorThresholdMillis=100, recorderSource = RecorderSourceInstance.Rs5m.class)

[PerfItFilter(“PerfItTests", InstanceName = "Test")]public string Get(){ return Guid.NewGuid().ToString();}

@aliostad

/// Publishing Methods

> Local file (various to logstash)

> TCP and HTTP (many, to zipkin, influxdb)

> UDP (statsd, collectd to graphite, logstash)

> Raising Kernel-level event (Windows ETW)

> Local communication (statsd)

@aliostad

/// Circuit- Breaker

@aliostad

/// tri-state> Closed traffic can flow normally

> Open traffic does not flow

> Half-open circuit breaker tests the waters again

Closed

Open

Half-open

Test

Failure

Wait timeout

@aliostad

/// Netflix Hysterix

RequestVolumeThreshold

ErrorThresholdPercentage

SleepWindowInMilliseconds

TimeInMilliseconds

NumBuckets

@aliostad

/// Fallback

> Custom: e.g. serve content from a local cache (status 206)

> Silent: return null/no-data/empty (status 200/204)

> Fail-fast: Customer experience is important (status 5xx)

@aliostad

/// Canary and Health Endpoint

@aliostad

/// Health Endpoints

Ping returns a success code when invoked

Canary returns a connectivity status and latency on the service and dependencies

“… none of them invoke any application code”

@aliostad

/// PingRequest

GET /api/health HTTP/1.1 host: foo.com

Response

200 OK

Response

500 Server Error

@aliostad

/// CanaryRequest

GET /api/canary HTTP/1.1 host: foo.com

Response

200 OK

{

[Nested Structure]

}

@aliostad

/// ChirpResult

{ "serviceName": "foo", "latency": "00:00:00.0542172", "statusCode": 200, "isCritical": true }

@aliostad

/// ChirpResult

@aliostad

/// ChirpResult - critical failure

API

NC

NC

C

200

200

500

500

@aliostad

/// ChirpResult - non-critical failure

API

NC

NC

C

500

200

200

200

@aliostad

/// AOP / Declarative (c#)

[AzureStorageCanary("Foo-AzureStorage-BarDatabaseServer", “config-key-for-cn“)] [SqlCanary("SQL-BazActiveDatabase", null, typeof(SqlConnectionFactory))] [CanaryEndpointCanary("Dependency-Api", “config-key-for-endpoint“)] public class CanaryController : CanaryBaseController { … // some boilerplate code }

@aliostad

/// Deep vs Shallow

API

API“Deep”“Shallow”/api/canary?deep=false

@aliostad

/// Wrap-up> If you have more than ~5 teams, consider Microservices

> Logging/Monitoring/Alerting: single most important asset

> Use ActivityId Propagator to correlate (consider zipkin)

> Cloud is a jungleTM. Without retry/timeout you won’t survive

> Monitor and measure all calls to external services (blame game)

> Protect your systems with circuit-breakers (and isolation)

> Canary helps you detect connectivity from customer view

@aliostad

Thomas Wood: Daisy Picture

Thomas Au: Thermometer Picture

Torbakhopper: Cables Picture

Dam Picture - Japan

Hsiung: Lights Picture

Health Endpoint in API Design