Download pdf - Serverless in production (O'Reilly Software Architecture)

Serverless in production an experience report

Yan Cui

What’s in this talk?

! how to responsibly run a serverless architecture (aka. how to do ops in serverless)

! testing, CI/CD! logging, distributed tracing, monitoring! config management, securing secrets! coldstarts! gotchas/limitations + workarounds/hacks

hi, I’m Yan Cui

hi, I’m Yan CuiAWS user since 2009

apr, 2016

Before

! hidden complexities and dependencies! low utilisation to leave headroom for large spikes! EC2 scaling is slow, so scale earlier! paying for lots of used resources! up to 30 mins to deploy! deployments required downtime

- Dan North

“lead time to someone saying thank you is the only reputation

metric that matters.”

“what would good look like for us?”

Deployments should…

! be small! be fast! have zero downtime! require no lock-step

Features should…

! be independently deployable! be loosely-coupled

We want to…

! minimise cost of unused resources! minimise ops effort! reduce technical mess! deliver visible improvements to users faster

nov, 2016

170 Lambda functions in prod

1.2 GB deployment packages in prod

95% cost saving vs EC2

15x no. of prod releases per month

timeis a good fit

1st function in prod!time

is a good fit

?

timeis a good fit

1st function in prod!

Practices ToolsPrinciples

what is good? how to make it good? with what?

Principles outlast Tools

ALERTING

CI / CD

TESTING

LOGGING

MONITORING

170 functions

WOOF!

? ?

timeis a good fit

1st function in prod!

CONFIG MANAGEMENT

SECURITY

DISTRIBUTED TRACING

evolving the platform

building a better search experience

Legacy Monolith Amazon Kinesis Amazon Lambda

Amazon CloudSearch


Amazon CloudSearchAmazon API Gateway Amazon Lambda

building an analytics pipeline


Google BigQuery


Google BigQuery1 developer, 2 daysdesign production

(his 1st serverless project)


Google BigQuery“thank you, nothing ever got

done this fast at Skype!”

- Dan North

“lead time to someone saying thank you is the only reputation

metric that matters.”

rebuilding the timeline feature

building better user recommendations

BigQuery

BigQuery

grapheneDB

BigQuery

grapheneDB

BigQuery

grapheneDB

BigQuery

getting PRODUCTION READY

CHOOSE A

FRAMEWORK

DEPLOYMENT

http://serverless.com

http://serverless.com

https://github.com/awslabs/serverless-application-model

https://github.com/awslabs/serverless-application-model

http://apex.run

http://apex.run

https://apex.github.io/up

https://apex.github.io/up

https://github.com/claudiajs/claudia

https://github.com/claudiajs/claudia

https://github.com/Miserlou/Zappa

https://github.com/Miserlou/Zappa

http://gosparta.io/

http://gosparta.io/

TESTING

amzn.to/29Lxuzu

http://amzn.to/29Lxuzu

Level of Testing

1.Unitdo our objects do the right thing?are they easy to work with?

1.Unit2.Integrationdoes our code work against code we can’t change?

Level of Testing

handler

handler

test by invoking the handler

Level of Testing

1.Unit2.Integration3.Acceptancedoes the whole system work?

Level of Testing

unit

integration

acceptance

feedb

ack

confidence

“…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise.

The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…”

Don’t Mock Types You Can’t Change

“…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do…

Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…”

Don’t Mock Types You Can’t Change

ServicesDon’t Mock Types You Can’t Change

Paul Johnston

The serverless approach to testing is different and

may actually be easier.http://bit.ly/2t5viwK

http://bit.ly/2t5viwK

LambdaAPI Gateway DynamoDB


Unit Tests


Unit Tests

Mock/Stub

is our request correct?

is the request mapping set up

is the API resources configured correctly?

are we assuming the correct schema?


is Lambda proxy configured correctly?

is IAM policy set up correctly?

is the table created?

what unit tests will not tell you…

most Lambda functions are simple have single purpose, the risk of shipping broken

software has largely shifted to how they integrate with external services

observation

But it slows down my feedback loop…

IT’S NOT ABOUT YOU!

me

test your system, not (just) your code

API Gateway

IOT

Kinesis

SNS

ElastiCache

CloudWatch

DynamoDB

IAM

S3

Auth0

GrapheneDB

SES

Twilio

Google BigQuery

MongoLab

CloudSearch

APN

GCM

Lambda

EC2

…if a service can’t provide you with a relatively easy way to test the

interface in reality, then you should consider using another one.

Paul Johnston

“…Wherever possible, an acceptance test should exercise the system end-to-end without directly calling its internal code.

An end-to-end test interacts with the system only from the outside: through its interface…”

Testing End-to-End





Test Input



Test Input

Validate

integration tests exercise system’s Integration with its

external dependencies

acceptance tests exercise system End-to-End from

the outside

integration tests differ from acceptance tests only in HOW the

Lambda functions are invoked

observation

CI/CD PIPELINE

“…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed…

This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…”

Testing End-to-End

me

Deployment scripts that only live on the CI box is a disaster

waiting to happen.

Jenkins build config deploys and tests

unit + integration tests

deploy

acceptance tests

if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4

npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION …

if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4

npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION …

install serverless framework as dev dependency

can be run locally & on the CI box

auto auto manual

LOGGING

2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?

2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?

UTC Timestamp API Gateway Request Id

your log message

function name

date

function version

me

Logs are not easily searchable in CloudWatch Logs.

LOG OVERLOAD

CENTRALISE LOGS

CENTRALISE LOGS

MAKE THEM EASILYSEARCHABLE

+ +the elk stack

CloudWatch Logs

CloudWatch Logs AWS Lambda ELK stack

CloudWatch Events

CloudWatch Logs

http://bit.ly/2f3zxQG

DISTRIBUTED TRACING

“my followers didn’t receive my new post!”

- a user

where could the problem be?

correlation IDs*

* eg. request-id, user-id, yubl-id, etc.

ROLL YOUR OWNCLIENTS

kinesis client

http client

sns client

http://bit.ly/2k93hAj

kinesisglobal.CONTEXT

log.info(…)

api-b

global.CONTEXT

global.CONTEXT

global.CONTEXT

x-correlation-id = … x-correlation-xxx = …

API Gateway Kinesis

SNS

API Gateway

API Gatewayapi-a api-c

sns

headers[“User-Agent”] headers[“Debug-Log-Enabled”]

MessageAttributes: [ “x-correlation-id”: … “User-Agent”: … “Debug-Log-Enabled”: … ]

global.CONTEXT

headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”]

headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”]

data.__context

capture

forward

function

event

http://bit.ly/2k93hAj

ROLL YOUR OWNCLIENTS

X-RAY

Amazon X-Ray

Amazon X-Ray

traces do not span over API Gateway

MONITORING + ALERTING

“where do I install monitoring agents?”

you can’t

• invocation Count• error Count• latency• throttling• granular to the minute• support custom metrics

• invocation Count• error Count• latency• throttling• granular to the minute• support custom metrics

Why not IOPipe?

! pervasive access to your entire application! adds latency for tracking

me

The only “background” processing you get are the capabilities the platform provides out of the box.

“how do I batch up and send logs/metrics in the

background?”

you can’t (kinda)

console.log(“hydrating yubls from db…”);

console.log(“fetching user info from user-api”);

console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);

console.log(“MONITORING|1489795335|8|count|yubls-served”);

timestamp metric value

metric type

metric namemetrics

logs

CloudWatch Logs AWS Lambda

ELK stacklogs

metrics

CloudWatch

CloudWatch Logs

CloudWatch Logs AWS Lambda

ELK stacklogs

metrics

CloudWatch

memory usedmemory size

billed duration

http://bit.ly/2gGredx

http://bit.ly/2gGredx

http://bit.ly/2goFZ8F

http://bit.ly/2goFZ8F

DASHBOARDS

DASHBOARDS

SET ALARMS

DASHBOARDS

SET ALARMS

TRACK APP-LEVELMETRICS

Not Only CloudWatch

don’t put all your eggs in one basket

aka. you don’t want your monitoring system to fail at the same time as the systems it monitors

CONFIG MANAGEMENT

Lambda

me

Environment variables make it hard to share configurations

across functions.

me

Environment variables make it hard to implement fine-grained

access to sensitive info.

http://bit.ly/2uQKABA

http://bit.ly/2uQKABA

couples ability to deploy with access to sensitive data, which often don’t overlap in a large

engineering team or in a regulated environment

CENTRALISEDCONFIG SERVICE

config servicegoes here

Why not consul or etcd?

! multiple EC2 instances in multi-AZ for HA! have to manage servers, patch OS, patch software, etc.! learning curve for configuring the service! learning curve for using the CLI tools

SSM Parameter

Store

SSM Parameter Store

HTTPS

role-based access

encrypted in-flight

SSM Parameter Store

encrypt

role-based access

SSM Parameter Store

encrypted at-rest

HTTPS

role-based access

SSM Parameter Store

encrypted in-flight

SSM Parameter Store

decrypt

role-based access

CENTRALISEDCONFIG SERVICE

CLIENT LIBRARY

Requirements for client library

! standardise and encapsulate how you manage configs! supports client-side caching (fetch & cache at coldstart)! invalidate cache at interval! invalidate cache explicitly when staleness is detected

http://bit.ly/2yLUjwd

http://bit.ly/2yLUjwd

PRO TIPS

max 75 GB total deployment package size*

* limit is per AWS region

Janitor Monkey

Janitor Lambda

http://bit.ly/2xzVu4a

http://bit.ly/2xzVu4a

disable versionFunctions in

install Serverless framework as dev dependency at project level

dev dependencies are excluded since 1.16.0

http://bit.ly/2vzBqhC

http://bit.ly/2vzBqhC

http://amzn.to/2vtUkDU

http://amzn.to/2vtUkDU

UNDERSTANDCOLDSTARTS

Amazon X-Ray1st invocation

2nd invocation

cold start

source: http://bit.ly/2oBEbw2

http://bit.ly/2oBEbw2

http://bit.ly/2rtCCBz



C#



Java



NodeJs, Python


me

C# and Java experiences ~100 times the cold start time of Python and also suffer from

much higher standard deviation

me

memory size improves cold start time linearly

AVOIDCOLDSTARTS

CloudWatch Event AWS Lambda


ping

ping

ping

ping


ping

ping

ping

ping


ping

ping

ping

ping

HEALTH CHECKS?

AWS Lambda docs

Take advantage of container re-use to improve the performance of your function. Make sure any

externalized configuration or dependencies that your code retrieves are stored and referenced locally after initial execution. Limit the re-initialization of variables/objects on

every invocation. Instead use static initialization/constructor, global/static variables and singletons. Keep alive and reuse connections (HTTP, database, etc.) that

were established during a previous invocation.

http://amzn.to/2jzLmkb

http://amzn.to/2jzLmkb

max 5 mins execution time

http://bit.ly/2w6ItdI

http://bit.ly/2w6ItdI

CONSIDERPARTIAL

FAILURES

AWS Lambda docs

AWS Lambda polls your stream and invokes your Lambda function.

Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time

the data expires.

http://amzn.to/2vs2lIg

http://amzn.to/2vs2lIg

vsprocessing halts until failed

events are retried successfully/expired from stream

prioritize realtime-ness, retry failed events with best effort,

then skip

SNS

Kinesis

SQS

after 3 attempts

share processing logic

events are processed in chronological order

failed events are retried out of sequence

PROCESS SQSWITH RECURSIVE

FUNCTIONS

http://bit.ly/2npomX6

http://bit.ly/2npomX6

AVOID HOTKINESS

STREAMS

AWS Lambda docs

Each shard can support up to 5 transactions per second for

reads, up to a maximum total data read rate of 2 MB per second.

http://amzn.to/2ubyaot


AWS Lambda docs

If your stream has 100 active shards, there will be 100 Lambda functions running concurrently. Then, each Lambda function

processes events on a shard in the order that they arrive.



when no. of processors goes up…

ReadProvisionedThroughputExceeded

can have too many Kinesis read operations…

ReadRecords.IteratorAge

unpredictable spikes in read ‘latency’…

can kinda workaround…

http://bit.ly/2uv5LsH

http://bit.ly/2uv5LsH

clever, but costly

new tool, new problemsbut they’re easier to deal with

@theburningmonktheburningmonk.comgithub.com/theburningmonk

http://bit.ly/2yQZj1H

http://bit.ly/2yQZj1H