Pushing the hassle from production to developers. Easily

DEV

OPS

Push the hassle

from production

to developers.

Easily

DevOpsDays Ghent 2016

October 27th

@MartinGoodwell

Dynatrace

@MartinGoodwell

About me

Passionate about life,

technology, and the people

behind both of them.

• Started with Commodore 8-bit (VC-20 and C-64)

• Built Null-modem connections for playing Doom and WarCraft

• Built IPX/SPX networks between MS-DOS 5.0 and Windows 3.1

• Did DevOps before they called it that way (mainly Java and Web)

for about 10 years

• Now at Dynatrace Innovation Lab

• Tech Lead for Microsoft Technologies

and Software Architecture

• Find me on Twitter: @MartinGoodwell

@MartinGoodwell

@MartinGoodwell

Agenda

• The Rules

• Warm-up

• The Ops dilemma (I call it that)

• The second Ops dilemma

• On Monitoring ...

• ... and Logging

• ... and Call Tracing

• ... and Databases

• Commercial offerings

@MartinGoodwell

@MartinGoodwell

The Rules

• Please, ask or interrupt anytime

• But keep ideas for open space discussions

• Or track me down anytime around

@MartinGoodwell

@MartinGoodwell

Warm up

• What's your occupation?

• Dev, Ops, BinExec?

• What's your technology stack?

• Node.js

• Go

• Java

• .net

• Who of you does

• Monitoring

• Logging

• Call-Tracing

• Application performance management/monitoring (APM)

@MartinGoodwell

@MartinGoodwell

The Ops dilemma (1)

Dev

• Single transaction

• Deal with a specific problem

• No impact on real users and business

• Concentrate on single component

• Deadlines refer to sprints

• Weeks, usually

Ops

• 100s or 1000s of txns

• No idea, what the cause is

• Real user impact

• Lots of moving parts

• Deadlines usually mean SLAs

• Hours, maybe just minutes

@MartinGoodwell

@MartinGoodwell@MartinGoodwell

@MartinGoodwell

The Ops dilemma (1)

Dev

• Single transaction

• Deal with a specific problem

• No impact on real users and business

• Concentrate on single component

• Deadlines refer to sprints

• Weeks, usually

Ops

• 100s or 1000s of txns

• No idea, what the cause is

• Real user impact

• Lots of moving parts

• Deadlines usually mean SLAs

• Hours, maybe just minutes

@MartinGoodwell

@MartinGoodwell

The Ops dilemma (2)

Automation

• Continuous {Integration/Deployment/Delivery} pipeline

• triggering unit tests for fast feedback

• Build servers

• Repositories

• Automatic deployments

• Helps devs getting stuff into production

• Does nothing for the opposite direction

@MartinGoodwell

@MartinGoodwell

DevOps is about collaboration.

Collaboration requires documentation.

Automation is implicit documentation.

But there is no automation for

supporting Ops with troubleshooting.

@MartinGoodwell

@MartinGoodwell

Monitoring

@MartinGoodwell

@MartinGoodwell

Host metrics

• CPU usage

• Memory usage

• Disk I/O

• Network performance

• No insight into app's

problems and performance

@MartinGoodwell

@MartinGoodwell

In your code

@MartinGoodwell

@MartinGoodwell

Use statsd

@MartinGoodwell

@MartinGoodwell

statsd real quick

http://www.slideshare.net/DatadogSlides/dev-opsdays-tokyo2013effectivestatsdmonitoring

@MartinGoodwell

@MartinGoodwell

Downsides?

• "Polluting" business logic with monitoring code

• Code introspection (ie AOP) requires advanced skills

• Not using something like statsd leads to cluttered metrics

• Great for single component insight

• what about called 3rd parties?

• what about microservices (ie distributed transactions)?

• what about calls to databases, queues, etc.

@MartinGoodwell

@MartinGoodwell

Logging

@MartinGoodwell

@MartinGoodwell

http://theburningmonk.com/2015/05/a-consistent-approach-to-track-correlation-ids-through-microservices/

@MartinGoodwell

@MartinGoodwell


@MartinGoodwell

@MartinGoodwell


@MartinGoodwell

@MartinGoodwell

Logging learnings

• Use a logging server (eg ELK stack)

• directly log as JSON

• at least store as JSON

• Using logging for monitoring is expensive

• log analysis is a real resource hog

• works great for troubleshooting

• works great with limited problem scope

• for Java, use Logback via SLF4J

• to local logfiles

• to logstash

• to syslog

@MartinGoodwell

@MartinGoodwell

Call Tracing

@MartinGoodwell

@MartinGoodwell

Google Dapper paper

• The Dapper paper (2010)

http://research.google.com/pubs/archive/36356.pdf

• OpenTracing

http://opentracing.io/documentation/

• OpenZipkin (by Twitter)

• http://zipkin.io/

@MartinGoodwell

http://research.google.com/pubs/archive/36356.pdf

http://opentracing.io/documentation/

http://zipkin.io/

@MartinGoodwell

Zipkin architecture

http://zipkin.io/pages/architecture.html

@MartinGoodwell

@MartinGoodwell

https://github.com/openzipkin/zipkin

@MartinGoodwell

@MartinGoodwell

http://zipkin.io/

@MartinGoodwell

@MartinGoodwell

https://github.com/ordina-jworks/microservices-dashboard

@MartinGoodwell

@MartinGoodwell

https://github.com/spring-cloud/spring-cloud-sleuth

Spring Cloud Sleuth is a distributed tracing solution on top of Spring Cloud

@MartinGoodwell

@MartinGoodwell

http://trace.risingstack.com

@MartinGoodwell

@MartinGoodwell

Databases

@MartinGoodwell

@MartinGoodwell

Getting database insight

• Database automation

• eg. DB Maintain

• https://dbmaintain.github.io/

• Database performance logging

• log4jdbc

• https://github.com/arthurblake/log4jdbc

@MartinGoodwell

https://dbmaintain.github.io/

https://github.com/arthurblake/log4jdbc

@MartinGoodwell

The commercial hood

@MartinGoodwell

@MartinGoodwell

Broad technology support

@MartinGoodwell

@MartinGoodwell

Zero-conf and ready to run dashboards

@MartinGoodwell

@MartinGoodwell

Method level insight for code and database

@MartinGoodwell

@MartinGoodwell

Host, process and network metrics

@MartinGoodwell

@MartinGoodwell

Call-tracing across technologies

@MartinGoodwell

@MartinGoodwell

Including log analytics

@MartinGoodwell

@MartinGoodwell

Full Docker insight (zero-conf)

@MartinGoodwell

@MartinGoodwell

Dedicated support for most important technologies

@MartinGoodwell

@MartinGoodwell

Automated baselining, root-cause-analysis, and problem

correlation

@MartinGoodwell

@MartinGoodwell

You can't fight in here, Gentlemen.

This is the war room!

@MartinGoodwell

Let's talk

• From monolith to microservice

• cloud migration

• performance optimization

• team culture

@MartinGoodwell

[email protected]

@MartinGoodwell

mailto:[email protected]

@MartinGoodwell

Thank you!

@MartinGoodwell