Mitchell Hashimoto: Building Robust Systems w/ Service Discovery & Configuration

Preview:

DESCRIPTION

Building Robust Systems with Service Discovery and Configuration There is no scenario in the future where we have less servers. Whether you consider a server a physical machine, a virtual machine, or even a container, the number of each is growing at an extremely fast rate. It is becoming increasingly important in this view of the world to build robust systems that can ideally run anywhere, recover from crashes, distribute load, etc. In this talk, I discuss these problems and how having a powerful system for service discovery and configuration can actually get you a fairly robust system without additional modifications. With this knowledge equipped, it becomes much easier to imagine migrating legacy and new infrastructures over to this modern world of many commodity machines. https://twitter.com/mitchellh http://mitchellh.com

Citation preview

Building Robust Systems With Consul

I’m Mitchell HashimotoAlso known as @mitchellh

HashiCorpTowards a Software Managed Datacenter

Vagranthttp://www.vagrantup.com

Packerhttp://www.packer.io

SERFhttp://www.serfdom.io

Consulhttp://www.consul.io

Consul

Take a Step BackTaking a look at the big picture.

Node

Service Service Service

Hypervisor

Node Node Node

S S S S S S S S S

Hypervisor

Node Node Node

Container

S S Container S Container

S S S S S S

Hypervisor

Node Node Node

Container

S S Container S Container

S S S S S S

Modern OpsMore everything, more problems.

• Where is service foo?• Is service foo healthy/available?• What is service foo’s

configuration?• Where is the service foo leader?

Meta:

What happens when the thing that answers these questions is unavailable?

Robust SystemsStem from the ability to answer these questions.

• Start services in any order• Destroy services with confidence• Restart servers safely• Reconfigure services easily

Practical Goals

• Where is service foo?• Is service foo healthy/available?• What is service foo’s

configuration?• Where is the service foo leader?

Where is service foo?

Maybe here: 127.0.0.1Maybe close: 10.0.1.35Maybe there: foo.foohost.com

Is service foo healthy/available?

Yes: Great!No: Avoid or handle gracefully.

What is service foo’s configuration?

Access information, supported features, enabled/disabled.

What is my configuration?

Expect it to be modifiable.

Where is the service foo leader or best choice?

Locality, master/slave, versions.

Meta: Is the thing answering these questions stable/available?

Critical infrastructure component, you want “yes” as often as possible.

Robust! Can find services, can avoid and handle unhealthy services, can be configured externally, and can trust that it can retrieve all of this information.

• Start services in any order• Destroy services with confidence• Restart servers safely• Reconfigure services easily

Practical Goals

Consul

Solution AttemptsIn a world… before Consul...

Manual/Hardcoded• Doesn’t scale with services/nodes• Not resilient to failures• Localized visibility/auditability• Manual locality of services

Config Mgmt Problem• Slow to react to changes• Not resilient to failures• Not really configurable by

developers• Locality, monitoring, etc. manual

LB Fronted Services• Introduces different SPOF• How does LB find service

addresses/configure?• Solves some problems, though.

ZooKeeper• Complicated• Heavy clients• Building block, very manual

Consul

Service Discovery

Where is service foo?

Service Discovery$ dig web-frontend.service.consul. +short10.0.3.8910.0.1.46

$ curl http://localhost:8500/v1/catalog/service/web-frontend[{ “Node”: “node-e818f1”, “Address”: “10.0.3.89”, “ServiceID”: “web-frontend”, …}]

Service Discovery

• DNS is legacy-friendly. No application changes required.

• HTTP returns rich metadata.

Failure Detection

Is service foo healthy/available?

Failure Detection

Failure Detection

• DNS won’t return non-healthy services or nodes.

• HTTP has endpoints to list health state of catalog.

Key/Value Storage

What is the config of service foo?

Key/Value Storage$ curl –X PUT –d ‘bar’ http://localhost:8500/v1/kv/footrue

$ curl http://localhost:8500/v1/kv/foo?rawbar

Key/Value Storage

• Highly available storage of configuration.

• Turn knobs without big configuration management process.

Multi-Datacenter

Multi-Datacenter$ dig web-frontend.singapore.service.consul. +short10.3.3.3310.3.1.18

$ dig web-frontend.germany.service.consul. +short10.7.3.4110.7.1.76

Multi-Datacenter$ curl http://localhost:8500/v1/kv/foo?raw&dc=asiatrue

$ curl http://localhost:8500/v1/kv/foo?raw&dc=eufalse

Multi-Datacenter

• Local by default• Can query other datacenters

however you may need to

Web UI

Web UI

• Node, service, health check, and K/V management and visibility for every datacenter in a single UI.

OperationsConsul Availability / Scalability

The Meta Question

Architecture

Server Cluster• 3, 5, 7 servers• (n/2) + 1 for

availability• Replicated writes• Automatic leader

election, leader forwarding.

Lightweight Clients• Ephemeral state• Health checks• Optional (but

recommended). Legacy machines don’t need them.

• Automatic request forwarding to servers.

Cheap Gossip• Health check and

membership info.• Very cheap• No guaranteed

reliability, but only used for data that can be lost

• (See Serf)

Multi-DC• Independent server

clusters• Request forwarding• WAN gossip for

membership

General Points: Servers

• (n+1)/2 servers for write avail• More servers means higher write latency

because of replication. Throughput marginally affected.

• Can leave/add at will, keeping in mind min. node requirement.

General Points: Clients• Clients can be removed/added at will

without issue.• Clients don’t currently affect read/write

throughput in a meaningful way.• Although technically optional, they’re

highly recommended for delegated health checks.

Throughput

• On virtualized cloud systems with spinning disks: thousands of reads and writes per second

• Practically won’t hit read/write limit

Scalable and available. Consul’s architecture makes it incredibly scalable and highly unlikely to become unavailable.

Robust SystemsConsul configured, monitored, discovered

• Consul KV for configuration.• Consul DNS for service

coupling/discovery.• Consul Health Checks for

monitoring.

Consul KV: Configuration

Consul KV: Configuration$ envconsul –reload myapp/config bin/myapp…

Consul KV: Configuration

• envconsul turns K/V into environmental variables and restarts on change.

• No application changes!

Consul DNS: Service Discovery$ envconsul myapp/config envELASTICSEARCH_HOST=elasticsearch.service.consul.POSTGRESQL_HOST=master.postgresql.service.consul.REDIS_HOST=redis.service.consul.

Consul DNS: Service Discovery

• Configuration to point to other services uses DNS.

• No application changes!

Consul Health Checks: Monitoring$ cat /etc/consul.d/web.json{ “check”: { “name”: “http”, “script”: “curl localhost:80”, “interval”: “5s” }}

Consul Health Checks: Monitoring

Consul Health Checks: Monitoring

• Simple shell scripts (UNIXy)• Logged output• Won’t show as result in service

discovery queries if failing.

Robust! Add/remove services, reconfigure services, see global state of services without complicated logic. And without modifying application code.

Thank You

http://www.consul.io

Recommended