54

Saltconf 2016: Salt stack transport and concurrency

Embed Size (px)

Citation preview

Page 1: Saltconf 2016: Salt stack transport and concurrency
Page 2: Saltconf 2016: Salt stack transport and concurrency

Salt Transport Modularity and Concurrency for Performance and

ScaleThomas JacksonStaff Site Reliability EngineerLinkedIn

Page 3: Saltconf 2016: Salt stack transport and concurrency

3

Agenda

• for item in (‘transport’, ‘concurrency’):• History• Problems• Options• Solution

Page 4: Saltconf 2016: Salt stack transport and concurrency

4

 Transport in SaltSalt Transport: a history

• In the beginning Salt was primarily a remote execution engine• Send jobs from Master to N minions (defined by some target)

• In the beginning there was

Page 5: Saltconf 2016: Salt stack transport and concurrency

5

"ZeroMQ (also spelled ØMQ, 0MQ or ZMQ) is a high-performance asynchronous messaging library, aimed at use in

distributed or concurrent applications.”

- Wikipedia (https://en.wikipedia.org/wiki/ZeroMQ)

Page 6: Saltconf 2016: Salt stack transport and concurrency

6

We took a normal TCP socket, injected it with a mix of radioactive isotopes stolen from a secret Soviet atomic

research project, bombarded it with 1950-era cosmic rays, and put it into the hands of a drug-addled comic book

author with a badly-disguised fetish for bulging muscles clad in spandex. Yes, ZeroMQ sockets are the world-saving

superheroes of the networking world.

- http://zguide.zeromq.org/page:all#How-It-Began

Page 7: Saltconf 2016: Salt stack transport and concurrency

7

Salt Transport: a history How ZMQ PUB/SUB looks

Servercontext = zmq.Context()socket = context.socket(zmq.PUB)socket.bind("tcp://*:12345")socket.send(”Message")

Clientcontext = zmq.Context()socket = context.socket(zmq.SUB)socket.connect("tcp://localhost:12345")print socket.recv()

Page 8: Saltconf 2016: Salt stack transport and concurrency

8

Salt Transport: a history How ZMQ REQ/REP looks

Servercontext = zmq.Context()socket = context.socket(zmq.REP)socket.bind("tcp://*:12345")message = socket.recv()socket.send(“got message”)

Clientcontext = zmq.Context()socket = context.socket(zmq.REQ)socket.connect("tcp://localhost:12345")socket.send("Hello”)message = socket.recv()

Page 9: Saltconf 2016: Salt stack transport and concurrency

9

 Request lifecycleSalt Transport: a history

Master Minion

1. Jobpublish2. Sign-in(optional–potentiallyreusedorcached)3. PillarFetch4. SLS/filefetch(optional)5. Return

Page 10: Saltconf 2016: Salt stack transport and concurrency

10

 Initial ZeroMQ implementationSalt Transport: a history

• Master-initiated messages• Using the pub/sub socket pair in zmq• All broadcast messages from the master to the minion

• Minion-initiated messages• Using the req/rep socket pair in zmq• All messages initiated by the minion, such as:

• Sign-in• Job return• Module sync• Pillar• Etc.

Page 11: Saltconf 2016: Salt stack transport and concurrency

11

 Initial problemsSalt Transport: a history

• Message loss• Broadcasts where filtered client side

• Added zmq filtering: https://github.com/saltstack/salt/pull/13285

• Etc.

Page 12: Saltconf 2016: Salt stack transport and concurrency

12

Page 13: Saltconf 2016: Salt stack transport and concurrency

13

 Larger problemsSalt Transport: a history

• Huge ZMQ publisher memory leak (https://github.com/zeromq/libzmq/issues/954)• Workaround: Process manager in salt

• No concept of client state• When messages arrive, there is no way to see if the client is still connected– which leads to auth storms• Workaround: Exponential backoff on the minion side

• No sync "connect" (https://github.com/saltstack/salt/pull/21570)• Workaround: fire event and wait for it to return (or timeout to expire)

• Some users have issues with the LGPL license • Workaround: n/a

Page 14: Saltconf 2016: Salt stack transport and concurrency
Page 15: Saltconf 2016: Salt stack transport and concurrency

15

The Reliable Asynchronous Event Transport, or RAET, is an alternative transport medium

developed specifically with Salt in mind. It has been developed to allow queuing to happen up on the application layer and comes with socket layer encryption. It also abstracts a great deal of control over the socket layer and makes it

easy to bubble up errors and exceptions.

- docs.saltstack.com

Salt Transport: previous attempt

Page 16: Saltconf 2016: Salt stack transport and concurrency

16

 RAETSalt Transport: previous attempt

• The good• No ZMQ!

• The bad• Effectively a re-implementation of the daemons (separate files, etc.)• Unable to run zmq and RAET simultaneously (initially, hydra was added later – which just runs both daemons at once)

• The different• Changed the model from “minions always connect” to “minions are listening”, meaning minions have a socket to

attack

Page 17: Saltconf 2016: Salt stack transport and concurrency

17

Page 18: Saltconf 2016: Salt stack transport and concurrency

18

 What do we really needSalt Transport: back to basics

• Salt is a platform, not a specific transport– we need transports to be modular• Some requirements:

• Simple interface to implement (such that other modules can be written)• Test coverage (including pre-canned tests for new modules)• Support N transports simultaneously (for ramps, and complex infra)• Clear contract of security/privacy requirements of various methods

Page 19: Saltconf 2016: Salt stack transport and concurrency

19

• ReqChannel: minion to master messagesSalt Transport: Channels!

• Master• pre_fork(self, process_manager)• post_fork(self, payload_handler, io_loop)

• Minion• send(self, load, tries=3, timeout=60)• crypted_transfer_decode_dictentry(self, load, dictkey=None, tries=3, timeout=60)

Page 20: Saltconf 2016: Salt stack transport and concurrency

20

• PubChannel: broadcasts to the appropriate minionsSalt Transport: Channels!

• Master• pre_fork(self, process_manager)• publish(self, load)

• Minion:• on_recv(self, callback)

Page 21: Saltconf 2016: Salt stack transport and concurrency

21

 ResponsibilitiesSalt Transport: Channels!

• Serialization• Encryption• Targeting (pub channel only)

Page 22: Saltconf 2016: Salt stack transport and concurrency

22

 TCP channelSalt Transport: Channels!

• Wire protocol: msgpack({'head': SOMEHEADER, 'body': SOMEBODY})• Main advantages over ZMQ? better failure modes

• Faster failure detection (if minion isn’t connected to the master, you don’t have to wait for the timeouts)• True link-status (no more auth storms!)• Basically, we have sockets again!

• https://docs.saltstack.com/en/develop/topics/transports/tcp.html

Page 23: Saltconf 2016: Salt stack transport and concurrency

23

 TCP: How does it look?Salt Transport: Channels!

async_channel = salt.transport.client.AsyncReqChannel.factory(minion_opts)ret = yield async_channel.send(msg)

Page 24: Saltconf 2016: Salt stack transport and concurrency

24

 TCP: How accurate?Salt Transport: Channels!

• ZeroMQ• Total jobs: 1000• Completed jobs: 171• Hit rate: 17.1%

• TCP• Total jobs: 1000• Completed jobs: 1000• Hit rate: 100%

Page 25: Saltconf 2016: Salt stack transport and concurrency

25

 TCP: How does it performSalt Transport: Channels!

• 15 byte message• ZeroMQ*

• Average time: 0.00295809405715• QPS: 2246.952241147

• TCP• Average time: 0.0023341544863• QPS: 2580.04452801

Page 26: Saltconf 2016: Salt stack transport and concurrency

26

 TCP: How does it performSalt Transport: Channels!

• 1053 byte message• ZeroMQ*

• Average time: 0.00278297542184• QPS: 2489.300394919

• TCP• Average time: 0.00251070397869• QPS: 2602.4855051

Page 27: Saltconf 2016: Salt stack transport and concurrency

27

 Awesome!Salt Transport: Channels!

• Definitely awesome! • But async? What was that about? • Before we get into specifics, lets talk about concurrency

Page 28: Saltconf 2016: Salt stack transport and concurrency

28

 The General ProblemConcurrency

We have lots of things to do, some of which are blocking calls to remote things which are “slow”. It is more efficient (and overall “faster”) to work on something else while we wait for that “slow” call.

Page 29: Saltconf 2016: Salt stack transport and concurrency

29

Page 30: Saltconf 2016: Salt stack transport and concurrency

30

 Current state of concurrency in SaltConcurrency

• Master-side: the master creates N Mworkers to process N requests in parallel• N Mworkers to process N requests in parallel• Interaces with non-blocking as well, using `while True:` loops to do timeouts etc.

• Minion-side:• Threads used in MultiMaster for managing the multiple master connections

Page 31: Saltconf 2016: Salt stack transport and concurrency

31

 ProblemsConcurrency

• No unified approach (multiprocessing, threading, nonblocking “loops” -- all in use)• Slow and/or blocking operations hold process/thread while waiting• No consistent use of non-blocking libraries, so the code is a mix of loops and

blocking calls• Limited scalability (each approach scales differently)

Page 32: Saltconf 2016: Salt stack transport and concurrency

32

 Common solutions in PythonConcurrency

• Threading• Multiprocessing• User-space “threads”: Coroutines / stackless threads

Page 33: Saltconf 2016: Salt stack transport and concurrency

33

Concurrency Threading

• Some isolation between threads• Pre-emptive scheduling

Import threading

def handle_request():

ret = requests.get(‘http://slowthing/’)

# do something else

threads = []

for x in xrange(0, NUM)REQUESTS):

t = threading.Thread(target=handle_request)

t.start()

threads.append(t)

for t in threads:

t.join()

Page 34: Saltconf 2016: Salt stack transport and concurrency

34

Concurrency Multiprocessing

• Complete isolation• Pre-emptive scheduling

Import multiprocessing

def handle():

ret = requests.get(‘http://slowthing/’)

# do something else

Processes = []

for x in xrange(0, NUM)REQUESTS):

p = multiprocessing.Process(target=handle)

p.start()

processes.append(p)

For p in processes:

p.join()

Page 35: Saltconf 2016: Salt stack transport and concurrency

35

• User-space “threads”: Coroutines / stackless threadsConcurrency

• Some libraries you may have heard of• gevent• Stackless python• Greenlet• Twisted• Tornado

• How are these implemented• Green threads• callbacks• coroutines

Page 36: Saltconf 2016: Salt stack transport and concurrency

36

 Why Coroutines?Concurrency

• Coroutines have been in use in python for a while (tornado)• The new asyncio in python3 (tulip) is coroutines

(https://docs.python.org/3/library/asyncio.html)

Page 37: Saltconf 2016: Salt stack transport and concurrency

37

Coroutines are computer program components that generalize subroutines for

nonpreemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations.

- https://en.wikipedia.org/wiki/Coroutine

Concurrency

Page 38: Saltconf 2016: Salt stack transport and concurrency

38

Concurrency Coroutines– what is this magic?

def item_of_work():

while True:

input = yield

yield do_something(input)

Page 39: Saltconf 2016: Salt stack transport and concurrency

39

Concurrency Coroutines– what is this magic?

def some_complex_handle():

while True:

input = yield

out1 = do_something(input)

yield None

out2 = do_something2(out1)

yield None

return do_something3(out2)

Page 40: Saltconf 2016: Salt stack transport and concurrency

40

Concurrency Tornado coroutines

• Some isolation between coroutines• Explicit yield• Light “threads”

Import threading

@tornado.gen.coroutine

def handle_request():

ret = yield requests.get(‘http://slow/’)

# do something else

loop = tornado.ioloop.IOLoop.current()

loop.spawn_callback(handle_request)

loop.start()

Page 41: Saltconf 2016: Salt stack transport and concurrency

41

 Coroutines– futuresConcurrency

• Futures are just objects that represent a thing that will complete in the future• This allows methods to return immediately, but finish the task in the future• This allows the callers to yield execution until the futures they depend on complete

Page 42: Saltconf 2016: Salt stack transport and concurrency

42

Concurrency Coroutines– with futures

• Yield execution, and get returns• Method looks fairly normal• Stack traces in here have context• Easy chaining of futures

@tornado.gen.coroutine

def some_complex_handle(request):

a = yield is_authd(request)

if not a:

return False

ret = yield do_request(request)

yield save1(ret), save2(ret)

return ret

Page 43: Saltconf 2016: Salt stack transport and concurrency

43

 Tornado in SaltConcurrency

• What is tornado?• Python web framework and asynchronous networking library

• Why Tornado and not asyncio?• Free python 2.x compatibility!• A fairly comprehensive set of libraries for it (http, locks, queues, etc.)

Page 44: Saltconf 2016: Salt stack transport and concurrency

44

 Back to the transport interfacesConcurrency

• AsyncReqChannel• send: return a future• crypted_transfer_decode_dictentry: return a future

ret = yield channel.send(load, timeout=timeout)

Page 45: Saltconf 2016: Salt stack transport and concurrency

45

 Now what?Concurrency

• Now that we have a real concurrency model, what have we done with it?• MultiMinion in a single process (coroutine per connection)• Easily implement concurrent networking within Salt

• TCP transport• IPC

Page 46: Saltconf 2016: Salt stack transport and concurrency

46

Page 47: Saltconf 2016: Salt stack transport and concurrency

47

 Really? Problems?Concurrency problems

• Most common pitfalls to concurrent programming• race conditions and memory collisions• deadlocks

Page 48: Saltconf 2016: Salt stack transport and concurrency

48

 Race conditionsConcurrency problems

• Weird data problems in the reactor: https://github.com/saltstack/salt/issues/23373• The underlying problem: injected stuff in modules (__salt__ etc.) were just dicts—

which aren’t threadsafe (or coroutinesafe!)

• The solution? `ContextDict`

Page 49: Saltconf 2016: Salt stack transport and concurrency

49

 Copy-on-write thread/coroutine specific dictContextDict

• Works just like a dict• Exposes a clone() method, which creates a `ChildContextDict` which is a

thread/coroutine local copy• With tornado’s StackContext, we switch the backing dict of the parent with your

child using a context manager

cd = ContextDict(foo=bar)print cd[‘foo’] # will be barwith tornado.stack_context.StackContext(cd.clone): print cd[‘foo’] # will be bar cd[‘foo’] = ‘baz’ print cd[‘foo’] # will be bazprint cd[‘foo’] # will be bar

More examples: https://github.com/saltstack/salt/blob/develop/tests/unit/context_test.py

Page 50: Saltconf 2016: Salt stack transport and concurrency

50

 DeadlocksConcurrency problems

• haven't seen any yet *knock on wood* -- in general we avoid these since each coroutine is more-or-less independent of the others

Page 51: Saltconf 2016: Salt stack transport and concurrency

51

 Layers!Concurrency problems

• Don’t forget, concurrency at all layers– including your DC-wide state execution• For example: automated highstate enforcement of your whole DC

• Does it matter if all DB hosts update at once?• Does it matter if all web servers update at once?• Does it matter if all edge boxes update at once?

Page 52: Saltconf 2016: Salt stack transport and concurrency

52

 concurrency controls for state executionzk_concurrency

acquire_lock: zk_concurrency.lock: - name: /trafficeserver - zk_hosts: 'zookeeper:2181' - max_concurrency: 4 - prereq: - service: trafficservertrafficserver: service.running: []release_lock: zk_concurrency.unlock: - name: /trafficserver - require: - service: trafficserver

Page 53: Saltconf 2016: Salt stack transport and concurrency

53

 Things on my “list”Future Awesomeness

• Transport• failover groups• even better HA (https://github.com/saltstack/salt/issues/25700 -- get involved in the conversation)

• Concurrency• async ext_pillar• Partially concurrent state execution (prefetch, etc.)?• Coroutine-based:

• Reactor• Engines• Beacons• Thorium

Page 54: Saltconf 2016: Salt stack transport and concurrency

©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.