SlideShare a Scribd company logo
1 of 54
Salt Transport Modularity and
Concurrency for Performance and
Scale
Thomas Jackson
Staff Site Reliability Engineer
LinkedIn
3
Agenda
• for item in (‘transport’, ‘concurrency’):
• History
• Problems
• Options
• Solution
Transport in Salt
4
Salt Transport: a history
• In the beginning Salt was primarily a remote execution engine
• Send jobs from Master to N minions (defined by some target)
• In the beginning there was
5
"ZeroMQ (also spelled ØMQ, 0MQ or ZMQ)
is a high-performance asynchronous
messaging library, aimed at use in
distributed or concurrent applications.”
- Wikipedia (https://en.wikipedia.org/wiki/ZeroMQ)
6
We took a normal TCP socket, injected it with a mix of radioactive
isotopes stolen from a secret Soviet atomic research project,
bombarded it with 1950-era cosmic rays, and put it into the hands
of a drug-addled comic book author with a badly-disguised fetish
for bulging muscles clad in spandex. Yes, ZeroMQ sockets are the
world-saving superheroes of the networking world.
- http://zguide.zeromq.org/page:all#How-It-Began
7
Salt Transport: a history
How ZMQ PUB/SUB looks
Server
context = zmq.Context()
socket = context.socket(zmq.PUB)
socket.bind("tcp://*:12345")
socket.send(”Message")
Client
context = zmq.Context()
socket = context.socket(zmq.SUB)
socket.connect("tcp://localhost:12345")
print socket.recv()
8
Salt Transport: a history
How ZMQ REQ/REP looks
Server
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:12345")
message = socket.recv()
socket.send(“got message”)
Client
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect("tcp://localhost:12345")
socket.send("Hello”)
message = socket.recv()
Request lifecycle
9
Salt Transport: a history
Master Minion
1. Job publish
2. Sign-in (optional – potentially reused or cached)
3. Pillar Fetch
4. SLS/file fetch (optional)
5. Return
Initial ZeroMQ implementation
10
Salt Transport: a history
• Master-initiated messages
• Using the pub/sub socket pair in zmq
• All broadcast messages from the master to the minion
• Minion-initiated messages
• Using the req/rep socket pair in zmq
• All messages initiated by the minion, such as:
• Sign-in
• Job return
• Module sync
• Pillar
• Etc.
Initial problems
11
Salt Transport: a history
• Message loss
• Broadcasts where filtered client side
• Added zmq filtering: https://github.com/saltstack/salt/pull/13285
• Etc.
12
Larger problems
13
Salt Transport: a history
• Huge ZMQ publisher memory leak (https://github.com/zeromq/libzmq/issues/954)
• Workaround: Process manager in salt
• No concept of client state
• When messages arrive, there is no way to see if the client is still connected– which leads to auth storms
• Workaround: Exponential backoff on the minion side
• No sync "connect" (https://github.com/saltstack/salt/pull/21570)
• Workaround: fire event and wait for it to return (or timeout to expire)
• Some users have issues with the LGPL license
• Workaround: n/a 
15
The Reliable Asynchronous Event Transport, or
RAET, is an alternative transport medium developed
specifically with Salt in mind. It has been developed to
allow queuing to happen up on the application layer
and comes with socket layer encryption. It also
abstracts a great deal of control over the socket layer
and makes it easy to bubble up errors and exceptions.
- docs.saltstack.com
Salt Transport: previous attempt
RAET
16
Salt Transport: previous attempt
• The good
• No ZMQ!
• The bad
• Effectively a re-implementation of the daemons (separate files, etc.)
• Unable to run zmq and RAET simultaneously (initially, hydra was added later – which just runs both daemons at once)
• The different
• Changed the model from “minions always connect” to “minions are listening”, meaning minions have a socket to
attack
17
What do we really need
18
Salt Transport: back to basics
• Salt is a platform, not a specific transport– we need transports to be modular
• Some requirements:
• Simple interface to implement (such that other modules can be written)
• Test coverage (including pre-canned tests for new modules)
• Support N transports simultaneously (for ramps, and complex infra)
• Clear contract of security/privacy requirements of various methods
• ReqChannel: minion to master messages
19
Salt Transport: Channels!
• Master
• pre_fork(self, process_manager)
• post_fork(self, payload_handler, io_loop)
• Minion
• send(self, load, tries=3, timeout=60)
• crypted_transfer_decode_dictentry(self, load, dictkey=None, tries=3, timeout=60)
• PubChannel: broadcasts to the appropriate minions
20
Salt Transport: Channels!
• Master
• pre_fork(self, process_manager)
• publish(self, load)
• Minion:
• on_recv(self, callback)
Responsibilities
21
Salt Transport: Channels!
• Serialization
• Encryption
• Targeting (pub channel only)
TCP channel
22
Salt Transport: Channels!
• Wire protocol: msgpack({'head': SOMEHEADER, 'body': SOMEBODY})
• Main advantages over ZMQ? better failure modes
• Faster failure detection (if minion isn’t connected to the master, you don’t have to wait for the timeouts)
• True link-status (no more auth storms!)
• Basically, we have sockets again! 
• https://docs.saltstack.com/en/develop/topics/transports/tcp.html
TCP: How does it look?
23
Salt Transport: Channels!
async_channel = salt.transport.client.AsyncReqChannel.factory(minion_opts)
ret = yield async_channel.send(msg)
TCP: How accurate?
24
Salt Transport: Channels!
• ZeroMQ
• Total jobs: 1000
• Completed jobs: 171
• Hit rate: 17.1%
• TCP
• Total jobs: 1000
• Completed jobs: 1000
• Hit rate: 100%
TCP: How does it perform
25
Salt Transport: Channels!
• 15 byte message
• ZeroMQ*
• Average time: 0.00295809405715
• QPS: 2246.952241147
• TCP
• Average time: 0.0023341544863
• QPS: 2580.04452801
TCP: How does it perform
26
Salt Transport: Channels!
• 1053 byte message
• ZeroMQ*
• Average time: 0.00278297542184
• QPS: 2489.300394919
• TCP
• Average time: 0.00251070397869
• QPS: 2602.4855051
Awesome!
27
Salt Transport: Channels!
• Definitely awesome!
• But async? What was that about?
• Before we get into specifics, lets talk about concurrency
The General Problem
28
Concurrency
We have lots of things to do, some of which are blocking calls to remote things which
are “slow”. It is more efficient (and overall “faster”) to work on something else while we
wait for that “slow” call.
29
Current state of concurrency in Salt
30
Concurrency
• Master-side: the master creates N Mworkers to process N requests in parallel
• N Mworkers to process N requests in parallel
• Interaces with non-blocking as well, using `while True:` loops to do timeouts etc.
• Minion-side:
• Threads used in MultiMaster for managing the multiple master connections
Problems
31
Concurrency
• No unified approach (multiprocessing, threading, nonblocking “loops” -- all in use)
• Slow and/or blocking operations hold process/thread while waiting
• No consistent use of non-blocking libraries, so the code is a mix of loops and
blocking calls
• Limited scalability (each approach scales differently)
Common solutions in Python
32
Concurrency
• Threading
• Multiprocessing
• User-space “threads”: Coroutines / stackless threads
33
Concurrency
Threading
• Some isolation between threads
• Pre-emptive scheduling
Import threading
def handle_request():
ret = requests.get(‘http://slowthing/’)
# do something else
threads = []
for x in xrange(0, NUM)REQUESTS):
t = threading.Thread(target=handle_request)
t.start()
threads.append(t)
for t in threads:
t.join()
34
Concurrency
Multiprocessing
• Complete isolation
• Pre-emptive scheduling
Import multiprocessing
def handle():
ret = requests.get(‘http://slowthing/’)
# do something else
Processes = []
for x in xrange(0, NUM)REQUESTS):
p = multiprocessing.Process(target=handle)
p.start()
processes.append(p)
For p in processes:
p.join()
• User-space “threads”: Coroutines / stackless threads
35
Concurrency
• Some libraries you may have heard of
• gevent
• Stackless python
• Greenlet
• Twisted
• Tornado
• How are these implemented
• Green threads
• callbacks
• coroutines
Why Coroutines?
36
Concurrency
• Coroutines have been in use in python for a while (tornado)
• The new asyncio in python3 (tulip) is coroutines
(https://docs.python.org/3/library/asyncio.html)
37
Coroutines are computer program components
that generalize subroutines for nonpreemptive
multitasking, by allowing multiple entry points
for suspending and resuming execution at
certain locations.
- https://en.wikipedia.org/wiki/Coroutine
Concurrency
38
Concurrency
Coroutines– what is this magic?
def item_of_work():
while True:
input = yield
yield do_something(input)
39
Concurrency
Coroutines– what is this magic?
def some_complex_handle():
while True:
input = yield
out1 = do_something(input)
yield None
out2 = do_something2(out1)
yield None
return do_something3(out2)
40
Concurrency
Tornado coroutines
• Some isolation between coroutines
• Explicit yield
• Light “threads”
Import threading
@tornado.gen.coroutine
def handle_request():
ret = yield requests.get(‘http://slow/’)
# do something else
loop = tornado.ioloop.IOLoop.current()
loop.spawn_callback(handle_request)
loop.start()
Coroutines– futures
41
Concurrency
• Futures are just objects that represent a thing that will complete in the future
• This allows methods to return immediately, but finish the task in the future
• This allows the callers to yield execution until the futures they depend on complete
42
Concurrency
Coroutines– with futures
• Yield execution, and get returns
• Method looks fairly normal
• Stack traces in here have context
• Easy chaining of futures
@tornado.gen.coroutine
def some_complex_handle(request):
a = yield is_authd(request)
if not a:
return False
ret = yield do_request(request)
yield save1(ret), save2(ret)
return ret
Tornado in Salt
43
Concurrency
• What is tornado?
• Python web framework and asynchronous networking library
• Why Tornado and not asyncio?
• Free python 2.x compatibility!
• A fairly comprehensive set of libraries for it (http, locks, queues, etc.)
Back to the transport interfaces
44
Concurrency
• AsyncReqChannel
• send: return a future
• crypted_transfer_decode_dictentry: return a future
ret = yield channel.send(load, timeout=timeout)
Now what?
45
Concurrency
• Now that we have a real concurrency model, what have we done with it?
• MultiMinion in a single process (coroutine per connection)
• Easily implement concurrent networking within Salt
• TCP transport
• IPC
46
Really? Problems?
47
Concurrency problems
• Most common pitfalls to concurrent programming
• race conditions and memory collisions
• deadlocks
Race conditions
48
Concurrency problems
• Weird data problems in the reactor: https://github.com/saltstack/salt/issues/23373
• The underlying problem: injected stuff in modules (__salt__ etc.) were just dicts—
which aren’t threadsafe (or coroutinesafe!)
• The solution? `ContextDict`
Copy-on-write thread/coroutine specific dict
49
ContextDict
• Works just like a dict
• Exposes a clone() method, which creates a `ChildContextDict` which is a
thread/coroutine local copy
• With tornado’s StackContext, we switch the backing dict of the parent with your
child using a context manager
cd = ContextDict(foo=bar)
print cd[‘foo’] # will be bar
with tornado.stack_context.StackContext(cd.clone):
print cd[‘foo’] # will be bar
cd[‘foo’] = ‘baz’
print cd[‘foo’] # will be baz
print cd[‘foo’] # will be bar
More examples: https://github.com/saltstack/salt/blob/develop/tests/unit/context_test.py
Deadlocks
50
Concurrency problems
• haven't seen any yet *knock on wood* -- in general we avoid these since each
coroutine is more-or-less independent of the others
Layers!
51
Concurrency problems
• Don’t forget, concurrency at all layers– including your DC-wide state execution
• For example: automated highstate enforcement of your whole DC
• Does it matter if all DB hosts update at once?
• Does it matter if all web servers update at once?
• Does it matter if all edge boxes update at once?
concurrency controls for state execution
52
zk_concurrency
acquire_lock:
zk_concurrency.lock:
- name: /trafficeserver
- zk_hosts: 'zookeeper:2181'
- max_concurrency: 4
- prereq:
- service: trafficserver
trafficserver:
service.running: []
release_lock:
zk_concurrency.unlock:
- name: /trafficserver
- require:
- service: trafficserver
Things on my “list”
53
Future Awesomeness
• Transport
• failover groups
• even better HA (https://github.com/saltstack/salt/issues/25700 -- get involved in the conversation)
• Concurrency
• async ext_pillar
• Partially concurrent state execution (prefetch, etc.)?
• Coroutine-based:
• Reactor
• Engines
• Beacons
• Thorium
©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

More Related Content

What's hot

OpenNebula in a Multiuser Environment
OpenNebula in a Multiuser EnvironmentOpenNebula in a Multiuser Environment
OpenNebula in a Multiuser EnvironmentNETWAYS
 
OpenStack Control Plane High Availability
OpenStack Control Plane High AvailabilityOpenStack Control Plane High Availability
OpenStack Control Plane High AvailabilityMichael Solberg
 
Live migrating a container: pros, cons and gotchas
Live migrating a container: pros, cons and gotchasLive migrating a container: pros, cons and gotchas
Live migrating a container: pros, cons and gotchasDocker, Inc.
 
Nats.io meetup october 2015 - Community Update
Nats.io meetup october 2015 - Community UpdateNats.io meetup october 2015 - Community Update
Nats.io meetup october 2015 - Community UpdateBrian Flannery
 
Introduction to MidoNet
Introduction to MidoNetIntroduction to MidoNet
Introduction to MidoNetTaku Fukushima
 
Lifecycle Management with Foreman
Lifecycle Management with ForemanLifecycle Management with Foreman
Lifecycle Management with ForemanJulien Pivotto
 
Deep dive into highly available open stack architecture openstack summit va...
Deep dive into highly available open stack architecture   openstack summit va...Deep dive into highly available open stack architecture   openstack summit va...
Deep dive into highly available open stack architecture openstack summit va...Arthur Berezin
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackNETWAYS
 
High availability and fault tolerance of openstack
High availability and fault tolerance of openstackHigh availability and fault tolerance of openstack
High availability and fault tolerance of openstackDeepak Mane
 
Building microservices with vert.x 3.0
Building microservices with vert.x 3.0Building microservices with vert.x 3.0
Building microservices with vert.x 3.0Agraj Mangal
 
Opening of Cloud Native Taiwan User Group Meetup#2
Opening of Cloud Native Taiwan User Group Meetup#2Opening of Cloud Native Taiwan User Group Meetup#2
Opening of Cloud Native Taiwan User Group Meetup#2HungWei Chiu
 
Introducing OpenStack for Beginners
Introducing OpenStack for Beginners Introducing OpenStack for Beginners
Introducing OpenStack for Beginners openstackindia
 
CentOS and OpenNebula, a Perfect Match
CentOS and OpenNebula, a Perfect MatchCentOS and OpenNebula, a Perfect Match
CentOS and OpenNebula, a Perfect MatchNETWAYS
 
Neutron high availability open stack architecture openstack israel event 2015
Neutron high availability  open stack architecture   openstack israel event 2015Neutron high availability  open stack architecture   openstack israel event 2015
Neutron high availability open stack architecture openstack israel event 2015Arthur Berezin
 
DevOps and OpenStack December 2012
DevOps and OpenStack December 2012DevOps and OpenStack December 2012
DevOps and OpenStack December 2012Matt Ray
 
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATSNATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATSNATS
 
OpenStack and Containers
OpenStack and ContainersOpenStack and Containers
OpenStack and ContainersLachlan Evenson
 

What's hot (20)

OpenNebula in a Multiuser Environment
OpenNebula in a Multiuser EnvironmentOpenNebula in a Multiuser Environment
OpenNebula in a Multiuser Environment
 
OpenStack Control Plane High Availability
OpenStack Control Plane High AvailabilityOpenStack Control Plane High Availability
OpenStack Control Plane High Availability
 
Live migrating a container: pros, cons and gotchas
Live migrating a container: pros, cons and gotchasLive migrating a container: pros, cons and gotchas
Live migrating a container: pros, cons and gotchas
 
Nats.io meetup october 2015 - Community Update
Nats.io meetup october 2015 - Community UpdateNats.io meetup october 2015 - Community Update
Nats.io meetup october 2015 - Community Update
 
Introduction to MidoNet
Introduction to MidoNetIntroduction to MidoNet
Introduction to MidoNet
 
Lifecycle Management with Foreman
Lifecycle Management with ForemanLifecycle Management with Foreman
Lifecycle Management with Foreman
 
OpenNebula Administrator View
OpenNebula Administrator ViewOpenNebula Administrator View
OpenNebula Administrator View
 
Deep dive into highly available open stack architecture openstack summit va...
Deep dive into highly available open stack architecture   openstack summit va...Deep dive into highly available open stack architecture   openstack summit va...
Deep dive into highly available open stack architecture openstack summit va...
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project Feedback
 
High availability and fault tolerance of openstack
High availability and fault tolerance of openstackHigh availability and fault tolerance of openstack
High availability and fault tolerance of openstack
 
Status of Embedded Linux
Status of Embedded LinuxStatus of Embedded Linux
Status of Embedded Linux
 
Building microservices with vert.x 3.0
Building microservices with vert.x 3.0Building microservices with vert.x 3.0
Building microservices with vert.x 3.0
 
Opening of Cloud Native Taiwan User Group Meetup#2
Opening of Cloud Native Taiwan User Group Meetup#2Opening of Cloud Native Taiwan User Group Meetup#2
Opening of Cloud Native Taiwan User Group Meetup#2
 
Introducing OpenStack for Beginners
Introducing OpenStack for Beginners Introducing OpenStack for Beginners
Introducing OpenStack for Beginners
 
CentOS and OpenNebula, a Perfect Match
CentOS and OpenNebula, a Perfect MatchCentOS and OpenNebula, a Perfect Match
CentOS and OpenNebula, a Perfect Match
 
Neutron high availability open stack architecture openstack israel event 2015
Neutron high availability  open stack architecture   openstack israel event 2015Neutron high availability  open stack architecture   openstack israel event 2015
Neutron high availability open stack architecture openstack israel event 2015
 
DevOps and OpenStack December 2012
DevOps and OpenStack December 2012DevOps and OpenStack December 2012
DevOps and OpenStack December 2012
 
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATSNATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
 
OpenStack and Containers
OpenStack and ContainersOpenStack and Containers
OpenStack and Containers
 
OpenDaylight OpenStack Integration
OpenDaylight OpenStack IntegrationOpenDaylight OpenStack Integration
OpenDaylight OpenStack Integration
 

Viewers also liked

SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterThomas Jackson
 
SaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your AutomationSaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your AutomationSteven Gonzales
 
SaltConf 2014: Safety with powertools
SaltConf 2014: Safety with powertoolsSaltConf 2014: Safety with powertools
SaltConf 2014: Safety with powertoolsThomas Jackson
 
CurveZMQ, ZMTP and other Dubious Characters
CurveZMQ, ZMTP and other Dubious CharactersCurveZMQ, ZMTP and other Dubious Characters
CurveZMQ, ZMTP and other Dubious Characterspieterh
 
The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016effie mouzeli
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionWalter Liu
 
ZeroMQ: Super Sockets - by J2 Labs
ZeroMQ: Super Sockets - by J2 LabsZeroMQ: Super Sockets - by J2 Labs
ZeroMQ: Super Sockets - by J2 LabsJames Dennis
 
Salt conf 2014 - Using SaltStack in high availability environments
Salt conf 2014 - Using SaltStack in high availability environmentsSalt conf 2014 - Using SaltStack in high availability environments
Salt conf 2014 - Using SaltStack in high availability environmentsBenjamin Cane
 
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are SecureSecurity & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are SecurePuppet
 
Fall 2016 ats summit - Parent & Origin Selection
Fall 2016 ats summit  - Parent & Origin SelectionFall 2016 ats summit  - Parent & Origin Selection
Fall 2016 ats summit - Parent & Origin SelectionThomas Jackson
 

Viewers also liked (10)

SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
 
SaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your AutomationSaltConf2015: SaltStack at Scale Automating Your Automation
SaltConf2015: SaltStack at Scale Automating Your Automation
 
SaltConf 2014: Safety with powertools
SaltConf 2014: Safety with powertoolsSaltConf 2014: Safety with powertools
SaltConf 2014: Safety with powertools
 
CurveZMQ, ZMTP and other Dubious Characters
CurveZMQ, ZMTP and other Dubious CharactersCurveZMQ, ZMTP and other Dubious Characters
CurveZMQ, ZMTP and other Dubious Characters
 
The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016The SaltStack Pub Crawl - Fosscomm 2016
The SaltStack Pub Crawl - Fosscomm 2016
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
 
ZeroMQ: Super Sockets - by J2 Labs
ZeroMQ: Super Sockets - by J2 LabsZeroMQ: Super Sockets - by J2 Labs
ZeroMQ: Super Sockets - by J2 Labs
 
Salt conf 2014 - Using SaltStack in high availability environments
Salt conf 2014 - Using SaltStack in high availability environmentsSalt conf 2014 - Using SaltStack in high availability environments
Salt conf 2014 - Using SaltStack in high availability environments
 
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are SecureSecurity & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure
 
Fall 2016 ats summit - Parent & Origin Selection
Fall 2016 ats summit  - Parent & Origin SelectionFall 2016 ats summit  - Parent & Origin Selection
Fall 2016 ats summit - Parent & Origin Selection
 

Similar to Saltconf 2016: Salt stack transport and concurrency

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Modern Java Concurrency
Modern Java ConcurrencyModern Java Concurrency
Modern Java ConcurrencyBen Evans
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Belmiro Moreira
 
EUC2015 - Load testing XMPP servers with Plain Old Erlang
EUC2015 - Load testing XMPP servers with Plain Old ErlangEUC2015 - Load testing XMPP servers with Plain Old Erlang
EUC2015 - Load testing XMPP servers with Plain Old ErlangPaweł Pikuła
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
Modern Java Concurrency (Devoxx Nov/2011)
Modern Java Concurrency (Devoxx Nov/2011)Modern Java Concurrency (Devoxx Nov/2011)
Modern Java Concurrency (Devoxx Nov/2011)Martijn Verburg
 
FFMUC: Half a year with WireGuard
FFMUC: Half a year with WireGuardFFMUC: Half a year with WireGuard
FFMUC: Half a year with WireGuardAnnika Wickert
 
JDK8 Functional API
JDK8 Functional APIJDK8 Functional API
JDK8 Functional APIJustin Lin
 
5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDB5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDBTim Callaghan
 
Queick: A Simple Job Queue System for Python
Queick: A Simple Job Queue System for PythonQueick: A Simple Job Queue System for Python
Queick: A Simple Job Queue System for PythonRyota Suenaga
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTPRoland Kuhn
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debuggingSyed Zaid Irshad
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorStéphane Maldini
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceEvan McGee
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Real time system_performance_mon
Real time system_performance_monReal time system_performance_mon
Real time system_performance_monTomas Doran
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 

Similar to Saltconf 2016: Salt stack transport and concurrency (20)

ZeroMQ
ZeroMQZeroMQ
ZeroMQ
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Modern Java Concurrency
Modern Java ConcurrencyModern Java Concurrency
Modern Java Concurrency
 
How we use Twisted in Launchpad
How we use Twisted in LaunchpadHow we use Twisted in Launchpad
How we use Twisted in Launchpad
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
 
EUC2015 - Load testing XMPP servers with Plain Old Erlang
EUC2015 - Load testing XMPP servers with Plain Old ErlangEUC2015 - Load testing XMPP servers with Plain Old Erlang
EUC2015 - Load testing XMPP servers with Plain Old Erlang
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
Modern Java Concurrency (Devoxx Nov/2011)
Modern Java Concurrency (Devoxx Nov/2011)Modern Java Concurrency (Devoxx Nov/2011)
Modern Java Concurrency (Devoxx Nov/2011)
 
FFMUC: Half a year with WireGuard
FFMUC: Half a year with WireGuardFFMUC: Half a year with WireGuard
FFMUC: Half a year with WireGuard
 
JDK8 Functional API
JDK8 Functional APIJDK8 Functional API
JDK8 Functional API
 
5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDB5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDB
 
Queick: A Simple Job Queue System for Python
Queick: A Simple Job Queue System for PythonQueick: A Simple Job Queue System for Python
Queick: A Simple Job Queue System for Python
 
Ipc feb4
Ipc feb4Ipc feb4
Ipc feb4
 
Akka Streams and HTTP
Akka Streams and HTTPAkka Streams and HTTP
Akka Streams and HTTP
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debugging
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a Microservice
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Real time system_performance_mon
Real time system_performance_monReal time system_performance_mon
Real time system_performance_mon
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Saltconf 2016: Salt stack transport and concurrency

  • 1.
  • 2. Salt Transport Modularity and Concurrency for Performance and Scale Thomas Jackson Staff Site Reliability Engineer LinkedIn
  • 3. 3 Agenda • for item in (‘transport’, ‘concurrency’): • History • Problems • Options • Solution
  • 4. Transport in Salt 4 Salt Transport: a history • In the beginning Salt was primarily a remote execution engine • Send jobs from Master to N minions (defined by some target) • In the beginning there was
  • 5. 5 "ZeroMQ (also spelled ØMQ, 0MQ or ZMQ) is a high-performance asynchronous messaging library, aimed at use in distributed or concurrent applications.” - Wikipedia (https://en.wikipedia.org/wiki/ZeroMQ)
  • 6. 6 We took a normal TCP socket, injected it with a mix of radioactive isotopes stolen from a secret Soviet atomic research project, bombarded it with 1950-era cosmic rays, and put it into the hands of a drug-addled comic book author with a badly-disguised fetish for bulging muscles clad in spandex. Yes, ZeroMQ sockets are the world-saving superheroes of the networking world. - http://zguide.zeromq.org/page:all#How-It-Began
  • 7. 7 Salt Transport: a history How ZMQ PUB/SUB looks Server context = zmq.Context() socket = context.socket(zmq.PUB) socket.bind("tcp://*:12345") socket.send(”Message") Client context = zmq.Context() socket = context.socket(zmq.SUB) socket.connect("tcp://localhost:12345") print socket.recv()
  • 8. 8 Salt Transport: a history How ZMQ REQ/REP looks Server context = zmq.Context() socket = context.socket(zmq.REP) socket.bind("tcp://*:12345") message = socket.recv() socket.send(“got message”) Client context = zmq.Context() socket = context.socket(zmq.REQ) socket.connect("tcp://localhost:12345") socket.send("Hello”) message = socket.recv()
  • 9. Request lifecycle 9 Salt Transport: a history Master Minion 1. Job publish 2. Sign-in (optional – potentially reused or cached) 3. Pillar Fetch 4. SLS/file fetch (optional) 5. Return
  • 10. Initial ZeroMQ implementation 10 Salt Transport: a history • Master-initiated messages • Using the pub/sub socket pair in zmq • All broadcast messages from the master to the minion • Minion-initiated messages • Using the req/rep socket pair in zmq • All messages initiated by the minion, such as: • Sign-in • Job return • Module sync • Pillar • Etc.
  • 11. Initial problems 11 Salt Transport: a history • Message loss • Broadcasts where filtered client side • Added zmq filtering: https://github.com/saltstack/salt/pull/13285 • Etc.
  • 12. 12
  • 13. Larger problems 13 Salt Transport: a history • Huge ZMQ publisher memory leak (https://github.com/zeromq/libzmq/issues/954) • Workaround: Process manager in salt • No concept of client state • When messages arrive, there is no way to see if the client is still connected– which leads to auth storms • Workaround: Exponential backoff on the minion side • No sync "connect" (https://github.com/saltstack/salt/pull/21570) • Workaround: fire event and wait for it to return (or timeout to expire) • Some users have issues with the LGPL license • Workaround: n/a 
  • 14.
  • 15. 15 The Reliable Asynchronous Event Transport, or RAET, is an alternative transport medium developed specifically with Salt in mind. It has been developed to allow queuing to happen up on the application layer and comes with socket layer encryption. It also abstracts a great deal of control over the socket layer and makes it easy to bubble up errors and exceptions. - docs.saltstack.com Salt Transport: previous attempt
  • 16. RAET 16 Salt Transport: previous attempt • The good • No ZMQ! • The bad • Effectively a re-implementation of the daemons (separate files, etc.) • Unable to run zmq and RAET simultaneously (initially, hydra was added later – which just runs both daemons at once) • The different • Changed the model from “minions always connect” to “minions are listening”, meaning minions have a socket to attack
  • 17. 17
  • 18. What do we really need 18 Salt Transport: back to basics • Salt is a platform, not a specific transport– we need transports to be modular • Some requirements: • Simple interface to implement (such that other modules can be written) • Test coverage (including pre-canned tests for new modules) • Support N transports simultaneously (for ramps, and complex infra) • Clear contract of security/privacy requirements of various methods
  • 19. • ReqChannel: minion to master messages 19 Salt Transport: Channels! • Master • pre_fork(self, process_manager) • post_fork(self, payload_handler, io_loop) • Minion • send(self, load, tries=3, timeout=60) • crypted_transfer_decode_dictentry(self, load, dictkey=None, tries=3, timeout=60)
  • 20. • PubChannel: broadcasts to the appropriate minions 20 Salt Transport: Channels! • Master • pre_fork(self, process_manager) • publish(self, load) • Minion: • on_recv(self, callback)
  • 21. Responsibilities 21 Salt Transport: Channels! • Serialization • Encryption • Targeting (pub channel only)
  • 22. TCP channel 22 Salt Transport: Channels! • Wire protocol: msgpack({'head': SOMEHEADER, 'body': SOMEBODY}) • Main advantages over ZMQ? better failure modes • Faster failure detection (if minion isn’t connected to the master, you don’t have to wait for the timeouts) • True link-status (no more auth storms!) • Basically, we have sockets again!  • https://docs.saltstack.com/en/develop/topics/transports/tcp.html
  • 23. TCP: How does it look? 23 Salt Transport: Channels! async_channel = salt.transport.client.AsyncReqChannel.factory(minion_opts) ret = yield async_channel.send(msg)
  • 24. TCP: How accurate? 24 Salt Transport: Channels! • ZeroMQ • Total jobs: 1000 • Completed jobs: 171 • Hit rate: 17.1% • TCP • Total jobs: 1000 • Completed jobs: 1000 • Hit rate: 100%
  • 25. TCP: How does it perform 25 Salt Transport: Channels! • 15 byte message • ZeroMQ* • Average time: 0.00295809405715 • QPS: 2246.952241147 • TCP • Average time: 0.0023341544863 • QPS: 2580.04452801
  • 26. TCP: How does it perform 26 Salt Transport: Channels! • 1053 byte message • ZeroMQ* • Average time: 0.00278297542184 • QPS: 2489.300394919 • TCP • Average time: 0.00251070397869 • QPS: 2602.4855051
  • 27. Awesome! 27 Salt Transport: Channels! • Definitely awesome! • But async? What was that about? • Before we get into specifics, lets talk about concurrency
  • 28. The General Problem 28 Concurrency We have lots of things to do, some of which are blocking calls to remote things which are “slow”. It is more efficient (and overall “faster”) to work on something else while we wait for that “slow” call.
  • 29. 29
  • 30. Current state of concurrency in Salt 30 Concurrency • Master-side: the master creates N Mworkers to process N requests in parallel • N Mworkers to process N requests in parallel • Interaces with non-blocking as well, using `while True:` loops to do timeouts etc. • Minion-side: • Threads used in MultiMaster for managing the multiple master connections
  • 31. Problems 31 Concurrency • No unified approach (multiprocessing, threading, nonblocking “loops” -- all in use) • Slow and/or blocking operations hold process/thread while waiting • No consistent use of non-blocking libraries, so the code is a mix of loops and blocking calls • Limited scalability (each approach scales differently)
  • 32. Common solutions in Python 32 Concurrency • Threading • Multiprocessing • User-space “threads”: Coroutines / stackless threads
  • 33. 33 Concurrency Threading • Some isolation between threads • Pre-emptive scheduling Import threading def handle_request(): ret = requests.get(‘http://slowthing/’) # do something else threads = [] for x in xrange(0, NUM)REQUESTS): t = threading.Thread(target=handle_request) t.start() threads.append(t) for t in threads: t.join()
  • 34. 34 Concurrency Multiprocessing • Complete isolation • Pre-emptive scheduling Import multiprocessing def handle(): ret = requests.get(‘http://slowthing/’) # do something else Processes = [] for x in xrange(0, NUM)REQUESTS): p = multiprocessing.Process(target=handle) p.start() processes.append(p) For p in processes: p.join()
  • 35. • User-space “threads”: Coroutines / stackless threads 35 Concurrency • Some libraries you may have heard of • gevent • Stackless python • Greenlet • Twisted • Tornado • How are these implemented • Green threads • callbacks • coroutines
  • 36. Why Coroutines? 36 Concurrency • Coroutines have been in use in python for a while (tornado) • The new asyncio in python3 (tulip) is coroutines (https://docs.python.org/3/library/asyncio.html)
  • 37. 37 Coroutines are computer program components that generalize subroutines for nonpreemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations. - https://en.wikipedia.org/wiki/Coroutine Concurrency
  • 38. 38 Concurrency Coroutines– what is this magic? def item_of_work(): while True: input = yield yield do_something(input)
  • 39. 39 Concurrency Coroutines– what is this magic? def some_complex_handle(): while True: input = yield out1 = do_something(input) yield None out2 = do_something2(out1) yield None return do_something3(out2)
  • 40. 40 Concurrency Tornado coroutines • Some isolation between coroutines • Explicit yield • Light “threads” Import threading @tornado.gen.coroutine def handle_request(): ret = yield requests.get(‘http://slow/’) # do something else loop = tornado.ioloop.IOLoop.current() loop.spawn_callback(handle_request) loop.start()
  • 41. Coroutines– futures 41 Concurrency • Futures are just objects that represent a thing that will complete in the future • This allows methods to return immediately, but finish the task in the future • This allows the callers to yield execution until the futures they depend on complete
  • 42. 42 Concurrency Coroutines– with futures • Yield execution, and get returns • Method looks fairly normal • Stack traces in here have context • Easy chaining of futures @tornado.gen.coroutine def some_complex_handle(request): a = yield is_authd(request) if not a: return False ret = yield do_request(request) yield save1(ret), save2(ret) return ret
  • 43. Tornado in Salt 43 Concurrency • What is tornado? • Python web framework and asynchronous networking library • Why Tornado and not asyncio? • Free python 2.x compatibility! • A fairly comprehensive set of libraries for it (http, locks, queues, etc.)
  • 44. Back to the transport interfaces 44 Concurrency • AsyncReqChannel • send: return a future • crypted_transfer_decode_dictentry: return a future ret = yield channel.send(load, timeout=timeout)
  • 45. Now what? 45 Concurrency • Now that we have a real concurrency model, what have we done with it? • MultiMinion in a single process (coroutine per connection) • Easily implement concurrent networking within Salt • TCP transport • IPC
  • 46. 46
  • 47. Really? Problems? 47 Concurrency problems • Most common pitfalls to concurrent programming • race conditions and memory collisions • deadlocks
  • 48. Race conditions 48 Concurrency problems • Weird data problems in the reactor: https://github.com/saltstack/salt/issues/23373 • The underlying problem: injected stuff in modules (__salt__ etc.) were just dicts— which aren’t threadsafe (or coroutinesafe!) • The solution? `ContextDict`
  • 49. Copy-on-write thread/coroutine specific dict 49 ContextDict • Works just like a dict • Exposes a clone() method, which creates a `ChildContextDict` which is a thread/coroutine local copy • With tornado’s StackContext, we switch the backing dict of the parent with your child using a context manager cd = ContextDict(foo=bar) print cd[‘foo’] # will be bar with tornado.stack_context.StackContext(cd.clone): print cd[‘foo’] # will be bar cd[‘foo’] = ‘baz’ print cd[‘foo’] # will be baz print cd[‘foo’] # will be bar More examples: https://github.com/saltstack/salt/blob/develop/tests/unit/context_test.py
  • 50. Deadlocks 50 Concurrency problems • haven't seen any yet *knock on wood* -- in general we avoid these since each coroutine is more-or-less independent of the others
  • 51. Layers! 51 Concurrency problems • Don’t forget, concurrency at all layers– including your DC-wide state execution • For example: automated highstate enforcement of your whole DC • Does it matter if all DB hosts update at once? • Does it matter if all web servers update at once? • Does it matter if all edge boxes update at once?
  • 52. concurrency controls for state execution 52 zk_concurrency acquire_lock: zk_concurrency.lock: - name: /trafficeserver - zk_hosts: 'zookeeper:2181' - max_concurrency: 4 - prereq: - service: trafficserver trafficserver: service.running: [] release_lock: zk_concurrency.unlock: - name: /trafficserver - require: - service: trafficserver
  • 53. Things on my “list” 53 Future Awesomeness • Transport • failover groups • even better HA (https://github.com/saltstack/salt/issues/25700 -- get involved in the conversation) • Concurrency • async ext_pillar • Partially concurrent state execution (prefetch, etc.)? • Coroutine-based: • Reactor • Engines • Beacons • Thorium
  • 54. ©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

Editor's Notes

  1. Transport != concurrency, although transport uses concurrency
  2. 10K foot view: Contexts have sockets Sockets are message passing things that are “like” sockets, but are not sockets (they are really a socket and a bunch of contexts) ZeroMQ attempts (and succeeds) in dramatically simplifying message passing, go zmq!
  3. Notice, to switch message types we only had to change the socket type– simple!
  4. Basically, this means we can break down communications in salt into two categories
  5. Effectively two separate transport issues to solve, so two socket pairs– great done
  6. Initially zmq was awesome, as with anything we ran into a variety of weird little issues Message loss: retries, various new versions of zmq to fix cases that dropped messages Broadcasts: ran out of B/W for medium sized job publishes, fixed by implementing zmq’s filtering (zmq saved the day!) But at this point, these are just bugs– nothing that’s a deal breaker
  7. At this point, the problems it has had aren’t really zmq’s fault… so we are okay right?
  8. Memory leaks: connecting and disconnecting on TCP causes ~600 bytes to be leaked on the master! Still unfixed to this day! Client state: publishes have to wait timeout (even if the minion isn’t connected) AND auth storms! So at this point, we are running into a variety of issues which we are attempting to hack around that are either getting little response, or are contrary to the design. Basically at our scale the abstraction layer is costing us too much
  9. At our scale (and with our availability/perf requirements) we need another transport option
  10. SaltStack had been working on a replacement, which I’m sure you have all heard of-- RAET
  11. NOTE: RAET in salt was being used for both transport (RAET) and concurrency (ioflo) All that being said, RAET (the transport) isn’t bad– its just too specific (and not modular), salt isn’t this specific about anything else-- so why transport?
  12. New systems might require new transports (QUIC, serial ports, USB, over text message??, who knows!) There had already been some work to consolidate the transport into “channel” classes before, so might as well finish that– then make them pluggable So, 2 types of channels: req and pub
  13. Master: Prefork so that you can bind before forking– to split the FD across multiple processes (to work around python’s GIL limitations) Process_manager, in case you need to make additional processes of your own (instead of just coroutines on the ioloop) Post_fork called in each process after fork, this sets up the handlers etc Minion: send– send load crypted_transfer_decode_dictentry– send `load` encrypted only to the master (e.g. not with the shared symmetric key)
  14. This means that as far as “Salt” is concerned, there is a thing (channel) I can pass something to which will get it to wherever I asked. And of course, the system couldn’t be considered modular unless there were at least two modules
  15. Since msgpack is an iterably parsed serialization, we can simply write the serialized payload to the wire. Crypto: still using aes that the zeromq stuff uses
  16. People asked about performance, which TBH I didn’t really think about putting in this presentation– because I was more worried about …accuracy
  17. This is a simple benchmark of sending 1k {‘cmd’: ‘get_token’} as quickly as possible to a master Note: ZMQ drops a LARGE number of messages– this is due to internal queues in ZMQ filling up– so, even if tcp was slower (which it isn’t) we’d still want it 
  18. I am of course obligated to show some metrics. This is a simple benchmark of sending {‘cmd’: ‘get_token’} repeatedly from a master Note, quick benchmark– mostly to show that it is roughly equivalent. In practice * ZeroMQ ReqClient is apparently VERY CPU heavy (probably a bug)– it uses ~5 client processes to get this number– whereas TCP uses just one
  19. Same as previous benchmark, we just added ~1k additional bytes to the payload
  20. Especially important for large fast modern CPUs that have to talk to things that are slow/far-away. Concurrency in python is more fun— because of the GIL, but still helpful because stuff is slooooow
  21. Sorry, not a funny picture :/ but you’ve probably seen it before at some conference  Since stuff is so far, there is no reason to leave the CPU just waiting, we can do something while we wait. Salt attempts to accommodate this…
  22. Basically-- doing only one thing at a time severely limits your performance and scalability. So lets go back to what our options are
  23. Lets do some examples
  24. Fairly clunky code, but it works. Linux pthreads– requires a decent amount of memory, and has some hard limits based on your OS Walk through how this runs: Creates a thread per request Waits for requests to finish Thread closes Note: still subject to the GIL
  25. Fairly clunky code, but it works. (Note: serialization (pickle)!!) Linux processes– requires a decent amount of memory, and has some hard limits based on your OS (pids) Walk through how this runs: Creates a process per request Waits for requests to finish Process closes Note: no GIL!
  26. Green threads: All the pre-emptive yields require some amount of monkey-patching, making it… difficult for a plugin based system (like Salt) Callbacks: mess! Coroutines-- yes But we don’t even have to make this decision, python already did!
  27. - quick aside RE ioflo-- basically a naive implementation of coroutines to achieve the required concurrency for the flo based model it has, serious scaling problems, limited usage, etc.-- details can be messy, talk after :) what exactly is a coroutine?
  28. Lets break that down Preemptive: implicit vs explicit yield Basically coroutines are explicitly yielded tasks lets talk a little about what coroutines are with some examples Great examples on www.dabeaz.com/coroutines/ -- I’ll try to explain it in a shorter way, but I highly recommend reading dabeaz’s page.
  29. To make it clearer, lets copy/paste an example of a naïve implementation in python using generators
  30. So, something like this lets us “schedule” tasks, meaning we can interleave execution of these things– even if they aren all blocking operations What would be even better– is if we could resume execution when whatever we are waiting on is completed Note: the return within a generator is new in python 3.x, so tornado (and trollius) use an exception of a specifc type
  31. Cleaner code, easy isolation, lighter concurrency (effectively just a stack)
  32. https://docs.python.org/3/library/asyncio-task.html#future
  33. So, something like this lets us “schedule” tasks, meaning we can interleave execution of these things– even if they aren all blocking operations What would be even better– is if we could resume execution when whatever we are waiting on is completed
  34. As of 2015.8– you get tornado!
  35. So, from the client– you say “send load with timeout” and we return a future that will fulfill that contract (either send or timeout). So from the client this is SUPER clean
  36. But, of course– we haven’t got this far without breaking anything ;)
  37. Like anything else concurrency isn’t free 
  38. The implementation here is a RequestContext (based on tornado's stack_context). This RequestContext will do all of the bookkeeping of which coroutine/thread is currently executing-- and will switch between the values for each one. With this I made the loader threadsafe (yay!) and it is easy to re-use if you need a concurrent copy-on-write structure
  39. What happens if you have automated highstate enforcement across your proxies??
  40. Limit concurrency of this particular part of your states– but not the rest
  41. Questions