Presented at Erlang Factory 2016, San Francisco, CA.
Erlang is widely used for building concurrent applications. However, when we push the performance of our Erlang based application to handle millions of concurrent clients, some Erlang scalability issues begin to show and some conventional programming paradigm of Erlang no longer hold. We would like to share some of these issue and how we address them. In addition, we share some of our experience on how to profile an Erlang application to identify bottlenecks.
We will take a deep look at some of the basic mechanisms of Erlang and show how they behave under high load and parallelism, which includes message delivery, process management and shared data structures such as maps and ETS tables. We will demonstrate their limitations and propose techniques to alleviate the issues.
We will also share profiling techniques on how to find those bottlenecks in Erlang applications across different levels. We will share techniques for writing highly performant Erlang applications.
14. 14Machine Zone, Inc
Message
Passing
• Limit on #messages delivered per second
• < 10M msgs/sec on MacBook Pro
• < 22M msgs/sec on a powerful 56-core server
• Independent of #processes
15. What if we want to achieve more?
15Machine Zone, Inc
16. 16Machine Zone, Inc
Solution
1:
Minimize
Message
Exchange
Do everything in the same process and shard
receiver
processor
sender
receiver
processor
sender
receiver
processor
sender
17. 17Machine Zone, Inc
Solution
1
Speed
• 100x the #ETS updates
• 20x–30x the #message exchanges
• No decrease with increase in #processes
0
20 M
40 M
60 M
80 M
100 M
0
20
40
60
80
100
Iterationspersec
Processes
iterations
processes
18. 18Machine Zone, Inc
Solution
2:
Message
Batching
Batch messages: 5x–30x more msgs/sec
1 10 100 1000 N
1
10
100
M
10:100
19. 19Machine Zone, Inc
Solution
3:
Domain-‐Specific
Optimizations
• Deliver one message to
multiple recipients:
1. Receive, process, store to
shared memory using NIF
2. Read from the shared
memory, send using push
notification
sender
receiver
processor
shared
memory
Total throughput of 600M msgs/sec to many recipients
shared
memory
21. 21Machine Zone, Inc
Lesson
1:
Use
Sharding
• Avoid having single process
• Use a pool of non-communicating shards
• erlang:phash2(Key, NumberOfShards)
Shard 2
Shard 1
receive
receive
send
send
No
communication
22. 0
2 M
4 M
6 M
8 M
10 M
12 M
0 1 M 2 M 3 M 4 M 5 M
Timetaken(us)
Number of insertions
ets set
maps
process dictionary
22Machine Zone, Inc
Lesson
2:
Use
Process
Dictionary
• Quicker insertions in process dictionary compared to
maps
23. 0
5 k
10 k
15 k
20 k
25 k
1 10 100 1000 10000 100000
Timefor100kupdates(ms)
Number of elements
R18 maps
R18 process dictionary
23Machine Zone, Inc
Lesson
2:
Use
Process
Dictionary
• Quicker update operations in process dictionary
compared to maps (Erlang 18)
24. 24Machine Zone, Inc
Lesson
3:
Use
Limited
#Processes
• Don’t create too many processes
• pid2proc lock for process resolution
• Scheduler has to manage the processes
• Fixed number of schedulers
• Don’t spawn/stop processes frequently
• Process resolution again
34. Benchmark
Environment
• Platform
• 2 CPU sockets
• CPU - Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 14 hardware threads per socket
• 2 hyper-threaded threads for each hardware thread
• Erlang Runtime
Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:56:56]
[async-threads:10] [kernel-poll:false]
34Machine Zone, Inc
35. 0
1 M
2 M
3 M
4 M
5 M
6 M
7 M
10 20 30 40 50 60
Numcounterupdatesperprocess
Num processes
ets counter
exometer counter
exometer fast counter
folsom counter
35Machine Zone, Inc
Benchmark
Results
• As number of processes increases, number of
counter updates per process decreases
36. 36Machine Zone, Inc
Limitations
• Exometer:
• counters – scales well but slow
• fast counters – doesn’t scale
• Lock contention in ets:update_counter()
38. 38Machine Zone, Inc
Refresher
on
Cacheline
Source:
http://www.slideshare.net/yzt/modern-‐cpus-‐and-‐caches-‐a-‐starting-‐point-‐for-‐programmers/52
39. 39Machine Zone, Inc
Occurs when two CPUs access different data
structures on the same cache line
False
Sharing
(SMP)
CPUO CPU1
struct foo
struct bar
Cache
line
43. 43Machine Zone, Inc
Benefits
of
Per-‐CPU
Data
Source:
http://lwn.net/Articles/256433/
Time taken for 500 million counter increment operations
on Intel Core 2 QX 6700
44. 44Machine Zone, Inc
Benefits
of
Per-‐CPU
Data:
Reasons
• Less overhead of cache coherency protocol
• Significantly higher benefits with QPI (Quick Path
Interconnect)
45. 45Machine Zone, Inc
Transactional
Memory
• Intel TSX (Transactional Synchronizations
Extensions)
• Helps with locks in small critical sections
• Not yet available in x86 (disabled due to
hardware bugs)
46. 46Machine Zone, Inc
MZMETRICS
• Design based on per-CPU counters
• Counter Operations:
• Create a new counter – requires lock
• Read/update counter – no lock required
• Delete counter – requires lock
• Implemented using NIF
47. 47Machine Zone, Inc
MZMETRICS
Evaluation
0
2 M
4 M
6 M
8 M
10 M
12 M
14 M
16 M
18 M
20 M
10 20 30 40 50
Numcounterupdatesperprocess
Num processes
exometer
mzmetrics
• 10X faster than Exometer counters
49. 49Machine Zone, Inc
Takeaways
• Pitfalls
• Message passing becomes expensive at scale
• Large number of processes needing pid
resolution from a global table causes
contention
• Existing libraries not amiable for fast counting
• Proposed Solutions
• Use sharding & Limit messaging passing
• Limit number of active processes
• Use MZMETRICS for fast counters