SlideShare a Scribd company logo
1 of 50
Download to read offline
High Performance Erlang:
Pitfalls and Solutions
MZSPEED Team
Machine Zone, Inc.
1Machine Zone, Inc
Agenda	
  
•  Erlang at Machine Zone
•  Issues and Roadblocks
•  Lessons Learned
•  Profiling Techniques
•  Case Study: Counters
2Machine Zone, Inc
Erlang	
  at	
  Machine	
  Zone	
  
•  High performance data processing system
•  How to make it fast?
3Machine Zone, Inc
4
Typical	
  Approach:	
  Pipelined	
  tasks	
  
Execute different logic in different processes
receiver processor sender
Machine Zone, Inc
How do we exchange data between processes?
How do we make it fast?
5Machine Zone, Inc
ETS
6Machine Zone, Inc
0
5 M
10 M
15 M
0
2
4
6
8
10
12
14
16
Updateoperations(total)
Numberofprocessesupdating
Timeline
expected
processes updating
0
5 M
10 M
15 M
0
2
4
6
8
10
12
14
16
Updateoperations(total)
Numberofprocessesupdating
Timeline
expected
processes updating
update operations
7Machine Zone, Inc
ETS:	
  update_counter,	
  10	
  Keys:	
  	
  
Expected	
  vs	
  Observed	
  Performance	
  
•  Observed Performance is far worse than Expected
8Machine Zone, Inc
ETS:	
  update_counter,	
  Multiple	
  Keys:	
  
Observed	
  Performance	
  
0
5 M
10 M
15 M
0
2
4
6
8
10
12
14
16
Updateoperations(total)
Numberofprocessesupdating
Timeline
1 key
10 keys
100 keys
1000 keys
•  Performance is degraded even with multiple keys
9Machine Zone, Inc
ETS	
  has	
  Locks	
  
•  Fine-grained locks
•  64 locks per table by default
Message Passing
10Machine Zone, Inc
1 10 100 1000 N
1
10
100
M
11Machine Zone, Inc
Message	
  Passing:	
  	
  M	
  Processes	
  to	
  N	
  Processes	
  
10:100
12Machine Zone, Inc
Message	
  Passing:	
  Scheduling	
  Locks	
  
•  Add to runq with locks
13Machine Zone, Inc
Message	
  Passing:	
  Scheduler	
  
Communication	
  Locks	
  
•  Task stealing with locks
14Machine Zone, Inc
Message	
  Passing	
  
•  Limit on #messages delivered per second
•  < 10M msgs/sec on MacBook Pro
•  < 22M msgs/sec on a powerful 56-core server
•  Independent of #processes
What if we want to achieve more?
15Machine Zone, Inc
16Machine Zone, Inc
Solution	
  1:	
  Minimize	
  Message	
  Exchange	
  
Do everything in the same process and shard
receiver
processor
sender
receiver
processor
sender
receiver
processor
sender
17Machine Zone, Inc
Solution	
  1	
  Speed	
  
•  100x the #ETS updates
•  20x–30x the #message exchanges
•  No decrease with increase in #processes
0
20 M
40 M
60 M
80 M
100 M
0
20
40
60
80
100
Iterationspersec
Processes
iterations
processes
18Machine Zone, Inc
Solution	
  2:	
  Message	
  Batching	
  
Batch messages: 5x–30x more msgs/sec
1 10 100 1000 N
1
10
100
M
10:100
19Machine Zone, Inc
Solution	
  3:	
  Domain-­‐Specific	
  
Optimizations	
  
•  Deliver one message to
multiple recipients:
1.  Receive, process, store to
shared memory using NIF
2.  Read from the shared
memory, send using push
notification
sender
receiver
processor
shared
memory
Total throughput of 600M msgs/sec to many recipients
shared
memory
20Machine Zone, Inc
High Performance Erlang:
Lessons Learned
21Machine Zone, Inc
Lesson	
  1:	
  Use	
  Sharding	
  
•  Avoid having single process
•  Use a pool of non-communicating shards
•  erlang:phash2(Key, NumberOfShards)
Shard 2
Shard 1
receive	
  
receive	
  
send	
  
send	
  
No	
  communication	
  
0
2 M
4 M
6 M
8 M
10 M
12 M
0 1 M 2 M 3 M 4 M 5 M
Timetaken(us)
Number of insertions
ets set
maps
process dictionary
22Machine Zone, Inc
Lesson	
  2:	
  Use	
  Process	
  Dictionary	
  
•  Quicker insertions in process dictionary compared to
maps
0
5 k
10 k
15 k
20 k
25 k
1 10 100 1000 10000 100000
Timefor100kupdates(ms)
Number of elements
R18 maps
R18 process dictionary
23Machine Zone, Inc
Lesson	
  2:	
  Use	
  Process	
  Dictionary	
  
•  Quicker update operations in process dictionary
compared to maps (Erlang 18)
24Machine Zone, Inc
Lesson	
  3:	
  Use	
  Limited	
  #Processes	
  
•  Don’t create too many processes
•  pid2proc lock for process resolution
•  Scheduler has to manage the processes
•  Fixed number of schedulers
•  Don’t spawn/stop processes frequently
•  Process resolution again
Profiling Techniques
25Machine Zone, Inc
26Machine Zone, Inc
pstack:	
  Stack	
  Sampling	
  
•  pstack (gdb): sampling call stack over multiple threads
•  $ pstack 1234
•  $ gdb -p 1234
•  (gdb) thread apply all bt 25
27Machine Zone, Inc
EEP:	
  Erlang	
  Easy	
  Profiling	
  
•  https://github.com/virtan/eep
•  Based on dbg module, low overhead trace ports
•  Easy to use
•  Kcachegrind visualization
Basic Usage
•  Collect runtime data to local file
•  > eep:start_file_tracing("abc"),
timer:sleep(10000),
eep:stop_tracing().
•  Convert to callgrind format
•  > eep:convert_tracing(“abc").
•  Visualize with kcachegrind
•  $ kcachegrind callgrind.out.abc
28Machine Zone, Inc
EEP:	
  	
  Example	
  Visualization	
  
29Machine Zone, Inc
oprofile:	
  System	
  Level	
  Profiling	
  
•  Initialization
$ opcontrol –deinit
$ echo 0 > /proc/sys/kernel/nmi_watchdog
•  Profiling
$ operf --vmlinux /usr/lib/debug/vmlinux --
callgraph ./rebar eunit apps=perf_test
•  Reporting and Visualization
$ opreport -cgf | ./gprof2dot.py -f oprofile
--show-samples | dot -Tpng -o output.png
30Machine Zone, Inc
oprofile:	
  Example	
  
Case Study: Counters
31Machine Zone, Inc
32Machine Zone, Inc
Need	
  for	
  High	
  Performance	
  Counters	
  
•  Thousands of global counters
•  Updated by several hundred processes
every second
33Machine Zone, Inc
Libraries	
  Used	
  
•  Exometer
•  ETS
•  Folsom
Benchmark	
  Environment	
  
•  Platform
•  2 CPU sockets
•  CPU - Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
•  14 hardware threads per socket
•  2 hyper-threaded threads for each hardware thread
•  Erlang Runtime 	
  
Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:56:56]
[async-threads:10] [kernel-poll:false]
34Machine Zone, Inc
0
1 M
2 M
3 M
4 M
5 M
6 M
7 M
10 20 30 40 50 60
Numcounterupdatesperprocess
Num processes
ets counter
exometer counter
exometer fast counter
folsom counter
35Machine Zone, Inc
Benchmark	
  Results	
  
•  As number of processes increases, number of
counter updates per process decreases
36Machine Zone, Inc
Limitations	
  
•  Exometer:
•  counters – scales well but slow
•  fast counters – doesn’t scale
•  Lock contention in ets:update_counter()
MZMETRICS
Our Solution:
37Machine Zone, Inc
38Machine Zone, Inc
Refresher	
  on	
  Cacheline	
  
Source:	
  http://www.slideshare.net/yzt/modern-­‐cpus-­‐and-­‐caches-­‐a-­‐starting-­‐point-­‐for-­‐programmers/52	
  
39Machine Zone, Inc
Occurs when two CPUs access different data
structures on the same cache line
False	
  Sharing	
  (SMP)	
  
CPUO CPU1
struct foo
struct bar
Cache	
  line	
  
40
CPUO
Main
memory
L2 L2
CPU1
foo
bar
foo
bar
foo
bar
FETCH	
  foo	
   FETCH	
  bar	
  
FETCH	
  foo	
   FETCH	
  bar	
  
False	
  Sharing	
  (SMP)	
  
Machine Zone, Inc
Main
memory
41
False	
  Sharing	
  (SMP)	
  
CPUO
L2 L2
CPU1
foo
bar
WRITE	
  foo	
  
WRITE	
  foo	
  
INVL	
  bar	
  
INVL	
  bar	
  
foo
bar
foo
bar
ACK	
  
ACK	
  ACK	
  
ACK	
  
Machine Zone, Inc
Main
memory
42
CPUO
L2 L2
CPU1
WRITE	
  bar	
  
WRITE	
  bar	
  
INVL	
  foo	
  
INVL	
  foo	
  
foo
bar
ACK	
  
ACK	
  
ACK	
  
ACK	
  
Machine Zone, Inc
False	
  Sharing	
  (SMP)	
  
43Machine Zone, Inc
Benefits	
  of	
  Per-­‐CPU	
  Data	
  
Source:	
  http://lwn.net/Articles/256433/	
  
Time taken for 500 million counter increment operations
on Intel Core 2 QX 6700
44Machine Zone, Inc
Benefits	
  of	
  Per-­‐CPU	
  Data:	
  Reasons	
  
•  Less overhead of cache coherency protocol
•  Significantly higher benefits with QPI (Quick Path
Interconnect)
45Machine Zone, Inc
Transactional	
  Memory	
  
•  Intel TSX (Transactional Synchronizations
Extensions)
•  Helps with locks in small critical sections
•  Not yet available in x86 (disabled due to
hardware bugs)
46Machine Zone, Inc
MZMETRICS	
  
•  Design based on per-CPU counters
•  Counter Operations:
•  Create a new counter – requires lock
•  Read/update counter – no lock required
•  Delete counter – requires lock
•  Implemented using NIF
47Machine Zone, Inc
MZMETRICS	
  Evaluation	
  	
  
0
2 M
4 M
6 M
8 M
10 M
12 M
14 M
16 M
18 M
20 M
10 20 30 40 50
Numcounterupdatesperprocess
Num processes
exometer
mzmetrics
•  10X faster than Exometer counters
48Machine Zone, Inc
MZMETRICS	
  Evaluation:	
  Results	
  	
  
•  1 million counter updates with 48 processes
•  Exometer takes ~ 1 seconds
•  MZMETRICS takes ~0.1 seconds
49Machine Zone, Inc
Takeaways	
  
•  Pitfalls
•  Message passing becomes expensive at scale
•  Large number of processes needing pid
resolution from a global table causes
contention
•  Existing libraries not amiable for fast counting
•  Proposed Solutions
•  Use sharding & Limit messaging passing
•  Limit number of active processes
•  Use MZMETRICS for fast counters
50Machine Zone, Inc
We are Hiring
http://www.machinezone.com/careers

More Related Content

What's hot

Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesMichael Klishin
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012mumrah
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtosNAVER D2
 
ZooKeeper - wait free protocol for coordinating processes
ZooKeeper - wait free protocol for coordinating processesZooKeeper - wait free protocol for coordinating processes
ZooKeeper - wait free protocol for coordinating processesJulia Proskurnia
 
Optimizing kubernetes networking
Optimizing kubernetes networkingOptimizing kubernetes networking
Optimizing kubernetes networkingLaurent Bernaille
 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSamantha Quiñones
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법Open Source Consulting
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with PrometheusOpenStack Korea Community
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesLaurent Bernaille
 
How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceLaurent Bernaille
 
Making the most out of kubernetes audit logs
Making the most out of kubernetes audit logsMaking the most out of kubernetes audit logs
Making the most out of kubernetes audit logsLaurent Bernaille
 
Testing Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsTesting Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsAll Things Open
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackMatt Ray
 
Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)Laurent Bernaille
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Laurent Bernaille
 
Anatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoortersAnatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoortersSadique Puthen
 
IP Virtual Server(IPVS) 101
IP Virtual Server(IPVS) 101IP Virtual Server(IPVS) 101
IP Virtual Server(IPVS) 101HungWei Chiu
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 

What's hot (20)

Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issues
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos[233] level 2 network programming using packet ngin rtos
[233] level 2 network programming using packet ngin rtos
 
ZooKeeper - wait free protocol for coordinating processes
ZooKeeper - wait free protocol for coordinating processesZooKeeper - wait free protocol for coordinating processes
ZooKeeper - wait free protocol for coordinating processes
 
Optimizing kubernetes networking
Optimizing kubernetes networkingOptimizing kubernetes networking
Optimizing kubernetes networking
 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with Varnish
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror Stories
 
How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My Namespace
 
Making the most out of kubernetes audit logs
Making the most out of kubernetes audit logsMaking the most out of kubernetes audit logs
Making the most out of kubernetes audit logs
 
Testing Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsTesting Wi-Fi with OSS Tools
Testing Wi-Fi with OSS Tools
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
 
Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)
 
How to Run Solr on Docker and Why
How to Run Solr on Docker and WhyHow to Run Solr on Docker and Why
How to Run Solr on Docker and Why
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019
 
Anatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoortersAnatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoorters
 
IP Virtual Server(IPVS) 101
IP Virtual Server(IPVS) 101IP Virtual Server(IPVS) 101
IP Virtual Server(IPVS) 101
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 

Similar to High Performance Erlang - Pitfalls and Solutions

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
39245147 intro-es-i
39245147 intro-es-i39245147 intro-es-i
39245147 intro-es-iEmbeddedbvp
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented DesignRodrigo Campos
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 

Similar to High Performance Erlang - Pitfalls and Solutions (20)

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
39245147 intro-es-i
39245147 intro-es-i39245147 intro-es-i
39245147 intro-es-i
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 

High Performance Erlang - Pitfalls and Solutions

  • 1. High Performance Erlang: Pitfalls and Solutions MZSPEED Team Machine Zone, Inc. 1Machine Zone, Inc
  • 2. Agenda   •  Erlang at Machine Zone •  Issues and Roadblocks •  Lessons Learned •  Profiling Techniques •  Case Study: Counters 2Machine Zone, Inc
  • 3. Erlang  at  Machine  Zone   •  High performance data processing system •  How to make it fast? 3Machine Zone, Inc
  • 4. 4 Typical  Approach:  Pipelined  tasks   Execute different logic in different processes receiver processor sender Machine Zone, Inc
  • 5. How do we exchange data between processes? How do we make it fast? 5Machine Zone, Inc
  • 7. 0 5 M 10 M 15 M 0 2 4 6 8 10 12 14 16 Updateoperations(total) Numberofprocessesupdating Timeline expected processes updating 0 5 M 10 M 15 M 0 2 4 6 8 10 12 14 16 Updateoperations(total) Numberofprocessesupdating Timeline expected processes updating update operations 7Machine Zone, Inc ETS:  update_counter,  10  Keys:     Expected  vs  Observed  Performance   •  Observed Performance is far worse than Expected
  • 8. 8Machine Zone, Inc ETS:  update_counter,  Multiple  Keys:   Observed  Performance   0 5 M 10 M 15 M 0 2 4 6 8 10 12 14 16 Updateoperations(total) Numberofprocessesupdating Timeline 1 key 10 keys 100 keys 1000 keys •  Performance is degraded even with multiple keys
  • 9. 9Machine Zone, Inc ETS  has  Locks   •  Fine-grained locks •  64 locks per table by default
  • 11. 1 10 100 1000 N 1 10 100 M 11Machine Zone, Inc Message  Passing:    M  Processes  to  N  Processes   10:100
  • 12. 12Machine Zone, Inc Message  Passing:  Scheduling  Locks   •  Add to runq with locks
  • 13. 13Machine Zone, Inc Message  Passing:  Scheduler   Communication  Locks   •  Task stealing with locks
  • 14. 14Machine Zone, Inc Message  Passing   •  Limit on #messages delivered per second •  < 10M msgs/sec on MacBook Pro •  < 22M msgs/sec on a powerful 56-core server •  Independent of #processes
  • 15. What if we want to achieve more? 15Machine Zone, Inc
  • 16. 16Machine Zone, Inc Solution  1:  Minimize  Message  Exchange   Do everything in the same process and shard receiver processor sender receiver processor sender receiver processor sender
  • 17. 17Machine Zone, Inc Solution  1  Speed   •  100x the #ETS updates •  20x–30x the #message exchanges •  No decrease with increase in #processes 0 20 M 40 M 60 M 80 M 100 M 0 20 40 60 80 100 Iterationspersec Processes iterations processes
  • 18. 18Machine Zone, Inc Solution  2:  Message  Batching   Batch messages: 5x–30x more msgs/sec 1 10 100 1000 N 1 10 100 M 10:100
  • 19. 19Machine Zone, Inc Solution  3:  Domain-­‐Specific   Optimizations   •  Deliver one message to multiple recipients: 1.  Receive, process, store to shared memory using NIF 2.  Read from the shared memory, send using push notification sender receiver processor shared memory Total throughput of 600M msgs/sec to many recipients shared memory
  • 20. 20Machine Zone, Inc High Performance Erlang: Lessons Learned
  • 21. 21Machine Zone, Inc Lesson  1:  Use  Sharding   •  Avoid having single process •  Use a pool of non-communicating shards •  erlang:phash2(Key, NumberOfShards) Shard 2 Shard 1 receive   receive   send   send   No  communication  
  • 22. 0 2 M 4 M 6 M 8 M 10 M 12 M 0 1 M 2 M 3 M 4 M 5 M Timetaken(us) Number of insertions ets set maps process dictionary 22Machine Zone, Inc Lesson  2:  Use  Process  Dictionary   •  Quicker insertions in process dictionary compared to maps
  • 23. 0 5 k 10 k 15 k 20 k 25 k 1 10 100 1000 10000 100000 Timefor100kupdates(ms) Number of elements R18 maps R18 process dictionary 23Machine Zone, Inc Lesson  2:  Use  Process  Dictionary   •  Quicker update operations in process dictionary compared to maps (Erlang 18)
  • 24. 24Machine Zone, Inc Lesson  3:  Use  Limited  #Processes   •  Don’t create too many processes •  pid2proc lock for process resolution •  Scheduler has to manage the processes •  Fixed number of schedulers •  Don’t spawn/stop processes frequently •  Process resolution again
  • 26. 26Machine Zone, Inc pstack:  Stack  Sampling   •  pstack (gdb): sampling call stack over multiple threads •  $ pstack 1234 •  $ gdb -p 1234 •  (gdb) thread apply all bt 25
  • 27. 27Machine Zone, Inc EEP:  Erlang  Easy  Profiling   •  https://github.com/virtan/eep •  Based on dbg module, low overhead trace ports •  Easy to use •  Kcachegrind visualization Basic Usage •  Collect runtime data to local file •  > eep:start_file_tracing("abc"), timer:sleep(10000), eep:stop_tracing(). •  Convert to callgrind format •  > eep:convert_tracing(“abc"). •  Visualize with kcachegrind •  $ kcachegrind callgrind.out.abc
  • 28. 28Machine Zone, Inc EEP:    Example  Visualization  
  • 29. 29Machine Zone, Inc oprofile:  System  Level  Profiling   •  Initialization $ opcontrol –deinit $ echo 0 > /proc/sys/kernel/nmi_watchdog •  Profiling $ operf --vmlinux /usr/lib/debug/vmlinux -- callgraph ./rebar eunit apps=perf_test •  Reporting and Visualization $ opreport -cgf | ./gprof2dot.py -f oprofile --show-samples | dot -Tpng -o output.png
  • 32. 32Machine Zone, Inc Need  for  High  Performance  Counters   •  Thousands of global counters •  Updated by several hundred processes every second
  • 33. 33Machine Zone, Inc Libraries  Used   •  Exometer •  ETS •  Folsom
  • 34. Benchmark  Environment   •  Platform •  2 CPU sockets •  CPU - Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz •  14 hardware threads per socket •  2 hyper-threaded threads for each hardware thread •  Erlang Runtime   Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:56:56] [async-threads:10] [kernel-poll:false] 34Machine Zone, Inc
  • 35. 0 1 M 2 M 3 M 4 M 5 M 6 M 7 M 10 20 30 40 50 60 Numcounterupdatesperprocess Num processes ets counter exometer counter exometer fast counter folsom counter 35Machine Zone, Inc Benchmark  Results   •  As number of processes increases, number of counter updates per process decreases
  • 36. 36Machine Zone, Inc Limitations   •  Exometer: •  counters – scales well but slow •  fast counters – doesn’t scale •  Lock contention in ets:update_counter()
  • 38. 38Machine Zone, Inc Refresher  on  Cacheline   Source:  http://www.slideshare.net/yzt/modern-­‐cpus-­‐and-­‐caches-­‐a-­‐starting-­‐point-­‐for-­‐programmers/52  
  • 39. 39Machine Zone, Inc Occurs when two CPUs access different data structures on the same cache line False  Sharing  (SMP)   CPUO CPU1 struct foo struct bar Cache  line  
  • 40. 40 CPUO Main memory L2 L2 CPU1 foo bar foo bar foo bar FETCH  foo   FETCH  bar   FETCH  foo   FETCH  bar   False  Sharing  (SMP)   Machine Zone, Inc
  • 41. Main memory 41 False  Sharing  (SMP)   CPUO L2 L2 CPU1 foo bar WRITE  foo   WRITE  foo   INVL  bar   INVL  bar   foo bar foo bar ACK   ACK  ACK   ACK   Machine Zone, Inc
  • 42. Main memory 42 CPUO L2 L2 CPU1 WRITE  bar   WRITE  bar   INVL  foo   INVL  foo   foo bar ACK   ACK   ACK   ACK   Machine Zone, Inc False  Sharing  (SMP)  
  • 43. 43Machine Zone, Inc Benefits  of  Per-­‐CPU  Data   Source:  http://lwn.net/Articles/256433/   Time taken for 500 million counter increment operations on Intel Core 2 QX 6700
  • 44. 44Machine Zone, Inc Benefits  of  Per-­‐CPU  Data:  Reasons   •  Less overhead of cache coherency protocol •  Significantly higher benefits with QPI (Quick Path Interconnect)
  • 45. 45Machine Zone, Inc Transactional  Memory   •  Intel TSX (Transactional Synchronizations Extensions) •  Helps with locks in small critical sections •  Not yet available in x86 (disabled due to hardware bugs)
  • 46. 46Machine Zone, Inc MZMETRICS   •  Design based on per-CPU counters •  Counter Operations: •  Create a new counter – requires lock •  Read/update counter – no lock required •  Delete counter – requires lock •  Implemented using NIF
  • 47. 47Machine Zone, Inc MZMETRICS  Evaluation     0 2 M 4 M 6 M 8 M 10 M 12 M 14 M 16 M 18 M 20 M 10 20 30 40 50 Numcounterupdatesperprocess Num processes exometer mzmetrics •  10X faster than Exometer counters
  • 48. 48Machine Zone, Inc MZMETRICS  Evaluation:  Results     •  1 million counter updates with 48 processes •  Exometer takes ~ 1 seconds •  MZMETRICS takes ~0.1 seconds
  • 49. 49Machine Zone, Inc Takeaways   •  Pitfalls •  Message passing becomes expensive at scale •  Large number of processes needing pid resolution from a global table causes contention •  Existing libraries not amiable for fast counting •  Proposed Solutions •  Use sharding & Limit messaging passing •  Limit number of active processes •  Use MZMETRICS for fast counters
  • 50. 50Machine Zone, Inc We are Hiring http://www.machinezone.com/careers