Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1zlqvAN.
Sponsored by Goldman Sachs. Java 8 has Streams, Scala has parallel collections, and GS Collections has ParallelIterables. Since we use parallelism to achieve better performance, it's interesting to ask: how well do they perform? We'll look at how these three APIs work with a critical eye toward performance. We'll also look at common performance pitfalls. Filmed at qconnewyork.com.
Craig Motlin is the technical lead for GS Collections, a full-featured open-source Collections library for Java, and is the author of the framework's parallel, lazy API. He has worked at Goldman Sachs for 9 years on several teams focusing on application development before moving to the JVM Architecture team to focus on framework development.
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Parallel-lazy Performance: Java 8 vs Scala vs GS Collections
1. This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon
or considered investment advice. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and
recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact.
Parallel-lazy performance
Java 8 vs Scala vs GS Collections
Craig Motlin
June 2014
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/java-streams-scala-parallel-
collections
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
5. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Lots of claims
and opinions
6. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Lots of evidence
7. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
8. Intro
• Solve the same problem in all three libraries
– Java (1.8.0_05)
– GS Collections (5.1.0)
– Scala (2.11.0)
• Count how many even numbers are in a list of numbers
• Then accomplish the same thing in parallel
– Data-level parallelism
– Batch the data
– Use all the cores
9. Performance Factors
Tests that isolate individual performance factors
• Count
• Filter, Transform, Transform, Filter, convert to List
• Aggregation
– Market value stats aggregated by product or category
10. Count: Serial
long evens = arrayList.stream()
.filter(each -> each % 2 == 0).count();
int evens =
fastList.count(each -> each % 2 == 0);
val evens = arrayBuffer.count(_ % 2 == 0)
11. Count: Serial Lazy
long evens = arrayList.stream()
.filter(each -> each % 2 == 0).count();
int evens = fastList.asLazy()
.count(each -> each % 2 == 0);
val evens = arrayBuffer.view
.count(_ % 2 == 0)
12. Count: Parallel Lazy
long evens = arrayList.parallelStream()
.filter(each -> each % 2 == 0).count();
int evens = fastList
.asParallel(executorService, BATCH_SIZE)
.count(each -> each % 2 == 0);
val evens =
arrayBuffer.par.count(_ % 2 == 0)
15. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Time for some
numbers!
18. Java Microbenchmark Harness
“JMH is a Java harness for building, running, and analysing
nano/micro/milli/macro benchmarks written in Java and other languages
targetting the JVM.”
• 5 forked JVMs per test
• 100 warmup iterations per JVM
• 50 measurement iterations per JVM
• 1 second of looping per iteration
http://openjdk.java.net/projects/code-tools/jmh/
20. Java Microbenchmark Harness
• @Setup includes megamorphic warmup
• More info on megamorphic in the appendix
• This is something that JMH does not handle
for you!
21. Java Microbenchmark Harness
• Throughput: higher is better
• Enough warmup iterations so that standard deviation is low
Benchmark Mode Samples Mean Mean error Units
CountTest.parallel_eager_gsc thrpt 250 629.961 8.305 ops/s
CountTest.parallel_lazy_gsc thrpt 250 595.023 7.153 ops/s
CountTest.parallel_lazy_jdk thrpt 250 415.382 7.766 ops/s
CountTest.parallel_lazy_scala thrpt 250 331.938 2.141 ops/s
CountTest.serial_eager_gsc thrpt 250 115.197 0.328 ops/s
CountTest.serial_eager_scala thrpt 250 91.167 0.864 ops/s
CountTest.serial_lazy_gsc thrpt 250 73.625 3.619 ops/s
CountTest.serial_lazy_jdk thrpt 250 58.182 0.477 ops/s
CountTest.serial_lazy_scala thrpt 250 84.200 1.033 ops/s...
22. Performance Factors
Tests that isolate individual performance factors
• Count
• Filter, Transform, Transform, Filter, convert to List
• Aggregation
– Market value stats aggregated by product or category
23. Java Microbenchmark Harness
• Performance tests are open sourced
• Read them and run them on your hardware
https://github.com/goldmansachs/gs-collections/
24. Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
25. Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Isolated by using
array-backed lists.
ArrayList, FastList, and
ArrayBuffer
Isolated because
combination of intermediate
results is simple addition.
Let’s look at reasons
for the differences in
count()
27. Count: Java 8 implementation
@GenerateMicroBenchmark
public void serial_lazy_jdk() {
long evens = this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}
28. Count: Java 8 implementation
@GenerateMicroBenchmark
public void serial_lazy_jdk() {
long evens = this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}
filter(Predicate)
.count()
Instead of
count(Predicate)
29. Count: Java 8 implementation
@GenerateMicroBenchmark
public void serial_lazy_jdk() {
long evens = this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}
Is count() just
incrementing a counter?
filter(Predicate)
.count()
Instead of
count(Predicate)
30. Count: Java 8 implementation
public final long count() {
return mapToLong(e -> 1L).sum();
}
public final long sum() {
return reduce(0, Long::sum);
}
/** @since 1.8 */
public static long sum(long a, long b) { return a + b; }
37. Count: GS Collections
public class CountProcedure<T> implements Procedure<T>
{
private final Predicate<? super T> predicate;
private int count;
...
public void value(T object) {
if (this.predicate.accept(object)) {
this.count++;
}
}
public int getCount() { return this.count; }
}
38. Count: GS Collections
public class CountProcedure<T> implements Procedure<T>
{
private final Predicate<? super T> predicate;
private int count;
...
public void value(T object) {
if (this.predicate.accept(object)) {
this.count++;
}
}
public int getCount() { return this.count; }
}
Predicate from the test:
each -> each % 2 == 0
41. Count: Scala implementation
TraversibleOnce.scala
def count(p: A => Boolean): Int =
{
var cnt = 0
for (x <- this)
if (p(x)) cnt += 1
cnt
}
for-comprehension
becomes call to foreach()
lambda closes over cnt.
Executes predicate and
increments cnt, just like
CountProcedure
44. Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Scala’s auto-boxing
Java’s pull lazy
evaluation
45. Performance Factors
Tests that isolate individual performance factors
• Count
• Filter, Transform, Transform, Filter, convert to List
• Aggregation
– Market value stats aggregated by product or category
57. Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Fork-join is general
purpose but requires
merge work
Specialized data
structures meant
for combining
59. Parallel: GSC
.asParallel(this.executorService, BATCH_SIZE)
• You must specify your own batch size
– 10,000 is fine
– size / (8 * #cores) is fine
• You must specify your own thread pool
– Can share, or not
– Can tailor for CPU-bound
Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors())
– Or IO-Bound
Executors.newFixedThreadPool(maxDbConnections)
60. Parallel: Scala
• One shared fork-join pool, configurable
• Batch sizes are dynamic and respond to work
stealing
• Minimum batch size:
1 + size / (8 * #cores)
61. Parallel: Java 8
• One shared fork-join pool, not configurable
• Batch sizes are dynamic and respond to work stealing
• Minimum batch size:
– max(1, size / (4 * (#cores - 1)))
– Default pool also has #cores – 1 threads, plus main thread
helps
– Can be changed with system property
java.util.concurrent.ForkJoinPool.common.parallelism
68. Aggregate by Category GSC
MapIterable<String, MarketValueStatistics> categoryDoubleMap =
this.gscPositions.asParallel(this.executorService, BATCH_SIZE)
.aggregateInPlaceBy(
Position::getCategory,
MarketValueStatistics::new,
MarketValueStatistics::acceptThis);
What if we group by
Account instead?
69. Aggregate by Account GSC
MapIterable<Account, MarketValueStatistics> accountDoubleMap =
this.gscPositions.asParallel(this.executorService, BATCH_SIZE)
.aggregateInPlaceBy(
Position::getAccount,
MarketValueStatistics::new,
MarketValueStatistics::acceptThis);
What if we group by
Account instead?
70. Collapse factor
MapIterable<String, MarketValueStatistics> categoryDoubleMap
There are 26 categories, so the map has 26 keys
MapIterable<Account, MarketValueStatistics> accountDoubleMap
There are 100k accounts, so the map has 100k keys
71. Collapse factor
• Aggregate: Java Streams
– Uses fork/join
– Each forked task creates a map
– Each join step merges two maps
– The joined map is roughly the same size
– Merge is costly when there are many keys
• Aggregate: GS Collections
– Uses a single ConcurrentMap for the results
– Each batched task writes into the map simultaneously with atomic operation
ConcurrentHashMapUnsafe.updateValueWith()
– Contention is costly when there are few keys
72. Collapse factor
• Aggregate: Java Streams
– Uses fork/join
– Each forked task creates a map
– Each join step merges two maps
– The joined map is roughly the same size
– Merge is costly when there are many keys
• Aggregate: GS Collections
– Uses a single ConcurrentMap for the results
– Each batched task writes into the map simultaneously with atomic operation
ConcurrentHashMapUnsafe.updateValueWith()
– Contention is costly when there are few keys
See Mohammad Rezaei’s
presentation from QCon
2012 called “Fine Grained
Coordinated Parallelism in
a Real World Application.”
73. Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Test groupBy,
aggregateBy
74. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
75. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
76. Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
81. Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Isolated by using
array-backed lists.
ArrayList, FastList, and
ArrayBuffer
What if we use Java’s
HashSet, Scala’s
HashSet, and GS
Collections’
UnifiedSet?
82. 0
200
400
600
800
1000
1200
Serial Lazy Parallel Lazy
Java 8 GS Collections Scala
Parallel Count ops/s(higher is better)
Measured on an 8 core Linux VM
Intel Xeon E5-2697 v28x
Lists: FastList | ArrayList | ArrayBuffer
89. Count: SAM method calls
• Let’s take a closer look at both
implementations of count()
• Let’s assume that @FunctionalInterface
method calls are costly and count them as we
go
• We’ll revisit this assumption
90. Count: GS Collections
java.lang.Thread.State: RUNNABLE
at com.gs.collections.impl.block.procedure.CountProcedure.value(CountProcedure.java:47)
at com.gs.collections.impl.list.mutable.FastList.forEach(FastList.java:623)
at com.gs.collections.impl.utility.Iterate.forEach(Iterate.java:114)
at com.gs.collections.impl.lazy.LazyIterableAdapter.forEach(LazyIterableAdapter.java:49)
at com.gs.collections.impl.lazy.AbstractLazyIterable.count(AbstractLazyIterable.java:461)
at com.gs.collections.impl.jmh.CountTest.serial_lazy_gsc(CountTest.java:302)
• Execution of the lazy evaluation
• Executed once per element
• We’ll look for @FunctionalInterface method calls here
92. Count: Java 8
java.lang.Thread.State: RUNNABLE
at java.lang.Long.sum(Long.java:1587)
at java.util.stream.LongPipeline$$Lambda$3.887750041.applyAsLong(Unknown Source:-1)
at java.util.stream.ReduceOps$8ReducingSink.accept(ReduceOps.java:394)
at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.LongPipeline.reduce(LongPipeline.java:438)
at java.util.stream.LongPipeline.sum(LongPipeline.java:396)
at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526)
at com.gs.collections.impl.jmh.CountTest.serial_lazy_jdk(CountTest.java:278)
• Execution of the pipeline
• Executed once per element
• We’ll look for @FunctionalInterface method calls here
95. @FunctionalInterface method calls
• Why do we care about
@FunctionalInterface method calls?
• The JDK inlines short method bodies like our
Predicates
• The exact nature of the inlining has a dramatic
impact on performance
96. @FunctionalInterface method calls
• JMH forks a new JVM for each test
• During both stages of JIT compilation, this.predicate is our test
Predicate
• The JVM will perform monomorphic inlining
public void value(T object)
{
if (this.predicate.accept(object))
{
this.count++;
}
}
Predicate from the test:
each -> each % 2 == 0
97. @FunctionalInterface method calls
The dispatch algorithm in pseudo code
if (this.predicate instanceof lambda$serial_lazy_gsc$1) {
if (object % 2 == 0) {
this.count++;
}
} else {
[recompile]
if (this.predicate.accept(object)) {
this.count++;
}
}
98. @FunctionalInterface method calls
• The next recompilation will result in bimorphic
inlining
• The next recompilation will result in
megamorphic method dispatch
• Classic table lookup and jump
• In other words, no inlining
• Dramatic performance penalty for fast methods
like count()
99. Megamorphic method dispatch
How do we trigger megamorphic deoptimization?
@Setup(Level.Trial)
public void setUp_megamorphic()
{
long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count();
Assert.assertEquals(SIZE / 2, evens);
long odds = this.integersJDK.stream().filter(each -> each % 2 == 1).count();
Assert.assertEquals(SIZE / 2, odds);
long evens2 = this.integersJDK.stream().filter(each -> (each & 1) == 0).count();
Assert.assertEquals(SIZE / 2, evens2);
}
This is something that JMH does not handle for you!
101. Megamorphic method dispatch
• Why force megamorphic deoptimization?
• Some implementations will have extra virtual method
calls (@FunctionalInterface method calls)
• Microbenchmarks aren’t realistic, but which is more
realistic (less unrealistic?)
• You will trigger this deoptimization in normal
production code, as soon as there is more than one call
to this api anywhere in the executed code
102. Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/java-
streams-scala-parallel-collections