Parallel-lazy Performance: Java 8 vs Scala vs GS Collections

This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon
or considered investment advice. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and
recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact.
Parallel-lazy performance
Java 8 vs Scala vs GS Collections
Craig Motlin
June 2014

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/java-streams-scala-parallel-
collections

Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide

Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid

Goals
testing
Lots of claims
and opinions

Goals
testing
Lots of evidence

Intro
• Solve the same problem in all three libraries
– Java (1.8.0_05)
– GS Collections (5.1.0)
– Scala (2.11.0)
• Count how many even numbers are in a list of numbers
• Then accomplish the same thing in parallel
– Data-level parallelism
– Batch the data
– Use all the cores

Performance Factors
Tests that isolate individual performance factors
• Count
• Filter, Transform, Transform, Filter, convert to List
• Aggregation
– Market value stats aggregated by product or category

Count: Serial
long evens = arrayList.stream()
.filter(each -> each % 2 == 0).count();
int evens =
fastList.count(each -> each % 2 == 0);
val evens = arrayBuffer.count(_ % 2 == 0)

Count: Serial Lazy
long evens = arrayList.stream()
int evens = fastList.asLazy()
.count(each -> each % 2 == 0);
val evens = arrayBuffer.view
.count(_ % 2 == 0)

Count: Parallel Lazy
long evens = arrayList.parallelStream()
int evens = fastList
.asParallel(executorService, BATCH_SIZE)
val evens =
arrayBuffer.par.count(_ % 2 == 0)

Parallel Lazy
1 2 3 4 5 6 7 8 … 1M
Filter
and
Count
1-10k 10k-20k 20k-30k 30k-40k … 990k-1M
500kReduce
5k 5k 5k 5k … 5k
Batch

Parallel Eager
1 2 3 4 5 6 7 8 … 1M
1-10k 10k-20k 20k-30k 30k-40k … 990k-1M
2, 4, 6, 8 …
10k
10k-20k
(evens)
20k-30k
(evens)
30k-40k
(evens)
…
990k-1M
(evens)
Batch
Filter
Count
5k 5k 5k 5k … 5k
500kReduce

Goals
testing
Time for some
numbers!

0
50
100
150
200
250
300
350
400
Serial Lazy
Java 8 GS Collections Scala
Serial Count ops/s(higher is better)

0
200
400
600
800
1000
1200
Serial Lazy Parallel Lazy
Parallel Count ops/s(higher is better)
Measured on an 8 core Linux VM
Intel Xeon E5-2697 v28x

Java Microbenchmark Harness
“JMH is a Java harness for building, running, and analysing
nano/micro/milli/macro benchmarks written in Java and other languages
targetting the JVM.”
• 5 forked JVMs per test
• 100 warmup iterations per JVM
• 50 measurement iterations per JVM
• 1 second of looping per iteration
http://openjdk.java.net/projects/code-tools/jmh/

@GenerateMicroBenchmark
public void parallel_lazy_jdk() {
long evens = this.integersJDK
.parallelStream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}

• @Setup includes megamorphic warmup
• More info on megamorphic in the appendix
• This is something that JMH does not handle
for you!

• Throughput: higher is better
• Enough warmup iterations so that standard deviation is low
Benchmark Mode Samples Mean Mean error Units
CountTest.parallel_eager_gsc thrpt 250 629.961 8.305 ops/s
CountTest.parallel_lazy_gsc thrpt 250 595.023 7.153 ops/s
CountTest.parallel_lazy_jdk thrpt 250 415.382 7.766 ops/s
CountTest.parallel_lazy_scala thrpt 250 331.938 2.141 ops/s
CountTest.serial_eager_gsc thrpt 250 115.197 0.328 ops/s
CountTest.serial_eager_scala thrpt 250 91.167 0.864 ops/s
CountTest.serial_lazy_gsc thrpt 250 73.625 3.619 ops/s
CountTest.serial_lazy_jdk thrpt 250 58.182 0.477 ops/s
CountTest.serial_lazy_scala thrpt 250 84.200 1.033 ops/s...

• Performance tests are open sourced
• Read them and run them on your hardware
https://github.com/goldmansachs/gs-collections/

Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns

Performance Factors
• Collapse factor
Isolated by using
array-backed lists.
ArrayList, FastList, and
ArrayBuffer
Isolated because
combination of intermediate
results is simple addition.
Let’s look at reasons
for the differences in
count()

Count: Java 8 implementation
public void serial_lazy_jdk() {
.stream()
.count();
}

.stream()
.count();
}
filter(Predicate)
.count()
Instead of
count(Predicate)

.stream()
.count();
}
Is count() just
incrementing a counter?
filter(Predicate)
.count()
Instead of
count(Predicate)

public final long count() {
return mapToLong(e -> 1L).sum();
}
public final long sum() {
return reduce(0, Long::sum);
}
/** @since 1.8 */
public static long sum(long a, long b) { return a + b; }

this.integersJDK
.stream()
.mapToLong(e -> 1L)
.reduce(0, Long::sum);
this.integersGSC
.asLazy()

this.integersJDK
.stream()
.mapToLong(e -> 1L)
.reduce(0, Long::sum);
this.integersGSC
.asLazy()
Seems like
extra work

Count: GS Collections
public void serial_lazy_gsc() {
int evens = this.integersGSC
.asLazy()
}

AbstractLazyIterable.java
public int count(Predicate<? super T> predicate)
{
CountProcedure<T> procedure =
new CountProcedure<T>(predicate);
this.forEach(procedure);
return procedure.getCount();
}

FastList.java
public void forEach(Procedure<? super T> procedure)
{
for (int i = 0; i < this.size; i++)
{
procedure.value(this.items[i]);
}
}

public class CountProcedure<T> implements Procedure<T>
{
private final Predicate<? super T> predicate;
private int count;
...
public void value(T object) {
if (this.predicate.accept(object)) {
this.count++;
}
}
public int getCount() { return this.count; }
}

public class CountProcedure<T> implements Procedure<T>
{
private final Predicate<? super T> predicate;
private int count;
...
public void value(T object) {
this.count++;
}
}
public int getCount() { return this.count; }
}
Predicate from the test:
each -> each % 2 == 0

Count: Scala implementation
TraversibleOnce.scala
def count(p: A => Boolean): Int =
{
var cnt = 0
for (x <- this)
if (p(x)) cnt += 1
cnt
}

TraversibleOnce.scala
def count(p: A => Boolean): Int =
{
var cnt = 0
for (x <- this)
if (p(x)) cnt += 1
cnt
}
for-comprehension
becomes call to foreach()
lambda closes over cnt.
Executes predicate and
increments cnt, just like
CountProcedure

public final java.lang.Object apply(java.lang.Object);
0: aload_0
1: aload_1
// Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I
2: invokestatic #32
// Method apply:(I)Z
5: invokevirtual #34
// Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean;
8: invokestatic #38
11: areturn
public boolean apply$mcZI$sp(int);
0: iload_1
1: iconst_2
2: irem
3: iconst_0
4: if_icmpne 11
7: iconst_1
8: goto 12
11: iconst_0
12: ireturn
public final boolean apply(int);
0: aload_0
1: iload_1
// Method apply$mcZI$sp:(I)Z
2: invokevirtual #21
5: ireturn

Integer int
booleanBoolean
Integer.intValue()
Lambda: _ % 2 == 0
Bytecode: irem
Boolean.valueOf(boolean)

Performance Factors
• Collapse factor
Scala’s auto-boxing
Java’s pull lazy
evaluation

Parallel / Lazy / JDK
List<Integer> list = this.integersJDK
.parallelStream()
.filter(each -> each % 10_000 != 0)
.map(String::valueOf)
.map(Integer::valueOf)
.filter(each -> (each + 1) % 10_000 != 0)
.collect(Collectors.toList());
Verify.assertSize(999_800, list);

Parallel / Lazy / GSC
MutableList<Integer> list = this.integersGSC
.asParallel(this.executorService, BATCH_SIZE)
.select(each -> each % 10_000 != 0)
.collect(String::valueOf)
.collect(Integer::valueOf)
.select(each -> (each + 1) % 10_000 != 0)
.toList();

Parallel / Lazy / Scala
val list = this.integers
.par
.filter(each => each % 10000 != 0)
.map(String.valueOf)
.map(Integer.valueOf)
.filter(each => (each + 1) % 10000 != 0)
.toBuffer
Assert.assertEquals(999800, list.size)

0
10
20
30
40
50
60
Stacked computation ops/s (higher is better)
8x

Parallel / Lazy / JDK
List<Integer> list = this.integersJDK
.parallelStream()
.filter(each -> each % 10_000 != 0)
.map(String::valueOf)
.map(Integer::valueOf)
.filter(each -> (each + 1) % 10_000 != 0)
.collect(Collectors.toList());
ArrayList::new
List::add
(left, right) -> {
left.addAll(right);
return left;
}

Fork-Join Merge
• Intermediate results are merged in a tree
• Merging is O(n log n) work and garbage

Fork-Join Merge
• Amount of work done by last thread is O(n)

.select(each -> (each + 1) % 10_000 != 0)
.toList();
ParallelIterable.toList()
returns a
CompositeFastList, a List
with O(1) implementation
of addAll()

public final class CompositeFastList<E> {
private final FastList<FastList<E>> lists =
FastList.newList();
public boolean addAll(Collection<? extends E> collection) {
FastList<E> collectionToAdd = collection instanceof FastList
? (FastList<E>) collection
: new FastList<E>(collection);
this.lists.add(collectionToAdd);
return true;
}
...
}

CompositeFastList Merge
• Merging is O(1) work per batch
CFL

Performance Factors
• Collapse factor
Fork-join is general
purpose but requires
merge work
Specialized data
structures meant
for combining

Parallel: GSC
• You must specify your own batch size
– 10,000 is fine
– size / (8 * #cores) is fine
• You must specify your own thread pool
– Can share, or not
– Can tailor for CPU-bound
Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors())
– Or IO-Bound
Executors.newFixedThreadPool(maxDbConnections)

Parallel: Scala
• One shared fork-join pool, configurable
• Batch sizes are dynamic and respond to work
stealing
• Minimum batch size:
1 + size / (8 * #cores)

Parallel: Java 8
• One shared fork-join pool, not configurable
• Batch sizes are dynamic and respond to work stealing
• Minimum batch size:
– max(1, size / (4 * (#cores - 1)))
– Default pool also has #cores – 1 threads, plus main thread
helps
– Can be changed with system property
java.util.concurrent.ForkJoinPool.common.parallelism

0
10
20
30
40
50
60
70
Java 8 GS Collections
Aggregate by Categories
8x

0
5
10
15
20
25
Java 8 GS Collections
Aggregate by Accounts
8x

Aggregate by Category Streams
Map<String, DoubleSummaryStatistics> categoryDoubleMap =
this.jdkPositions.parallelStream().collect(
Collectors.groupingBy(
Position::getCategory,
Collectors.summarizingDouble(Position::getMarketValue)));

Aggregate by Category GSC
MapIterable<String, MarketValueStatistics> categoryDoubleMap =
this.gscPositions.asParallel(this.executorService, BATCH_SIZE)
.aggregateInPlaceBy(
MarketValueStatistics::new,
MarketValueStatistics::acceptThis);

Aggregate by Category GSC
MapIterable<String, MarketValueStatistics> categoryDoubleMap =
What if we group by
Account instead?

Aggregate by Account GSC
MapIterable<Account, MarketValueStatistics> accountDoubleMap =
Position::getAccount,
What if we group by
Account instead?

Collapse factor
MapIterable<String, MarketValueStatistics> categoryDoubleMap
There are 26 categories, so the map has 26 keys
MapIterable<Account, MarketValueStatistics> accountDoubleMap
There are 100k accounts, so the map has 100k keys

Collapse factor
• Aggregate: Java Streams
– Uses fork/join
– Each forked task creates a map
– Each join step merges two maps
– The joined map is roughly the same size
– Merge is costly when there are many keys
• Aggregate: GS Collections
– Uses a single ConcurrentMap for the results
– Each batched task writes into the map simultaneously with atomic operation
ConcurrentHashMapUnsafe.updateValueWith()
– Contention is costly when there are few keys

Collapse factor
• Aggregate: Java Streams
– Uses fork/join
– Each forked task creates a map
– Each join step merges two maps
– The joined map is roughly the same size
– Merge is costly when there are many keys
• Aggregate: GS Collections
– Uses a single ConcurrentMap for the results
– Each batched task writes into the map simultaneously with atomic operation
ConcurrentHashMapUnsafe.updateValueWith()
– Contention is costly when there are few keys
See Mohammad Rezaei’s
presentation from QCon
2012 called “Fine Grained
Coordinated Parallelism in
a Real World Application.”

Performance Factors
• Collapse factor
Test groupBy,
aggregateBy

Q&A
http://github.com/goldmansachs/gs-collections
http://github.com/goldmansachs/gs-collections-kata
@GoldmanSachs
http://stackoverflow.com/questions/tagged/gs-collections
craig.motlin@gs.com
Info in appendix
• Sets
• Handcoded parallelism
• Megamorphic warmup

Performance Factors
• Collapse factor
Isolated by using
array-backed lists.
ArrayList, FastList, and
ArrayBuffer
What if we use Java’s
HashSet, Scala’s
HashSet, and GS
Collections’
UnifiedSet?

0
200
400
600
800
1000
1200
Measured on an 8 core Linux VM
Intel Xeon E5-2697 v28x
Lists: FastList | ArrayList | ArrayBuffer

0
200
400
600
800
1000
1200
Sets: UnifiedSet | HashSet (Java’s) | HashSet (Scala’s)
8x

.select(each -> (each + 1) % 10_000 != 0)
.toSet();
ParallelIterable.toSet()
uses a concurrent set.
No combination step.
No preserving order.

Hand coded Parallel / Lazy
.select(integer -> integer % 10_000 != 0 &&
(Integer.valueOf(String.valueOf(integer)) + 1) % 10_000 != 0)
.toList();

Stacked computation ops/s (higher is better)
8x
0
10
20
30
40
50
60
70
Serial Lazy Parallel Lazy Parallel hand-coded

Count: SAM method calls
• Let’s take a closer look at both
implementations of count()
• Let’s assume that @FunctionalInterface
method calls are costly and count them as we
go
• We’ll revisit this assumption

java.lang.Thread.State: RUNNABLE
at com.gs.collections.impl.block.procedure.CountProcedure.value(CountProcedure.java:47)
at com.gs.collections.impl.list.mutable.FastList.forEach(FastList.java:623)
at com.gs.collections.impl.utility.Iterate.forEach(Iterate.java:114)
at com.gs.collections.impl.lazy.LazyIterableAdapter.forEach(LazyIterableAdapter.java:49)
at com.gs.collections.impl.lazy.AbstractLazyIterable.count(AbstractLazyIterable.java:461)
at com.gs.collections.impl.jmh.CountTest.serial_lazy_gsc(CountTest.java:302)
• Execution of the lazy evaluation
• Executed once per element
• We’ll look for @FunctionalInterface method calls here

Grand total of 2 @FunctionalInterface
method calls

Count: Java 8
java.lang.Thread.State: RUNNABLE
at java.lang.Long.sum(Long.java:1587)
at java.util.stream.LongPipeline$$Lambda$3.887750041.applyAsLong(Unknown Source:-1)
at java.util.stream.ReduceOps$8ReducingSink.accept(ReduceOps.java:394)
at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.LongPipeline.reduce(LongPipeline.java:438)
at java.util.stream.LongPipeline.sum(LongPipeline.java:396)
at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526)
at com.gs.collections.impl.jmh.CountTest.serial_lazy_jdk(CountTest.java:278)
• Execution of the pipeline
• Executed once per element
• We’ll look for @FunctionalInterface method calls here

Count: Java 8
method calls

Count: Scala
Scala implementation is similar to GS Collections
method calls

@FunctionalInterface method calls
• Why do we care about
@FunctionalInterface method calls?
• The JDK inlines short method bodies like our
Predicates
• The exact nature of the inlining has a dramatic
impact on performance

• JMH forks a new JVM for each test
• During both stages of JIT compilation, this.predicate is our test
Predicate
• The JVM will perform monomorphic inlining
public void value(T object)
{
if (this.predicate.accept(object))
{
this.count++;
}
}
Predicate from the test:
each -> each % 2 == 0

The dispatch algorithm in pseudo code
if (this.predicate instanceof lambda$serial_lazy_gsc$1) {
if (object % 2 == 0) {
this.count++;
}
} else {
[recompile]
this.count++;
}
}

• The next recompilation will result in bimorphic
inlining
• The next recompilation will result in
megamorphic method dispatch
• Classic table lookup and jump
• In other words, no inlining
• Dramatic performance penalty for fast methods
like count()

Megamorphic method dispatch
How do we trigger megamorphic deoptimization?
@Setup(Level.Trial)
public void setUp_megamorphic()
{
long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count();
long odds = this.integersJDK.stream().filter(each -> each % 2 == 1).count();
Assert.assertEquals(SIZE / 2, odds);
long evens2 = this.integersJDK.stream().filter(each -> (each & 1) == 0).count();
Assert.assertEquals(SIZE / 2, evens2);
}
This is something that JMH does not handle for you!

0
50
100
150
200
250
300
350
400
Serial Lazy Megamorphic Serial Lazy
Megamorphic Count ops/s(higher is better)
8x

Megamorphic method dispatch
• Why force megamorphic deoptimization?
• Some implementations will have extra virtual method
calls (@FunctionalInterface method calls)
• Microbenchmarks aren’t realistic, but which is more
realistic (less unrealistic?)
• You will trigger this deoptimization in normal
production code, as soon as there is more than one call
to this api anywhere in the executed code

Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/java-
streams-scala-parallel-collections

Parallel-lazy Performance: Java 8 vs Scala vs GS Collections

Recommended

Recommended

More Related Content

More from C4Media

More from C4Media (20)

Recently uploaded

Recently uploaded (20)

Parallel-lazy Performance: Java 8 vs Scala vs GS Collections