SlideShare a Scribd company logo
1 of 102
This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon
or considered investment advice. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and
recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact.
Parallel-lazy performance
Java 8 vs Scala vs GS Collections
Craig Motlin
June 2014
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/java-streams-scala-parallel-
collections
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Lots of claims
and opinions
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Lots of evidence
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Intro
• Solve the same problem in all three libraries
– Java (1.8.0_05)
– GS Collections (5.1.0)
– Scala (2.11.0)
• Count how many even numbers are in a list of numbers
• Then accomplish the same thing in parallel
– Data-level parallelism
– Batch the data
– Use all the cores
Performance Factors
Tests that isolate individual performance factors
• Count
• Filter, Transform, Transform, Filter, convert to List
• Aggregation
– Market value stats aggregated by product or category
Count: Serial
long evens = arrayList.stream()
.filter(each -> each % 2 == 0).count();
int evens =
fastList.count(each -> each % 2 == 0);
val evens = arrayBuffer.count(_ % 2 == 0)
Count: Serial Lazy
long evens = arrayList.stream()
.filter(each -> each % 2 == 0).count();
int evens = fastList.asLazy()
.count(each -> each % 2 == 0);
val evens = arrayBuffer.view
.count(_ % 2 == 0)
Count: Parallel Lazy
long evens = arrayList.parallelStream()
.filter(each -> each % 2 == 0).count();
int evens = fastList
.asParallel(executorService, BATCH_SIZE)
.count(each -> each % 2 == 0);
val evens =
arrayBuffer.par.count(_ % 2 == 0)
Parallel Lazy
1 2 3 4 5 6 7 8 … 1M
Filter
and
Count
1-10k 10k-20k 20k-30k 30k-40k … 990k-1M
500kReduce
5k 5k 5k 5k … 5k
Batch
Parallel Eager
1 2 3 4 5 6 7 8 … 1M
1-10k 10k-20k 20k-30k 30k-40k … 990k-1M
2, 4, 6, 8 …
10k
10k-20k
(evens)
20k-30k
(evens)
30k-40k
(evens)
…
990k-1M
(evens)
Batch
Filter
Count
5k 5k 5k 5k … 5k
500kReduce
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Time for some
numbers!
0
50
100
150
200
250
300
350
400
Serial Lazy
Java 8 GS Collections Scala
Serial Count ops/s(higher is better)
0
200
400
600
800
1000
1200
Serial Lazy Parallel Lazy
Java 8 GS Collections Scala
Parallel Count ops/s(higher is better)
Measured on an 8 core Linux VM
Intel Xeon E5-2697 v28x
Java Microbenchmark Harness
“JMH is a Java harness for building, running, and analysing
nano/micro/milli/macro benchmarks written in Java and other languages
targetting the JVM.”
• 5 forked JVMs per test
• 100 warmup iterations per JVM
• 50 measurement iterations per JVM
• 1 second of looping per iteration
http://openjdk.java.net/projects/code-tools/jmh/
Java Microbenchmark Harness
@GenerateMicroBenchmark
public void parallel_lazy_jdk() {
long evens = this.integersJDK
.parallelStream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}
Java Microbenchmark Harness
• @Setup includes megamorphic warmup
• More info on megamorphic in the appendix
• This is something that JMH does not handle
for you!
Java Microbenchmark Harness
• Throughput: higher is better
• Enough warmup iterations so that standard deviation is low
Benchmark Mode Samples Mean Mean error Units
CountTest.parallel_eager_gsc thrpt 250 629.961 8.305 ops/s
CountTest.parallel_lazy_gsc thrpt 250 595.023 7.153 ops/s
CountTest.parallel_lazy_jdk thrpt 250 415.382 7.766 ops/s
CountTest.parallel_lazy_scala thrpt 250 331.938 2.141 ops/s
CountTest.serial_eager_gsc thrpt 250 115.197 0.328 ops/s
CountTest.serial_eager_scala thrpt 250 91.167 0.864 ops/s
CountTest.serial_lazy_gsc thrpt 250 73.625 3.619 ops/s
CountTest.serial_lazy_jdk thrpt 250 58.182 0.477 ops/s
CountTest.serial_lazy_scala thrpt 250 84.200 1.033 ops/s...
Performance Factors
Tests that isolate individual performance factors
• Count
• Filter, Transform, Transform, Filter, convert to List
• Aggregation
– Market value stats aggregated by product or category
Java Microbenchmark Harness
• Performance tests are open sourced
• Read them and run them on your hardware
https://github.com/goldmansachs/gs-collections/
Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Isolated by using
array-backed lists.
ArrayList, FastList, and
ArrayBuffer
Isolated because
combination of intermediate
results is simple addition.
Let’s look at reasons
for the differences in
count()
Count: Java 8
Count: Java 8 implementation
@GenerateMicroBenchmark
public void serial_lazy_jdk() {
long evens = this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}
Count: Java 8 implementation
@GenerateMicroBenchmark
public void serial_lazy_jdk() {
long evens = this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}
filter(Predicate)
.count()
Instead of
count(Predicate)
Count: Java 8 implementation
@GenerateMicroBenchmark
public void serial_lazy_jdk() {
long evens = this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.count();
Assert.assertEquals(SIZE / 2, evens);
}
Is count() just
incrementing a counter?
filter(Predicate)
.count()
Instead of
count(Predicate)
Count: Java 8 implementation
public final long count() {
return mapToLong(e -> 1L).sum();
}
public final long sum() {
return reduce(0, Long::sum);
}
/** @since 1.8 */
public static long sum(long a, long b) { return a + b; }
Count: Java 8 implementation
this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.mapToLong(e -> 1L)
.reduce(0, Long::sum);
this.integersGSC
.asLazy()
.count(each -> each % 2 == 0);
Count: Java 8 implementation
this.integersJDK
.stream()
.filter(each -> each % 2 == 0)
.mapToLong(e -> 1L)
.reduce(0, Long::sum);
this.integersGSC
.asLazy()
.count(each -> each % 2 == 0);
Seems like
extra work
Count: GS Collections
Count: GS Collections
@GenerateMicroBenchmark
public void serial_lazy_gsc() {
int evens = this.integersGSC
.asLazy()
.count(each -> each % 2 == 0);
Assert.assertEquals(SIZE / 2, evens);
}
Count: GS Collections
AbstractLazyIterable.java
public int count(Predicate<? super T> predicate)
{
CountProcedure<T> procedure =
new CountProcedure<T>(predicate);
this.forEach(procedure);
return procedure.getCount();
}
Count: GS Collections
FastList.java
public void forEach(Procedure<? super T> procedure)
{
for (int i = 0; i < this.size; i++)
{
procedure.value(this.items[i]);
}
}
Count: GS Collections
public class CountProcedure<T> implements Procedure<T>
{
private final Predicate<? super T> predicate;
private int count;
...
public void value(T object) {
if (this.predicate.accept(object)) {
this.count++;
}
}
public int getCount() { return this.count; }
}
Count: GS Collections
public class CountProcedure<T> implements Procedure<T>
{
private final Predicate<? super T> predicate;
private int count;
...
public void value(T object) {
if (this.predicate.accept(object)) {
this.count++;
}
}
public int getCount() { return this.count; }
}
Predicate from the test:
each -> each % 2 == 0
Count: Scala
Count: Scala implementation
TraversibleOnce.scala
def count(p: A => Boolean): Int =
{
var cnt = 0
for (x <- this)
if (p(x)) cnt += 1
cnt
}
Count: Scala implementation
TraversibleOnce.scala
def count(p: A => Boolean): Int =
{
var cnt = 0
for (x <- this)
if (p(x)) cnt += 1
cnt
}
for-comprehension
becomes call to foreach()
lambda closes over cnt.
Executes predicate and
increments cnt, just like
CountProcedure
Count: Scala implementation
public final java.lang.Object apply(java.lang.Object);
0: aload_0
1: aload_1
// Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I
2: invokestatic #32
// Method apply:(I)Z
5: invokevirtual #34
// Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean;
8: invokestatic #38
11: areturn
public boolean apply$mcZI$sp(int);
0: iload_1
1: iconst_2
2: irem
3: iconst_0
4: if_icmpne 11
7: iconst_1
8: goto 12
11: iconst_0
12: ireturn
public final boolean apply(int);
0: aload_0
1: iload_1
// Method apply$mcZI$sp:(I)Z
2: invokevirtual #21
5: ireturn
Count: Scala implementation
Integer int
booleanBoolean
Integer.intValue()
Lambda: _ % 2 == 0
Bytecode: irem
Boolean.valueOf(boolean)
Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Scala’s auto-boxing
Java’s pull lazy
evaluation
Performance Factors
Tests that isolate individual performance factors
• Count
• Filter, Transform, Transform, Filter, convert to List
• Aggregation
– Market value stats aggregated by product or category
Parallel / Lazy / JDK
List<Integer> list = this.integersJDK
.parallelStream()
.filter(each -> each % 10_000 != 0)
.map(String::valueOf)
.map(Integer::valueOf)
.filter(each -> (each + 1) % 10_000 != 0)
.collect(Collectors.toList());
Verify.assertSize(999_800, list);
Parallel / Lazy / GSC
MutableList<Integer> list = this.integersGSC
.asParallel(this.executorService, BATCH_SIZE)
.select(each -> each % 10_000 != 0)
.collect(String::valueOf)
.collect(Integer::valueOf)
.select(each -> (each + 1) % 10_000 != 0)
.toList();
Verify.assertSize(999_800, list);
Parallel / Lazy / Scala
val list = this.integers
.par
.filter(each => each % 10000 != 0)
.map(String.valueOf)
.map(Integer.valueOf)
.filter(each => (each + 1) % 10000 != 0)
.toBuffer
Assert.assertEquals(999800, list.size)
0
10
20
30
40
50
60
Serial Lazy Parallel Lazy
Java 8 GS Collections Scala
Stacked computation ops/s (higher is better)
8x
Parallel / Lazy / JDK
List<Integer> list = this.integersJDK
.parallelStream()
.filter(each -> each % 10_000 != 0)
.map(String::valueOf)
.map(Integer::valueOf)
.filter(each -> (each + 1) % 10_000 != 0)
.collect(Collectors.toList());
Verify.assertSize(999_800, list);
Parallel / Lazy / JDK
List<Integer> list = this.integersJDK
.parallelStream()
.filter(each -> each % 10_000 != 0)
.map(String::valueOf)
.map(Integer::valueOf)
.filter(each -> (each + 1) % 10_000 != 0)
.collect(Collectors.toList());
Verify.assertSize(999_800, list);
ArrayList::new
List::add
(left, right) -> {
left.addAll(right);
return left;
}
Fork-Join Merge
• Intermediate results are merged in a tree
• Merging is O(n log n) work and garbage
Fork-Join Merge
• Amount of work done by last thread is O(n)
Parallel / Lazy / GSC
MutableList<Integer> list = this.integersGSC
.asParallel(this.executorService, BATCH_SIZE)
.select(each -> each % 10_000 != 0)
.collect(String::valueOf)
.collect(Integer::valueOf)
.select(each -> (each + 1) % 10_000 != 0)
.toList();
Verify.assertSize(999_800, list);
ParallelIterable.toList()
returns a
CompositeFastList, a List
with O(1) implementation
of addAll()
Parallel / Lazy / GSC
public final class CompositeFastList<E> {
private final FastList<FastList<E>> lists =
FastList.newList();
public boolean addAll(Collection<? extends E> collection) {
FastList<E> collectionToAdd = collection instanceof FastList
? (FastList<E>) collection
: new FastList<E>(collection);
this.lists.add(collectionToAdd);
return true;
}
...
}
CompositeFastList Merge
• Merging is O(1) work per batch
CFL
Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Fork-join is general
purpose but requires
merge work
Specialized data
structures meant
for combining
Thread Pools
Parallel: GSC
.asParallel(this.executorService, BATCH_SIZE)
• You must specify your own batch size
– 10,000 is fine
– size / (8 * #cores) is fine
• You must specify your own thread pool
– Can share, or not
– Can tailor for CPU-bound
Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors())
– Or IO-Bound
Executors.newFixedThreadPool(maxDbConnections)
Parallel: Scala
• One shared fork-join pool, configurable
• Batch sizes are dynamic and respond to work
stealing
• Minimum batch size:
1 + size / (8 * #cores)
Parallel: Java 8
• One shared fork-join pool, not configurable
• Batch sizes are dynamic and respond to work stealing
• Minimum batch size:
– max(1, size / (4 * (#cores - 1)))
– Default pool also has #cores – 1 threads, plus main thread
helps
– Can be changed with system property
java.util.concurrent.ForkJoinPool.common.parallelism
Aggregation
Aggregation Domain
0
10
20
30
40
50
60
70
Serial Lazy Parallel Lazy
Java 8 GS Collections
Aggregate by Categories
8x
0
5
10
15
20
25
Serial Lazy Parallel Lazy
Java 8 GS Collections
Aggregate by Accounts
8x
Aggregate by Category Streams
Map<String, DoubleSummaryStatistics> categoryDoubleMap =
this.jdkPositions.parallelStream().collect(
Collectors.groupingBy(
Position::getCategory,
Collectors.summarizingDouble(Position::getMarketValue)));
Aggregate by Category GSC
MapIterable<String, MarketValueStatistics> categoryDoubleMap =
this.gscPositions.asParallel(this.executorService, BATCH_SIZE)
.aggregateInPlaceBy(
Position::getCategory,
MarketValueStatistics::new,
MarketValueStatistics::acceptThis);
Aggregate by Category GSC
MapIterable<String, MarketValueStatistics> categoryDoubleMap =
this.gscPositions.asParallel(this.executorService, BATCH_SIZE)
.aggregateInPlaceBy(
Position::getCategory,
MarketValueStatistics::new,
MarketValueStatistics::acceptThis);
What if we group by
Account instead?
Aggregate by Account GSC
MapIterable<Account, MarketValueStatistics> accountDoubleMap =
this.gscPositions.asParallel(this.executorService, BATCH_SIZE)
.aggregateInPlaceBy(
Position::getAccount,
MarketValueStatistics::new,
MarketValueStatistics::acceptThis);
What if we group by
Account instead?
Collapse factor
MapIterable<String, MarketValueStatistics> categoryDoubleMap
There are 26 categories, so the map has 26 keys
MapIterable<Account, MarketValueStatistics> accountDoubleMap
There are 100k accounts, so the map has 100k keys
Collapse factor
• Aggregate: Java Streams
– Uses fork/join
– Each forked task creates a map
– Each join step merges two maps
– The joined map is roughly the same size
– Merge is costly when there are many keys
• Aggregate: GS Collections
– Uses a single ConcurrentMap for the results
– Each batched task writes into the map simultaneously with atomic operation
ConcurrentHashMapUnsafe.updateValueWith()
– Contention is costly when there are few keys
Collapse factor
• Aggregate: Java Streams
– Uses fork/join
– Each forked task creates a map
– Each join step merges two maps
– The joined map is roughly the same size
– Merge is costly when there are many keys
• Aggregate: GS Collections
– Uses a single ConcurrentMap for the results
– Each batched task writes into the map simultaneously with atomic operation
ConcurrentHashMapUnsafe.updateValueWith()
– Contention is costly when there are few keys
See Mohammad Rezaei’s
presentation from QCon
2012 called “Fine Grained
Coordinated Parallelism in
a Real World Application.”
Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Test groupBy,
aggregateBy
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Goals
• Compare Java Streams, Scala parallel
Collections, and GS Collections
• Convince you to use GS Collections
• Convince you to do your own performance
testing
• Identify when to avoid parallel APIs
• Identify performance pitfalls to avoid
Q&A
Q&A
http://github.com/goldmansachs/gs-collections
http://github.com/goldmansachs/gs-collections-kata
@GoldmanSachs
http://stackoverflow.com/questions/tagged/gs-collections
craig.motlin@gs.com
Info in appendix
• Sets
• Handcoded parallelism
• Megamorphic warmup
Appendix
Hashtable Sets
Performance Factors
Factors that may affect performance
• Underlying container implementation
• Combine strategy
• Fork-join vs batching (and batch size)
• Push vs pull lazy evaluation
• Collapse factor
• Unknown unknowns
Isolated by using
array-backed lists.
ArrayList, FastList, and
ArrayBuffer
What if we use Java’s
HashSet, Scala’s
HashSet, and GS
Collections’
UnifiedSet?
0
200
400
600
800
1000
1200
Serial Lazy Parallel Lazy
Java 8 GS Collections Scala
Parallel Count ops/s(higher is better)
Measured on an 8 core Linux VM
Intel Xeon E5-2697 v28x
Lists: FastList | ArrayList | ArrayBuffer
0
200
400
600
800
1000
1200
Serial Lazy Parallel Lazy
Java 8 GS Collections Scala
Parallel Count ops/s(higher is better)
Sets: UnifiedSet | HashSet (Java’s) | HashSet (Scala’s)
8x
Parallel / Lazy / GSC
MutableList<Integer> list = this.integersGSC
.asParallel(this.executorService, BATCH_SIZE)
.select(each -> each % 10_000 != 0)
.collect(String::valueOf)
.collect(Integer::valueOf)
.select(each -> (each + 1) % 10_000 != 0)
.toSet();
Verify.assertSize(999_800, list);
ParallelIterable.toSet()
uses a concurrent set.
No combination step.
No preserving order.
Hand coded parallelism
Hand coded Parallel / Lazy
MutableList<Integer> list = this.integersGSC
.asParallel(this.executorService, BATCH_SIZE)
.select(integer -> integer % 10_000 != 0 &&
(Integer.valueOf(String.valueOf(integer)) + 1) % 10_000 != 0)
.toList();
Verify.assertSize(999_800, list);
Stacked computation ops/s (higher is better)
8x
0
10
20
30
40
50
60
70
Serial Lazy Parallel Lazy Parallel hand-coded
Java 8 GS Collections Scala
Method inlining
Count: SAM method calls
• Let’s take a closer look at both
implementations of count()
• Let’s assume that @FunctionalInterface
method calls are costly and count them as we
go
• We’ll revisit this assumption
Count: GS Collections
java.lang.Thread.State: RUNNABLE
at com.gs.collections.impl.block.procedure.CountProcedure.value(CountProcedure.java:47)
at com.gs.collections.impl.list.mutable.FastList.forEach(FastList.java:623)
at com.gs.collections.impl.utility.Iterate.forEach(Iterate.java:114)
at com.gs.collections.impl.lazy.LazyIterableAdapter.forEach(LazyIterableAdapter.java:49)
at com.gs.collections.impl.lazy.AbstractLazyIterable.count(AbstractLazyIterable.java:461)
at com.gs.collections.impl.jmh.CountTest.serial_lazy_gsc(CountTest.java:302)
• Execution of the lazy evaluation
• Executed once per element
• We’ll look for @FunctionalInterface method calls here
Count: GS Collections
Grand total of 2 @FunctionalInterface
method calls
Count: Java 8
java.lang.Thread.State: RUNNABLE
at java.lang.Long.sum(Long.java:1587)
at java.util.stream.LongPipeline$$Lambda$3.887750041.applyAsLong(Unknown Source:-1)
at java.util.stream.ReduceOps$8ReducingSink.accept(ReduceOps.java:394)
at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.LongPipeline.reduce(LongPipeline.java:438)
at java.util.stream.LongPipeline.sum(LongPipeline.java:396)
at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526)
at com.gs.collections.impl.jmh.CountTest.serial_lazy_jdk(CountTest.java:278)
• Execution of the pipeline
• Executed once per element
• We’ll look for @FunctionalInterface method calls here
Count: Java 8
Grand total of 6 @FunctionalInterface
method calls
Count: Scala
Scala implementation is similar to GS Collections
Grand total of 2 @FunctionalInterface
method calls
@FunctionalInterface method calls
• Why do we care about
@FunctionalInterface method calls?
• The JDK inlines short method bodies like our
Predicates
• The exact nature of the inlining has a dramatic
impact on performance
@FunctionalInterface method calls
• JMH forks a new JVM for each test
• During both stages of JIT compilation, this.predicate is our test
Predicate
• The JVM will perform monomorphic inlining
public void value(T object)
{
if (this.predicate.accept(object))
{
this.count++;
}
}
Predicate from the test:
each -> each % 2 == 0
@FunctionalInterface method calls
The dispatch algorithm in pseudo code
if (this.predicate instanceof lambda$serial_lazy_gsc$1) {
if (object % 2 == 0) {
this.count++;
}
} else {
[recompile]
if (this.predicate.accept(object)) {
this.count++;
}
}
@FunctionalInterface method calls
• The next recompilation will result in bimorphic
inlining
• The next recompilation will result in
megamorphic method dispatch
• Classic table lookup and jump
• In other words, no inlining
• Dramatic performance penalty for fast methods
like count()
Megamorphic method dispatch
How do we trigger megamorphic deoptimization?
@Setup(Level.Trial)
public void setUp_megamorphic()
{
long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count();
Assert.assertEquals(SIZE / 2, evens);
long odds = this.integersJDK.stream().filter(each -> each % 2 == 1).count();
Assert.assertEquals(SIZE / 2, odds);
long evens2 = this.integersJDK.stream().filter(each -> (each & 1) == 0).count();
Assert.assertEquals(SIZE / 2, evens2);
}
This is something that JMH does not handle for you!
0
50
100
150
200
250
300
350
400
Serial Lazy Megamorphic Serial Lazy
Java 8 GS Collections Scala
Megamorphic Count ops/s(higher is better)
8x
Megamorphic method dispatch
• Why force megamorphic deoptimization?
• Some implementations will have extra virtual method
calls (@FunctionalInterface method calls)
• Microbenchmarks aren’t realistic, but which is more
realistic (less unrealistic?)
• You will trigger this deoptimization in normal
production code, as soon as there is more than one call
to this api anywhere in the executed code
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/java-
streams-scala-parallel-collections

More Related Content

More from C4Media

Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 

More from C4Media (20)

Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Parallel-lazy Performance: Java 8 vs Scala vs GS Collections

  • 1. This presentation reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon or considered investment advice. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee to anyone the accuracy, completeness or efficacy of this presentation, and recipients should not rely on it except at their own risk. This presentation may not be forwarded or disclosed except with this disclaimer intact. Parallel-lazy performance Java 8 vs Scala vs GS Collections Craig Motlin June 2014
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /java-streams-scala-parallel- collections
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  • 5. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of claims and opinions
  • 6. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Lots of evidence
  • 7. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  • 8. Intro • Solve the same problem in all three libraries – Java (1.8.0_05) – GS Collections (5.1.0) – Scala (2.11.0) • Count how many even numbers are in a list of numbers • Then accomplish the same thing in parallel – Data-level parallelism – Batch the data – Use all the cores
  • 9. Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation – Market value stats aggregated by product or category
  • 10. Count: Serial long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); int evens = fastList.count(each -> each % 2 == 0); val evens = arrayBuffer.count(_ % 2 == 0)
  • 11. Count: Serial Lazy long evens = arrayList.stream() .filter(each -> each % 2 == 0).count(); int evens = fastList.asLazy() .count(each -> each % 2 == 0); val evens = arrayBuffer.view .count(_ % 2 == 0)
  • 12. Count: Parallel Lazy long evens = arrayList.parallelStream() .filter(each -> each % 2 == 0).count(); int evens = fastList .asParallel(executorService, BATCH_SIZE) .count(each -> each % 2 == 0); val evens = arrayBuffer.par.count(_ % 2 == 0)
  • 13. Parallel Lazy 1 2 3 4 5 6 7 8 … 1M Filter and Count 1-10k 10k-20k 20k-30k 30k-40k … 990k-1M 500kReduce 5k 5k 5k 5k … 5k Batch
  • 14. Parallel Eager 1 2 3 4 5 6 7 8 … 1M 1-10k 10k-20k 20k-30k 30k-40k … 990k-1M 2, 4, 6, 8 … 10k 10k-20k (evens) 20k-30k (evens) 30k-40k (evens) … 990k-1M (evens) Batch Filter Count 5k 5k 5k 5k … 5k 500kReduce
  • 15. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid Time for some numbers!
  • 16. 0 50 100 150 200 250 300 350 400 Serial Lazy Java 8 GS Collections Scala Serial Count ops/s(higher is better)
  • 17. 0 200 400 600 800 1000 1200 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Parallel Count ops/s(higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v28x
  • 18. Java Microbenchmark Harness “JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.” • 5 forked JVMs per test • 100 warmup iterations per JVM • 50 measurement iterations per JVM • 1 second of looping per iteration http://openjdk.java.net/projects/code-tools/jmh/
  • 19. Java Microbenchmark Harness @GenerateMicroBenchmark public void parallel_lazy_jdk() { long evens = this.integersJDK .parallelStream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }
  • 20. Java Microbenchmark Harness • @Setup includes megamorphic warmup • More info on megamorphic in the appendix • This is something that JMH does not handle for you!
  • 21. Java Microbenchmark Harness • Throughput: higher is better • Enough warmup iterations so that standard deviation is low Benchmark Mode Samples Mean Mean error Units CountTest.parallel_eager_gsc thrpt 250 629.961 8.305 ops/s CountTest.parallel_lazy_gsc thrpt 250 595.023 7.153 ops/s CountTest.parallel_lazy_jdk thrpt 250 415.382 7.766 ops/s CountTest.parallel_lazy_scala thrpt 250 331.938 2.141 ops/s CountTest.serial_eager_gsc thrpt 250 115.197 0.328 ops/s CountTest.serial_eager_scala thrpt 250 91.167 0.864 ops/s CountTest.serial_lazy_gsc thrpt 250 73.625 3.619 ops/s CountTest.serial_lazy_jdk thrpt 250 58.182 0.477 ops/s CountTest.serial_lazy_scala thrpt 250 84.200 1.033 ops/s...
  • 22. Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation – Market value stats aggregated by product or category
  • 23. Java Microbenchmark Harness • Performance tests are open sourced • Read them and run them on your hardware https://github.com/goldmansachs/gs-collections/
  • 24. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns
  • 25. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer Isolated because combination of intermediate results is simple addition. Let’s look at reasons for the differences in count()
  • 27. Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); }
  • 28. Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); } filter(Predicate) .count() Instead of count(Predicate)
  • 29. Count: Java 8 implementation @GenerateMicroBenchmark public void serial_lazy_jdk() { long evens = this.integersJDK .stream() .filter(each -> each % 2 == 0) .count(); Assert.assertEquals(SIZE / 2, evens); } Is count() just incrementing a counter? filter(Predicate) .count() Instead of count(Predicate)
  • 30. Count: Java 8 implementation public final long count() { return mapToLong(e -> 1L).sum(); } public final long sum() { return reduce(0, Long::sum); } /** @since 1.8 */ public static long sum(long a, long b) { return a + b; }
  • 31. Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0);
  • 32. Count: Java 8 implementation this.integersJDK .stream() .filter(each -> each % 2 == 0) .mapToLong(e -> 1L) .reduce(0, Long::sum); this.integersGSC .asLazy() .count(each -> each % 2 == 0); Seems like extra work
  • 34. Count: GS Collections @GenerateMicroBenchmark public void serial_lazy_gsc() { int evens = this.integersGSC .asLazy() .count(each -> each % 2 == 0); Assert.assertEquals(SIZE / 2, evens); }
  • 35. Count: GS Collections AbstractLazyIterable.java public int count(Predicate<? super T> predicate) { CountProcedure<T> procedure = new CountProcedure<T>(predicate); this.forEach(procedure); return procedure.getCount(); }
  • 36. Count: GS Collections FastList.java public void forEach(Procedure<? super T> procedure) { for (int i = 0; i < this.size; i++) { procedure.value(this.items[i]); } }
  • 37. Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private int count; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public int getCount() { return this.count; } }
  • 38. Count: GS Collections public class CountProcedure<T> implements Procedure<T> { private final Predicate<? super T> predicate; private int count; ... public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } public int getCount() { return this.count; } } Predicate from the test: each -> each % 2 == 0
  • 40. Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt }
  • 41. Count: Scala implementation TraversibleOnce.scala def count(p: A => Boolean): Int = { var cnt = 0 for (x <- this) if (p(x)) cnt += 1 cnt } for-comprehension becomes call to foreach() lambda closes over cnt. Executes predicate and increments cnt, just like CountProcedure
  • 42. Count: Scala implementation public final java.lang.Object apply(java.lang.Object); 0: aload_0 1: aload_1 // Method scala/runtime/BoxesRunTime.unboxToInt:(Ljava/lang/Object;)I 2: invokestatic #32 // Method apply:(I)Z 5: invokevirtual #34 // Method scala/runtime/BoxesRunTime.boxToBoolean:(Z)Ljava/lang/Boolean; 8: invokestatic #38 11: areturn public boolean apply$mcZI$sp(int); 0: iload_1 1: iconst_2 2: irem 3: iconst_0 4: if_icmpne 11 7: iconst_1 8: goto 12 11: iconst_0 12: ireturn public final boolean apply(int); 0: aload_0 1: iload_1 // Method apply$mcZI$sp:(I)Z 2: invokevirtual #21 5: ireturn
  • 43. Count: Scala implementation Integer int booleanBoolean Integer.intValue() Lambda: _ % 2 == 0 Bytecode: irem Boolean.valueOf(boolean)
  • 44. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Scala’s auto-boxing Java’s pull lazy evaluation
  • 45. Performance Factors Tests that isolate individual performance factors • Count • Filter, Transform, Transform, Filter, convert to List • Aggregation – Market value stats aggregated by product or category
  • 46. Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);
  • 47. Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list);
  • 48. Parallel / Lazy / Scala val list = this.integers .par .filter(each => each % 10000 != 0) .map(String.valueOf) .map(Integer.valueOf) .filter(each => (each + 1) % 10000 != 0) .toBuffer Assert.assertEquals(999800, list.size)
  • 49. 0 10 20 30 40 50 60 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Stacked computation ops/s (higher is better) 8x
  • 50. Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list);
  • 51. Parallel / Lazy / JDK List<Integer> list = this.integersJDK .parallelStream() .filter(each -> each % 10_000 != 0) .map(String::valueOf) .map(Integer::valueOf) .filter(each -> (each + 1) % 10_000 != 0) .collect(Collectors.toList()); Verify.assertSize(999_800, list); ArrayList::new List::add (left, right) -> { left.addAll(right); return left; }
  • 52. Fork-Join Merge • Intermediate results are merged in a tree • Merging is O(n log n) work and garbage
  • 53. Fork-Join Merge • Amount of work done by last thread is O(n)
  • 54. Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list); ParallelIterable.toList() returns a CompositeFastList, a List with O(1) implementation of addAll()
  • 55. Parallel / Lazy / GSC public final class CompositeFastList<E> { private final FastList<FastList<E>> lists = FastList.newList(); public boolean addAll(Collection<? extends E> collection) { FastList<E> collectionToAdd = collection instanceof FastList ? (FastList<E>) collection : new FastList<E>(collection); this.lists.add(collectionToAdd); return true; } ... }
  • 56. CompositeFastList Merge • Merging is O(1) work per batch CFL
  • 57. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Fork-join is general purpose but requires merge work Specialized data structures meant for combining
  • 59. Parallel: GSC .asParallel(this.executorService, BATCH_SIZE) • You must specify your own batch size – 10,000 is fine – size / (8 * #cores) is fine • You must specify your own thread pool – Can share, or not – Can tailor for CPU-bound Executors.newFixedThreadPool( Runtime.getRuntime().availableProcessors()) – Or IO-Bound Executors.newFixedThreadPool(maxDbConnections)
  • 60. Parallel: Scala • One shared fork-join pool, configurable • Batch sizes are dynamic and respond to work stealing • Minimum batch size: 1 + size / (8 * #cores)
  • 61. Parallel: Java 8 • One shared fork-join pool, not configurable • Batch sizes are dynamic and respond to work stealing • Minimum batch size: – max(1, size / (4 * (#cores - 1))) – Default pool also has #cores – 1 threads, plus main thread helps – Can be changed with system property java.util.concurrent.ForkJoinPool.common.parallelism
  • 64. 0 10 20 30 40 50 60 70 Serial Lazy Parallel Lazy Java 8 GS Collections Aggregate by Categories 8x
  • 65. 0 5 10 15 20 25 Serial Lazy Parallel Lazy Java 8 GS Collections Aggregate by Accounts 8x
  • 66. Aggregate by Category Streams Map<String, DoubleSummaryStatistics> categoryDoubleMap = this.jdkPositions.parallelStream().collect( Collectors.groupingBy( Position::getCategory, Collectors.summarizingDouble(Position::getMarketValue)));
  • 67. Aggregate by Category GSC MapIterable<String, MarketValueStatistics> categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE) .aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis);
  • 68. Aggregate by Category GSC MapIterable<String, MarketValueStatistics> categoryDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE) .aggregateInPlaceBy( Position::getCategory, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?
  • 69. Aggregate by Account GSC MapIterable<Account, MarketValueStatistics> accountDoubleMap = this.gscPositions.asParallel(this.executorService, BATCH_SIZE) .aggregateInPlaceBy( Position::getAccount, MarketValueStatistics::new, MarketValueStatistics::acceptThis); What if we group by Account instead?
  • 70. Collapse factor MapIterable<String, MarketValueStatistics> categoryDoubleMap There are 26 categories, so the map has 26 keys MapIterable<Account, MarketValueStatistics> accountDoubleMap There are 100k accounts, so the map has 100k keys
  • 71. Collapse factor • Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys • Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys
  • 72. Collapse factor • Aggregate: Java Streams – Uses fork/join – Each forked task creates a map – Each join step merges two maps – The joined map is roughly the same size – Merge is costly when there are many keys • Aggregate: GS Collections – Uses a single ConcurrentMap for the results – Each batched task writes into the map simultaneously with atomic operation ConcurrentHashMapUnsafe.updateValueWith() – Contention is costly when there are few keys See Mohammad Rezaei’s presentation from QCon 2012 called “Fine Grained Coordinated Parallelism in a Real World Application.”
  • 73. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Test groupBy, aggregateBy
  • 74. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  • 75. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  • 76. Goals • Compare Java Streams, Scala parallel Collections, and GS Collections • Convince you to use GS Collections • Convince you to do your own performance testing • Identify when to avoid parallel APIs • Identify performance pitfalls to avoid
  • 77. Q&A
  • 81. Performance Factors Factors that may affect performance • Underlying container implementation • Combine strategy • Fork-join vs batching (and batch size) • Push vs pull lazy evaluation • Collapse factor • Unknown unknowns Isolated by using array-backed lists. ArrayList, FastList, and ArrayBuffer What if we use Java’s HashSet, Scala’s HashSet, and GS Collections’ UnifiedSet?
  • 82. 0 200 400 600 800 1000 1200 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Parallel Count ops/s(higher is better) Measured on an 8 core Linux VM Intel Xeon E5-2697 v28x Lists: FastList | ArrayList | ArrayBuffer
  • 83. 0 200 400 600 800 1000 1200 Serial Lazy Parallel Lazy Java 8 GS Collections Scala Parallel Count ops/s(higher is better) Sets: UnifiedSet | HashSet (Java’s) | HashSet (Scala’s) 8x
  • 84. Parallel / Lazy / GSC MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(each -> each % 10_000 != 0) .collect(String::valueOf) .collect(Integer::valueOf) .select(each -> (each + 1) % 10_000 != 0) .toSet(); Verify.assertSize(999_800, list); ParallelIterable.toSet() uses a concurrent set. No combination step. No preserving order.
  • 86. Hand coded Parallel / Lazy MutableList<Integer> list = this.integersGSC .asParallel(this.executorService, BATCH_SIZE) .select(integer -> integer % 10_000 != 0 && (Integer.valueOf(String.valueOf(integer)) + 1) % 10_000 != 0) .toList(); Verify.assertSize(999_800, list);
  • 87. Stacked computation ops/s (higher is better) 8x 0 10 20 30 40 50 60 70 Serial Lazy Parallel Lazy Parallel hand-coded Java 8 GS Collections Scala
  • 89. Count: SAM method calls • Let’s take a closer look at both implementations of count() • Let’s assume that @FunctionalInterface method calls are costly and count them as we go • We’ll revisit this assumption
  • 90. Count: GS Collections java.lang.Thread.State: RUNNABLE at com.gs.collections.impl.block.procedure.CountProcedure.value(CountProcedure.java:47) at com.gs.collections.impl.list.mutable.FastList.forEach(FastList.java:623) at com.gs.collections.impl.utility.Iterate.forEach(Iterate.java:114) at com.gs.collections.impl.lazy.LazyIterableAdapter.forEach(LazyIterableAdapter.java:49) at com.gs.collections.impl.lazy.AbstractLazyIterable.count(AbstractLazyIterable.java:461) at com.gs.collections.impl.jmh.CountTest.serial_lazy_gsc(CountTest.java:302) • Execution of the lazy evaluation • Executed once per element • We’ll look for @FunctionalInterface method calls here
  • 91. Count: GS Collections Grand total of 2 @FunctionalInterface method calls
  • 92. Count: Java 8 java.lang.Thread.State: RUNNABLE at java.lang.Long.sum(Long.java:1587) at java.util.stream.LongPipeline$$Lambda$3.887750041.applyAsLong(Unknown Source:-1) at java.util.stream.ReduceOps$8ReducingSink.accept(ReduceOps.java:394) at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.LongPipeline.reduce(LongPipeline.java:438) at java.util.stream.LongPipeline.sum(LongPipeline.java:396) at java.util.stream.ReferencePipeline.count(ReferencePipeline.java:526) at com.gs.collections.impl.jmh.CountTest.serial_lazy_jdk(CountTest.java:278) • Execution of the pipeline • Executed once per element • We’ll look for @FunctionalInterface method calls here
  • 93. Count: Java 8 Grand total of 6 @FunctionalInterface method calls
  • 94. Count: Scala Scala implementation is similar to GS Collections Grand total of 2 @FunctionalInterface method calls
  • 95. @FunctionalInterface method calls • Why do we care about @FunctionalInterface method calls? • The JDK inlines short method bodies like our Predicates • The exact nature of the inlining has a dramatic impact on performance
  • 96. @FunctionalInterface method calls • JMH forks a new JVM for each test • During both stages of JIT compilation, this.predicate is our test Predicate • The JVM will perform monomorphic inlining public void value(T object) { if (this.predicate.accept(object)) { this.count++; } } Predicate from the test: each -> each % 2 == 0
  • 97. @FunctionalInterface method calls The dispatch algorithm in pseudo code if (this.predicate instanceof lambda$serial_lazy_gsc$1) { if (object % 2 == 0) { this.count++; } } else { [recompile] if (this.predicate.accept(object)) { this.count++; } }
  • 98. @FunctionalInterface method calls • The next recompilation will result in bimorphic inlining • The next recompilation will result in megamorphic method dispatch • Classic table lookup and jump • In other words, no inlining • Dramatic performance penalty for fast methods like count()
  • 99. Megamorphic method dispatch How do we trigger megamorphic deoptimization? @Setup(Level.Trial) public void setUp_megamorphic() { long evens = this.integersJDK.stream().filter(each -> each % 2 == 0).count(); Assert.assertEquals(SIZE / 2, evens); long odds = this.integersJDK.stream().filter(each -> each % 2 == 1).count(); Assert.assertEquals(SIZE / 2, odds); long evens2 = this.integersJDK.stream().filter(each -> (each & 1) == 0).count(); Assert.assertEquals(SIZE / 2, evens2); } This is something that JMH does not handle for you!
  • 100. 0 50 100 150 200 250 300 350 400 Serial Lazy Megamorphic Serial Lazy Java 8 GS Collections Scala Megamorphic Count ops/s(higher is better) 8x
  • 101. Megamorphic method dispatch • Why force megamorphic deoptimization? • Some implementations will have extra virtual method calls (@FunctionalInterface method calls) • Microbenchmarks aren’t realistic, but which is more realistic (less unrealistic?) • You will trigger this deoptimization in normal production code, as soon as there is more than one call to this api anywhere in the executed code
  • 102. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/java- streams-scala-parallel-collections