Let's Get to the Rapids

Let’s Get to the RapidsUnderstanding Java 8 Stream Performance

JavaDay KyivNovember 2015@mauricenaftalin

@mauricenaftalin

Maurice Naftalin

Developer, designer, architect, teacher, learner, writer

@mauricenaftalin

Maurice Naftalin

Developer, designer, architect, teacher, learner, writer

JavaOne Rock Star

Java Champion

Repeat offender:

Maurice Naftalin

Repeat offender:

Maurice NaftalinJava 5

Repeat offender:

Maurice NaftalinJava 5 Java 8

@mauricenaftalin

The Lambda FAQ

www.lambdafaq.org

http://www.lambdafaq.org

Agenda

– Background

– Java 8 Streams

– Parallelism

– Microbenchmarking

– Case study

– Conclusions

Streams – Why?

• Bring functional style to Java

• Exploit hardware parallelism – “explicit but unobtrusive”

Streams – Why?

• Intention: replace loops for aggregate operations

7

Streams – Why?


List<Person> people = … Set<City> shortCities = new HashSet<>();

for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } }

instead of writing this:

7

Streams – Why?


• more concise, more readable, composable operations, parallelizable

Set<City> shortCities = new HashSet<>();



List<Person> people = … Set<City> shortCities = people.stream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());

8

we’re going to write this:

Streams – Why?






List<Person> people = … Set<City> shortCities = people.stream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());

8


Streams – Why?






List<Person> people = … Set<City> shortCities = people.parallelStream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());

9


Visualising Sequential Streams

x2x0 x1 x3x0 x1 x2 x3

Source Map Filter Reduction

Intermediate Operations

Terminal Operation

“Values in Motion”


x2x0 x1 x3x1 x2 x3 ✔



Terminal Operation



x2x0 x1 x3 x1x2 x3 ❌✔



Terminal Operation



x2x0 x1 x3 x1x2x3 ❌✔



Terminal Operation


Practical Benefits of Streams?



Functional style will affect (nearly) all collection processing


Functional style will affect (nearly) all collection processingAutomatic parallelism is useful, in certain situations


Functional style will affect (nearly) all collection processingAutomatic parallelism is useful, in certain situations- but everyone cares about performance!

Parallelism – Why?

The Free Lunch Is Over

http://www.gotw.ca/publications/concurrency-ddj.htm

http://www.gotw.ca/publications/concurrency-ddj.htm

Intel Xeon E5 2600 10-core

x2

Visualizing Parallel Streams

x0

x1

x3

x0

x1x2x3

x2


x0

x1

x3

x0

x1

x2x3

x2


x1

x3

x0

x1

x3

✔

❌

x2


x1 y3

x0

x1

x3

✔

❌

What to Measure?

What to Measure?

How do code changes affect system performance?

What to Measure?


What to Measure?


Controlled experiment, production conditions

What to Measure?


Controlled experiment, production conditions- difficult!

What to Measure?



What to Measure?



So: controlled experiment, lab conditions

What to Measure?



So: controlled experiment, lab conditions- beware the substitution effect!

Microbenchmarking

Really hard to get meaningful results from a dynamic runtime:– timing methods are flawed

– System.currentTimeMillis() and System.nanoTime()– compilation can occur at any time– garbage collection interferes– runtime optimizes code after profiling it for some time– then may deoptimize it

– optimizations include dead code elimination

MicrobenchmarkingDon’t try to eliminate these effects yourself!

Use a benchmarking library– Caliper– JMH (Java Benchmarking Harness)

Ensure your results are statistically meaningful

Get your benchmarks peer-reviewed

Case Study: grep -b

grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.”

Case Study: grep -b

The Moving Finger writes; and, having writ,Moves on: nor all thy Piety nor Wit

Shall bring it back to cancel half a LineNor all thy Tears wash out a Word of it.

rubai51.txt


Case Study: grep -b

The Moving Finger writes; and, having writ,Moves on: nor all thy Piety nor Wit

Shall bring it back to cancel half a LineNor all thy Tears wash out a Word of it.

rubai51.txt


$ grep -b 'W.*t' rubai51.txt 44:Moves on: nor all thy Piety nor Wit 122:Nor all thy Tears wash out a Word of it.

Because we don’t have a problem

Why Shouldn’t We Optimize Code?


Because we don’t have a problem- No performance target!


Else there is a problem, but not in our process



Else there is a problem, but not in our process- The OS is struggling!




Else there’s a problem in our process, but not in the code




Else there’s a problem in our process, but not in the code- GC is using all the cycles!




Else there’s a problem in our process, but not in the code- GC is using all the cycles!

Now we can consider the code

Else there’s a problem in the code… somewhere- now we can consider optimising!

122Nor …Moves … 44

grep -b: Collector combiner

0The …[ , 80Shall … ], ,

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ], ,

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ]

41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]

, ,

,] [

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ]


, ,

,] [

Moves … 36 0 41 0Nor …42 0Shall …0The … 44[ ]] [[ [] ]

grep -b: Collector accumulator

44 0The moving … writ,

“Moves on: … Wit”

44 0The moving … writ, 36 44Moves on: … Wit

][

][ ,

[ ]

Supplier “The moving … writ,”

accumulator

accumulator

41 122Nor …Moves … 36 44

grep -b: Collector solution

0The … 44[ , 42 80Shall … ]


, ,

,] [

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ]


, ,

,] [

80

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ]


, ,

,] [

80

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ]


, ,

,] [

80

What’s wrong?

What’s wrong?

What’s wrong?

• Possibly very little

What’s wrong?

• Possibly very little - overall performance comparable to Unix grep -b

What’s wrong?

• Possibly very little - overall performance comparable to Unix grep -b

• Can we improve it by going parallel?

Serial vs. Parallel

Serial vs. Parallel

• The problem is a prefix sum – every element contains the sum of the preceding ones.

Serial vs. Parallel

• The problem is a prefix sum – every element contains the sum of the preceding ones.- Combiner is O(n)

Serial vs. Parallel


• The source is streaming IO (BufferedReader.lines())

Serial vs. Parallel



• Amdahl’s Law strikes:

Serial vs. Parallel



• Amdahl’s Law strikes:

A Parallel Solution for grep -b

Need to get rid of streaming IO – inherently serialParallel streams need splittable sources

Stream Sources

Implemented by a Spliterator

Stream Sources


Stream Sources


Stream Sources


Stream Sources


Stream Sources


Stream Sources


Stream Sources


Stream Sources


Moves …Wit

LineSpliterator

The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n

spliterator coverage

Moves …Wit

LineSpliterator



MappedByteBuffer

Moves …Wit

LineSpliterator



MappedByteBuffer mid

Moves …Wit

LineSpliterator




Moves …Wit

LineSpliterator


spliterator coveragenew spliterator coverage


Parallelizing grep -b

• Splitting action of LineSpliterator is O(log n)• Collector no longer needs to compute index• Result (relatively independent of data size):

- sequential stream ~2x as fast as iterative solution- parallel stream >2.5x as fast as sequential stream

- on 4 hardware threads

When to go Parallel

The workload of the intermediate operations must be great enough to outweigh the overheads (~100µs):

– initializing the fork/join framework – splitting– concurrent collection

Often quoted as N x Q

size of data set processing cost per element


Parallel-unfriendly intermediate operations:

stateful ones– need to store some or all of the stream data in memory

– sorted()

those requiring ordering– limit()

Collectors Cost Extra!Depends on the performance of accumulator and combiner functions

• toList(), toSet(), toCollection() – performance normally dominated by accumulator

• but allow for the overhead of managing multithread access to non-threadsafe containers for the combine operation

• toMap(), toConcurrentMap() – map merging is slow. Resizing maps, especially concurrent maps, is very expensive. Whenever possible, presize all data structures, maps in particular.

Threads for executing parallel streams are (all but one) drawn from the common Fork/Join pool

• Intermediate operations that block (for example on I/O) will prevent pool threads from servicing other requests

• Fork/Join pool assumes by default that it can use all cores– Maybe other thread pools (or other processes) are running?

Parallel Streams in the Real World

Performance mostly doesn’t matter

But if you must…• sequential streams normally beat iterative solutions• parallel streams can utilize all cores, providing

- the data is efficiently splittable- the intermediate operations are sufficiently expensive and are

CPU-bound- there isn’t contention for the processors

Conclusions

@mauricenaftalin

Resources

@mauricenaftalin

Resources

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html


http://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf

http://openjdk.java.net/projects/code-tools/jmh/

@mauricenaftalin

Resources

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf




@mauricenaftalin

Resources

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdfhttp://openjdk.java.net/projects/code-tools/jmh/




@mauricenaftalin

Resources

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdfhttp://openjdk.java.net/projects/code-tools/jmh/




Software

Let's Get to the Rapids