32
Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin @rxin Spark Conference Japan Feb 8, 2016

Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Project TungstenBringing Spark Closer to Bare Metal

Reynold Xin @rxinSpark Conference JapanFeb 8, 2016

Page 2: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Hardware Trends

Storage

Network

CPU

Page 3: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Hardware Trends

2010

Storage 50+MB/s(HDD)

Network 1Gbps

CPU ~3GHz

Page 4: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Hardware Trends

2010 2016

Storage 50+MB/s(HDD)

500+MB/s(SSD)

Network 1Gbps 10Gbps

CPU ~3GHz ~3GHz

Page 5: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Hardware Trends

2010 2016

Storage 50+MB/s(HDD)

500+MB/s(SSD) 10X

Network 1Gbps 10Gbps 10X

CPU ~3GHz ~3GHz L

Page 6: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

On the flip side

Spark IO has been optimized• Reduce IO by pruning input data that is not needed• New shuffle and network implementations (2014 sort record)

Data formats have improved• E.g. Parquet is a “dense” columnar format

���

Spark ÆIOÇV� ³Õº

��Ä��ôĖï×I�¶Ô±ÂÅÑÔ IO Æ�a

Pµ¨éċòþĐÂøòõēĖãÆ3� (2014:îĖõ��)

Parquet Ç¥4¦ÄàĎĈ÷þÞĖĆòõ

ôĖïþÞĖĆòõ¬L(³Õº

Page 7: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Goals of Project Tungsten

Substantially improve the memory and CPU efficiency of Spark backend execution and push performance closer to the limits of modern hardware.

Note the focus on “execution” not “optimizer”: very easy to pick broadcast join that is 1000X faster than Cartesian join, but hard to optimize broadcast join to be an order of magnitude faster.

Project Tungsten ÆçĖĐ

Spark Æ¥ûòãÝĔöÆ3�¦ÆĉĊďÂCPU�g×3�nÅL(¶Ô±Â¤?~×ĊðĔÄúĖöÛÜØÆ�lÅ�¿°Ô±Â

¥V� ¦ÁÇį¥3�¦Å_oµÀ¨Ô±ÂÅ_CĘbroadcast join ×Cartesian joinÑÓ 1000�S¯¶ÔÆÇÂÀÐy!»¬¤broadcast join ×åï�¨ÅS¯ÄÔÑ©V� ¶ÔÆÇ�µ¨

Page 8: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Phase 1Foundation

Memory ManagementCode Generation

Cache-aware Algorithms

Phase 2Order-of-magnitude Faster

Whole-stage CodegenVectorization

v¼�¬Ó

ĉĊďwi

æĖöjD

(úĖöÛÜØ)âċòéČ×C�µºØĐçďìĈ

ĂãõĐ

åï�¨Æ£�

�ëóĖêæĖöjD

Page 9: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Phase 1Laying The Foundation

Page 10: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Optimized data representations

Java objects have two downsides:• Space overheads• Garbage collection overheads

Tungsten sidesteps these problems by performing its own manual memory management.

V� ³ÕºôĖï�h

Java ßÿêÜãõÅÇ2¾ÆĆÚ÷롬§Ôĕu�ÆßĖûĖāòö

ĕáĂĖêæđãéčĔÆßĖûĖāòö

Tungsten DZÕÒÆ'¢×¤f�ÆĉĊďwi×�©±ÂÁ)�¶Ô

Page 11: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

The overheads of Java objects

“abcd”Native: 4 bytes with UTF-8 encodingJava: 48 bytes

java.lang.String object internals:OFFSET SIZE TYPE DESCRIPTION VALUE

0 4 (object header) ...4 4 (object header) ...8 4 (object header) ...12 4 char[] String.value []16 4 int String.hash 020 4 int String.hash32 0

Instance size: 24 bytes (reported by Instrumentation API)

12 byte object header

8 byte hashcode20 bytes of overhead + 8 bytes for chars

Java ßÿêÜãõÆßĖûĖāòö

øÚóÙÿĘUTF-8ÝĔæĖôÙĔäÁ4ûÚõJava: 48ûÚõ

12ûÚõĘßÿêÜãõāòð

8ûÚõĘúòéČæĖö

ÚĔëïĔëèÚì: 24ûÚõ (Instrumentation API ÅÑÔ)

20ûÚõÆßĖûĖāòö+ 8ûÚõÆN0

Page 12: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

sun.misc.Unsafe

JVM internal API for directly manipulating memory without safety checks (hence “unsafe”)We use this API to build data structures in both on- and off-heap memory

Data structures with

pointers

Flat data structures

Complexexamples

2�?ÆñÜòã×�Ö·ÅpJnÅĉĊď×K¶Ô JVM �� API (¶ÄÖ¼ “unsafe”)on-  off-heap ĉĊďÆ�QÁôĖïX�×ýĐö¶Ô�űÆAPI×�kµÀ¨Ô

þĎòõÄôĖïX�ąÚĔï×�HµºôĖïX�

���

Page 13: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Code generationGeneric evaluation of expression logic is very expensive on the JVM

– Virtual function calls– Branches based on expression type– Object creation due to primitive

boxing– Memory consumption by boxed

primitive objectsGenerating custom bytecode can eliminate these overheads

9.33

9.36

36.65

Hand written

Code gen

Interpreted Projection

Evaluating “a + a + a”(query time in seconds)

�hĒêòãÆ^knÄ� Ç JVM �ÁÇ 9Åæëõ¬£¨

+�hÅ,¿¯ÿĎĔñ

ĀďćóÙÿÆĄãéĔäÅ�*¶ÔßÿêÜãõÆjD

�B�M%É�µ

ĀďćóÙÿץãéĔ䵺ßÿêÜãõÅÑÔĉĊďÆ`�

àëïĈûÚõæĖöÆjDÅÑÓ±ÕÒÆßĖûĖāòö×I�Á­Ô

ÚĔïĀďï5=

æĖöjD

EU­

“a + a + a” Æ� ('¨$Ö¸T�(s))

Page 14: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Code generationTungsten uses the Janino compiler to reduce code generation time

Spark 1.5 added ~100 UDF’s with code gen:AddMonthsArrayContainsAsciiBase64BinCheckOverflowCombineSetsContainsCountSetCrc32DateAdd

DateDiffDateFormatClassDateSubDayOfMonthDayOfYearDecodeEncodeEndsWithExplodeFactorialFindInSetFormatNumberFromUTCTimestamp

FromUnixTimeGetArrayItemGetJsonObjectGetMapValueHexInSetInitCapIsNaNIsNotNullIsNullLastDayLengthLevenshtein

LikeLowerMakeDecimalMd5MonthMonthsBetweenNaNvlNextDayNotPromotePrecisionQuarterRLikeRound

SecondSha1Sha2ShiftLeftShiftRightShiftRightUnsignedSortArraySoundExStartsWithStringInstrStringRepeatStringReverseStringSpace

StringSplitStringTrimStringTrimLeftStringTrimRightTimeAddTimeSubToDateToUTCTimestampTruncDateUnBase64UnhexUnixTimestamp

Tungsten ÇæĖöÆjDT�×aÒ¶ºÏÅ Janino æĔüÚĎ×�k¶ÔSpark 1.5 Ç code gen Á ~100 Æ UDF ×��µºĘ

Page 15: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

AlphaSort Cache-friendly SortingIdea: minimize random memory accesses

• Store prefixes of sort keys inside the sort pointer array• Short-circuit by comparing prefixes first• Swap only pointers and prefixes

pointer record

key prefix pointer record

Naïve layout

Cache friendly layout

AlphaSortâċòéČþđĔöďĖÄîĖõ

ØÚôØĘ ĎĔðĈĉĊďØãíëÆV6

!{ÄđÚØÛõ

âċòéČþđĔöďĖÄđÚØÛõ

ĕîĖõÆąÚĔï��Æ�ÅîĖõâĖÆĀđþÙòãë×Wz

ĕ�ÅĀđþÙòãë×]�¶Ô±ÂÅÑÔq}� (éčĖõèĖâòõ)ĕąÚĔïÂĀđþÙòãëÆÎÆëēòĀ

Page 16: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Phase 2Order-of-magnitude Faster

Page 17: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Scan

Filter

Project

Aggregate

select count(*) from store_saleswhere ss_item_sk = 1000

Page 18: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Volcano Iterator Model

Standard for 30 years: almost all databases do it

Each operator is an “iterator” that consumes records from its input operator

class Filter {…def next(): Boolean = {var found = falsewhile (!found && child.next()) {

found = predicate(child.fetch())}return found

}

def fetch(): InternalRow = {child.fetch()

}…

}

30:�YbË̶ÊÀÆôĖïĂĖë¬�½À¨Ô

#ßăđĖïÇ��ßăđĖï«ÒÆđæĖö×`�¶Ô¥ÚóđĖï¦

Page 19: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Scan

Filter

Project

Aggregate

data flow

Scan

Filter

Project

Aggregate

call stack to process a rowôĖïþĒĖ ��×�i¶ÔºÏÆæĖĐëïòã

Page 20: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Downside of the Volcano Model

1. Too many virtual function callso at least 3 calls for each row in Aggregate

2. Extensive memory accesso “row” is a small segment in memory (or in L1/L2/L3 cache)

3. Can’t take advantage modern CPU featureso SIMD, pipelining, prefetching …

�B�M%É�µ¬.¶®Ô

AggregateÁ7įÂÐ3)Æ%É�µ

ĉĊďØãíë¬;x

¥�¦ÇĉĊď(ͺÇL1/L2/L3âċòéČ)�Æ6³ÄOe

ĊðĔÄCPUÆZ~Æ@A¬"°ÒÕĨSIMD¤üÚĀĎÚĔ¤ĀďþÜòñ¤ĕĕĕ

ĄĐåĖùĊôĐÆ[c

Page 21: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

What if we hire a college freshman toimplement this query in Java in 10 mins?

select count(*) from store_saleswhere ss_item_sk = 1000

var count = 0for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1

}}

е10��Á Java Á±ÆãÝď×3�¶ÔºÏŤ/11:j×�½ºÒę

Page 22: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Volcano model30+ years of database research

college freshmanhand-written code in 10 minsvs

ĄĐåĖùĊôĐ

30:��ÆôĖïĂĖëÆrt/11:j

10��ÆEU­ÆæĖö

Page 23: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Volcano 13.95 millionrows/sec

collegefreshman

125 millionrows/sec

Note: End-to-end, single thread, single column, and data originated in Parquet on disk

�����

/11:j

High throughput£ëĐĖĀòõ

Page 24: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

How does a student beat 30 years of research?

Volcano1. Many virtual function calls

2. Data in memory (or cache)

3. No loop unrolling, SIMD, pipelining

hand-written code1. No virtual function calls

2. Data in CPU registers

3. Compiler loop unrolling, SIMD, pipelining

éµÀ1j¬30:ÆrtÅ�½ºÆ«ę

/�Æ�B�M%É�µ

ôĖï×ĉĊďÅ(е¯ÇâċòéČÅ)

ĐĖĀ8�¤SIMD¤üÚĀĎÚĔ¬d¨

�B�M%É�µ¬Ä¨

ôĖï× CPU ÆđêëïÅ

æĔüÚĎÅÑÔĐĖĀ8�¤SIMD¤üÚĀĎÚĔ

Page 25: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Whole-stage Codegen

Fusing operators together so the generated code looks like hand optimized code:

- Data stays in registers, rather than cache/memory- Minimizes virtual function calls- Compilers can unroll loops and use SIMD instructions

�ëóĖê æĖöjD

�MÆßăđĖï×�GÁF¨|$¶Ô±ÂÁ¤jD³ÕºæĖö¬E�ÁV� ³ÕºæĖöÆÑ©ÅΪÔ

-ôĖïÇâċòéČ / ĉĊďÑÓÐđêëïÆ�Å\Ô-�B�M%É�µ×V6 ¶Ô- æĔüÚϬĐĖĀ×8�¶Ô±Â¬Á­¤SIMD&�×�©

Page 26: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Scan

Filter

Project

Aggregate

var count = 0for (ss_item_sk in store_sales) {if (ss_item_sk == 1000) {

count += 1}

}

Whole-stage Codegen: Spark as a “Compiler”�ëóĖê æĖöjD Ę ¥æĔüÚϦµÀÆ Spark

Page 27: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

But there are things we can’t fuse

Complicated I/O• CSV, Parquet, ORC, …

External integrations• Python, R, scikit-learn, TensorFlow, etc

Don’t want to compile a CSV reader inline for every query!

µ«µ|$³¸Ô±Â¬Á­Ä¨ÐƬ§Ô

��Ä I/Oĕ CSV, Parquet, ORC, …

-�ÂÆÚĔóäđĖéčĔ

ĕPython, R, scikit-learn, TensorFlow, ÄÃ

ãÝď²ÂÅ CSV reader ×æĔüÚеº¯Ä¨ė

Page 28: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Vectorization

In-memoryRow Format

1 john 4.1

2 mike 3.5

3 sally 6.4

1 2 3

john mike sally

4.1 3.5 6.4

In-memoryColumn Format

ĂãõĐ

ĉĊď�Æ�þÞĖĆòõ ĉĊď�Æ�þÞĖĆòõ

Page 29: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Why Vectorization?

1. Modern CPUs are better at doing one thing over and over again (rather than N different things over and over again)

2. Most high-performance external systems are already columnar (numpy, TensorFlow, Parquet) so it is easier to integrate with

ĹĂãõĐ ę

ĊðĔÄ CPU ÇȾƱÂ×<Ð<Ð�©Æ¬>C (N�ÆmÄÔ±Â×<Ð<Ð�©ÑÓÐ)

úÚüþÞĖĆĔëÄ-�éëóĈÆ.¯Ç¤RÅàĎĈ÷ (numpy, TensorFlow, Parquet) Á§ÔºÏ¤ÚĔóäđĖõ¶ÔƬÑÓy!

Page 30: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Parquet 11 millionrows/sec

Parquetvectorized

90 millionrows/sec

Note: End-to-end, single thread, single column, and data originated in Parquet on disk

High throughput£ëĐĖĀòõ

Page 31: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

Phase 1Spark 1.4 - 1.6

Memory ManagementCode Generation

Cache-aware Algorithms

Phase 2Spark 2.0+

Whole-stage CodegenVectorization

ĉĊďwi

æĖöjD

âċòéČ×C�µºØĐçďìĈ

�ëóĖê æĖöjD

ĂãõĐ

Page 32: Project Tungsten Bringing Spark Closer to Bare Metalrxin.github.io/talks/2016-02-08_tungsten_ja.pdf · 2/8/2016  · Project Tungsten Bringing Spark Closer to Bare Metal Reynold Xin

§Ó¬Â©²´¨Íµº@rxin