Spark Summit EU talk by Simon Whitear

SPARK SUMMIT EUROPE 2016

Sparklinta Tool for Identifying and Tuning Inefficient Spark Jobs Across Your Cluster

Simon WhitearPrincipal Engineer @ Groupon

Why Sparklint?• A successful Spark cluster grows rapidly• Capacity and capability mismatches arise• Leads to resource contention• Tuning process is non-trivial• Current UI operational in focus

We wanted to understand application efficiency

Sparklint provides:• Live view of batch & streaming application stats

or• Event by event analysis of historical event logs• Stats and graphs for:

– Idle time– Core usage– Task locality

Sparklint Listener:

Sparklint Server:

Demo…• Simulated workload analyzing site access logs:

– read text file as JSON– convert to Record(ip, verb, status, time)– countByIp, countByStatus, countByVerb

Job took 10m7s to finish

Already pretty good distribution; low idle time indicates good worker

usage, minimal driver node interaction in job

But overall utilization is low

Which is reflected in the common occurrence of the IDLE state (unused cores)

Core usage increased, job is more efficient, execution time increased,

but the app is not cpu bound

Core utilization decreased proportionally, trading execution time

for efficiency

Lots of IDLE state shows we are over allocating

resources

Dynamic allocation only effective at app start due to long

executorIdleTimeout setting

Core utilization remains low, the config settings

are not right for this workload.

Job took 33m5s to finish Core utilization is up, but execution time is up dramatically due to reclaiming resources before

each short running task.

IDLE state is reduced to a minimum, looks efficient, but execution is much slower due to

dynamic allocation overhead

Executor churn!

Job took 7m34s to finishCore utilization way up,

with lower execution time

Flat tops show we are becoming CPU bound

Parallel execution is clearly visible in

overlapping stages

Job took 5m6s to finishCore utilization decreases, trading execution time for

efficiency again here

Thanks to dynamic allocation the utilization is high despite being a bi-

modal application

Data loading and mapping requires a large core count to get throughput

Aggregation and IO of results optimized for end file size,

therefore requires less cores

Future Features:• Increased job & stage detail in UI• History Server event sources• Inline recommendations• Auto-tuning• Streaming stage parameter delegation

The Credit:• Lead developer is Robert Xue• https://github.com/roboxue• SDE @ Groupon

Contribute!Sparklint is OSS:

https://github.com/groupon/sparklint

SPARK SUMMIT EUROPE 2016

THANK YOU.swhitear@groupon.com

Spark Summit EU talk by Simon Whitear

Data & Analytics

Book review body language by GEOFF RIBBONS and Greg Whitear

SIMON 54 PREMIUM · 2019-09-13 · Simon Akord Simon Auarius Simon Classic Simon 15 Simon 10 Simon asic Simon 54 Simon 82 Nowoczesne dotykowe sterowniki do załączania światła

Spark Concepts - Spark SQL, Graphx, Streaming

Kerberizing spark. Spark Summit east

Using Spark @ Conviva Spark Summit 2013

Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide5xracing.com/...spark-plug-and-spark-plug-wire-installation-guide.pdf · Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for Ford

Spark SQL | Apache Spark

SPARK SPARK VRT

Spark, spark streaming & tachyon

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

Intro to Spark and Spark SQL

CiscoSpark インストール手順 (forWindowsPC)...Cisco Spark Test Spark Spark 15:02 O Spark spark 15:02 Welcome to Cisco Spark! You can easily meet with your team and collaborate

Spark Platform Spark Core Spark Extensions Using … Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks 20 years

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Taskforce on students with learning difficulties · Whitear, Jo Director, Canberra Reading Clinic and currently completing doctoral research in learning difficulties ... characteristic

Spark Infrastructure - Australian Securities Exchange · Spark Infrastructure represents Spark Infrastructure Trust and its consolidated entities. Spark Infrastructure RE Limited

Spark Summit 2014: Spark Job Server Talk

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to