35
Scala The language for Big Data Tzach Zohar @ Kenshoo, March/2016

BDX 2016 - Tzach zohar @ kenshoo

Embed Size (px)

Citation preview

Page 1: BDX 2016 - Tzach zohar  @ kenshoo

Scala The language for Big Data

Tzach Zohar @ Kenshoo, March/2016

Page 2: BDX 2016 - Tzach zohar  @ kenshoo

Who am I

System Architect @ Kenshoo

Java backend for 10 years

Working with Scala + Spark for 2 years

https://www.linkedin.com/in/tzachzohar

Page 3: BDX 2016 - Tzach zohar  @ kenshoo

Who’s Kenshoo

10-year Tel-Aviv based startup

500+ employees

Industry Leader in Digital Marketing

Heavy data shop

http://kenshoo.com/

Page 4: BDX 2016 - Tzach zohar  @ kenshoo

And who are you?

Page 5: BDX 2016 - Tzach zohar  @ kenshoo

Agenda

NOT the usual Scala pitch

Page 6: BDX 2016 - Tzach zohar  @ kenshoo

Scala - Short Intro

Page 7: BDX 2016 - Tzach zohar  @ kenshoo

Scala

Created by Martin Odersky and his research group in EPFL, 2003

Open Source

Runs on the JVM, Seamless Java Interoperability

Strongly Typed

Object Oriented

Functional

Page 8: BDX 2016 - Tzach zohar  @ kenshoo

Functional Programming

From Wikipedia:

“[…] functional programming […] treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data”

Page 9: BDX 2016 - Tzach zohar  @ kenshoo

Functional Languages

Functions are “first-class citizens”, i.e. values

Higher-Order Functions

Minimize side effects

Minimize mutability

Page 10: BDX 2016 - Tzach zohar  @ kenshoo

Example: Imperative to Functional

Java / Imperative:

private static class Person { String firstName; String lastName; } private List<Person> firstNFamilies(int n, List<Person> persons) { final List<String> familiesSoFar = new LinkedList<>(); final List<Person> result = new LinkedList<>(); for (Person p : persons) { if (familiesSoFar.contains(p.lastName)) { result.add(p); } else if (familiesSoFar.size() < n) { familiesSoFar.add(p.lastName); result.add(p); } } return result; }

Scala / Functional:

case class Person(firstName: String, lastName: String) def firstNFamilies(n: Int, persons: List[Person]): List[Person] = { val firstFamilies = persons.map(p => p.lastName).distinct.take(n) persons.filter(p => firstFamilies.contains(p.lastName)) }

Page 11: BDX 2016 - Tzach zohar  @ kenshoo

Hey, won’t this look rather similar with Java8’s Streams + Lambdas?

Page 12: BDX 2016 - Tzach zohar  @ kenshoo

Scala + Big Data

Page 13: BDX 2016 - Tzach zohar  @ kenshoo

What can a language do for Big Data?

Page 14: BDX 2016 - Tzach zohar  @ kenshoo

Language “requirements”

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery” away

Page 15: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Java Interoperability

class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext) extends ParquetOutputCommitter(outputPath, context) { … }

Java class from org.apache.parquet:parquet-hadoop

Scala class from org.apache.spark:spark-core_2.10

Page 16: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Scala is Interactive

Scala has a built-in REPL, extensible by Scala-based tools

Page 17: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Performant

Benchmarking languages is hard and suffers from bias

Most benchmarks show Scala is at least on-par with Java, e.g. Google’s benchmark:

Nonsense!

No Way!

RAGE!!11

Page 18: BDX 2016 - Tzach zohar  @ kenshoo

Does it matter?

Page 19: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Performant

Ability to scale out is more significant than per-CPU performance

http://vmturbo.com/wp-content/uploads/2015/05/ScaleUpScaleOut_sm-min.jpg

Page 20: BDX 2016 - Tzach zohar  @ kenshoo

Abstracts “machinery” away - What?

Page 21: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Think about MapReduce

Hadoop’s Mapper and Reducer - code what to do, not:

How

Where

In what order

How to handle failures

Leaves these concerns for the framework to figure out

Page 22: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Think about MapReduce

Hadoop’s Java API imitates Functional Programming:

Mapper and Reducer are Functions

Executed by “Higher Order Functions”

No Side Effects / Mutability

Page 23: BDX 2016 - Tzach zohar  @ kenshoo

Abstracts “machinery” away - Why?

Page 24: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes concurrency easy

val numbers = 1 to 100000 val result = numbers.map(slowF)

Page 25: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes concurrency easy

val numbers = 1 to 100000 val result = numbers.par.map(slowF)

Parallelizes next manipulations over available CPUs

Page 26: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes distribution easy

val numbers = 1 to 100000 val result = sparkContext.parallelize(numbers).map(slowF)

Parallelizes next manipulations over scalable cluster, by creating a Spark RDD - a Resilient Distributed Dataset

Page 27: BDX 2016 - Tzach zohar  @ kenshoo

“Spark RDDs are the ultimate Scala collections"

-  Martin Odersky

photo: http://www.swissict-award.ch/fileadmin/award/Pressebilder/Martin_Odersky_Scala.jpg

Page 28: BDX 2016 - Tzach zohar  @ kenshoo

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes Resiliency easy

Pure functions are idempotent, which allows retriability

Map

Map

Map Map Map (retry)

Page 29: BDX 2016 - Tzach zohar  @ kenshoo

What if we always coded this way?

Page 30: BDX 2016 - Tzach zohar  @ kenshoo

A functional language means just that

Page 31: BDX 2016 - Tzach zohar  @ kenshoo

Language “requirements”

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery” away

Page 32: BDX 2016 - Tzach zohar  @ kenshoo

But “Scala is hard!”

Page 33: BDX 2016 - Tzach zohar  @ kenshoo
Page 34: BDX 2016 - Tzach zohar  @ kenshoo

It’s really not that scary...

From Manuel Bernhardt's "Debunking Some Myths About Scala And Its Environment":

“I need to become a mathematician and know all about Monads before I can get started”

“I can throw all of my object-orientation knowledge out of the window”

“There is no good IDE support for Scala”

Page 35: BDX 2016 - Tzach zohar  @ kenshoo

Thank You!