Upload
ido-shilon
View
284
Download
7
Embed Size (px)
Citation preview
Scala The language for Big Data
Tzach Zohar @ Kenshoo, March/2016
Who am I
System Architect @ Kenshoo
Java backend for 10 years
Working with Scala + Spark for 2 years
https://www.linkedin.com/in/tzachzohar
Who’s Kenshoo
10-year Tel-Aviv based startup
500+ employees
Industry Leader in Digital Marketing
Heavy data shop
http://kenshoo.com/
And who are you?
Agenda
NOT the usual Scala pitch
Scala - Short Intro
Scala
Created by Martin Odersky and his research group in EPFL, 2003
Open Source
Runs on the JVM, Seamless Java Interoperability
Strongly Typed
Object Oriented
Functional
Functional Programming
From Wikipedia:
“[…] functional programming […] treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data”
Functional Languages
Functions are “first-class citizens”, i.e. values
Higher-Order Functions
Minimize side effects
Minimize mutability
Example: Imperative to Functional
Java / Imperative:
private static class Person { String firstName; String lastName; } private List<Person> firstNFamilies(int n, List<Person> persons) { final List<String> familiesSoFar = new LinkedList<>(); final List<Person> result = new LinkedList<>(); for (Person p : persons) { if (familiesSoFar.contains(p.lastName)) { result.add(p); } else if (familiesSoFar.size() < n) { familiesSoFar.add(p.lastName); result.add(p); } } return result; }
Scala / Functional:
case class Person(firstName: String, lastName: String) def firstNFamilies(n: Int, persons: List[Person]): List[Person] = { val firstFamilies = persons.map(p => p.lastName).distinct.take(n) persons.filter(p => firstFamilies.contains(p.lastName)) }
Hey, won’t this look rather similar with Java8’s Streams + Lambdas?
Scala + Big Data
What can a language do for Big Data?
Language “requirements”
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery” away
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Java Interoperability
class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext) extends ParquetOutputCommitter(outputPath, context) { … }
Java class from org.apache.parquet:parquet-hadoop
Scala class from org.apache.spark:spark-core_2.10
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Scala is Interactive
Scala has a built-in REPL, extensible by Scala-based tools
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Performant
Benchmarking languages is hard and suffers from bias
Most benchmarks show Scala is at least on-par with Java, e.g. Google’s benchmark:
Nonsense!
No Way!
RAGE!!11
Does it matter?
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Performant
Ability to scale out is more significant than per-CPU performance
http://vmturbo.com/wp-content/uploads/2015/05/ScaleUpScaleOut_sm-min.jpg
Abstracts “machinery” away - What?
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Think about MapReduce
Hadoop’s Mapper and Reducer - code what to do, not:
How
Where
In what order
How to handle failures
Leaves these concerns for the framework to figure out
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Think about MapReduce
Hadoop’s Java API imitates Functional Programming:
Mapper and Reducer are Functions
Executed by “Higher Order Functions”
No Side Effects / Mutability
Abstracts “machinery” away - Why?
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Functional makes concurrency easy
val numbers = 1 to 100000 val result = numbers.map(slowF)
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Functional makes concurrency easy
val numbers = 1 to 100000 val result = numbers.par.map(slowF)
Parallelizes next manipulations over available CPUs
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Functional makes distribution easy
val numbers = 1 to 100000 val result = sparkContext.parallelize(numbers).map(slowF)
Parallelizes next manipulations over scalable cluster, by creating a Spark RDD - a Resilient Distributed Dataset
“Spark RDDs are the ultimate Scala collections"
- Martin Odersky
photo: http://www.swissict-award.ch/fileadmin/award/Pressebilder/Martin_Odersky_Scala.jpg
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery”
Functional makes Resiliency easy
Pure functions are idempotent, which allows retriability
Map
Map
Map Map Map (retry)
What if we always coded this way?
A functional language means just that
Language “requirements”
Open Source
Strongly Typed
Java/JVM Friendly
Interactive
Performant
Abstracts “machinery” away
But “Scala is hard!”
It’s really not that scary...
From Manuel Bernhardt's "Debunking Some Myths About Scala And Its Environment":
“I need to become a mathematician and know all about Monads before I can get started”
“I can throw all of my object-orientation knowledge out of the window”
“There is no good IDE support for Scala”
Thank You!