Scalding Presentation

MapReduce with ScaldingAntonios Chalkiopoulos24th Big Data London Meetup

Scalding.io

$ whoami

Scalding.io

http://scalding.io

http://github.com/scalding-io

@chalkiopoulos

My recent achievement..

Scalding.io

What are we gonna talk about..?

Scalding.io

A Scala API on top of Cascading

Scalding.io

But what is ?

Scalding.io

A few years ago I started on a fresh Big Data team…

Scalding.io

Story!!

How do we efficiently develop MapReduce jobs for our new hadoop cluster ?

Scalding.io

MapReduce Techs

Scalding.io

Java MapReduce

Hadoop

Java MapReduce Word count example

MapReduce Techs

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

The promise of Cascading

Scalding.io

[1] A simple, high level java API for MapReduce easy to understand and work with.

Scalding.io

[2] Extensions to

MANY platforms

Scalding.io

Cascading

NoSQL Databases

SQL Databases

Hadoop Filesystem

Local Filesystem

In memory systems

Search Platforms

MongoDB Cassandra HBASE Accumulo …

ElasticSearch Solr …

Redis Memcached

How it works?

Scalding.io

A pipeline architecture

Scalding.io

Tuple1Tuple2

where tuples flow through pipes

Source tap

Scalding.io

Log files

Customer Data

Log & Customer

FinalResults

Log files

Customer Data

Results

Cascading Example

Scalding.io

Word count in Cascading

1. public class WordCount {

2. public static void main(String[] args) {3. Properties properties = new Properties();4. FlowConnector.setApplicationJarClass (properties, WordCount.class);5. Scheme sourceScheme = new TextLine (new Fields(“line”));6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7. Tap source = new Hfs( sourceScheme, args[0]);8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE );9. Pipe assembly = new Pipe(“ Word Count “);10. String regex = “(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)”;11. Function function = new RegexGenerator( new Fields(“word”), regex);12. assembly = new Each( assembly, new Fields(“line”), function );13. assembly = new GroupBy( assembly, new Fields(“word”) );14. Aggregator count = new Count(new Fields(“count”) );15. assembly = new Every( assembly, count );16. FlowConnector flowConnector = new FlowConnector( properties );17. Flow flow = flowConnector.connect(“word-count”, source, sink,

assembly);18. flow.complete();19. }20. }Scalding.io

70% less boilerplate code

But still some infrastructure code

Scalding.io

No boilerplate code at all

Functional

Robust & Scalable

Run on JVM

Here it comes

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

Scalding

The power of Scala on top of Cascading

Scalding.io

Scala fits naturally with data

Scalding.io

Word count in Scalding

Scalding.io

1. import com.twitter.scalding._

2. class WordCountJob(args : Args) extends Job(args) {

3. TextLine("input.txt”).read4. .flatMap('line -> 'word) { line : String => line.split("\\s+") }5. .groupBy('word) { _.size }6. .write( Tsv(”results.tsv”) )

Map phase

Reducephase

4 lines of code!

Code that developers enjoy writing

Who is using it?

Scalding.io

Many many others…

Scalding…

…open sourced by twitter at 2011…has more than 100 open source contributors…exposes the right abstractions…maximizes expressiveness…promotes extensibility…adds new capabilities to Cascading

Scalding.io

Core Concepts

Scalding.io

Sources & Sinks

1. Tsv("data.tsv", ('productID,'price,'quantity))2. .read3. .write(UnpackedAvroSource("data.avro”))

Scalding.io

TsvCsvOsvAvroParquet…

Map Operations

Scalding.io

1. pipe1.filter ('age) { age:Int => age > 18 }2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 }3. pipe1.project('name, 'surname)

15 map operations translated into map phases

Join operations

1. pipe1.joinWithSmaller('productId -> 'productId, pipe2)2. pipe1.joinWithLarger ('productId -> 'productId, pipe2)3. pipe1.joinWithTiny ('productId -> 'productId, pipe2)

Scalding.io

Optimize by hinting the relative sizes

Supports Left, Right, Inner, Outer Joins

1. pipe12. .joinWithSmaller('productId -> 'productId, pipe2, 3. joiner=new LeftJoin)

Group operations

1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity))2. .groupBy('shopId) {3. _.sum[Long]('quantity-> 'totalSoldItems)4. }5. .write(Tsv(“results.tsv”))

Scalding.io

Group by particular fields

.groupBy

.groupAll Group all data

Pipe operations

1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes2. .debug // Print sample data to screen3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded

Scalding.io

Simple pipe operations

Connect with external systems

Scalding.io

Scalding + Hive1. class HiveExample (args: Args) extends Job(args) {

2. val USER_SCHEMA = List('userId, 'username, 'photo)

3. HiveSource("myHiveTable", SinkMode.KEEP)4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))5. .write(Tsv("outputFromHive"))6. }

Scalding.io

Define the schemaQuery HcatalogRead directly from

Scalding + ElasticSearch1. val schema = List('number, 'product, 'description)

2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))

3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))

Scalding.io

Read from ElasticSearch in

one line!Also index new data in ES

Design patterns

Scalding.io

Dependency InjectionLate boundExternal Operations

How about defining external operations?

Scalding.io

1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA)2. .read3. .ETLOmnitureData4. .calculateOmnitureUserStats5. .joinWithCustomerDB('userId->'userId, customerPipe)6. .write(Tsv(“omniture-results.tsv”))

Custom operations: Re-usable modular code Single responsibility TestabilityFull-code

http://bit.ly/1pNSUKf

Scalding Testing

Scalding.io

Testing challenges in the context of MR

Scalding.io

Acceptance Tests

Unit – Component Tests

System Tests

Integration Tests

Scalding enables

testing in every layer

example

Scalding.io

1. class TsvWordCountJobTest extends FlatSpec2. with ShouldMatchers with TuppleConversions {

3. “WordCountJob” should “count words” in { 4. JobTest(new WordCountJob(_))5. .args(“input”,”inFile”)6. .args(“output”,”outFile”)7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”))8. .sink[(String,Int)](Tsv(“outFile”)) { out =>9. out.toList should contain (“cool” -> 2)10. }11. .run12. .finish13. }14. }

Replaces taps with in-memory

collections and asserts the expected

output

Monitoring

Scalding.io

“Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps”

Scalding.io

http://driven.cascading.io

Scalding.io

Collects telemetry data and expose through a Web UI

Advanced Concepts

Scalding.io

Scalding adds Typed API Matrix API

Graphs Machine Learning Algorithm

Scalding.io

What the future like?

Scalding.io

So far…

Scalding.io

Real TimeBatch Hybrid

Scalding.io

Summingbird

A unified API for everything

StormTEZ Spark

Enables the Lambda architecture

Scalding.io

Questions?

Scalding Presentation

Technology

Influence of milk pasteurization and scalding temperature

THERMOSTATIC MIXING VALVESs3.supplyhouse.com/manuals/1344152714554/76650_PROD_FILE.pdfTwo Problems Scalding involves the destruction of skin cells, and sometimes the under-lying structures

Are You Your Share Pzet - Library of Congress · conclusion reservation appearance secretary para-graph Interpar respects-It committee candidates Com-mittee ... Empress scalding Archibald

Oxfordshire County Council Water Services Hygiene, Legionellosis and Scalding …schools.oxfordshire.gov.uk/cms/sites/schools/files/... · 2014-08-26 · Oxfordshire County Council

Scalding Big (Ad)ta

HEAT TREATMENTS FOR CONTROLLING POSTHARVEST …plaza.ufl.edu/jkjoseph/johnkaruppiah_k.pdf · C-1 Analysis of variance for peel scalding, total decay and chilling injury of ‘Ruby

Effects of Scalding Parameters and Ripening on the

ˆ - Alpine Home Air · 2016. 8. 11. · Water temperatures over 125 °F can cause severe burns instantly or death from scalding. A hot water scalding potential exists if the thermostat

Scalding - Big Data Programming with Scala

DRIVING INNOVATION THROUGH DATA HADOOP IN …dw.connect.sys-con.com/session/2647/Supreet Oberoi.pdf · DRIVING INNOVATION THROUGH DATA HADOOP IN ENTERPRISE ... • Scalding is great

Mechanical engineering Scalding, slaughtering and stunning ... · slaughtering and stunning systems for small to medium slaughterhouses. In 1997, Hubert Haas acquired a building plot

Installation Guide - KOHLER · Installation Guide Thermostatic Mixing Valve Trim ... scalding if there is a failure of other temperature-limiting devices ... Complete Trim Installation

Working with the Scalding Type -Safe API

Productivity Frameworks in Big Data Image Processing ... · Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding

Using Scalding for Data Driven Product Development at LinkedIn

DHE 12, DHE 15, DHE 20, DHE 24 DHE 12 Pro, DHE 15 Pro, DHE ... · danger: water temperatures over 125°f can cause severe burns instantly or death from scalding. a hot water scalding

National Acid Sulfate Soils Guidance · • production of noxious gases (for example hydrogen sulfide), • production of greenhouse gases, and • scalding (that is de-vegetation)

How LinkedIn Uses Scalding for Data Driven Product Development

@Scaldingblogs.ischool.berkeley.edu/.../2012/11/twitter... · Scalding @Twitter • Revenue quality team (ads targeting, market insight, click-prediction, trafﬁc-quality) uses scalding

Scalding @ Coursera