33
Scala and Hadoop @ eBay

Scala and Hadoop @ eBay

  • Upload
    ebaynyc

  • View
    7.171

  • Download
    2

Embed Size (px)

DESCRIPTION

Slides from Adam Ilardi's presentation at the 5/21 NY Scala Meetup held at eBay NYC.

Citation preview

Page 1: Scala and Hadoop @ eBay

Scala and Hadoop @ eBay

Page 2: Scala and Hadoop @ eBay

What we will cover

• Polymorphic Function Values• Higher Kinded/Recursive Types• Cokleislis Star Operators• Scala Macros

Page 3: Scala and Hadoop @ eBay

I have no clue what those things are

Page 4: Scala and Hadoop @ eBay

What we will ACTUALLY cover

• Why Scala• Why Hadoop• How we use Scala with Hadoop• Lots of CODE!

Page 5: Scala and Hadoop @ eBay

Why Scala?

• JVM• **Functional**• Expressive• How to convince your boss?

Page 6: Scala and Hadoop @ eBay

Someone on Hacker News said Scala sucks

• Compile Times• You changed List again?• Complicated• Leads to Madness

Page 7: Scala and Hadoop @ eBay

Madness?trait Lazy[+T, P] { var creationParameters: P = None.asInstanceOf[P]; lazy val lazyThing: Either[Throwable, T] = try { Right(create(creationParameters)) }

catch { case e => Left(e) } def get(createParams: P): Either[Throwable, T] = { creationParameters = createParams lazyThing } def create(params: P): T}

Page 8: Scala and Hadoop @ eBay

Madness?

def getSingleInstance[T, P](params: P)(implicit lazyCreator: Lazy[T, P]): T = { lazyCreator.get(params) match {

case Right(successValue) => successValue case Left(exception) => throw new

StackException(exception) }

}

Page 9: Scala and Hadoop @ eBay

This is used by ONE client class

• Show some self-restraint

Page 10: Scala and Hadoop @ eBay
Page 11: Scala and Hadoop @ eBay

Hadoop

• void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)

• void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)

Page 12: Scala and Hadoop @ eBay

BIG NUMBERS

• Petabytes of data• 1k+ node Hadoop cluster• Multi-billion dollar merchandising business• Lots of users and items

Page 13: Scala and Hadoop @ eBay

How should I use Map Reduce?

• Raw map reduce • Pig • Hive• Cascading• Scoobi• Scalding

Page 14: Scala and Hadoop @ eBay

Decision Time

• “And every one that heareth these sayings of mine (great software engineers of the past), and doeth them not, shall be likened unto a foolish man, which built his house upon the sand.”

• “And the rain descended, and the floods came, and the winds blew, and beat upon that house; and it fell: and great was the fall of it.”

Page 15: Scala and Hadoop @ eBay

I believe!

• Scalding combines the best of PIG and Cascading

Page 16: Scala and Hadoop @ eBay

Good PigA = LOAD 'input' AS (x, y, z);B = FILTER A BY x > 5;DUMP B;C = FOREACH B GENERATE y, z;STORE C INTO 'output';

// do joins and group by also

Page 17: Scala and Hadoop @ eBay

Bad Pig

DEFINE NV_terms `perl nv_terms2.pl` ship('$scripts/nv_terms2.pl');

i5 = stream i4 through NV_terms as (leafcat:chararray, name:chararray, name1:chararray);

i7 = foreach i5 generate leafcat, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as name, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as name1;

Page 18: Scala and Hadoop @ eBay

Other Pig Issues

• Scheduling and DAG creation

Page 19: Scala and Hadoop @ eBay

Cascading Rocks!

• What is it?• Supports large workflows and reusable

components– DAG generation– Parallel Executions

Page 20: Scala and Hadoop @ eBay

Cascading code in Scala

val masterPipe = new FilterURLEncodedStrings(masterPipe, "sqr")

masterPipe = new FilterInappropriateQueries(masterPipe, "sqr”)

masterPipe = new GroupBy(masterPipe, CFields("user_id", "epoch_ts", "sqr"), sortFields)

Page 21: Scala and Hadoop @ eBay

Someone should really code review this

Page 22: Scala and Hadoop @ eBay

Cascading Issues

This page intentionally left blank

Page 23: Scala and Hadoop @ eBay

Scalding Time

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )

// Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}

Page 24: Scala and Hadoop @ eBay

Scalding @ eBay

• Boilerplate reduction• Extensibility• New hires

Page 25: Scala and Hadoop @ eBay

Practical Scalding Use • Pimp my pimp• Code generated boilerplate• Cascades• Traps• Testing!

Page 26: Scala and Hadoop @ eBay

class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {

implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)

class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with CommonFunctions

trait CommonFunctions { import Dsl._ import RichPipe.assignName def pipe: Pipe def reallyComplexFunction(field: Fields, param: Long) = {

//mind blowing code here }}}

Page 27: Scala and Hadoop @ eBay

CheckoutTransactionsPipe(//default path logic) .project(//fields I need).countUserInteractions(//params).doScoreCalculation(//params).doConfidenceCalculation(//params)

Seems a bit too readable for Scala

Page 28: Scala and Hadoop @ eBay

Collaborative Filtering

• Typically hard to run on large datasets

Page 29: Scala and Hadoop @ eBay

Structured Data Importance

• Do people shop by brand?

Bag Dep

th

Bag Heig

ht

Bag Le

ngthBran

dColor

Country of M

anufac

ture

Materia

l

Shad

eSiz

e

Strap

Drop

Style

0

0.2

0.4

0.6

0.8

1

1.2

Handbags and Purses

Supp

ly

Page 30: Scala and Hadoop @ eBay

Markov Chains

• Investigation of buying patterns in ~50 lines of code

val purchases = "firsttime" :: x.take(500).toListval pairs = purchases zip purchases.tailval grouped = pairs.groupBy(x =>

x._1.toString+"-"+x._2.toString) val sizes = grouped map { x => { x._1 -> x._2.size }} toList

Page 31: Scala and Hadoop @ eBay

Mining Search Queries

• 20+ billion user queries - give me the top ones per user

De-Dupe Rank ValidateSample Data

Page 32: Scala and Hadoop @ eBay

Automation

Hadoop Proxy Batch Database Load Machines

Cassandra

Jenkins

MySql

Mongo

Page 33: Scala and Hadoop @ eBay

Questions?

www.ebaynyc.com