32
Scalding Big ADta или обжигая горшки с рекламой Boris Trofimov @b0ris_1

Scalding big ADta

  • Upload
    b0ris1

  • View
    686

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Scalding big ADta

Scalding Big ADta

или обжигая горшки с рекламой

Boris Trofimov @b0ris_1

Page 2: Scalding big ADta

Agenda

• Two stories on how AD is served inside AD company

• Awesome Scalding

Page 3: Scalding big ADta

The story about shoes or

Big Brother is watching on you

Page 4: Scalding big ADta

We will answer this question in a few slides

or be careful while buying shoes

Page 5: Scalding big ADta

How many web sites with Ad aboard do you open

during the day?

Page 6: Scalding big ADta

Open any site with Ad

Page 7: Scalding big ADta

What can be simpler than loading web site via web

browser, hah?

Page 8: Scalding big ADta
Page 9: Scalding big ADta

However it is deceptive judgment

Page 10: Scalding big ADta

The first second… Story Actors

User Publisher (foxnews.com) Ad Server (Google’s Doubleclick) SSP (Ad Exchange) DSP (decides what ad to show) Advertiser (Nike)

we are here

Page 11: Scalding big ADta

… 1 sec 20 ms 100 ms 150 ms

Publisher receives request

Publisher sends

response

Content delivered

to user

170

Site sends request to Ad Server

200

80 ms

280

SSP picks the winning bid and sends redirect url back to ad

Server Every bidder/DSP receives info about user:

• ssp_cookie_id • geo data • site url

300

SSP (Ad Exchange) receives ad request

and opens RTB Auction

210

Ad Server receives ad

request and redirects to

Ad Exchange

All bidders should send their

decision (participate? &

price) back

350

Ad Server shows page to

user which redirects to the bidder’s server

User’s web page asks Ad banner from CDN Showing ad & bidder’s

1x1 pixel (impression)

400

The first second…

~70% users have this cookie aboard

>>1 independent companies take

part in this auction

Page 12: Scalding big ADta

Under hood

Page 13: Scalding big ADta

Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user

is man who has iphone and lives in NYC and has dog. Major format: <cookie_id – segment_id> Data Scientists

Real time

Offline

Pixel Tracking Farm

Warehouse

Bidder Farm

Auction requests

SSP Ad Exchange

Hourly Logs

3rd part data

House holders data

Hadoop’s HDFS

Updating user profiles

Hive Oozie MapReduce

Partners

HBASE Scalding

hbase keeps user profiles

Update user’s profiles with new segments

Data export

Brand new feed about

user interests

2

3

4 5

6

7

8

9

0 1 • Impressions • Clicks • Post-click Activities

5

Page 14: Scalding big ADta

Why do we need all this science?

• Deep audience targeting

• Case: customer would like

to show ad for all men who live in NYC have iPhone and dog

Page 15: Scalding big ADta

Facts about Data Scientists

• Data Scientists do: – Audience Modeling

identifying new user interests [segments] and finding ways to track them

– Audience Bridging – Insights and Analytics

• They use IBM Netezza as local warehouse

• They use R language

Page 16: Scalding big ADta

Facts about Realtime team

• Scala, Java • Restful Services • Akka • In Memory Cache : Aerospike, Redis

Page 17: Scalding big ADta

Facts about Offline team • The tasks we solve over hadoop:

– As a Storage to keep all logs we need – As Profile DB to keep all users and their interests [segments] – As MapReduce Engine to run jobs on transformations between data – As a Warehouse to export data via hive

• We use Clouderra CDH 5.1.2

• Major language: Scala

• Pure MapReduce jobs & Scalding/Cascading

• All map reduce applications are wrapped by Oozie’s workflow(s)

• Developing nextgen paltform version based on Spark Streaming/Kafka

Page 18: Scalding big ADta
Page 19: Scalding big ADta

hdfs

Scalding in a nutshell • Concise DSL

• Configurable Source(s) and

sink(s)

• Data transform operations: – map/flatMap – pivot/unpivot – project – groupBy/reduce/foldLeft

hdfs

Page 20: Scalding big ADta

Just one example (Java way) public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

Page 21: Scalding big ADta

Source

Just one example (Scalding way)

class WordCountJob(args : Args) extends Job(args) {

TextLine( args("input") )

.flatMap('line -> 'word) { line : String => tokenize(line) }

.groupBy('word) { _.size }

.write( Tsv( args("output") ) )

// Split a piece of text into individual words.

def tokenize(text : String) : Array[String] = {

// Lowercase each word and remove punctuation.

text.toLowerCase.split("\\s+")

}

}

Sink

Transform operations

Page 22: Scalding big ADta

Use Case 1 Split

• Motivation: reuse calculated streams

val common = Tsv("./file").map(...)

val branch1 = common.map(..).write(Tsv("output"))

val branch2 = common.groupby(..).write(Tsv("output"))

Page 23: Scalding big ADta

Use Case 2 Exotic Sources JDBC (out of the box)

case object YourTableSource extends JDBCSource {

override val tableName = "tableName"

override val columns = List(

varchar("col1", 64),

date("col2"),

tinyint("col3"),

double("col4"),

)

override def currentConfig = ConnectionSpec("www.gt.com", "username", "password",

"mysql")

}

YourTableSource.read.map(...) ...

Page 24: Scalding big ADta

Use Case 2 Exotic Sources HBASE

HBaseSource (https://github.com/ParallelAI/SpyGlass) • SCAN_ALL, • GET_LIST, • SCAN_RANGE HBaseRawSource (https://github.com/andry1/SpyGlass) • Advanced filtering via base64Scan

val hbs3 = new HBaseSource(

tableName,

quorum,

'key,

List("data"),

List('data),

sourceMode = SourceMode.SCAN_ALL)

.read

val scan = new Scan()

scan.setCaching(caching)

val activity_filters = new FilterList(MUST_PASS_ONE, {

val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))

scvf.setFilterIfMissing(true)

scvf.setLatestVersionOnly(true)

val scvf2 = ...

List(scvf, scvf2)

})

scan.setFilter(activity_filters)

new HBaseRawSource(tableName, quorum, families,

base64Scan = convertScanToBase64(scan)).read. ...

Page 25: Scalding big ADta

Use Case 3 Join

• Motivation: joining two streams by key

• Different join strategies: – joinWithLarger – joinWithSmaller – joinWithTiny

• Inner, Left, Right, strategies

val pipe1 = Tsv("file1").read

val pipe2 = Tsv("file2").read // small file

val pipe3 = Tsv("file3").read // huge file

val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)

val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)

Page 26: Scalding big ADta

Use Case 4 Distributed Caching and Counters

//somewhere outside Job definition

val fl = DistributedCacheFile("/user/boris/zooKeeper.json")

// next value can be passed through any Scalding's jobs via Args

object for instance

val fileName = fl.path

...

class Job(val args:Args) {

// once we receive fl.path we can read it like a ordinary file

val fileName = args.get("fileName")

lazy val data = readJSONFromFile(fileName)

...

TSV(args.get("input")).read.map('line -> 'word ) {

line => ... /* using data json object*/ ... }

}

// counter example

Stat("jdbc.call.counter","myapp").incBy(1)

Page 27: Scalding big ADta

Use Case 5 Bridging Profiles

Motivation: bridge information from different sources and build complete person profile

imp

Own company’s private cookie

thanks to 1x1 pixel impression

Bridging two ssp_cookies via private

cookie

ssp_cookie_Id1

ssp_cookie_Id2

Bridging via ip address

Page 28: Scalding big ADta

Bridging Profiles

General task definition:

• Build graph

• Vertexes – user’s interests • Edges – bridging rules

[cookies, IP,…]

• Task – Identify connected components

Page 29: Scalding big ADta

Connected components Let’s scalding it

/**

* The class represents just one iteration of searching connected component algorithm.

* Somewhere outside the Job code we have to run this job iteratively until N [~20] and should check number inside "count" file.

* If it is zero then we can stop running other iterations

*/

class ConnectedComponentsOneIterationJob(args : Args) extends Job(args) {

val vertexes = Tsv( args("vertexes"),('id,'gid)).read // by default gid is equal to id

val edges = Tsv( args("edges"), ('id_a,'id_b) ).read

val groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))

.discard('id )

.rename('gid ->'gid_b)

.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }

.project ('gid_a, 'gid_b)

.mapTo(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (String, String) => max(gid._1, gid._2) -> min(gid._1, gid._2) }

// if count=0 then we can stop running next iterations

groups.groupAll { _.size }.write(Tsv("count"))

val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('gid_a,'gid_b)->('source, 'target))

val new_vertexes = vertexes.joinWithSmaller('id -> 'source, new_groups, joiner = new LeftJoin )

.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(String, String, String, String) =>

val (id, gid, source,target) = param

if (target != null) ( id , min( gid, target ) ) else ( id, gid )

}

new_vertexes.write( Tsv( args("new_vertexes") ) )

}

Page 30: Scalding big ADta

Other nice things

• Typed pipes

• Elegant and fast Matrix operations

• Simple migration on Spark/Kafka

• More sources: e.g. retrieve data from hive’s hcatalog

Page 31: Scalding big ADta

Useful Resources • http://www.adopsinsider.com/ad-serving/how-does-ad-serving-

work/

• http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/

• https://github.com/twitter/scalding

• https://github.com/ParallelAI/SpyGlass

• https://github.com/branky/cascading.hive

Page 32: Scalding big ADta

Thank you!