19
Building Audience Analytics Platform Jothi Padmanabhan Inmobi 6-Sep-2014

Building Audience Analytics Platform

Embed Size (px)

Citation preview

Page 1: Building Audience Analytics Platform

Building Audience Analytics Platform

Jothi PadmanabhanInmobi

6-Sep-2014

Page 2: Building Audience Analytics Platform

Motivation➔ Audience Analytics platform is extremely critical➔ Segmentation

➔Rule Based

➔Inferred based on Sciences Modeling

➔Third Party

➔ Targeting➔Maximize CTR and CVR

Page 3: Building Audience Analytics Platform

Challenges➔ Scale

➔Billions of Ad requests/day, Peak 25K rps, 800M Users

➔ Multiple Input Sources and Types➔Fact Data, Dimension Data

➔ Multiple Consumers➔Reporting, Segmentation and Targeting, Inferences

Page 4: Building Audience Analytics Platform

Challenges➔ Data Curation➔ Define and Measure Data Quality

➔Track sources and possibly assign confidence

➔ Governance and Licensing restrictions➔ Consistent Querying Interface

Page 5: Building Audience Analytics Platform

Challenges

● Storage capacity and retention● Optimal usage of grid resources

Page 6: Building Audience Analytics Platform

Activity Data➔ Records actual activity ➔ Time-series data➔ Immutable, actual facts➔ Comprises Dimensions and Measures➔ Measures

➔Ad requests, Impressions, Clicks, Conversion, ...

Page 7: Building Audience Analytics Platform

Dimension Data➔ Domain specific Metadata (user, location, app

etc) ➔ Each domain will have its own schema

➔User (uid, age, gender, interests etc)

➔Location (Lat/Long – zip/city/country, etc)

➔Device (Handset model, OS, version etc)

➔ Mutable (but possibly slowly changing)

Page 8: Building Audience Analytics Platform

ETL➔ Need to ingest data from different

sources ➔ Transform the data into a format for

optimized storage and easy queriability➔ Query interface for different consumers

Page 9: Building Audience Analytics Platform

ETL - Ingestion➔ Naive -- Have custom ingestion flows

➔Quick to develop

➔Could be highly optimized

➔Not scalable

➔ Have a generic framework➔Streamlined and scalable

➔Might need more processing

Page 10: Building Audience Analytics Platform

ETL - Storage➔ Naive -- Storage schema closely coupled

with ingestion schema➔Multiple representations of same data. Age

could be DOB or years

➔ Consitent representation a must➔Would require transformation from input

schema to storage schema

Page 11: Building Audience Analytics Platform

ETL - Storage➔ Location – Lat/Long, Zip, City, Country➔ Need to store in the lowest possible granularity

(Lat/Long)➔ GPS readings come with accuracy that needs to

be recorded➔ Queries are almost always nearness queries,

not exact matches➔

Page 12: Building Audience Analytics Platform

ETL - Storage➔ Quadtile representation➔ Use leading bits for tile id, remaining for storing

accuracy➔ Transform all location information to such ids➔ Nearness with Lat/Long distance is a cross-product

join➔ With Tiles, we can translate this into equi-joins (of

course with some loss of accuracy)

Page 13: Building Audience Analytics Platform

ETL - Querying➔ Naive -- Users aware of multiple feeds

and schemas, query appropriately➔Extremely difficult as schemas change,

new feeds get added

➔Closely coupled with internal representation, not good

Page 14: Building Audience Analytics Platform

ETL - Querying➔ Having a consistent, published schema

➔Enables exploration and discovery

➔Well defined querying interfaces that abstract out internal representation

➔Provide primitives (for example UDFs for nearness calculations) for easier querying

Page 15: Building Audience Analytics Platform
Page 16: Building Audience Analytics Platform

Ingestion Server

● Curation to filter out dubious records● Adapters for transformation● REST based ingestion server

– Support multiple compression types

– Support multiple serialization formats

– Handle rate-limiting/throttling

– Bulk/Streaming inputs

Page 17: Building Audience Analytics Platform

Storage and Querying

● Possibly different schema than ingestion schema

● Columnar storage format (Parquet/ORC)● Predominantly Hive friendly● No direct access to internal storage, access

only through a HQL-like query layer● Export option for other use case (online store)

Page 18: Building Audience Analytics Platform

Tech Stack

● Pig for most pipeline tasks● Grill for analytics interface● Hive as the primary execution engine● Tez as the runtime environment● ORC/Parquet for the storage format●

Page 19: Building Audience Analytics Platform

Questions