Building Audience Analytics Platform

Building Audience Analytics Platform

Jothi PadmanabhanInmobi

6-Sep-2014

Motivation➔ Audience Analytics platform is extremely critical➔ Segmentation

➔Rule Based

➔Inferred based on Sciences Modeling

➔Third Party

➔ Targeting➔Maximize CTR and CVR

Challenges➔ Scale

➔Billions of Ad requests/day, Peak 25K rps, 800M Users

➔ Multiple Input Sources and Types➔Fact Data, Dimension Data

➔ Multiple Consumers➔Reporting, Segmentation and Targeting, Inferences

Challenges➔ Data Curation➔ Define and Measure Data Quality

➔Track sources and possibly assign confidence

➔ Governance and Licensing restrictions➔ Consistent Querying Interface

Challenges

● Storage capacity and retention● Optimal usage of grid resources

Activity Data➔ Records actual activity ➔ Time-series data➔ Immutable, actual facts➔ Comprises Dimensions and Measures➔ Measures

➔Ad requests, Impressions, Clicks, Conversion, ...

Dimension Data➔ Domain specific Metadata (user, location, app

etc) ➔ Each domain will have its own schema

➔User (uid, age, gender, interests etc)

➔Location (Lat/Long – zip/city/country, etc)

➔Device (Handset model, OS, version etc)

➔ Mutable (but possibly slowly changing)

ETL➔ Need to ingest data from different

sources ➔ Transform the data into a format for

optimized storage and easy queriability➔ Query interface for different consumers

ETL - Ingestion➔ Naive -- Have custom ingestion flows

➔Quick to develop

➔Could be highly optimized

➔Not scalable

➔ Have a generic framework➔Streamlined and scalable

➔Might need more processing

ETL - Storage➔ Naive -- Storage schema closely coupled

with ingestion schema➔Multiple representations of same data. Age

could be DOB or years

➔ Consitent representation a must➔Would require transformation from input

schema to storage schema

ETL - Storage➔ Location – Lat/Long, Zip, City, Country➔ Need to store in the lowest possible granularity

(Lat/Long)➔ GPS readings come with accuracy that needs to

be recorded➔ Queries are almost always nearness queries,

not exact matches➔

ETL - Storage➔ Quadtile representation➔ Use leading bits for tile id, remaining for storing

accuracy➔ Transform all location information to such ids➔ Nearness with Lat/Long distance is a cross-product

join➔ With Tiles, we can translate this into equi-joins (of

course with some loss of accuracy)

ETL - Querying➔ Naive -- Users aware of multiple feeds

and schemas, query appropriately➔Extremely difficult as schemas change,

new feeds get added

➔Closely coupled with internal representation, not good

ETL - Querying➔ Having a consistent, published schema

➔Enables exploration and discovery

➔Well defined querying interfaces that abstract out internal representation

➔Provide primitives (for example UDFs for nearness calculations) for easier querying

Ingestion Server

● Curation to filter out dubious records● Adapters for transformation● REST based ingestion server

– Support multiple compression types

– Support multiple serialization formats

– Handle rate-limiting/throttling

– Bulk/Streaming inputs

●

Storage and Querying

● Possibly different schema than ingestion schema

● Columnar storage format (Parquet/ORC)● Predominantly Hive friendly● No direct access to internal storage, access

only through a HQL-like query layer● Export option for other use case (online store)

Tech Stack

● Pig for most pipeline tasks● Grill for analytics interface● Hive as the primary execution engine● Tez as the runtime environment● ORC/Parquet for the storage format●

Questions

Technology

Building Audience Analytics Platform