Large-Scale Real-Time Data Management for Engagement and Monetization

Large-Scale Real-Time Data Management for Engagement and Monetization Simon Lia-Jonassen LSDS-IR 2015

Our mission is to help companies understand their audience and build great online user experiences.

– Find interesting articles. – Stay longer on the site – Get relevant ads. – Sign up for subscriptions.

Some of our customers:

About Cxense

Our solutions

Cxense DMP

How does it work!? (JavaScript tag example)

Page view events (example)

Content profiles (example)

Custom events (example)

UI and API capabilities

UI and API capabilities

User Segments

Data Volume and Traffic (monthly)

–  5 000 active Web-sites –  100 million pages –  1 billion users –  15 billion page views

Constrains and Requirements –  Online and real-time processing

•  Show, analyze and act on what is happening exactly right now.

–  High and sustainable performance •  Peak-load 10K+ request/sec.

•  50 ms latency constrain for ads and recs. –  Availability, reliability, durability

•  multi DC and fault-tolerance –  Security and privacy

Challenges

Heterogeneity and Reliability –  Hundreds of mobile and desktop platforms, browsers, internet providers, etc. –  Multiple browsers and devices per user, cross-domain tracking (3rd party cookies are dying out). –  Web-pages (articles, image/video galleries, chats, search/front pages) and human language. –  The Internet is Broken™

Customer success –  Providing the right insights.

•  Data, metrics and visualization. –  Providing the right set of tools.

•  Usability, brevity, expressiveness, completeness.

–  Best practices. •  Analytics, ads, recs, user engagement,

personalization and subscription optimization. –  Onboarding and support.

Challenges

Communication –  HTTP with JSON payload. –  Durable and Idempotent.

Local storage

–  Atomically append to file. –  Use a separate directory for each

partition and a new file each hour. –  Tail files and/or directories.

Metadata

–  Keeps the state. –  Rewind and re-feed when needed.

System

–  Configured via Upstart and Cron. –  Monitoring via Graphite and log files. –  Automatic alerting.

Architecture and Data Flow

Data Cubes –  Partitioned column store database. –  Efficient string handling and integer compression. –  Fast filtering and aggregation over billions of data points. –  Low update latency (100ms). –  Exists in multiple variants:

•  Disk or memory based.

•  Partitioned by site, by user or by both. –  Low-level API.

Example:

The Cube

!me user rnd siteid url

browser

1409425329634 “4szi” “xzst” “9978” “cxnews.com” “Chrome”

1409425329634 “zthp” “fd0z” “9978” “cxnews.com/seahawks-‐win-‐again…” “Firefox”

1409425329635 “4szi” “tzdt” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Chrome”

1409425329640 “4szi” “aext” “9978” “cxnews.com/elon-‐musk-‐is-‐awes…” “Chrome”

1409425329640 “zx5t” “dxrf” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Safari”

Frame of Reference Compression –  Compress the numbers in groups of 64. –  If the sequence is increasing – use the first number as the reference and compute the

differences between each two consecutive numbers (deltas). –  Find the maximum number of bits (width) needed to represent the larges delta and

compress the deltas using fixed bit width.

–  For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas.

The Cube – Integer Columns

–  A global lexicon maps all strings to numbers and back. –  For each column, map global keys to a smaller set of numbers and back.

The Cube – String Columns

Structured data –  Can represent any simple JSON object (document). –  Node types: Null, Object, Array, Integer, Float, String, Boolean. –  Stored in a separate container, separate columns for each node type. –  Each document is decomposed into a list of paths and nodes. –  Each node is added to the corresponding column.

The Cube – Advanced Data Types

Filtering operations and tricks: –  Keep a bit-filter over a range of rows (1 = exclude). –  By a number or range – unset bits where numbers not match. Can use binary search for ordered

columns such as time, and inverted indexes for unordered such as user id. –  By a key or set of key – map keys to a number or bit-set and filter. –  By pattern – filter by the set of keys matching the pattern. –  Logical AND, OR, NOT – use a stack of filters and binary operations.

The Cube – Filtering and Aggregation

Some aggregation operations and tricks: –  Count, Sum, Cardinaltiy – bit-counting. Can use HLL for distributed cardinality.

–  Frequency, SumBy, CardinalityMap – sorting and bit-counting using pairs of integers. –  Frequency-, SumBy–, CardinalityDistribution – histograms, more sorting and bit-counting.


Advanced operations –  Use aggregation output as filtering input (e.g., top-list, histogram, etc.). –  Join between cubes on one or multiple dimensions.


Partitioning –  Most of the data structures are partitioned into chunks of data. –  This improves memory allocation, materialization, skipping, compression and locking.

Static and dynamic parts

–  Each data column, lexicon or mapping consist of a static and a dynamic part. –  The static part is ordered – use binary search and Minimal Perfect Hashing. –  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees. –  Updates are mostly appends, but updates can also be done via deletion and a new write.

Maintenance

–  Periodically flush the dynamic part into the static part. –  Remove the old data, delete unused strings, optimize the mapping.

The Cube – Updates

Thank you! Questions?

Credits: Erik Gorset and the Oslo R&D Team

[email protected] …btw, we are hiring!

cxense.com facebook.com/cxense

twitter.com/cxense linkedin.com/company/cxense

youtube.com/user/cxense

One more thing… the Internet of Things!

Technology

Large-Scale Real-Time Data Management for Engagement and Monetization