Upload
simon-lia-jonassen
View
1.117
Download
0
Embed Size (px)
Citation preview
Large-Scale Real-Time Data Management for Engagement and Monetization Simon Lia-Jonassen LSDS-IR 2015
Our mission is to help companies understand their audience and build great online user experiences.
– Find interesting articles. – Stay longer on the site – Get relevant ads. – Sign up for subscriptions.
Some of our customers:
About Cxense
Our solutions
Cxense DMP
How does it work!? (JavaScript tag example)
Page view events (example)
Content profiles (example)
Custom events (example)
UI and API capabilities
UI and API capabilities
User Segments
Data Volume and Traffic (monthly)
– 5 000 active Web-sites – 100 million pages – 1 billion users – 15 billion page views
Constrains and Requirements – Online and real-time processing
• Show, analyze and act on what is happening exactly right now.
– High and sustainable performance • Peak-load 10K+ request/sec.
• 50 ms latency constrain for ads and recs. – Availability, reliability, durability
• multi DC and fault-tolerance – Security and privacy
Challenges
Heterogeneity and Reliability – Hundreds of mobile and desktop platforms, browsers, internet providers, etc. – Multiple browsers and devices per user, cross-domain tracking (3rd party cookies are dying out). – Web-pages (articles, image/video galleries, chats, search/front pages) and human language. – The Internet is Broken™
Customer success – Providing the right insights.
• Data, metrics and visualization. – Providing the right set of tools.
• Usability, brevity, expressiveness, completeness.
– Best practices. • Analytics, ads, recs, user engagement,
personalization and subscription optimization. – Onboarding and support.
Challenges
Communication – HTTP with JSON payload. – Durable and Idempotent.
Local storage
– Atomically append to file. – Use a separate directory for each
partition and a new file each hour. – Tail files and/or directories.
Metadata
– Keeps the state. – Rewind and re-feed when needed.
System
– Configured via Upstart and Cron. – Monitoring via Graphite and log files. – Automatic alerting.
Architecture and Data Flow
Data Cubes – Partitioned column store database. – Efficient string handling and integer compression. – Fast filtering and aggregation over billions of data points. – Low update latency (100ms). – Exists in multiple variants:
• Disk or memory based.
• Partitioned by site, by user or by both. – Low-level API.
Example:
The Cube
!me user rnd siteid url
browser
1409425329634 “4szi” “xzst” “9978” “cxnews.com” “Chrome”
1409425329634 “zthp” “fd0z” “9978” “cxnews.com/seahawks-‐win-‐again…” “Firefox”
1409425329635 “4szi” “tzdt” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Chrome”
1409425329640 “4szi” “aext” “9978” “cxnews.com/elon-‐musk-‐is-‐awes…” “Chrome”
1409425329640 “zx5t” “dxrf” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Safari”
Frame of Reference Compression – Compress the numbers in groups of 64. – If the sequence is increasing – use the first number as the reference and compute the
differences between each two consecutive numbers (deltas). – Find the maximum number of bits (width) needed to represent the larges delta and
compress the deltas using fixed bit width.
– For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas.
The Cube – Integer Columns
– A global lexicon maps all strings to numbers and back. – For each column, map global keys to a smaller set of numbers and back.
The Cube – String Columns
Structured data – Can represent any simple JSON object (document). – Node types: Null, Object, Array, Integer, Float, String, Boolean. – Stored in a separate container, separate columns for each node type. – Each document is decomposed into a list of paths and nodes. – Each node is added to the corresponding column.
The Cube – Advanced Data Types
Filtering operations and tricks: – Keep a bit-filter over a range of rows (1 = exclude). – By a number or range – unset bits where numbers not match. Can use binary search for ordered
columns such as time, and inverted indexes for unordered such as user id. – By a key or set of key – map keys to a number or bit-set and filter. – By pattern – filter by the set of keys matching the pattern. – Logical AND, OR, NOT – use a stack of filters and binary operations.
The Cube – Filtering and Aggregation
Some aggregation operations and tricks: – Count, Sum, Cardinaltiy – bit-counting. Can use HLL for distributed cardinality.
– Frequency, SumBy, CardinalityMap – sorting and bit-counting using pairs of integers. – Frequency-, SumBy–, CardinalityDistribution – histograms, more sorting and bit-counting.
The Cube – Filtering and Aggregation
Advanced operations – Use aggregation output as filtering input (e.g., top-list, histogram, etc.). – Join between cubes on one or multiple dimensions.
The Cube – Filtering and Aggregation
Partitioning – Most of the data structures are partitioned into chunks of data. – This improves memory allocation, materialization, skipping, compression and locking.
Static and dynamic parts
– Each data column, lexicon or mapping consist of a static and a dynamic part. – The static part is ordered – use binary search and Minimal Perfect Hashing. – The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees. – Updates are mostly appends, but updates can also be done via deletion and a new write.
Maintenance
– Periodically flush the dynamic part into the static part. – Remove the old data, delete unused strings, optimize the mapping.
The Cube – Updates
Thank you! Questions?
Credits: Erik Gorset and the Oslo R&D Team
[email protected] …btw, we are hiring!
cxense.com facebook.com/cxense
twitter.com/cxense linkedin.com/company/cxense
youtube.com/user/cxense
One more thing… the Internet of Things!