Exalead managing terrabytes

Content

• Introduc*on • Databases

– ACID – Data structures, algorithms

– Scalability issues – Scaling pa=erns

• Search engines – Data structures, algorithms

– Pros & cons • NoSQL Movement

– Why and What 1

Content

• NoSQL Families – Key value stores – Column stores

– Document stores – Graph DB

• Principles: CAP, Scaling pa=erns, High availability pa=erns, Elas*city

• How to choose ? • Conclusion

2

Introduc,on

• Who we are: – Clément STENAC (Indexing and search techs)

– Jérémie BORDIER (360 team (a bit of everything))

• Exalead: – Indexing technologies provider since 1998 – Online search engine: h=p://www.exalead.com – Daily challenge: Tackle informa*on access problems for large companies.

3

Introduc,on

• Universal answer to data storage: RELATIONAL DATABASES

• Well known data representa*on: Objects and rela*onships

• Powerful query language: SQL • Open source implementa*ons:

– MySQL – PostgreSQL – …

4

Introduc,on

• Database scalability problems ? • Used to be a Telco and bank problem…

• Un*l the internet has come !

5 Twitter whale, 2008

Introduc,on

• Thanks to the internet… • …millions of rows is frequent…

• … real *me websites.

How to deal with massive amount of structured data ? Are there alterna*ves ?

What’s this NoSQL buzz ?

6

RELATIONAL DATABASES Knowing your enemy:

7

Databases: ACID

• Atomicity • Transac*ons succeed or fail atomically

• Consistency • Transac*ons leave the database in a consistent state

• Isola,on • Transac*ons do not see the effects of concurrent transac*ons

• Durability • Once a transac*on is commi=ed, it can’t be lost

ACID constraints

Database structures Primary storage

Id 4 bytes

CREATE TABLE author ( id INTEGER PRIMARY KEY, nick VARCHAR(16), age INTEGER, firstname VARCHAR(128), biography TEXT);

CREATE TABLE post ( id INTEGER PRIMARY KEY, author_id FOREIGN KEY REFERENCES author(id); timestamp TIMESTAMP, title VARCHAR(256), text TEXT);

age 4 bytes

nick 16 bytes

firstname pointer

biography pointer

len data

Id 4 bytes

age 4 bytes

nick 16 bytes

firstname pointer

biography pointer

Row 1

Row 2

Table strings len data len data len data

Each value or pointer can be retrieved at a

known offset in the row

Fixed size

Variable size

Heuris*cs change it to variable-‐size

Searching in a database SELECT * FROM author WHERE age=24;

• Enumerate all records in the table • For each record, fetch the condi*on value • Inline value: direct access at row_address + offset(column) • Outside value : fetch pointer and fetch data

• Perform comparison

The raw way: full scan

• Need to analyse the full table • Very CPU intensive • If the table does not fit in memory ? – I/O on the whole table

Analysis

Database structures Indexes

• Primary storage: forward mapping row_id –> row data

• Index : reverse mapping row data –> row_id(s)

• Updated together with the primary storage

What is an index ?

• Retrieve the row ids using the index • Fetch the row data from primary storage

Searching with an index

Database structures Indexes – Hash index

• Stores hashes of column values in as hash-‐table • Retrieve through the hash table

How it works

• Very easy and fast to update • Fast lookup – single hashtable lookup

Pros

• Only provides equality matching • Unable to answer inequality queries

Cons

Database structures Indexes – BTree index

• Provides range and inequality queries easily • Quite fast (logarithmic) opera*ons

Pros

• More complex and expensive to update • B-‐Tree rebalancing

Cons

Binary search tree B-Tree

Choosing how to search

• SELECT * from author where age < 300;

Is indexed search always be=er ?

• Fetch of whole table • Index: random lookups • Full scan : sequen*al fetch

Analysis

• Iden*fy the expensive queries • Use the EXPLAIN statement • Only add indexes where they are required • Indexes are expensive to update

Choosing wisely

Joining

• Put together data from several tables • For some values in table A, find matching values in table B

Goal

• SELECT * FROM post INNER JOIN author ON author.id = post.author_id WHERE author.age = 42;

Example

Join algorithms

• Foreach (author WHERE age=42) { Foreach(post) { if (post.author_id == author.id) { append post to the result set; } } }

• Very naive algorithm : runs in PxA *me • Provides all predicates

Nested loops

• Algorithm • Make a hashtable of author ids matching the « age = 42 » condi*on • Scan once the post table • For each post, lookup in the hashtable to check if it matches a valid author

• Faster than nested loops (2 scans instead of A) • Requires memory to store the hashtable • Only provides equality predicate

Hash join

Join algorithms

• Need to have both tables sorted by join key • Post sorted by author_id • Author sorted by id

• Perform a single parallel scan of the two tables and iden*fy matches • Fastest algorithm, but needs sorted data • Disk-‐based sort for large data sets

Merge join

• Performed automa*cally by the query op*mizer (EXPLAIN) • Main parameters: • Rela*ons cardinali*es • Data order (presence of an ORDER BY clause ?) • Available indexes

• JOIN are always expensive -‐> schema denormaliza,on

Choice of join algorithm

Database scaling Typical workloads

• Example: Wikipedia • First solu*on: high-‐level (frontend *er) caching • Database scaling : 1 master – N slaves • Replica,on of changes from master to slaves

• Does not solve the write bo=leneck problem

Mostly read workloads

• Examples: credit cards, Twi=er (>1000 tweets/second, 1000s of deliveries)

• Performance limited by write I/O throughput • Because of the « D » constraint • Hard to have more than 1000-‐2000 writes/second

High write workloads

Database scaling Scaling writes

• All masters have the same data and share the updates • « share-‐all » cluster architecture

• Extremely complex synchroniza*on • Bi-‐direc*onal replica*on • Conflict detec*on

• Bad performance • Complex resilience • Down*me of a master: need a resync

• Complex, heavy and expensive architectures

Mul*ple master setups

Master 1

Master 2

Bi-directional replication flow Client 1 Client 2

Database scaling Scaling writes

• Split the data between the masters based on a criterion • Date • User id • hash(url), …

• Clients query the correct master for each data • No shared data between masters (« share-‐nothing »)

Sharding

Master 1

Master 2

Client 1

Client 2

Database scaling Problems with SQL sharding

• Not integrated in SQL • Need to perform the sharding in applica*ve code

Complexity

• Several machines but no resilience • Loss of one master = loss of data (compare to RAID-‐0)

Resilience

• You can’t do cross-‐shard joins

Loss of features

• How do you keep scaling ? • To add another machine, you need to change the distribu*on func*on

Complex evolu*ons

Database scaling Other SQL shortcomings

• It is good, it provides strong typing • But, migra*on hell ! • Web applica*ons changes quickly • Not « Agile »

Strict schema

SEARCH ENGINES On the other side:

23

A quick look at search engines

• Not designed for OLTP • Update by batches • No transac*ons, updates are available to readers « later »

• Heavily read-‐op*mized

Differences from a tradi*onal database

• It’s more complex than LIKE ’%myword%’; • Need specific data structures

Full text search

Search engines Inverted lists

Exalead S.A. © 2010 CONFIDENTIAL

Document 1

The quick fox

Document 2

The lazy dog

Document 3

The dog quick dog

• the = 1 • quick = 2 • fox = 3 • lazy = 4 • dog = 5

List for word 1 (the) • doc 1 (at posi*on 0) • doc 2 (at posi*on 0) • doc 3 (at posi*on 0)

List for word 2 (quick) • doc 1 (at posi*on 1) • doc 3 (at posi*on 2)

List for word 4 (lazy) • doc 2 (at posi*on 1)

List for word 3 (fox) • doc 1 (at posi*on 2)

List for word 5 (dog) • doc 2 (at posi*on 2) • doc 3 (at posi*ons 1, 3)

• A data structure mapping a « word iden*fier » to a list of « document iden*fier »

• For each word of each document, store the posi*ons

What is is

Search engines Searching with inverted lists


• Resolve the word to its id using the dic*onary (wid 5) • Fetch the inverted list for this id • Simply read the inverted list for its id • We have the hits: document 2 and document 3

Single word query : dog

• Resolve words, fetch inverted lists • The: 1,2,3 Dog: 2,3 • Perform intersec*on: hits = 2,3

Boolean query: the AND dog

• Resolve/fetch • Perform union: hits = 1, 2, 3

Boolean query : the OR dog

Search engines Searching with inverted lists


• Fetch the inverted lists and also read the posi*ons • The : 1(0), 2(0), 3(0) Dog : 2(2), 3(1,3)

• Iden*fy “simple boolean” matches: docs 2 and 3 • For each possible match, check if posi*ons form a sequence

• Only document 3 matches on sequence (0,1)

• Posi*onal queries are more expensive and storing word posi*ons is expensive (disk space, decoding CPU, I/O)

Posi*onal query: the NEXT dog

THE NOSQL MOVEMENT The revolu*on:

28

NoSQL Movement

• « NoSQL » © Eric VANS (Rackspace, 2009)

29

The name was an a=empt to describe the emergence of a growing number of non-‐

rela*onal, distributed data stores that ozen did not a=empt to provide ACID guarantees.

Wikipedia

NoSQL Movement: Issue

• RDBMS fails with huge amount of data – Facebook’s 70TB of inbox – Digg’s 3TB – eBay’s 2PB…

• High scale SQL systems are either: – Very expensive to buy and quite to maintain

– Very expensive to maintain

30

NoSQL Movement

• We need new systems that: – Scales horizontally (both read/write) – Have no single point of failure – Are fault tolerant – Are elas*cs (adding nodes is easy) – Have flexible data schemas – Are more web applica*ons friendly

31

NoSQL: Families

• Different types of data stores: – Key-‐Value stores (Dynamo, Redis, Voldemort…)

– Column stores (BigTable, Cassandra, HBase…) – Document stores (CouchDB, MongoDB…) – Graph stores (Neo4J, Swarm…)

32

NoSQL: Key-‐Value stores

• Distributed hashtables – Btrees – Fixed sized tables

• Benefits: – Very simple API (get/put/delete/range)

– Easily shardable – Fast reads

• Drawbacks: – No data schema (no joins, data fla=ening…)

– No query language • Implems: Redis, Amazon Dynamo, Voldemort

33

NoSQL: Column Stores

• Row based storage: – 1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;

• Column based storage: – 1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;

34

Id Lastname Firstname Salary

1 Smith Joe 40000

2 Jones Mary 50000

3 Johnson Cathy 44000


• Benefits: – Reading all the values of a given column is faster (ex: aggregates)

– Batch writes are faster • Joins are faster

– Comparing two columns is sequen*al – Much more L1 CPU cache hits – L1 cache reference: 0.5ns – L2 cache reference: 7ns

35


• Drawbacks: – Reading a single object is slower (mul* ios)

– Wri*ng a single object is slower (mul* ios) – Doesn’t fit to most applica*ons

• Finally: – Well suited for heavy write / read applica*ons

• (eg: Facebook inbox indexes)

36

SQL Schema:

NoSQL: Document Stores

• Can be seen as schema free, hierarchical database (usually represented as JSON)

37

Person: -‐ id - name -‐ address - phone

Animal: -‐ id - person_id - name -‐ address - phone

1

N

Document store: Person: -‐ id - name -‐ address - phone - animals =

-‐ id -‐ person_id -‐ name -‐ address -‐ phone

NoSQL: Document Stores

• Benefits: – Data spa*ality ! Everything in one place – Efficient write and updates (in place) – Efficient read – Highly flexible data schema

– Usually provides indexes over each object key to have powerful query language

• Drawbacks – Doesn’t encourage well designed data schema

38

NoSQL: Graph Stores

• An entry is a node • Nodes have proper*es • Edges are links between nodes

39

NoSQL: Graph Stores

• Benefits: – Faster to fetch an entry and its related entries (links are already resolved, no need to join)

– Flexible data schema

• Drawbacks: – Complex APIs – Slow for batch opera*ons – Open source implems are not that good…

40

SCALABILITY IN PRACTICE The real issues…

41

CAP Theorem

• CAP: – Consistency: Opera*ng fully or not at all. – Availability: The service must be reachable at any *me.

– Par,,on Tolerance: No set of failures less than total network failure is allowed to cause the system to respond incorrectly.

42

Any shared-‐data system can only achieve two of these three.

CAP Theorem, Dr. Eric Brewer, Berkeley (2000)

Consistent Hashing

• Ensuring data availability: replica*on ! • Reaching the right nodes ? Hashing • Consistent hashing: Hash ring

– Objects are mapped into a range – Nodes are mapped into that range

– We write the object into the nearest node, clockwise

43

Data consistency • Ensuring data eventual consistency: Quorum writes

– W = number of writes to ensure before returning OK – R = number of reads to ensure

– N = replica*on factor

• W < N == High write availability – Data may be lost or outdated if read from another node

• R < N == High read availability – Data may be outdated

• W + R > N == Full consistency ! – But slower writes / reads

44

Conflicts resolu,on

• What happens when R > 1 and two different versions are found ?

• Conflict resolu*on ! • Common algorithm:

Vector clocks

45

Vector clocks

46

• Assign to each node a unique ID • A node increments its own vector and keep track of the old entries

Elas,city: Gossip Membership

• When a node joins…

47

Elas,city: Gossip Membership

• When a node crashes !

48

WHAT’S THE BEST SYSTEM ?

I’m star*ng the next big startup…

Choosing your storage system

• “Don’t op,mize too early” • MySQL is robust and works VERY well

– You’ll know where bugs come from (you)

• Key-‐Value stores are hype, and o`en badly implemented

• Anyway, most mature “NoSQL” systems: – MongoDB

– Cassandra

50

Ques,ons

?

Technology

Exalead managing terrabytes