View
1.973
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Content
• Introduc*on • Databases
– ACID – Data structures, algorithms
– Scalability issues – Scaling pa=erns
• Search engines – Data structures, algorithms
– Pros & cons • NoSQL Movement
– Why and What 1
Content
• NoSQL Families – Key value stores – Column stores
– Document stores – Graph DB
• Principles: CAP, Scaling pa=erns, High availability pa=erns, Elas*city
• How to choose ? • Conclusion
2
Introduc,on
• Who we are: – Clément STENAC (Indexing and search techs)
– Jérémie BORDIER (360 team (a bit of everything))
• Exalead: – Indexing technologies provider since 1998 – Online search engine: h=p://www.exalead.com – Daily challenge: Tackle informa*on access problems for large companies.
3
Introduc,on
• Universal answer to data storage: RELATIONAL DATABASES
• Well known data representa*on: Objects and rela*onships
• Powerful query language: SQL • Open source implementa*ons:
– MySQL – PostgreSQL – …
4
Introduc,on
• Database scalability problems ? • Used to be a Telco and bank problem…
• Un*l the internet has come !
5 Twitter whale, 2008
Introduc,on
• Thanks to the internet… • …millions of rows is frequent…
• … real *me websites.
How to deal with massive amount of structured data ? Are there alterna*ves ?
What’s this NoSQL buzz ?
6
RELATIONAL DATABASES Knowing your enemy:
7
Databases: ACID
• Atomicity • Transac*ons succeed or fail atomically
• Consistency • Transac*ons leave the database in a consistent state
• Isola,on • Transac*ons do not see the effects of concurrent transac*ons
• Durability • Once a transac*on is commi=ed, it can’t be lost
ACID constraints
Database structures Primary storage
Id 4 bytes
CREATE TABLE author ( id INTEGER PRIMARY KEY, nick VARCHAR(16), age INTEGER, firstname VARCHAR(128), biography TEXT);
CREATE TABLE post ( id INTEGER PRIMARY KEY, author_id FOREIGN KEY REFERENCES author(id); timestamp TIMESTAMP, title VARCHAR(256), text TEXT);
age 4 bytes
nick 16 bytes
firstname pointer
biography pointer
len data
Id 4 bytes
age 4 bytes
nick 16 bytes
firstname pointer
biography pointer
Row 1
Row 2
Table strings len data len data len data
Each value or pointer can be retrieved at a
known offset in the row
Fixed size
Variable size
Heuris*cs change it to variable-‐size
Searching in a database SELECT * FROM author WHERE age=24;
• Enumerate all records in the table • For each record, fetch the condi*on value • Inline value: direct access at row_address + offset(column) • Outside value : fetch pointer and fetch data
• Perform comparison
The raw way: full scan
• Need to analyse the full table • Very CPU intensive • If the table does not fit in memory ? – I/O on the whole table
Analysis
Database structures Indexes
• Primary storage: forward mapping row_id –> row data
• Index : reverse mapping row data –> row_id(s)
• Updated together with the primary storage
What is an index ?
• Retrieve the row ids using the index • Fetch the row data from primary storage
Searching with an index
Database structures Indexes – Hash index
• Stores hashes of column values in as hash-‐table • Retrieve through the hash table
How it works
• Very easy and fast to update • Fast lookup – single hashtable lookup
Pros
• Only provides equality matching • Unable to answer inequality queries
Cons
Database structures Indexes – BTree index
• Provides range and inequality queries easily • Quite fast (logarithmic) opera*ons
Pros
• More complex and expensive to update • B-‐Tree rebalancing
Cons
Binary search tree B-Tree
Choosing how to search
• SELECT * from author where age < 300;
Is indexed search always be=er ?
• Fetch of whole table • Index: random lookups • Full scan : sequen*al fetch
Analysis
• Iden*fy the expensive queries • Use the EXPLAIN statement • Only add indexes where they are required • Indexes are expensive to update
Choosing wisely
Joining
• Put together data from several tables • For some values in table A, find matching values in table B
Goal
• SELECT * FROM post INNER JOIN author ON author.id = post.author_id WHERE author.age = 42;
Example
Join algorithms
• Foreach (author WHERE age=42) { Foreach(post) { if (post.author_id == author.id) { append post to the result set; } } }
• Very naive algorithm : runs in PxA *me • Provides all predicates
Nested loops
• Algorithm • Make a hashtable of author ids matching the « age = 42 » condi*on • Scan once the post table • For each post, lookup in the hashtable to check if it matches a valid author
• Faster than nested loops (2 scans instead of A) • Requires memory to store the hashtable • Only provides equality predicate
Hash join
Join algorithms
• Need to have both tables sorted by join key • Post sorted by author_id • Author sorted by id
• Perform a single parallel scan of the two tables and iden*fy matches • Fastest algorithm, but needs sorted data • Disk-‐based sort for large data sets
Merge join
• Performed automa*cally by the query op*mizer (EXPLAIN) • Main parameters: • Rela*ons cardinali*es • Data order (presence of an ORDER BY clause ?) • Available indexes
• JOIN are always expensive -‐> schema denormaliza,on
Choice of join algorithm
Database scaling Typical workloads
• Example: Wikipedia • First solu*on: high-‐level (frontend *er) caching • Database scaling : 1 master – N slaves • Replica,on of changes from master to slaves
• Does not solve the write bo=leneck problem
Mostly read workloads
• Examples: credit cards, Twi=er (>1000 tweets/second, 1000s of deliveries)
• Performance limited by write I/O throughput • Because of the « D » constraint • Hard to have more than 1000-‐2000 writes/second
High write workloads
Database scaling Scaling writes
• All masters have the same data and share the updates • « share-‐all » cluster architecture
• Extremely complex synchroniza*on • Bi-‐direc*onal replica*on • Conflict detec*on
• Bad performance • Complex resilience • Down*me of a master: need a resync
• Complex, heavy and expensive architectures
Mul*ple master setups
Master 1
Master 2
Bi-directional replication flow Client 1 Client 2
Database scaling Scaling writes
• Split the data between the masters based on a criterion • Date • User id • hash(url), …
• Clients query the correct master for each data • No shared data between masters (« share-‐nothing »)
Sharding
Master 1
Master 2
Client 1
Client 2
Database scaling Problems with SQL sharding
• Not integrated in SQL • Need to perform the sharding in applica*ve code
Complexity
• Several machines but no resilience • Loss of one master = loss of data (compare to RAID-‐0)
Resilience
• You can’t do cross-‐shard joins
Loss of features
• How do you keep scaling ? • To add another machine, you need to change the distribu*on func*on
Complex evolu*ons
Database scaling Other SQL shortcomings
• It is good, it provides strong typing • But, migra*on hell ! • Web applica*ons changes quickly • Not « Agile »
Strict schema
SEARCH ENGINES On the other side:
23
A quick look at search engines
• Not designed for OLTP • Update by batches • No transac*ons, updates are available to readers « later »
• Heavily read-‐op*mized
Differences from a tradi*onal database
• It’s more complex than LIKE ’%myword%’; • Need specific data structures
Full text search
Search engines Inverted lists
Exalead S.A. © 2010 CONFIDENTIAL
Document 1
The quick fox
Document 2
The lazy dog
Document 3
The dog quick dog
• the = 1 • quick = 2 • fox = 3 • lazy = 4 • dog = 5
List for word 1 (the) • doc 1 (at posi*on 0) • doc 2 (at posi*on 0) • doc 3 (at posi*on 0)
List for word 2 (quick) • doc 1 (at posi*on 1) • doc 3 (at posi*on 2)
List for word 4 (lazy) • doc 2 (at posi*on 1)
List for word 3 (fox) • doc 1 (at posi*on 2)
List for word 5 (dog) • doc 2 (at posi*on 2) • doc 3 (at posi*ons 1, 3)
• A data structure mapping a « word iden*fier » to a list of « document iden*fier »
• For each word of each document, store the posi*ons
What is is
Search engines Searching with inverted lists
Exalead S.A. © 2010 CONFIDENTIAL
• Resolve the word to its id using the dic*onary (wid 5) • Fetch the inverted list for this id • Simply read the inverted list for its id • We have the hits: document 2 and document 3
Single word query : dog
• Resolve words, fetch inverted lists • The: 1,2,3 Dog: 2,3 • Perform intersec*on: hits = 2,3
Boolean query: the AND dog
• Resolve/fetch • Perform union: hits = 1, 2, 3
Boolean query : the OR dog
Search engines Searching with inverted lists
Exalead S.A. © 2010 CONFIDENTIAL
• Fetch the inverted lists and also read the posi*ons • The : 1(0), 2(0), 3(0) Dog : 2(2), 3(1,3)
• Iden*fy “simple boolean” matches: docs 2 and 3 • For each possible match, check if posi*ons form a sequence
• Only document 3 matches on sequence (0,1)
• Posi*onal queries are more expensive and storing word posi*ons is expensive (disk space, decoding CPU, I/O)
Posi*onal query: the NEXT dog
THE NOSQL MOVEMENT The revolu*on:
28
NoSQL Movement
• « NoSQL » © Eric VANS (Rackspace, 2009)
29
The name was an a=empt to describe the emergence of a growing number of non-‐
rela*onal, distributed data stores that ozen did not a=empt to provide ACID guarantees.
Wikipedia
NoSQL Movement: Issue
• RDBMS fails with huge amount of data – Facebook’s 70TB of inbox – Digg’s 3TB – eBay’s 2PB…
• High scale SQL systems are either: – Very expensive to buy and quite to maintain
– Very expensive to maintain
30
NoSQL Movement
• We need new systems that: – Scales horizontally (both read/write) – Have no single point of failure – Are fault tolerant – Are elas*cs (adding nodes is easy) – Have flexible data schemas – Are more web applica*ons friendly
31
NoSQL: Families
• Different types of data stores: – Key-‐Value stores (Dynamo, Redis, Voldemort…)
– Column stores (BigTable, Cassandra, HBase…) – Document stores (CouchDB, MongoDB…) – Graph stores (Neo4J, Swarm…)
32
NoSQL: Key-‐Value stores
• Distributed hashtables – Btrees – Fixed sized tables
• Benefits: – Very simple API (get/put/delete/range)
– Easily shardable – Fast reads
• Drawbacks: – No data schema (no joins, data fla=ening…)
– No query language • Implems: Redis, Amazon Dynamo, Voldemort
33
NoSQL: Column Stores
• Row based storage: – 1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;
• Column based storage: – 1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;
34
Id Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
NoSQL: Column Stores
• Benefits: – Reading all the values of a given column is faster (ex: aggregates)
– Batch writes are faster • Joins are faster
– Comparing two columns is sequen*al – Much more L1 CPU cache hits – L1 cache reference: 0.5ns – L2 cache reference: 7ns
35
NoSQL: Column Stores
• Drawbacks: – Reading a single object is slower (mul* ios)
– Wri*ng a single object is slower (mul* ios) – Doesn’t fit to most applica*ons
• Finally: – Well suited for heavy write / read applica*ons
• (eg: Facebook inbox indexes)
36
SQL Schema:
NoSQL: Document Stores
• Can be seen as schema free, hierarchical database (usually represented as JSON)
37
Person: -‐ id - name -‐ address - phone
Animal: -‐ id - person_id - name -‐ address - phone
1
N
Document store: Person: -‐ id - name -‐ address - phone - animals =
-‐ id -‐ person_id -‐ name -‐ address -‐ phone
NoSQL: Document Stores
• Benefits: – Data spa*ality ! Everything in one place – Efficient write and updates (in place) – Efficient read – Highly flexible data schema
– Usually provides indexes over each object key to have powerful query language
• Drawbacks – Doesn’t encourage well designed data schema
38
NoSQL: Graph Stores
• An entry is a node • Nodes have proper*es • Edges are links between nodes
39
NoSQL: Graph Stores
• Benefits: – Faster to fetch an entry and its related entries (links are already resolved, no need to join)
– Flexible data schema
• Drawbacks: – Complex APIs – Slow for batch opera*ons – Open source implems are not that good…
40
SCALABILITY IN PRACTICE The real issues…
41
CAP Theorem
• CAP: – Consistency: Opera*ng fully or not at all. – Availability: The service must be reachable at any *me.
– Par,,on Tolerance: No set of failures less than total network failure is allowed to cause the system to respond incorrectly.
42
Any shared-‐data system can only achieve two of these three.
CAP Theorem, Dr. Eric Brewer, Berkeley (2000)
Consistent Hashing
• Ensuring data availability: replica*on ! • Reaching the right nodes ? Hashing • Consistent hashing: Hash ring
– Objects are mapped into a range – Nodes are mapped into that range
– We write the object into the nearest node, clockwise
43
Data consistency • Ensuring data eventual consistency: Quorum writes
– W = number of writes to ensure before returning OK – R = number of reads to ensure
– N = replica*on factor
• W < N == High write availability – Data may be lost or outdated if read from another node
• R < N == High read availability – Data may be outdated
• W + R > N == Full consistency ! – But slower writes / reads
44
Conflicts resolu,on
• What happens when R > 1 and two different versions are found ?
• Conflict resolu*on ! • Common algorithm:
Vector clocks
45
Vector clocks
46
• Assign to each node a unique ID • A node increments its own vector and keep track of the old entries
Elas,city: Gossip Membership
• When a node joins…
47
Elas,city: Gossip Membership
• When a node crashes !
48
WHAT’S THE BEST SYSTEM ?
I’m star*ng the next big startup…
Choosing your storage system
• “Don’t op,mize too early” • MySQL is robust and works VERY well
– You’ll know where bugs come from (you)
• Key-‐Value stores are hype, and o`en badly implemented
• Anyway, most mature “NoSQL” systems: – MongoDB
– Cassandra
50
Ques,ons
?