Vertica’s Design: Basics, Successes, and...

Preview:

Citation preview

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Vertica’s Design: Basics, Successes, and Failures Chuck Bear

CIDR 2015 ‟ January 5, 2015

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

1. Vertica Basics: Storage Format

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3

Design Goals „ SQL (for the ecosystem and knowledge pool)

„ Clusters of commodity hardware (for cost)

„ Linux, x86, Ethernet

„ Software-only solution (for flexibility)

„ Special purpose hardware has poor track record in databases

„ Shared Nothing MPP

„ Cheaper, but puts more complexity in the software

„ Analytics: Run large queries many times faster than a legacy DB, load as fast, but feel free to snarl and growl at small UPDATEs and DELETEs

„ Work smart, and work hard.

„ Robust algorithms, query optimizer, vectorize, JIT, etc.

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 4

Start from how data is stored on disk…

SELECT SUM(volume) FROM trades WHERE symbol = 'HPQ' AND date = '5/13/2011'

SYMBOL DATE TIME PRICE VOLUME ETC

… … … … … …

HPQ 05/13/11 01:02:02 PM 40.01 100 …

IBM 05/13/11 01:02:03 PM 171.22 10 …

AAPL 05/13/11 01:02:03 PM 338.02 5 …

GOOG 05/13/11 01:02:04 PM 524.03 150 …

HPQ 05/13/11 01:02:05 PM 39.97 40 …

AAPL 05/13/11 01:02:07 PM 338.02 20 …

GOOG 05/13/11 01:02:07 PM 524.02 40 …

… … … … … …

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 5

Sorted Data

Sort by Symbol, Date, and Time

SYMBOL DATE TIME PRICE VOLUME ETC

… … … … … …

AAPL 05/13/11 01:02:07 PM 338.02 20 …

AAPL 05/13/11 01:02:03 PM 338.02 5 …

… … … … … …

GOOG 05/13/11 01:02:04 PM 524.03 150 …

GOOG 05/13/11 01:02:07 PM 524.02 40 …

… … … … … …

HPQ 05/13/11 01:02:02 PM 40.01 100 …

HPQ 05/13/11 01:02:05 PM 39.97 40 …

… … … … … …

IBM 05/13/11 01:02:03 PM 171.22 10 …

… … … … … …

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 6

Column Files

Split into columns

SYMBOL DATE TIME PRICE VOLUME ETC

… … … … … …

AAPL 05/13/11 01:02:07 PM 338.02 20 …

AAPL 05/13/11 01:02:03 PM 338.02 5 …

… … … … … …

GOOG 05/13/11 01:02:04 PM 524.03 150 …

GOOG 05/13/11 01:02:07 PM 524.02 40 …

… … … … … …

HPQ 05/13/11 01:02:02 PM 40.01 100 …

HPQ 05/13/11 01:02:05 PM 39.97 40 …

… … … … … …

IBM 05/13/11 01:02:03 PM 171.22 10 …

… … … … … …

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 7

Compression + RLE

SYMBOL DATE VOLUME

(8K Distinct) (250/yr)

… 22

GOOG (x18M) 05/13/2011 (x150K) 150

… 40

… 99

HPQ (x22M) 05/13/2011 (x220K) 100

… 40

… 200

IBM (x19M) 05/13/2011 (x150K) 10

… 18

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 8

Position Index (NOT Row ID) Last PSize

Comp SzCRC

Min/MaxNull Count

Last PSize

Comp SzCRC

Min/MaxNull Count

Last PSize

Comp SzCRC

Min/MaxNull Count

CmpData

CmpData

CmpData

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

2. Vertica Basics: Updates & Deletes

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 10

Q: How do you update this?

SYMBOL DATE VOLUME

(8K Distinct) (250/yr)

… 22

GOOG (x18M) 05/13/2011 (x150K) 150

… 40

… 99

HPQ (x22M) 05/13/2011 (x220K) 100

… 40

… 200

IBM (x19M) 05/13/2011 (x150K) 10

… 18

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11

A: You Do Not!

„ Multiple sets of sorted files loaded

‟ Or keep things in memory for a while

„ Update is INSERT+DELETE

„ Delete is just a mark ‟ nice sorted list of positions

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12

So you need to compaction, or whatever. We call ours the tuple mover.

It’ll Get Dirty….

Maximum ROS size (~1/2 disk or less)

Negligible ROS size (1 MB/column or less)

ROS Size (Sum of all columns). Log scale.

Merge Strata (Blue dots represent ROSs)

Start Epoch, End Epoch

Full Stratum

(ready to merge)

...up to the number of strata

Stratum

Height

Stratum 0

Stratum 1

Stratum 3

Stratum 4

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13

Not by glamour, etc.

How Do You Judge a Tuple Mover?

„ Magical: no problems, no backlogs, no errors

„ Latency and freshness: How much batching is needed?

„ Sustained load rate (consider machine capacity + retention interval)

„ Efficiency will be required

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

3. Vertica Basics: Transactions & Recovery

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15

Transactions

Vertica offers full ACID ( just at low TPS)

Queries take a snapshot of the relevant list of files, and need no locks at READ COMMITTED isolation

Tuple Mover (etc.) doesn’t interfere

Loads do not conflict with each other

COMMIT ‟ keep the new files

ROLLBACK ‟ discard them

Table level locks for SERIALIZABLE

All Operations are On-Line

Database is essentially its own undo / redo log Recovery can be as simple as file copies

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16

Node 1

1A 2B

Node 2

1B 2C

Node 4

1D 2A

Node 3

1C 2D

a) All nodes up

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17

Node 1

1A 2B

Node 2

1B 2C

Node 4

1D 2A

Node 3

1C 2D

a) All nodes up

Node 1

1A 2B

Node 2

1B 2C

Node 4

1D 2A

Node 3

1C 2D

b) Node 2 down

All data still available, in several combinations: 2A, 2B, 1C, 1D (shown) 1A, 2B, 1C, 1D 2A, 2B, 1C, 2D 1A, 2B, 1C, 2D (never chosen)

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18

Node 1

1A 2B

Node 2

1B 2C

Node 4

1D 2A

Node 3

1C 2D

a) All nodes up

Node 1

1A 2B

Node 2

1B 2C

Node 4

1D 2A

Node 3

1C 2D

b) Node 2 down

Node 1

1A 2B

Node 2

1B 2C

Node 4

1D 2A

Node 3

1C 2D

c) Recovery

All data still available, in several combinations: 2A, 2B, 1C, 1D (shown) 1A, 2B, 1C, 1D 2A, 2B, 1C, 2D 1A, 2B, 1C, 2D (never chosen)

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

4. Mistake: Execution Engine Design

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20

Simple Design

„ Use iterators

‟ open

‟ getNext

‟ close

„ If there’s trouble, use a temp relation

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21

(You have to vectorize. And do JIT compiling.)

Too Slow!

0 2 4 6 8 10 12

0

1000

2000

3000

4000

5000

6000

Original (ms)

Vectorized Copy (ms)

JIT Compiled (ms)

Number of Merge Streams

Merg

e T

ime (

ms,

Nehale

m)

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22

“Push Model” DAG Executor You might even get parallelism for free

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23

“Push Model” DAG Executor

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24

“Push Model” DAG Executor

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25

“Push Model” DAG Executor

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26

“Push Model” DAG Executor

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27

“Push Model” DAG Executor

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 28

Problems with DAG Execution

„ Free-for-all

‟ And that parallelism thing didn’t pan out after all

„ Resource usage: could do better

„ Diamond problem

„ Need to give clues to upstream operations

‟ Imagine subqueries?

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 29

We threw it away, and went back to the “pull” model

End Result?

„ Block iterators

‟ open

‟ getNextBatch (w/ optimizations to avoid tuple copies)

‟ close

‟ Also, send information back upstream

„ When it gets tricky, use coroutines or other tactics

„ We still push data when there are multiple targets

‟ Such as loading multiple projections, UPDATEs, etc.

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

5. Evolution of Joins in Vertica: The Good, Bad, and Ugly

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 31

Scan all columns

Perform Join of 'fk' against 'pk' IN list

Final 'sv' SUM

Pre-SUM 'sv' data Against 'fk' join key

(optional)

a) No SIPS, EMJ

SELECT SUM(sv) FROM fact WHERE fk IN (SELECT pk FROM d)

Early Materialized Joins

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 32

Scan all columns

Perform Join of 'fk' against 'pk' IN list

Final 'sv' SUM

Pre-SUM 'sv' data Against 'fk' join key

(optional)

Scan 'fk' key column

Perform Join of 'fk' against 'pk' IN list

Materialize columns from rows

that joined

SUM 'sv' from rows

a) No SIPS, EMJ b) No SIPS, LMJ

SELECT SUM(sv) FROM fact WHERE fk IN (SELECT pk FROM d)

Late Materialized Joins

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 33

SCANfact

SCANd

JOIN

1

23

Sideways Information Passing (SIPS)

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 34

Scan 'fk' key column (Filter out keys that

will not join)

Perform Join of 'fk' against 'pk' IN list

Materialize columns from rows that joined

SUM 'sv' from rows

Scan 'fk' key column (Filter out keys that

will not join)

Perform Join of 'fk' against 'pk' IN list

Final 'sv' SUM

Pre-SUM 'sv' data against 'fk' join key

(optional)

c) SIPS, EMJ d) SIPS, LMJ

SELECT SUM(sv) FROM fact WHERE fk IN (SELECT pk FROM d)

Late Materialized Joins

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 35

Outcome?

Selectivity Neither Feature LMJ only SIPS only SIPS+LMJ

0.00% 1206 39 23 271.00% 1202 63 33 39

2.00% 1200 75 50 57

3.00% 1208 121 75 79

5.00% 1207 151 93 116

10.00% 1200 195 141 191

20.00% 1202 362 405 360

50.00% 1202 1050 1086 1047

100.00% 1204 1720 1222 1724

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 36

a) Good join order, no SIPS

A B

CJoin

Join

10M

1M

10

100

10M

Robustness to Join Order Errors

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 37

b) Bad join order, no SIPS

A C

BJoin

Join

10M

100M

100

10

10M

Robustness to Join Order Errors

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 38

c) Good join order, w/ SIPS d) Bad join order, w/ SIPS

A B

CJoin

Join

1M

1M

10

100

10M

A C

BJoin

Join

1M

10M

100

10

10M

Robustness to Join Order Errors

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

6. Mistake: Partitioned Hash Join

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 40

Partitioned Hash Join vs. Sort Merge Join

„ (There are papers about these)

„ PHJ was the first one tried

„ SMJ was simpler to implement

„ Sometimes one relation is sorted already

„ Sometimes, you need to sort for other reasons

„ Much more compatible with SIPS

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 41

Also, There’s Performance

0 1 2 3 4 5 6 7 8 9

0

5000

10000

15000

20000

25000

PHJ

SMJ

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

7. Good Idea: Data Collection

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 43

A database that doesn’t self-collect is hypocrisy at its worst

Big Data Mentality

„ How busy is the machine compared to historical trends?

„ What have my users been doing?

„ How long will this job take to finish?

„ What is the most common error?

„ When was the last time we made a backup?

„ My request’s run-time changed… why?

„ Have there been changes from the standard configuration?

„ Are there problems that the customer hasn’t called about?

„ Which features have been used?

„ Where do customer machines burn the most CPU cycles?

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 44

Unexpected Questions

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 45

Don’t Compromise on the Design

„ Data collector can’t kill the system

„ Like a log, lots of little appends

„ Shouldn’t accidentally monitor itself

„ Should be able to analyze off-line

„ Result: Separate data management scheme

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

8. Good Idea: Dynamic Workload Management

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 47

Static (Known) Workload Management

Don't want reports to take over the entire system, preventing loads or tactical queries

Keep some resources (e.g. memory) reserved so that high-priority queries can always begin

Apply run-time prioritization to manage CPU and I/O

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 48

Unpredictable Workload: Short Query Bias

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 49

Q: Are optimizer cost model estimates really that bad?

Dynamic Prioritization

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 50

Q: Are optimizer cost model estimates really that bad?

A: Doesn’t matter!

Dynamic Prioritization

0 50 100 150 200 0

20

40

60

80

100

120

Time (s)

Cum

ula

tive

Co

mp

leti

on

(%

)

Unprioritized

Dynamic Priority

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Thank you

Please come visit our development team in:

Boston (Cambridge and Andover), MA

Pittsburgh, PA

Sunnyvale, CA