27
Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang

Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Insights of Approximate Query Processing Systems

Presented by: Huanyi Chen

Ruoxi Zhang

Page 2: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Agenda§ Introduction

§ Background

§ VerdictDB & SnappyData

§ Experiment Setup

§ Evaluation

§ Insights

Insights of Approximate Query Processing Systems PAGE 2

Page 3: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Why AQP?

Insights of Approximate Query Processing Systems PAGE 3

# of Day Income (CAD)1 1502 2403 1804 2005 2306 1907 180

Avg(Income) 195.71

shop income

Page 4: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Why AQP?

Insights of Approximate Query Processing Systems PAGE 4

# of Day Income (CAD)1 1502 2403 1804 2005 2306 1907 180

Avg(Income)

186.67shop income

more efficient (50% rows)accuracy > 95%

195.71

Page 5: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Why AQP?

Insights of Approximate Query Processing Systems PAGE 5

99.9% Identical100x-200x Faster

Page 6: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Sampling Based AQP

Insights of Approximate Query Processing Systems PAGE 6

Page 7: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Sampling Based AQP

Insights of Approximate Query Processing Systems PAGE 7

Query Column Set (QCS)

Page 8: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Why SnappyData & VerdictDB ?

Insights of Approximate Query Processing Systems PAGE 8

Name Online/Offline

Distributed/Standalone Platform Algorithm Skewed

BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes

Sapprox Online Distributed Hadoop Distribution-aware Online sampling No

Approxhadoop Online Distributed Hadoop Approximation-enabled MapReduce No

Quickr Online Distributed N/A ASALQA algorithm No

SnappyData Online Distributed Spark and GemFireSpark as a computational engine; GemFire as transactional store

No

FluoDB Online Distributed Spark Mini-batch execution OLA Model No

XDB Online Standalone PostgreSQL Wander join No

VerdictDB Online Standalone Spark SQL Database learning No

IDEA Online Standalone N/AReuse answers of past overlapping queries for new query

No

BEAS Online Standalone Commercial DBMS Approximability theorem NoABS Online Standalone N/A Bootstrap No

• Spark• Open-source*

Page 9: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

SnappyData

Insights of Approximate Query Processing Systems PAGE 9

SDE is NOT open

source

Page 10: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

SnappyData

Insights of Approximate Query Processing Systems PAGE 10

+ WITH ERROR

QCSFRACTION

Page 11: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

VerdictDB

Insights of Approximate Query Processing Systems PAGE 11

Page 12: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

VerdictDB

Insights of Approximate Query Processing Systems PAGE 12

Page 13: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Experiment Setup

Insights of Approximate Query Processing Systems PAGE 13

§ Cluster Setup§ SnappyData: 1 locator, 1 lead, and 2 servers

Page 14: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Experiment Setup

Insights of Approximate Query Processing Systems PAGE 14

§ Cluster Setup§ SnappyData: 1 locator, 1 lead, and 2 servers

§ VerdictDB on Spark: 1 master and 2 executors

§ Each Node§ 24/32 GB memory used

§ 500 GB HDD

Page 15: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Experiment Setup

Insights of Approximate Query Processing Systems PAGE 15

§ TPC-H Benchmark § OLAP

§ 22 queries includes Aggregation, Join, etc.

§ Well known and standard

§ Customizable

§ Data§ 1GB and 10GB

§ Uniformly distributed

Page 16: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Evaluation

Insights of Approximate Query Processing Systems PAGE 16

SnappyData• Stratified Sampling• In-memory

VerdictDB• Uniform Sampling• Not in-memory (bug?)

Page 17: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

SnappyData - Latency

Insights of Approximate Query Processing Systems PAGE 17

29439

1832

66298092

13993870

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Q1 Q6 Q14

Execution time (ms) using TPC-H (SF=10, fraction 0.1)

SnappyData SnappyData_AQP (>95% accuracy)

Q1: Up to 3.6x speedup~0.0001 Error

Page 18: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

fraction0.1

SnappyData - Accuracy

Insights of Approximate Query Processing Systems PAGE 18

Base Table

fraction0.01

Sample Tables ...

0.00

0.00

0.01

0.01

0.02

0.02

fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3

Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction)

01,0002,0003,0004,0005,0006,0007,000

fraction 0.01

fraction 0.1

fraction 0.2

fraction 0.3

Snappy

Time (ms) for TPC-H Q14 result (SF=10) given different sample tables (fraction)

Page 19: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

SnappyData- Creating Sample Tables

Insights of Approximate Query Processing Systems PAGE 19

0

50,000

100,000

150,000

200,000

250,000

fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3

Time (ms) for creating SnappyData sample tables with different fractions

Page 20: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

VerdictDB - Latency

Insights of Approximate Query Processing Systems PAGE 20

206034

8891299195

17598 1521024355

0

50,000

100,000

150,000

200,000

250,000

Q1 Q6 Q14

Execution time (ms) using TPC-H (SF=10, fraction 0.1)

SparkSQL VerdictDB (> 95% accuracy)

Up to ~11x speedup!

Page 21: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

VerdictDB - Speedup

Insights of Approximate Query Processing Systems PAGE 21

0

2

4

6

8

10

12

14

Q1 Q6 Q14

Speedup for TPC-H (SF=10, fraction=0.1)

Speedup

0

2

4

6

8

10

12

Q1 Q6 Q14

Speedup for TPC-H (SF=1, fraction=0.1)

Speedup

Page 22: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

VerdictDB - Creating Sample Tables

Insights of Approximate Query Processing Systems PAGE 22

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3

Time (ms) for creating VerdictDB sample tables with different fraction

Page 23: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

fraction0.1

VerdictDB - Accuracy

Insights of Approximate Query Processing Systems PAGE 23

Base Table

fraction0.01

Sample Tables ...

0

0.05

0.1

0.15

0.2

0.25

fraction 0.01

fraction 0.05

fraction 0.1

fraction 0.2

fraction 0.3

Actual Error for TPC-H Q14 result (SF=10)given different sample tables (fraction)

0

20,000

40,000

60,000

80,000

100,000

120,000

fraction 0.01

fraction 0.05

fraction 0.1 fraction 0.2fraction 0.3 SparkSQL

Time (ms) for TPC-H Q14 result (SF=10)given different sample tables (fraction)

converge!

Page 24: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Other Queries?

Insights of Approximate Query Processing Systems PAGE 24

Q19Error: ~ 80%

Speedup: ~5.5X

Q14 Error: ~ 1.7%

Speedup: ~1.7X

Page 25: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Other Queries?

Insights of Approximate Query Processing Systems PAGE 25

Key missing in sample tables!

Careful design of sample tableor original table!

Q7AQP not working!

Page 26: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Insights

Insights of Approximate Query Processing Systems PAGE 26

§ AQP performs well:§ For aggregate functions such as SUM, AVG and COUNT

§ When WHERE is simple

§ Users’ foreseen is important!§ for both query and original table

Page 27: Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Future Work

Insights of Approximate Query Processing Systems PAGE 27

§ Test error estimation in sampling

§ Other sampling techniques§ Biased Sampling

§ Database learning

§ Approximate hardware