27
Adaptive Query Processing on RAW Data Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, Anastasia Ailamaki

Adaptive Query Processing on RAW Data

Embed Size (px)

DESCRIPTION

Presentation of RAW, a prototype query engine which enables querying heterogeneous data sources transparently using Just-In-Time access paths. Presentation given at the 40th International Conference on Very Large Databases (VLDB 2014)

Citation preview

Page 1: Adaptive Query Processing on RAW Data

Adaptive Query Processing on RAW Data

Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, Anastasia Ailamaki

Page 2: Adaptive Query Processing on RAW Data

Many “data problems” lack solutions

• Petabytes of data

• No a priori knowledge about data

• Multiple file formats

2

Page 3: Adaptive Query Processing on RAW Data

Many “data problems” lack solutions

• Petabytes of data

• No a priori knowledge about data

• Multiple file formats

3Need declarative, interactive access to arbitrary datasets

Page 4: Adaptive Query Processing on RAW Data

CSV JSONXML

DBMSTr

ansf

orm

QueryRAW: A query engine adapting to data

Raw files as native storage format

RAW: Generate plug-in per file

4

Traditional DBMS: Data adapts to query engine

Page 5: Adaptive Query Processing on RAW Data

CSV JSONXML

QueryRAW: A query engine adapting to data

Raw files as native storage format

RAW: Generate plug-in per file

5

Traditional DBMS: Data adapts to query engine

Page 6: Adaptive Query Processing on RAW Data

CSV JSONXML

QueryRAW: A query engine adapting to data

Raw files as native storage format

RAW: Generate plug-in per file

6

High-performance querying… while keeping data formats, files, and scripts

Traditional DBMS: Data adapts to query engine

Page 7: Adaptive Query Processing on RAW Data

How RAW adapts to data

CSVROOT

Page 8: Adaptive Query Processing on RAW Data

Vectorized Query Execution

CSVROOT

join

scanplaceholder

scanplaceholder

filter

How RAW adapts to data

scanplaceh.

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND …

8

Page 9: Adaptive Query Processing on RAW Data

Code Generate the Access Paths

Vectorized Query Execution

CSVROOT

join

filter

How RAW adapts to data

scancsv

scanroot

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan

csv

9

Page 10: Adaptive Query Processing on RAW Data

Code Generate the Access Paths

Build Position and Data Caches

Vectorized Query Execution

CSVROOT

join

filter

How RAW adapts to data

scancsv

scanroot

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan

csv

10

Page 11: Adaptive Query Processing on RAW Data

Code Generate the Access Paths

Build Position and Data Caches

Vectorized Query Execution

CSVROOT

join

filter

How RAW adapts to data

scancsv

scanroot

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan

csv

11

Adapt to format, file instance and queryjust-in-time

Page 12: Adaptive Query Processing on RAW Data

Adapting to schema & query

[CSV input]

∀col:

if col needed:

if col isInt

readInt();

if col isFloat

readFloat();

if ...

else:

skipField();

GENERAL-PURPOSE

readInt();

readInt();

skipField();

readFloat();

skipRestLine();

JUST-IN-TIME

Remove overhead of generic operators

12

Page 13: Adaptive Query Processing on RAW Data

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

13

Page 14: Adaptive Query Processing on RAW Data

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

∀col:if col needed:

if col isInt...

readInt();skipField();readFloat();skipRest();

14

Page 15: Adaptive Query Processing on RAW Data

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

∀col:if col needed:

if col isInt...

readInt();skipField();readFloat();skipRest();

- fieldLength:10- tupleLength:100- Need fields 2 & 5

of 2nd row

moveTo(110);readInt();moveTo(140);readFloat();

15

Page 16: Adaptive Query Processing on RAW Data

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

∀col:if col needed:

if col isInt...

readInt();skipField();readFloat();skipRest();

- fieldLength:10- tupleLength:100- Need fields 2 & 5

of 2nd row

moveTo(110);readInt();moveTo(140);readFloat();

- Bitmaps, R-Trees etc.- readNextField() vs. readField(filename,id)

16

Page 17: Adaptive Query Processing on RAW Data

Ad-Hoc Operators for Raw Data Fine-grained, raw-data-aware decisions

Page 18: Adaptive Query Processing on RAW Data

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Fine-grained, raw-data-aware decisions

18

Page 19: Adaptive Query Processing on RAW Data

JIT – OPTION 2

Col1

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Scan CSVColumn 1

Fine-grained, raw-data-aware decisions

19

Page 20: Adaptive Query Processing on RAW Data

JIT – OPTION 2

Filter

Col1

Col1

543

12

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Scan CSVColumn 1

Fine-grained, raw-data-aware decisions

20

Page 21: Adaptive Query Processing on RAW Data

JIT – OPTION 2

Filter

Tuple Construction

Col1 Col9

Col1

Col9

Col1

543

12

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Scan CSVColumn 1

Fine-grained, raw-data-aware decisions

Scan CSVColumn 9

21

Processes 3/5 raw col9 fields

Page 22: Adaptive Query Processing on RAW Data

Electron

eventID INT

eta FLOAT

pt FLOAT

Jet

eventID INT

eta FLOAT

pt FLOATEvent

eventID INT

runNumber INT

Muon

eventID INT

eta FLOAT

pt FLOAT

ROOT - C++ RAWclass Event {

class Muon {float pt, eta;…

}class Electron {

float pt, eta;…

}class Jet {

float pt, eta;…

}int runNumber;vector<Muon> muons;vector<Electron> electrons;vector<Jet> jets; }

Finding the Higgs Boson: Data

22

Page 23: Adaptive Query Processing on RAW Data

Finding the Higgs Boson: Queries“Identify events of interest → Filter out background events

→ Plot aggregated results in a histogram”

SELECT eventFROM root:/data1/ATLAS/*.root ,

csv:/data1/ATLAS/events.csvWHERE( csv.id = event.id AND

event.EF_e24vhi_medium1 OR event.EF_e60_medium1 OR event.EF_2e12Tvh_loose1 OR event.EF_mu24i_tight OR event.EF_mu36_tight OR event.EF_2mu13) ANDevent.muon.mu_ptcone20 < 0.1 * event.muon.mu_pt ANDevent.muon.mu_pt > 20000. ANDABS(event.muon.mu_eta) < 2.4 AND

…..

1000+ lines of C++

for (unsigned int imuon = 0 ; imuon<((*curr_entries)[jentry].mu_pt)->size(); imuon++) {if (((*curr_entries)[jentry].

mu_ptcone20)->at(imuon) < 0.1 * ((*curr_entries)[jentry].mu_pt)->at(imuon) && ((*curr_entries)[jentry].mu_pt)->at(imuon) > 20000. &&fabs(((*curr_entries)[jentry].mu_eta)->at(imuon)) < 2.4 &&…

}...

ROOT - C++ RAW

23

Page 24: Adaptive Query Processing on RAW Data

0.1

1

10

100

1000

10000

Query 1(Cold)

Query 2 Query 3 Query 4 Query 5 Query 6

Exe

cuti

on

Tim

e (

sec)

RAW ROOT

RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]

ROOT: 900 GB in 127 files

CSV: 1 “table” of IDs

24

Page 25: Adaptive Query Processing on RAW Data

0.1

1

10

100

1000

10000

Query 1(Cold)

Query 2 Query 3 Query 4 Query 5 Query 6

Exe

cuti

on

Tim

e (

sec)

RAW ROOT

RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]

ROOT: 900 GB in 127 files

CSV: 1 “table” of IDs

25Declarative access and up to 90x improvement

Page 26: Adaptive Query Processing on RAW Data

Accessing raw data is DBMS business

• Adapt to data and queries

• CERN: 90x

• Application in Banking / Healthcare / Astronomy

http://dias.epfl.ch/RAW

26

Page 27: Adaptive Query Processing on RAW Data

Image Sources

http://www.nasa.gov

http://home.web.cern.ch/

http://www.alacergroup.com/

http://www.humanbrainproject.eu

28