Scientific Project: Databases for Multi-dimensional …Team+Project/Material/...Scientiﬁc Project: Databases for Multi-dimensional Data, Genomics and modern Hardware David Broneske,

Scientific Project: Databases forMulti-dimensional Data, Genomics and

modern Hardware

David Broneske, Gabriel Campero, Bala Gurumurthy, AndreasMeister, Marcus Pinnecke, Sabine Wehnert, Roman Zoun,

Gunter SaakeOctober 9, 2018

Overview

I Concepts of this courseI Course of action (milestones, presentations)I Overview of project topics & forming project teamsI How to perform literature research?

I Further lectures:

I Academic writing (2-3 lectures)

David Broneske et al. Scientific Project 2

Overview

I Concepts of this courseI Course of action (milestones, presentations)I Overview of project topics & forming project teamsI How to perform literature research?I Further lectures:

I Academic writing (2-3 lectures)


Organization

Scientific Project: Modules

BachelorI Module: WPF FIN SMK (Schlussel- und

Methodenkompetenzen)I 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h

autonomous workMasterI Module: Scientific Team Project (Inf, IngInf, WIF, CV)

I DKE: Methods 2 or ApplicationsI DE: Interdisciplinary Team Project

I 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138hautonomous work

Grade at the end of the course for the whole project team


Scientific Project: Prerequisite

I Successful programming test in C++/JavaI 1h theoretical test in a seminar room (data and place to be

discussed)I Half of the team members have to pass the testI Topics:

I Some language specificsI General program understandingI Control flow understanding

I You can take both tests and have to pass at least one!


Scientific Project: Semester Plan

Introduction 08.10.2018 09.10.2018 10.10.2018 11.10.2018 12.10.2018Team Formation 15.10.2018 16.10.2018 17.10.2018 18.10.2018 19.10.2018

22.10.2018 23.10.2018 24.10.2018 25.10.2018 26.10.2018MS-I 29.10.2018 30.10.2018 31.10.2018 01.11.2018 02.11.2018

05.11.2018 06.11.2018 07.11.2018 08.11.2018 09.11.201812.11.2018 13.11.2018 14.11.2018 15.11.2018 16.11.2018

MS II 19.11.2018 20.11.2018 21.11.2018 22.11.2018 23.11.201826.11.2018 27.11.2018 28.11.2018 29.11.2018 30.11.201803.12.2018 04.12.2018 05.12.2018 06.12.2018 07.12.2018

MS III 10.12.2018 11.12.2018 12.12.2018 13.12.2018 14.12.201817.12.2018 18.12.2018 19.12.2018 20.12.2018 21.12.2018

Winter Break ... ... ... ... ...07.01.2019 08.01.2019 09.01.2019 10.01.2019 11.01.201914.01.2019 15.01.2019 16.01.2019 17.01.2019 18.01.2019

MS Final 21.01.2019 22.01.2019 23.01.2019 24.01.2019 25.01.2019

Monday Tuesday Wednesday Thursday Friday

Aca-IAca-II


Scientific Project: Milestones

I Milestone I - Topic, schedule, and team presentation & firstresults of literature research

I Milestone II - Concept & additional literature researchI Milestone III - Implementation & evaluation setupI Milestone IV - Final presentation (wrap-up + evaluation

results)


Concepts & Content

Lecture, Meetings & Presentation

Lecture & PresentationI Time/Place: Tuesday, 13:00-15:00, G22A - room 217I Lectures with content of course → allI Presentation of main milestones (see time table)

→ each project teamMeetings (Exercise)I Individual for each project teamI Time and room to be agreed in project teams!I Presentation of all intermediate results/milestones (informal)I Discussion, discussion, discussion . . .


Objectives & Qualification (I)

Acquired skills, specific to researchI Performing literature researchI Understanding and structured reviewing of scientific workI Autonomous, solution-based reasoning on research task (e.g.,

finding alternative solutions)I How to ask? How to adapt a task (extend/reduce)?I Academic writing


Objectives & Qualification (II)

Acquired skills, always neededI Team managementI Project and time schedulingI Presentation of resultsI Flexibility regarding changing conditionsI Reasoning about solutions (”Why is this the best/not

adequate. . . ”)


Progress of Course

DeliveriesI 4 milestone presentations (main milestones)I Each team member has to present at least onceI Reporting of (sub) milestones in exercises/meetingsI Written paper about literature research (technical report)I Prototypical implementation


Deliveries and Grading (I)

Technical ReportI Delivery of report at a given time (deadline)I Quality/Quantity of literature researchI Number of pagesI Quality of paper structure and evaluationI Own contribution


Deliveries and Grading (II)

Presentation & DiscussionI Quality of scientific presentation (structure, references, time)I Assessment regarding the content (e.g., results of particular

milestones)I Participation of discussion

OrganizationI StrictnessI Communication (just-in-time answers, satisfying time

constraints)I Self-organization (Sharing tasks, internal reporting of current

state-of-work, dealing with problems)I Autonomous working


Deliveries and Grading (III)

I Grade consists of:I Presentations: 30%,I Implementation: 30%,I Paper: 30%,I Soft Skills: 10%

I Binding registration: Second Milestone


Task & Time Management

Task ManagementI Main milestones have to be finished in timeI (Sub) milestones are less strict (but don’t be sloppy)I Pre-defined work packages ⇒ each project team

I . . . defines sub work packagesI . . . determines responsibilities for these packages

(divide&conquer)Time ManagementI Planning of periodsI Regarding capacities and resourcesI Considering other tasks and activitiesI Reporting of delays immediately to project members !


Role Management

I Possible roles: team leader, developer, researcher, . . .I work together vs. responsibilities: design, implementation,

testing, writing, . . .I Delegate for important roles/work packagesI Assignment of (sub) tasks to role for each milestone


Topic & Project Teams

I Teams with 3 to 5 students (depends on the task)I Most tasks can be chosen onceI Projects

I Theoretical partI State of the artI New ideas

I Practical partI Usually in C++ or JavaI Prototypical implementationI Evaluation part


Topic 1 - Information Extraction from Legal TextbooksIntro

I Textbooks contain valuable expert knowledge about lawsI Contextual information extraction can be a tedious taskI Find a way of automating the current process

We’ve gotI Rule-based annotation process implementation in GATEI Dbpedia and Stanford Core NLP plugins for GATE

Your TaskI Literature Research: Alternative implementation options for automation

of rule-based annotation within or outside of GATEI Understanding of the current workflow and its constraintsI Implementation of an automated alternative workflow to the current

GATE implementationI Assessment which techniques require adaptation or replacementI Documentation of time taken by current workflow compared to

alternative proposalDavid Broneske et al. Scientific Project 19

Topic 2 - Collaborative Adaptive LayoutsIntroI GridFormation is our work-in-progress solution to optimizing

physical design with reinforcement learning. We have severalimplementations using OpenAI Gym and TensorFlow-Slim.

I Collaborate with us in growing to a multi-agent setting, whereagents with different workloads collaborate to reach a Nashequilibrium.

Your TaskI Literature research: data partitioning, multi-agent

reinforcement learning (RL), deep RL, DQN, DPPO, MCTS(experimental).

I Prototypical implementation: adapt existing agents.I Experimental evaluation, with Hyrise or Peloton, and analysis.


Topic 3 - JOB19- Join Order Bot 2019IntroI Join order optimization is an NP-hard problem, where

approximate solutions are error-prone.I Traditionally optimizers do not gain from experience.I How well can (deep) reinforcement learning (RL) be applied,

so optimizers improve with time and learn how to avoidcomputation?

Your TaskI Literature research: join order optimization, deep RL.I Prototypical implementation: novel environment and agent,

using OpenAI Gym and TensorFlow-Slim.I Evaluate the performance of your solution using the JOB

benchmark, compare with a dynamic programming approach.I Suggest improvements and outline limitations.


Topic 4 - Don’t Bust the Memory Bank: RecyclingSlots in our String Dictionary

IntroI ”Remove” operations are required in operational systemsI A lot of ”Remove” operations leads to sparse vectors + book

keepingI Continuous elements are required

We’ve gotI String Dictionary ImplementationI Modularized frameworks and C data structure libraryI Vector-based Implementation as ”Ground Truth”

Your TaskI Implementation of Slot-Map-based AlternativeI Evaluation of As-Is and Slot-Map based Alternative (w.r.t.

runtime, memory consumption)I Feel free to plug-in own ideas and conceptsI Low-Level Implementation Skills in C (Experience is required!)


Topic 5 - YCSB Sharp: Driver ImplementationIntro

I Benchmarking database system under atomic operations and queries isreasonable

I Requires to have the same evaluation system at hand and systems tocompare

I Driver-implementations for systems of different classes (Document Stores,Key-Value Stores, Disk-based Relational, and In-Memory RelationalSystems) are required

We’ve gotI Benchmark interface to 12 queries + a few setup functions (”Driver API”)I Example drivers for ”Dummy”, ”REDIS”, ...

Your TaskI Implementation of 3-5 system drivers for our benchmarkI Writing setup environment scripts (install, prepare, remove,... scripts)

that are system-dependentI Execute and report what the benchmarks tell you about your systemsI Low-Level Implementation Skills in C (Experience is required!)David Broneske et al. Scientific Project 23

Topic 6 - One Predicate Many Queries: OptimizerIntro

I OLAP queries have multiple conjunctive selection statementsI Naive optimizer executes the selection together for efficient processingI This leads to re-running same selection predicate multiple timesI Optimizing multiple queries can reduce the number of executed predicates

We’ve gotI A nice framework for connecting OpenCL with C++I A growing set of primitives - primitives are granular operations within

each DB operation.Your Task

I Literature Research: Multi query optimization and ”One Operator ManyQuery”

I Identifying necessary selection primitives variantsI Identifying the implementation method for the selection primitivesI Implementation of a kernel generator based on the input characteristicI Develop and test the optimizer for efficient selection groups


Topic 6 - One Predicate Many Queries: Optimizer


Topic 7 - Cold Data Avoidance for ElfIntroI Cold data traversal for queries on a little amount of columnsI Worst case: Mono-column selection predicates

We’ve gotI Elf implementation

Your TaskI Literature Research: Related index structures and cold data

managementI Understanding of the Elf and its optimization conceptsI Implementation of Elfs for Mono-column selection predicates,

Pointers into TID listsI Performance evaluation of the variantsI Investigate ratio of storage overhead and performance gain


Topic 7 - Cold Data Avoidance for Elf

1 2

0 1Column C1

Column C2

(1)

(2) (3)

T2 T1 0

0 T3 1

Column C3

Column C4(5) 0(4)

+

0

01

Accelerating multi-column selection predicates inmain-memory – the Elf approach

David BroneskeUniversity of [email protected]

Veit KoppenUniversity of Magdeburg

[email protected]

Gunter SaakeUniversity of Magdeburg

[email protected]

Martin SchalerKarlsruhe Institute of Technology

[email protected]

Abstract—Evaluating selection predicates is a data-intensivetask that reduces intermediate results, which are the input forfurther operations. With analytical queries getting more andmore complex, the number of evaluated selection predicatesper query and table rises, too. This leads to numerous multi-column selection predicates. Recent approaches to increase theperformance of main-memory databases for selection-predicateevaluation aim at optimally exploiting the speed of the CPU byusing accelerated scans. However, scanning each column one byone leaves tuning opportunities open that arise if all predicatesare considered together. To this end, we introduce Elf, an indexstructure that is able to exploit the relation between severalselection predicates. Elf features cache sensitivity, an optimizedstorage layout, fixed search paths, and slight data compression.In our evaluation, we compare its query performance to state-of-the-art approaches and a sequential scan using SIMD capabilities.Our results indicate a clear superiority of our approach for multi-column selection predicate queries with a low combined selectivity.For TPC-H queries with multi-column selection predicates, weachieve a speed-up between a factor of five and two orders ofmagnitude, mainly depending on the selectivity of the predicates.

I. INTRODUCTION

Predicate evaluation is an important task in current OLAP(Online Analytical Processing) scenarios [1]. To extract nec-essary data for reports, fact and dimension tables are passedthrough several filter predicates involving several columns.For example, a typical TPC-H query involving several columnpredicates is Q6, whose WHERE-clause is visualized in Fig. 1(a).We name such a collection of predicates on several columns inthe WHERE-clause a multi-column selection predicate. Multi-column selection predicate evaluation is performed as early aspossible in the query plan, because it shrinks the intermediateresults to a more manageable size. This task has become evenmore important, when all data fits into main memory, becausethe I/O bottleneck is eliminated and, hence, a full table scanbecomes less expensive.

In case all data sets are available in main memory (e.g., ina main-memory database system [2], [3], [4]), the selectivitythreshold for using an index structure instead of an optimizedfull table scan is even smaller than for disk-based databasesystems. In a recent study, Das et al. propose to use anindex structure for very low selectivities only, such as valuesunder 2 % [5]. Hence, most OLAP queries would never usean index structure to evaluate the selection predicates. Toillustrate this, we visualize the selectivity of each selectionpredicate for the TPC-H Query Q6 in Fig. 1(b). All of itssingle predicate selectivities are above the threshold of 2 %and, thus, would prefer an accelerated scan per predicate.

(a)Q6.1Q6.2Q6.3

l shipdate >= [DATE] and l shipdate < [DATE] + ’1 year’and l discount between [DISCOUNT] � 0.01 and [DISCOUNT] + 0.01and l quantity < [QUANTITY]

(b)Q6

Q6.1 Q6.2 Q6.32 %

20 %

40 %

Sele

ctiv

ity

(c)SIM

DSeq Elf

50

100

150

Res

pons

eTi

me

inm

s

Fig. 1. (a) WHERE-clause, (b) selectivity, and (c) response time of TPC-Hquery Q6 and its predicates Q6.1 - Q6.3 on Lineitem table scale factor 100

However, an interesting fact neglected by this approach isthat the accumulated selectivity of the multi-column selectionpredicates (1.72 % for Q6) is below the 2 % threshold. Hence, anindex structure would be favored if it could exploit the relationbetween all selection predicates of the query. Consequently,when considering multi-column selection predicates, we achievethe selectivity required to use an index structure instead of anaccelerated scan.

In this paper, we examine the question: How can we exploitthe combined selectivity of multi-column selection predicatesin order to speed up predicate evaluation? As a solutionfor efficient multi-column selection predicate evaluation, wepropose Elf, an index structure that is able to exploit therelation between data of several columns. Using Elf resultsin performance benefits from several factors up to two ordersof magnitude in comparison to accelerated scans, e.g., a scanusing single instruction multiple data (SIMD). About factor 18can be achieved for Q6 on a Lineitem table of scale factors = 100, as visible in Fig. 1(c).

Elf is a tree structure combining prefix-redundancy elimina-tion with an optimized memory layout explicitly designed forefficient main-memory access. Since the upper levels representthe paths to a lot of data, we use a memory layout that resemblesthat of a column store. This layout allows to prune the searchspace efficiently in the upper layers. Following the paths deeperto the leaves of the tree, the node entries are representing lesserand lesser data. Thus, it makes sense to switch to a memorylayout that resembles a row store, because a row store ismore efficient when accessing several columns of one tuple.

Author Copy of: David Broneske, Veit Köppen, Gunter Saake, Martin Schäler. Accelerating multi-column selection predicates in main-memory – the Elf approach. IEEE 33rd International Conference on Data Engineering (ICDE), 2017, pp. 647-658, DOI 10.1109/ICDE.2017.118

A. Conceptual designIn the following, we explain the basic design with the

help of the example data in Table II. The data set shows fourcolumns to be indexed and a tuple identifier (TID) that uniquelyidentifies each row (e.g., the row id in a column store).

C1 C2 C3 C4 ... TID0 1 0 1 ... T1

0 2 0 0 ... T2

1 0 1 0 ... T3

TABLE II. RUNNING EXAMPLE DATA

In Fig. 2, we depict the resulting Elf for the four indexedcolumns of the example data from Table II. The Elf tree struc-ture maps distinct values of one column to DimensionListsat a specific level in the tree. In the first column, there aretwo distinct values, 0 and 1. Thus, the first DimensionList,L(1), contains two entries and one pointer for each entry. Therespective pointer points to the beginning of the respectiveDimensionList of the second column, L(2) and L(3). Note,as the first two points share the same value in the first column,we observe a prefix redundancy elimination. In the secondcolumn, we cannot eliminate any prefix redundancy, as allattribute combinations in this column are unique. As a result,the third column contains three DimensionLists: L(4), L(5),and L(6). In the final DimensionList, the structure of theentries changes. While in an intermediate DimensionList,an entry consists of a value and a pointer, the pointer in thefinal dimension is interpreted as a tuple identifier (TID).

1 2

0 1

0

Column C1

Column C2

(1)

(2) (3)

0 T3 0 T21 T1

0 1Column C3

Column C4(7)

(5)

(8) (9)

0(4) (6)

Fig. 2. Elf tree structure using prefix-redundancy elimination.

The conceptual Elf structure is designed from the idea ofprefix-redundancy elimination in Section II-B and the propertiesof multi-column selection predicates. To this end, it featuresthe following properties to accelerate multi-column selectionpredicates on the conceptual level:

Prefix-redundancy elimination: Attribute values are mainlyclustered, appear repeatedly, and share the same prefix.Thus, Elf exploits this redundancy as each distinct valueper prefix exists only once in a DimensionList toreduce the amount of stored and queried data.

Ordered node elements: Each DimensionList is an or-dered list of entries. This property is beneficial for equalityor range predicates, because we can stop the search ina list if the current value is bigger than the searchedconstant/range.

Fixed depth: Since, a column of a table corresponds to a levelin the Elf, for a table with n columns, we have to descendat most n nodes to find the corresponding TID. This setsan upper bound on the search cost that does not dependon the amount of stored tuples, but mostly on the amountof used columns.

In summary, our index structure is a bushy tree structureof a fixed height resulting in stable search paths that allowsfor efficient multi-column selection predicate evaluation on aconceptual level. To further optimize such queries, we alsoneed to optimize the memory layout of the Elf approach.

B. Improving Elf’s memory layoutThe straight-forward implementation of Elf is similar to data

structures used in other tree-based index structures. However,this creates an OLTP-optimized version of the Elf, which wecall InsertElf. To enhance OLAP query performance, we usean explicit memory layout, meaning that Elf is linearized intoan array of integer values. For simplicity of explanation, weassume that column values and pointers within Elf are 64-bitinteger values. However, our approach is not restricted to thisdata type. Thus, we can also use 64 bits for pointers and 32 bitsfor values, which is the most common case.

1) Mapping DimensionLists to arrays: To store the nodeentries – in the following named DimensionElements –of Elf, we use two integers. Since we expect the largestperformance impact for scanning these potentially longDimensionLists, our first design principle is adjacency ofthe DimensionElements of one DimensionList, whichleads to a preorder traversal during linearization. To illustratethis, we depict the linearized Elf from Fig. 2 in Fig. 3. Thefirst DimensionList, L(1), starts at position 0 and has twoDimensionElements: E(1), with the value 0 and the pointer04 (depicted with brackets around it), and E(2), with the value1 and the pointer 16 (the negativity of the value 1 marks theend of the list and is explained in the next subsection). Forexplanatory reasons, we highlight DimensionLists withalternating colors.

0 [04] -1 [16] 1 [08] -2 [12] -0 [10]

-1 -0 [14] -0 -0 [18]T1 T2

-0 T3

ELF[00]

ELF[10]

ELF[20]

0 91 2 3 4 5 6 7 8(1) (2) (4)

(7) (5) (8) (3)

(9)

-1 [20](6)

Fig. 3. Memory layout as an array of 64-bit integers

The pointers in the first list indicate that theDimensionLists in the second column, L(2) andL(3) (cf. Fig. 2), start at offset 04 and 16, respectively. Thismechanism works for any subsequent DimensionListanalogously, except for those in the final column (C4). In thefinal column, the second part of a DimensionElement isnot a pointer within the Elf array, but a TID, which we encodeas an integer as well. The order of DimensionLists isdefined to support a depth-first search with expected low hitrates within the DimensionLists. To this end, we firststore a complete DimensionList and then recursively storethe remaining lists starting at the first element. We repeat thisprocedure until we reach the final column.

2) Implicit length control of arrays: The second designprinciple is size reduction. To this end, we store only valuesand pointers, but not the size of the DimensionLists. Toindicate the end of such a list, we utilize the most significantbit (MSB) of the value. Thus, whenever we encounter a

Beispieldaten

Elf-DatenstrukturPerformanzgewinne

Reorganisation


Topic 8 - Joins on ElfIntroI Elf stores all information of a table → storage structureI Currently used for selections... what about joins?

We’ve gotI Elf implementationI Ideas and code for a straight-forward join on Elf

Your TaskI Literature Research: Join implementations, sort-based joinsI Understanding of the Elf and its optimization conceptsI Implement a straight-forward join based on selections and an

own native join algorithmI Performance comparison between state-of-the-art joins (e.g.,

Radix Join) and your implementationDavid Broneske et al. Scientific Project 28

Finding your Team

Topics:I Topic 1 - Information Extraction from Legal TextbooksI Topic 2 - Collaborative Adaptive LayoutsI Topic 3 - JoinOrderBot 2019I Topic 4 - Don’t Bust the Memory Bank: Recycling Slots in

our String DictionaryI Topic 5 - YCSB Sharp: Driver ImplementationI Topic 6 - One Predicate Many Queries: OptimizerI Topic 7 - Cold Data Avoidance for ElfI Topic 8 - Joins on Elf

When do we meet for the programming test?


Literature Research

How to Perform Literature Research

I Efficient literature research requiresI Knowledge of Where to searchI Knowledge of How to searchI Finding adequate search termsI Structured review of papersI Knowledge of how to find information in papers


Where to Search (I)

I Different websites available that provide large literaturedatabases

1. Google Scholar: http://scholar.google.de/I Key word and conrete paper searchI Often, PDFs are provided

2. DBLP: http://www.informatik.uni-trier.de/˜ley/db/I Search for keyword, conferences, journals, author(s)I BibTex and references to other websites

3. Citeseer: http://citeseerx.ist.psu.edu/about/siteI keyword, fulltext, author, and title searchI BibTex and (partially) PDFs are provided


http://scholar.google.de/

http://www.informatik.uni-trier.de/~ley/db/

http://citeseerx.ist.psu.edu/about/site

Where to Search (II)

I Publisher sites are also a suitable targetI ACM Digital Library: http://portal.acm.org/dl.cfm

I Keyword, author, conference/literature (proceedings), and titlesearch

I Bibtex, mostly PDFs and other information are providedI IEEE Xplore: http://ieeexplore.ieee.org/Xplore/

guesthome.jsp?reload=trueI Similar to ACM, but only few PDFsI Extended access within university network

I Springer: http://www.springerlink.de/I Similar to previousI Extended access within university Network

I Further search possibilities: on author, research group oruniversity sites


http://portal.acm.org/dl.cfm

http://ieeexplore.ieee.org/Xplore/guesthome.jsp?reload=true

http://ieeexplore.ieee.org/Xplore/guesthome.jsp?reload=true

http://www.springerlink.de/

How to Search

Some hints to not get lost in the jungleI Use distinct keywords (fingerprint vs. fingerprint data)I Keep keywords simple (at most three words)I Otherwise, search for whole titleI Read abstract (and maybe introduction) ⇒ decision for

relevanceFirst insightsI Read abstract, introduction and background/related work

(coarse-grained) toI . . . get a first idea of the approachI . . . find other relevant papers


Information Retrieval

Finding the required informationI Read the paper carefullyI Omit formal parts/sectionsI Try to classify (core idea, main characteristics) ⇒ develop

classification/evaluation in mindI Understand the big pictureI Make notesI Do NOT translate each sentence


Finding your Team

Topics:I Topic 1 - Information Extraction from Legal TextbooksI Topic 2 - Collaborative Adaptive LayoutsI Topic 3 - JoinOrderBot 2019I Topic 4 - Don’t Bust the Memory Bank: Recycling Slots in

our String DictionaryI Topic 5 - YCSB Sharp: Driver ImplementationI Topic 6 - One Predicate Many Queries: OptimizerI Topic 7 - Cold Data Avoidance for ElfI Topic 8 - Joins on Elf


Documents

Scientific Project: Databases for Multi-dimensional …Team+Project/Material/...Scientiﬁc Project: Databases for Multi-dimensional Data, Genomics and modern Hardware David Broneske,