EDIC RESEARCH PROPOSAL 1 Micro-architectural … exam/PR15Sirin.pdfEDIC RESEARCH PROPOSAL 1 Micro-architectural Analysis ... Index Terms—thesis proposal, ... II presents the motivation

EDIC RESEARCH PROPOSAL 1

Micro-architectural Analysis of In-Memory OLTPUtku SirinI&C, EPFL

Abstract—Traditional disk-based online transaction process-ing (OLTP) systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems process all thedata in main-memory, and therefore, can omit the buffer managercomponent. In addition, they usually adopt more lightweightconcurrency control mechanisms, optimized index structures,optimized query compilation techniques, and optimized loggingmechanisms. Therefore, it is not straightforward to extrapolatehow OLTP benchmarks behave at the micro-architectural levelwhen run on an in-memory engine. In this work, we presentmain design considerations of three in-memory OLTP systems,H-Store, [25], HyPer [13] and Hekaton [6], and compare theirmicro-architectural behavior with two traditional disk-basedOLTP systems, Shore-MT [2] and Microsoft SQL Server [19]. Theresults show that despite all the design differences, in-memoryOLTP systems significantly under-utilize micro-architectural re-sources similar to traditional disk-based OLTP systems.

Index Terms—thesis proposal, candidacy exam write-up,EDIC, EPFL, in-memory OLTP systems, micro-architecturalanalysis

I. INTRODUCTION

Traditional on-line transaction processing (OLTP) systemswere designed more than 30 years ago when main-memorieswere in the sizes of megabytes, and the machines had single-core uniprocessor architectures. Their main design character-istics can be summarized as following:

Proposal submitted to committee: June 1st, 2015; Candidacyexam date: June 8th, 2015; Candidacy exam committee: WillyZwaenepoel, Anastasia Ailamaki, Edouard Bugnion.

This research plan has been approved:

Date: ————————————

Doctoral candidate: ————————————(name and signature)

Thesis director: ————————————(name and signature)

Thesis co-director: ————————————(if applicable) (name and signature)

Doct. prog. director:————————————(B. Falsafi) (signature)

EDIC-ru/05.05.2009

• Multi-threading and locking-based concurrency controlmechanisms

• Disk-based index structures• Iterator-based query execution model• ARIES-style Write-Ahead Logging (WAL) protocol for

recoveryOver the three decades, however, the hardware character-

istics has changed dramatically. Today, DRAM prices are ascheap as to buy 1TB of main-memory for ∼ $30K. This isenough to keep most of the OLTP workloads entirely in themain-memory. Moreover, most of the modern servers havemulti-socket multi-core architectures providing abundance ofparallelism for the OLTP systems. This dramatic change inthe hardware characteristics have triggered alternative designopportunities for new generation OLTP systems optimized forthe case where the hot dataset resides in memory [25].

In-memory OLTP systems have revised the four main de-sign characteristics of traditional disk-based OLTP systems.As in-memory systems keep all of the data in their main-memory, they remove the disk I/O latency from the criticalpath. Therefore, they omit the buffer manager componentwhich is essential for the traditional disk-based OLTP sys-tems providing the infinite-memory illusion for the DBMS.Eliminating/minimizing disk I/O operations further eliminatesthe need for multi-threading to hide the I/O latencies. Thatenables to use light-weight concurrency control mechanismsinstead of traditional centralized locking-based mechanisms.Additionally, in-memory OLTP systems implement optimizedlock&latch-free index structures with low data access times.Moreover, they adopt optimized query compilation techniquesproducing highly optimized machine code with high instruc-tion and data locality, instead of using traditional iterator-basedquery execution model having bad instruction and data local-ity. Lastly, in-memory OLTP systems optimize their loggingmechanism to minimize the time spent for logging duringthe normal execution of transactions by either optimizing thetraditional ARIES-style WAL mechanism or relying on high-availability [25], [13], [6].

There is a large body of work analyzing the micro-architectural behavior of traditional disk-based OLTP systems[5], [7], [28], [29], [1]. However, due to the distinctive designfeatures of in-memory OLTP systems, these studies are notrepresentative for in-memory OLTP systems. Therefore, inthis work, we present main design considerations of threein-memory OLTP systems, H-Store [25], developed by theresearchers at MIT, Yale and Brown University, HyPer devel-oped by the database research group at Technical Universityof Munich (TUM) [13], and Microsoft SQL Server’s in-memory OLTP engine Hekaton [6]. Then, we analyze themicro-architectural behavior of the three in-memory OLTP


systems, and compare them with the two traditional disk-based OLTP systems, the academic storage manager Shore-MT [2], and the commercial Microsoft SQL Server [19]. Ouranalysis shows that, despite all of their optimizations, in-memory OLTP systems still suffer micro-architectural under-utilization problem similar to the traditional disk-based OLTPsystems. The number of instructions retired per cycle barelyreaches to one where L1 instruction (L1-I) misses and thelong-latency data misses from the last-level cache (LLC) arethe dominant factors in the overall stall time. Therefore, weconclude that software-level optimizations are not enough toaddress the micro-architectural under-utilization problem. Wepropose to re-consider the traditional OLTP system designbased on co-engineering hardware and software together bysetting the highest priority goal as maximizing the utilizationof micro-architectural resources.

The rest of the document is organized as following. SectionII presents the motivation for micro-architectural analysis ofin-memory OLTP systems. Section III presents specific designconsiderations of the three in-memory OLTP systems weanalyzed, H-Store, HyPer and Hekaton. Within Section III,Section III-A presents the light-weight concurrency controlmechanisms, Section III-B presents the optimized index struc-tures, Section III-C presents the optimized query compilationtechniques, and Section III-D explains the optimized loggingmechanisms. Section IV presents the micro-architectural anal-ysis of the three in-memory OLTP systems contrasting themwith the two traditional disk-based OLTP systems. Lastly,Section V concludes with future research directions.

II. MOTIVATION

Harizopoulos et al. presents a detailed instruction-levelbreakdown of the major components of a traditional disk-basedOLTP system (Shore) running a subset of one of the standardTPC benchmarks, TPC-C. Figure 1 shows the breakdown ofinstruction count for the New Order transaction of the TPC-Cbenchmark. As can be seen, only ∼ 7% of the instructions areactually spent for doing the useful work shown by the bottom-most white box, while more than 75% of the instructionsare spent for buffer manager, locking, latching and loggingcomponents. The hand-coded optimizations refers to the B-tree package optimizations improving the performance of theB-tree index [8].

In-memory OLTP systems remove the buffer manager com-ponent. Since they do not have to do multi-threading for hidingthe disk I/O latency, they implement light-weight concurrencycontrol mechanism. They optimize their logging mechanism toreduce the time spent for logging during the normal executionof the transactions. Additionally, they implement lock&latch-free optimized index structures, and optimized query compila-tion techniques producing highly optimized compiled machinecodes. Therefore, in-memory OLTP systems potentially reducethe total instruction footprint more than 65%, and significantlyincrease the ratio of useful work. In their study, Stonebraker& Weisberg show that the amount of useful work can beincreased up to 95% in VoltDB, which is the commercialproduct of the H-Store system [26].

Hence, the micro-architectural behavior of OLTP workloadsrunning on disk-based systems cannot be representative forthe in-memory systems. In this work, we outline main designconsiderations of in-memory OLTP systems, and analyze theirmicro-architectural behaviors by contrasting with traditionaldisk-based OLTP systems.

Fig. 1: Instruction count breakdown for the New Ordertransaction from TPC-C.

III. DESIGN CONSIDERATIONS OF IN-MEMORY OLTP

In this section, we present the four design considerations ofin-memory OLTP systems.

A. Light-weight Concurrency Control Mechanisms

This section describes the light-weight concurrency controlmechanisms that the in-memory OLTP systems we analyzeadopt. In an OLTP setting, the amount of useful work pertransaction is typically less than one millisecond, while thedisk-access latencies are usually around several milliseconds.Therefore, traditional disk-based systems need to do multi-threading to overlap the long disk I/O latency with increasedCPU utilization. Having eliminated the disk-access latencyfrom the critical path, in-memory systems mitigate the need formulti-threading. This provides to use light-weight concurrencycontrol mechanisms instead of traditional centralized locking-based concurrency control mechanisms. Locking-based con-currency control mechanisms are costly as they are pessimistic.Moreover, they are typically scalability bottlenecks for multi-core systems as they utilize highly contended shared datastructures such as lock manager [25].

1) Partitioning-based approaches: The first light-weightconcurrency control mechanism is partitioning the data, andrunning everything as single-threaded within each partition, inparallel. To do that, the system classifies each transaction intoone of the three types of transactions: single-sited, one-shotand general. A single-sited transaction does all of its workwithin a single partition without requiring to touch to any otherpartition. A one-shot transaction does require to touch morethan one partition, however, it can be decomposed into a setof SQL statements that can be executed in parallel at differentpartitions. Lastly, general transactions are any transaction that


is neither single-sited nor one-shot. The main characteristics ofgeneral transactions is that they touch multiple partitions of thedata and require some intermediate communications betweenthe multiple phases of the transaction.

Once a transaction arrives to the system, the system iden-tifies the class of the transaction at the client-side and sendsthe transaction to the correct node if it is single-sited, or toa global controller if it is a one-shot or a general transaction.The global controller can be any node in the computing grid,and acts as an execution supervisor controlling the executionsand communications of the sub-plans of the transaction atvarious nodes so that the ultimate execution of the scheduleis serializable. The in-memory OLTP system, H-Store, andits commercial product VoltDB pioneer this approach forconcurrency control in main-memory OLTP systems [25],[26].

A slightly different approach using the same partitioningmechanism handles the general transactions by running inexclusive mode. Thus, the system trades-off the overhead ofthe communication across the nodes with a higher averagetransaction latency. The in-memory OLTP system, HyPer, aug-ments this approach to the main partitioning-based approachexplained in the previous paragraph [13].

2) Optimistic approaches: The other light-weight concur-rency control mechanism argues that, though the partitioning-based concurrency control mechanisms work well for thepartitionable workloads, their performance degrades quicklyif the workload is not partitionable [6]. Therefore, they preferlock&latch-free optimistic multi-version concurrency control(OMVCC) mechanisms where all the transactions can accessany part of the data. OMVCC mechanisms optimisticallyassume that transactions mostly run without interfering eachother. Therefore, they allow transactions to run concurrentlyby doing the reads from a common database, and doingtheir writes to a private copy. At the commit time of eachtransaction, OMVCC mechanisms validate if the modificationsof the transaction are applicable to the database with respectto the chosen isolation level.

The proposed OMVCC mechanism keeps a read set for eachtransaction, which contains pointers to the set of objects thatthe transaction reads during its execution. At the validationphase of a transaction, the system verifies that the objects inthe read set of the transaction have not been changed by anyother transaction during the execution of the validating trans-action. Moreover, if two different transactions update the samevariable concurrently without one being validated/committed,the system applies first-writer-wins rule and aborts the later-writer transaction. Additionally, the mechanism does not allowtransactions to read an uncommitted version of an object.Lastly, the system keeps a scan set for each transaction storingthe information needed to repeat every scan. At the validationphase of a transaction, it repeats all the index scans that thetransaction has done, and verifies that there is no phantomobject inserted during the transaction.

Note that, the system also keeps a write set for eachtransaction, which includes the pointers to the set of objectsthat the transaction has modified, in order to realize themodifications of the transaction that commits successfully.

SQL Server’s in-memory OLTP engine, Hekaton, implementsthis approach [6].

B. Optimized Index Structures

This section describes the optimized index structures thatin-memory OLTP systems we analyze adopt. There is alarge body of works optimizing the index structure for main-memory database systems. The first study dates back to1986 proposed by Lehman & Carey [16], T-Tree. This studyis followed by several cache conscious indexing strategiesfor main-memory databases [21], [22], [15]. More recently,several studies are proposed to optimize the index structuresfor modern hardware architecture features such as SingleInstruction Multiple Data (SIMD) instructions [23], [14].

Hash table is another popular index structure for main-memory database systems. They are particularly appealing dueto their O(1) constant access time rather than O(log n) accesstime of search-based index structures where the access timedepends on the number of keys in the index, i.e., n.

Fig. 2: An example radix tree.

Although hash tables provide O(1) constant access time,they do not support range queries. On the other hand, cacheconscious search-based index structures do provide rangescans and also point queries, however, their time complexityis O(log n), where n is the number of keys in the index.A third class of index structures is radix trees. Radix treesprovide O(k) access time, where k is the length of the key,and also provide range scans. Therefore, similar to hash tables,their access time is independent of the number of keys in theindex structure, and similar to search-based index structures,they support range scans [17]. Figure 2 shows an exampleradix tree. The search in radix trees starts by checking thefirst character of the key, and directly jumping to the nodewhere all the keys starts with the same character. It proceedsfor the following characters of the key in the same way untilit reaches the bottom-most level of the tree where the valuesare stored. Therefore, the data access time only depends onthe length of the key, i.e., k.

1) Adaptive radix tree: The first index structure that weexplain is a variation of radix trees, Adaptive Radix Tree(ART). Classical radix tree implementation allocates the samenumber of children for every node in the tree, i.e., all thenodes are treated to have the same fan-out. Figure 3 shows an


example of the classical radix tree implementation. However,this implementation significantly suffers from space under-utilization since, most of the time, nodes have different fan-outs. Therefore, ART implements an adaptive version of radixtrees where each node is adaptively assigned to one of thefour node types having different fan-outs. Figure 4 shows anexample for ART. Thus, ART improves the space utilizationof classical radix tree implementations, and at the same time,benefits from radix tree’s O(k) data access time and supportfor range queries. This index structure is proposed by Leis etal., integrated into the in-memory OLTP system HyPer [17],[13].

Fig. 3: A classical radix tree implementation.

Fig. 4: Adaptive radix tree (ART) implementation.

2) Bw-Tree: The second index structure that we explainis a variation of traditional B-Trees, Bw-Tree. Bw-Tree isa lock&latch-free B-Tree that supports atomic compare-and-swap (CAS) update operations. Bw-Tree assigns to each of itsnodes a logical page identifier (LPID) which is mapped to theaddress of the page in physical main-memory. All the linkswithin the Bw-Tree are based on LPIDs rather than physicalpointers. The mapping between the logical and physical pagesare maintained by a mapping table data structure. Therefore,once a page is updated, Bw-Tree only needs to update corre-sponding entry between the logical and physical pages in themapping table. This enables atomic CAS update operations.Updates to the pages are prepended to the original pages inthe form of delta records. Once the system decides to realizethe update on the original page, it simply changes the mappingtable entry of the corresponding logical page with an atomicCAS instruction. Figure 5 (left) shows how the atomic deltaupdate is realized. The dashed line represents the prior addressof the page P, and the solid line represents the new address ofthe page P including the delta record. Once required, multipledelta records of a page are consolidated into a single page asshown by Figure 5 (right). Moreover, out-dated delta recordsand pages are garbage collected as a background process.

SQL Server’s in-memory OLTP engine, Hekaton, integratesthis index structure into its system [18], [6]. Observe that

Fig. 5: Atomic delta updates (left) and consolidation of a setof delta updates (right) of Bw-Tree.

Bw-Tree works well with Hekaton’s optimistic multi-versionconcurrency control (OMVCC) mechanism. Bw-Tree providesthe required versioning for OMVCC in a lock&latch-free way.

In addition to the Bw-Tree, Hekaton also supports hashtable as an index structure for the workloads having only pointqueries.

Lastly, the in-memory OLTP system, H-Store, implementsa traditional B-Tree with block size tuned to the width ofan L2 cache line on the machine being used. They arguethat indexing code should be optimized only if it becomesa significant performance bottleneck.

C. Optimized Query Compilation Techniques

This section describes the optimized query compilationtechniques that the in-memory OLTP systems we analyzeadopt. Traditional database systems translate a query into analgebraic query plan that can be represented as an operatortree, and execute the query plan based on the classical iterator-based execution model. Figure 6 shows an example SQL queryand its corresponding algebraic query plan as an operator tree,and Figure 7 shows a simplified definition for the abstractIterator class. In the iterator execution model, every node, i.e.,every operator, in the operator tree implements the abstractIterator class. Therefore, every node calls the get next()function to request, i.e., pull, tuples from its input operators,i.e., from its children, which recursively propagates from topto the bottom level of the operator tree. Therefore, for eachtuple, the iterator model requires to call the virtual get next()function at least once, meaning millions of times of functioncalls. That results in a bad cache locality as each function callpotentially invalidates many instruction cache lines [10], [20],[6].

1) Iterator-like execution model: In-memory OLTP systemsadopt optimized query compilation techniques producing di-rectly a compiled machine code instead of using the traditionaliterator-based execution model. The first approach implementsa similar interface to the abstract Iterator class shown inFigure 7. This interface includes get first(), get next(),return row() and return done() methods. However, unlikethe iterator model, the optimized compilation technique doesnot use function calls to implement these interface methods.Instead, it collapses the entire query execution plan intoone function call using gotos and labels to connect differentinterface methods. Thus, they avoid expensive virtual functioncalls of traditional iterator model wasting a lot of instructionswithout doing a useful work. Once the entire execution planis reduced to a single function call, e.g., to a C code, the


Fig. 6: Example SQL query and its corresponding algebraicexecution plan.

Fig. 7: A simplified definition for the abstract Iterator class.

produced code is simply compiled into a machine code, e.g.,with a C compiler. SQL Server’s in-memory OLTP engine,Hekaton, proposes and implements this approach [6].

2) Data-centric push-based execution model: Although thefirst optimized compilation technique improves the traditionaliterator model by eliminating the expensive virtual functioncalls, it still uses operator-centric pull-based query executionmodel where the plan is executed in a way that each operatorrequests, i.e., pulls, its input tuples from its children, andthese requests are recursively propagated from the top to thebottom of the operator-tree. This means to have a lot of gotostatements within the code resulting in bad cache locality.

The second optimized query compilation technique, on theother hand, proposes to use a data-centric push-based queryexecution model. It takes an algebraic execution plan andbreaks the algebraic expression tree into a set of pipelines.Figure 8 presents an example algebraic expression tree decom-posed into a set of pipelines, whose original version withoutpipeline boundaries and its corresponding SQL query shownin Figure 6. The decomposed algebraic execution plan istranslated into a highly optimized C++/LLVM code whosesimplified pseudo-code version is shown in Figure 9. As canbe seen, the generated code includes four tight for loops.Each for loop corresponds to one pipeline fragment in thedecomposed algebraic expression tree. These tight for loopsprovide very high instruction and data locality as small codesegments iterates over a large amount of data. Once the codeis translated into C++/LLVM code, it is compiled with a

C++/LLVM compiler into a highly optimized machine code.As the optimized query compilation technique breaks thequery plan into pipelines by trying to maximize the datalocality, the method is data-centric. Moreover, as the operatorsdo not ask, i.e., pull, tuples from its children but the datais pushed towards the operators within each pipeline, thetechnique is push-based. The in-memory OLTP system, HyPer,integrates this query compilation technique [20], [13].

Fig. 8: An example execution plan decomposed into a set ofpipelines for the query in Figure 6.

Fig. 9: Pseudo-code for the compiled query for the algebraicexecution plan shown in Figure 8.

Note that both of the optimized compilation techniques takethe SQL query, parse it, translate it into an algebraic expressionand optimize the algebraic expression in exactly the sameway that traditional query processing engines do. However,after that point, both of the optimized compilation techniquesdeviate from the original execution model and compile theexecution plan into an highly optimized machine code indifferent ways.

Traditional iterator-based query execution model is simpleto implement and easy to debug, however, it brings significantoverhead of extensive virtual function calls. By eliminating thevirtual function calls, optimizing for high instruction and datalocality and compiling into highly optimized machine code, theoptimized query compilation techniques significantly improvethe query execution time of the OLTP systems.

Lastly, the in-memory OLTP system, H-Store, does not im-plement an optimized query compilation technique. However,


it relies on stored procedures where transactions are analyzedand classified at compile time as explained in Section III-A1.Users execute transactions by calling the stored proceduresand specifying the run-time parameters. However, the detailsof the query execution model is not revealed [13].

D. Optimized Logging Mechanisms

This section describes the optimized logging mechanismsthat our considered in-memory OLTP systems adopt. Tradi-tional disk-based OLTP systems implement classical Write-Ahead Logging (WAL) logging protocol for recovery. Thethree basic rules of WAL protocol are the following:

1) Each modification to a database page generates a logrecord that is enqueued to an in-memory data structure,log tail. A log record of a modification of a databasepage must be flushed to a non-volatile storage beforethe database page itself is flushed.

2) Log records must be flushed to a non-volatile storage inorder.

3) A commit request by a transaction generates a commitlog record. A commit log record has to be flushed toa non-volatile storage before the transaction commitssuccessfully.

While the first rule enables to undo the uncommittedchanges to a database page, the second and third rules enableto redo the committed changes to the database. Thereby, incase of a system failure, the database records can be recoveredby redoing the committed transaction logs, i.e., redo logging,and undoing the uncommitted transaction logs, i.e., undologging [10].

1) Optimizing the WAL protocol: In-memory OLTP sys-tems implement several optimizations to WAL logging proto-col. First, they do not have to persist undo log information, i.e.,log records for uncommitted modifications, since everything isin memory, and no database page is flushed to a non-volatilestorage. Therefore, in-memory OLTP systems generate logrecords only at the commit time, and persist only transactionconsistent redo log information. Secondly, in-memory systemsonly logs the logical effects of the transactions rather thanthe entire physical effects of the transaction such as after-images for all modified byte ranges. Thereby, they push thework to the recovery time while minimizing the loggingoverhead during the normal transaction execution. Third, in-memory systems implement group commit protocol wheremultiple transaction log records are grouped together andflushed to a non-volatile storage as one big I/O to minimizethe per-transaction effect of the I/O latency. All the three in-memory OLTP systems we analyze implement all these threeoptimizations.

Some systems also implement asynchronous logging wherethe third rule of WAL logging protocol is relaxed. As soonas the log record of a transaction is enqueued to the log tail,the transaction is allowed to commit. However, this approachmight cause data losses since, in case of a failure, the effectsof the transactions whose log records are not flushed to anon-volatile storage will be lost. While Hekaton and thecommercial product of H-Store, VoltDB, can be configured to

do asynchronous logging, HyPer always does asynchronouslogging [6], [26], [13].

One last optimization the in-memory systems implement isthe parallelization of the logging process. Multiple concur-rently running logging threads are responsible for generatinglog records, enqueuing them to a log tail and flushing thelog tails to non-volatile storages. Multiple log tails are keptto avoid a possible scalability bottleneck with the log tail.It sustains the correct order among the concurrently loggedtransactions based on the end timestamps assigned to eachtransaction at its commit time. SQL Server’s in-memory OLTPengine, Hekaton, implements this optimization. Observe thatthis optimization works well with the optimistic multi-versionconcurrency control (OMVCC) mechanism of Hekaton, ex-plained in Section III-A2, as OMVCC already assigns endtimestamps to the transactions for its own mechanism [6].

2) Relying on high-availability: Another approach to op-timize the logging process is to rely on high-availability forreal-time failover. Instead of keeping a redo log informationin a non-volatile storage as in classical ARIES-style logging,in-memory OLTP systems keep multiple copies of exactly thesame database in several redundant servers, i.e., hot standbys.Once a machine fails, it is simply replaced by one of thehot standbys. Since these redundant servers are identicalcopies of the primary database, they can also be used forload balancing purposes. Thus, machine failures only causea degraded performance rather than a complete loss of thedata. To provide consistency among different copies of thesame data, i.e., among replicas, the systems using single-threaded partitioning-based concurrency control mechanismsimply execute the same transaction at all replicas in parallel.Since the system is single-threaded, the replicas executing thesame transaction will have the same copies of the data. Thein-memory OLTP system H-Store implements this approach[25].

To keep multiple replicas consistent, the systems using op-timistic multi-version concurrency control mechanisms, suchas Hekaton, either do not use the redundant servers for load-balancing, and simply replay the same transactions executedin the primary server in the redundant servers. Or, theyimplement more complex consistency mechanisms. Hekaton isintegrated to SQL Server’s AlwaysOn feature providing high-availability and supporting failover, however, the details of thehigh-availability and failover protocol are not revealed [6].

Note that although H-Store does not do redo logging,its commercial product VoltDB keeps flushing redo log in-formation to a non-volatile storage unit, and takes asyn-chronous transaction-consistent checkpoints of the state ofmain-memory. The reason is that relying on redundant replicasfor failover is problematic in case of having cluster-wideerrors causing all the relied nodes in the cluster to go down.Once a cluster-wide error occurs, the system restores thelatest transaction-consistent checkpoint and then re-executesthe transaction logs right after the checkpoint is taken. Check-pointing is also implemented by HyPer and Hekaton asyn-chronously as a background process to reduce the recoverytime.


IV. MICRO-ARCHITECTURAL ANALYSIS OF IN-MEMORYOLTP

In this section, we analyze the micro-architectural behaviorof the three in-memory OLTP systems, H-Store, HyPer andHekaton, and contrast them with the two traditional disk-basedOLTP systems, Shore-MT and SQL Server.

A. Experimental Setup and Methodology

We run our experiments on a modern commodity server withIntel Xeon (R) CPU E5-2640 v2 (Ivy Bridge) processor with2 sockets and 8 cores per socket. The machine has 256GBmain-memory, 32KB per core L1-I & L1-D, 256KB per coreL2 and 20MB shared last-level caches.

To collect numbers about various hardware events, we useIntel VTune Amplifier XE 2015 [12], which provides an APIfor lightweight hardware counter sampling. We run all theexperiments using RHEL 6.5 with Linux kernel version 2.6.32.

We analyze three in-memory OLTP systems: VoltDB (Com-munity Edition Version 4.8), [30], [26], HyPer (online demo-version) [11], [13], and Hekaton, which is the in-memoryOLTP engine of Microsoft SQL Server [9], [6], and two disk-based systems: the open-source academic Shore-MT storagemanager [24], [2], and commercial Microsoft SQL Server [19].

B. Results

To analyze the five systems, we implement a micro-benchmark where we randomly read one row at a time from arandomly generated table with two columns (key and value)of type long. We use a single worker thread executing thebenchmark transactions, and exclude the other threads thatare responsible for background tasks, e.g., communicationbetween the server and client, parsing transactions, etc.

MAX

00.5

11.5

22.5

33.5

4

1M

B1

0M

B1

0G

B1

00G

B1

MB

10

MB

10

GB

10

0GB

1M

B1

0M

B1

0G

B1

00G

B1

MB

10

MB

10

GB

10

0GB

1M

B1

0M

B1

0G

B1

00G

B

Shore-MT SQL Server VoltDB HyPer Hekaton

IPC

Database size

MAX

Fig. 10: Effect of database size on the IPC value.

We generate databases of size 1MB, 10MB, 10GB and100GB, run the micro-benchmark and collect the VTunehardware counter results. Figure 10 shows the number ofinstructions retired within one CPU cycle, i.e., instructions-per-cycle (IPC) value, across the five systems under analysisfor different sizes of databases. As can be seen, except HyPerwith databases of size 1MB and 10MB, all the systems for allthe data sizes has IPC value less than one, although the server

used in our experiment is 4-way issue, i.e., has the ability toretire up to four instructions per cycle.

To better understand the IPC values of the systems underanalysis, Figure 11 shows the number of stalled CPU cyclescaused by instruction and data misses per 1000 instructions foreach level of the cache hierarchy. For example, L1-I stands forthe stalled cycles caused by the misses coming from the firstlevel of the instruction cache.

From Figure 11, we observe that all the systems exceptHyPer suffers significantly from the L1-I misses. On the otherhand, HyPer presents a very good instruction locality havingalmost no instruction misses thanks to its aggressive querycompilation optimizations. However, for large data sizes suchas 10GB and 100GB, it dramatically suffers from LLC Dmisses.

Therefore, we conclude that, despite all of their optimiza-tions explained in Section III, in-memory OLTP systemsstill suffer from micro-architectural under-utilization problemsimilar to the traditional disk-based OLTP systems. Both in-memory and traditional systems have IPC values around onealthough the maximum number of instructions that can beretired per cycle is four. Except HyPer, all the systems sufferthe most from the L1-I misses being the main cause for lowIPC values. While HyPer almost eliminates the instructionmisses by doing aggressive query compilation optimizations,for large data sizes, it dramatically suffers from LLC datamisses resulting in even lower IPC values.

V. RESEARCH PROPOSAL: CO-ENGINEERING HARDWAREAND SOFTWARE FOR DATA MANAGEMENT

Based on our micro-architectural analysis of in-memoryOLTP systems, we observe that, the software-level optimiza-tions, even if being as extensive as to almost completelyeliminate the instruction misses, are not enough to addressthe micro-architectural under-utilization problem. Therefore,we propose to thoroughly analyze the micro-architecturalunder-utilization problem of traditional and in-memory OLTPsystems, and re-consider database design by setting the high-est priority goal as maximizing the utilization of micro-architectural resources.

More specifically, the first step we plan to follow is to do afiner-grained micro-architectural analysis of in-memory OLTPsystems. Particularly, we plan to analyze the direct effectsof each optimizations of in-memory systems on utilizationof micro-architectural resources with different workloads andconfigurations. Based on our fine-grained analysis, we plan toinvestigate what the boundaries of database architectures arein terms of their utilization of micro-architectural resources.For example, is it an ultimately unavoidable result to havelow IPC values either due to the instruction or data misses.Next, we plan to co-engineer hardware and software togetherby aiming to maximize the utilization of micro-architecturalresources. Our goal is to understand and exploit domain-specific characteristics of data management systems to developefficient hardware and software systems particularly for datamanagement tasks whose initial steps are taken by [27],[4], [3]. Finally, we plan to generalize our understanding of


0

100

200

300

400

500

600

1M

B

10

MB

10

GB

10

0G

B

1M

B

10

MB

10

GB

10

0G

B

1M

B

10

MB

10

GB

10

0G

B

1M

B

10

MB

10

GB

10

0G

B

1M

B

10

MB

10

GB

10

0G

B

Shore-MT SQL Server VoltDB HyPer Hekaton

Stal

l Cyc

les

per

k-I

nst

ruct

ion

s

Database size

L1I L2I LLC I L1D L2D LLC D ~1000~900

Fig. 11: Stall cycles from the different levels of the memory hierarchy as we increase the database size.

hardware-software interaction in data management systems toother computer systems.

REFERENCES

[1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on aModern Processor: Where Does Time Go? In VLDB, pages 266–277,1999.

[2] A. Ailamaki, R. Johnson, I. Pandis, and P. Tozun. Toward ScalableTransaction Processing: Evolution of Shore-MT. PVLDB, 6(11):1192–1193, 2013.

[3] I. Atta, P. Tozun, A. Ailamaki, and A. Moshovos. SLICC: Self-Assemblyof Instruction Cache Collectives for OLTP Workloads. In MICRO, pages188–198, 2012.

[4] I. Atta, P. Tozun, X. Tong, A. Ailamaki, and A. Moshovos. STREX:Boosting Instruction Cache Reuse in OLTP Workloads Through Strat-ified Transaction Execution. In Proceedings of the 40th InternationalSymposium on Computer Architecture, 2013.

[5] L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory SystemCharacterization of Commercial Workloads. In ISCA, pages 3–14, 1998.

[6] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stoneci-pher, N. Verma, and M. Zwilling. Hekaton: SQL Server’s Memory-optimized OLTP Engine. In SIGMOD, pages 1243–1254, 2013.

[7] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd-jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearingthe Clouds: A Study of Emerging Scale-out Workloads on ModernHardware. In ASPLOS, pages 37–48, 2012.

[8] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. OLTPThrough the Looking Glass, and What We Found There. In SIGMOD,pages 981–992, 2008.

[9] Hekaton Breaks Through. http://research.microsoft.com/en-us/news/features/hekaton-122012.aspx.

[10] J. M. Hellerstein, M. Stonebraker, and J. Hamilton. Architecture of aDatabase System. Found. Trends databases, 1(2):141–259, Feb. 2007.

[11] HyPer. http://hyper-db.de/.[12] Intel. Intel VTune Amplifier XE Performance Profiler.

http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/.[13] A. Kemper and T. Neumann. HyPer – A Hybrid OLTP&OLAP

Main Memory Database System Based on Virtual Memory Snapshots,booktitle = ICDE, year = 2011, pages = 195–206,.

[14] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey,V. W. Lee, S. A. Brandt, and P. Dubey. FAST: Fast ArchitectureSensitive Tree Search on Modern CPUs and GPUs. In Proceedingsof the 2010 ACM SIGMOD International Conference on Managementof Data, SIGMOD ’10, pages 339–350, New York, NY, USA, 2010.ACM.

[15] K. Kim, J. Shim, and I.-h. Lee. Cache Conscious Trees: How Do TheyPerform on Contemporary Commodity Microprocessors? In Proceedingsof the 2007 International Conference on Computational Science andIts Applications - Volume Part I, ICCSA’07, pages 189–200, Berlin,Heidelberg, 2007. Springer-Verlag.

[16] T. J. Lehman and M. J. Carey. A Study of Index Structures for MainMemory Database Management Systems. In Proceedings of the 12thInternational Conference on Very Large Data Bases, VLDB ’86, pages294–303, San Francisco, CA, USA, 1986. Morgan Kaufmann PublishersInc.

[17] V. Leis, A. Kemper, and T. Neumann. The Adaptive Radix Tree: ARTfulIndexing for Main-memory Databases. In Proceedings of the 2013 IEEEInternational Conference on Data Engineering (ICDE 2013), ICDE ’13,pages 38–49, Washington, DC, USA, 2013. IEEE Computer Society.

[18] J. Levandoski, D. Lomet, and S. Sengupta. The Bw-Tree: A B-Tree forNew Hardware Platforms. In ICDE, pages 302–313, 2013.

[19] Microsoft SQL Server. http://www.microsoft.com/en-us/server-cloud/products/sql-server/.

[20] T. Neumann. Efficiently Compiling Efficient Query Plans for ModernHardware. Proc. VLDB Endow., 4(9):539–550, June 2011.

[21] J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Supportin Main Memory. In Proceedings of the 25th International Conferenceon Very Large Data Bases, VLDB ’99, pages 78–89, San Francisco,CA, USA, 1999. Morgan Kaufmann Publishers Inc.

[22] J. Rao and K. A. Ross. Making B+- Trees Cache Conscious in MainMemory. SIGMOD Rec., 29(2):475–486, May 2000.

[23] B. Schlegel, R. Gemulla, and W. Lehner. K-ary Search on ModernProcessors. In Proceedings of the Fifth International Workshop on DataManagement on New Hardware, DaMoN ’09, pages 52–60, New York,NY, USA, 2009. ACM.

[24] Shore-MT. Shore-MT Official Website. http://diaswww.epfl.ch/shore-mt/.

[25] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem,and P. Helland. The end of an architectural era: (it’s time for a completerewrite). In VLDB, pages 1150–1160, 2007.

[26] M. Stonebraker and A. Weisberg. The VoltDB Main Memory DBMS.IEEE DEBull, 36(2):21–27, 2013.

[27] P. Tozun, I. Atta, A. Ailamaki, and A. Moshovos. ADDICT: AdvancedInstruction Chasing for Transactions. In VLDB, pages 1893–1904, 2014.

[28] P. Tozun, B. Gold, and A. Ailamaki. OLTP in Wonderland – Where docache misses come from in major OLTP components? In Proceedings ofthe 9th International Workshop on Data Management on New Hardware,DaMoN ’13, pages 8:1–8:6, 2013.

[29] P. Tozun, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki. From A toE: Analyzing TPC’s OLTP Benchmarks – The obsolete, the ubiquitous,the unexplored. In EDBT, pages 17–28, 2013.

[30] VoltDB. http://www.voltdb.com.

Documents

EDIC RESEARCH PROPOSAL 1 Micro-architectural … exam/PR15Sirin.pdfEDIC RESEARCH PROPOSAL 1 Micro-architectural Analysis ... Index Terms—thesis proposal, ... II presents the motivation