Eliminating the Bandwidth Bottleneck of Central Query ... · mSwitch - software switch Honda et al. mSwitch: A Highly-Scalable, Modular Software Switch. SOSR 2015. Experimental Evaluation

Eliminating the Bandwidth Bottleneck of Central Query Dispatching Through TCP Connection Hand-Over

Stefan Klauck1, Max Plauth1, Sven Knebel1, Marius Strobl2, Douglas Santry2, Lars Eggert2

1 Hasso Plattner Institute, University of Potsdam, Germany 2 .

March, 2019 Image: wolfro54 CC BY-NC-ND 2.0

In scale-out database systems, queries must be routed to individual servers.

Motivation

2

Central Dispatcher + Simple clients / dynamic backends - Central dispatcher is potential bottleneck

Direct Communication + Latency - Requires smart clients or static backends

In scale-out database systems, queries must be routed to individual servers.

Motivation

3

Client 1

Client m

DB Backend 1

DB Backend n

… …

DB Backend 1

DB Backend n

Dispatcher

Client 1

Client m

… …

Database

1

2 436

5

7 98 10

5

1 2 3 4

3 4 6

97 8

108 9

1

q1

q2

q3

q4

q5

q1 (100%)q5 (50%)

25%

Scale-Out

10%

15%

25%

30%

20%

Replica 1

1

2 436

7 98 10

5

Replica 2

1

2 436

7 98 10

5

Replica 3

1

2 436

7 98 10

5

Replica 4

1

2 436

7 98 10

5

q2 (100%)q5 (33.3%)

25%

q3 (100%)

25%

q4 (100%)q5 (16.6%)

25%

DispatcherClient 1

Client m

…

Client 1

Client m

…

■  Horizontal Partitioning / Sharded Database

■  Partially Replicated Database System

□  Maximize throughput

by balancing the load evenly

while minimizing memory footprint

Motivation – Use Cases for Central Dispatching

4 Rabl et Jacobsen. Query Centric Partitioning and Allocation for Partially Replicated Database Systems. SIGMOD 2017. Klauck et Schlosser. Workload-Driven Fragment Allocation for Partially Replicated Databases Using Linear Programming. ICDE 2019.

Shard 1

DispatcherClient 1

Client m

…

1234

Shard 2

678

5

>>> import psycopg2

>>> conn = psycopg2.connect("dbname='tpch' host='dispatcher'”)

■  Logical view

Motivation – Central Dispatching from a Network Perspective

5

DB Backend 1

DB Backend n

Dispatcher

Client 1

Client m

… …dispatcher:5432

database1:5432

database2:5432

client1:65140

client2:65144

dispatcher:65228

dispatcher:65231

>>> import psycopg2

>>> conn = psycopg2.connect("dbname='tpch' host='dispatcher'”)

■  Logical view

■  Physical view

Motivation – Central Dispatching from a Network Perspective

6

DB Backend 1

DB Backend n

Dispatcher

Client 1

Client m


database1:5432

database2:5432

client1:65140

client2:65144

dispatcher:65228

dispatcher:65231

DB Backend 1

DB Backend n

DispatcherClient 1

Client m


database1:5432

database2:5432

client1:65140

client2:65144

dispatcher:65228dispatcher:65231

Switch

■  Whether the dispatcher becomes a bottleneck depends on the workload

□  Number and size of queries/messages

□  Ratio of processed tuples and result set size

Motivation

7

DB Backend 1

DB Backend n

Dispatcher

Client 1

Client m

… …

■  Whether the dispatcher becomes a bottleneck depends on the workload

□  Number and size of queries/messages

□  Ratio of processed tuples and result set size

■  “Transferring a large amount of data out of a database system to a client program is a common task.”

□  Needed for statistical analyses or machine learning in clients

□  Main bottleneck is network bandwidth

Motivation

8

Raasveldt et Mühleisen. Don’t Hold My Data Hostage – A Case For Client Protocol Redesign. VLDB 2017.

DB Backend 1

DB Backend n

Dispatcher

Client 1

Client m

… …

■  Integration of a TCP connection hand-over by means of a reprogrammable network switch into a database

■  Comparison of query-based dispatching approaches in terms of

□  Throughput scaling

□  Processing flexibility

Research Goals

9

■  Traditional architecture with two separate TCP connections: client ßà dispatcher ßà database

1. HAProxy – free and open source TCP/HTTP load balancer 2. Hyrise dispatcher

■  Using a reprogrammable switch to perform TCP connection hand-over 3. Prism: exchange most packets directly between client and backend

Y. Hayakawa et al. Prism: A Proxy Architecture for Datacenter Networks. SoCC 2017.

Dispatcher Implementations

10

https://github.com/hyrise

■  Client query is initially sent/routed to Prism Controller

■  Prism Controller forwards connection to an appropriate backend and reprograms the switch

■  Backend processes query and sends result directly to the client (bypassing the Prism Controller)

■  Backend hands back connection to Prism Controller

Dispatcher Implementations - Prism

11 Client

Prism Switch

ClientDB Backend

Prism InterfaceSwitch Logic

Rewrite Paket Information

Lookup(Src IP, Src TCP Port, Dst IP, Dst Port)

Transform RulesUnmatched Packets ConnectionHand-Off/Hand-Back

Prism Controller

■  10Gb and 40Gb Ethernet experiments

□  Hyrise with a stored procedure https://github.com/hyrise

□  wrk - HTTP benchmarking tool https://github.com/wg/wrk

□  mSwitch - software switch Honda et al. mSwitch: A Highly-Scalable, Modular Software Switch. SOSR 2015.

Experimental Evaluation

12

Switch

Client 1 DB Backend 1

mSwitchLearning Bridge Mode

wrk 1

Client 2wrk 2

Hyrise 1

DB Backend 2Hyrise 2

Load-BalancerHyrise Dispatcher/

HAProxy

Switch

Client 1 DB Backend 1wrk 1

Client 2wrk 2

Hyrise 1

DB Backend 2Hyrise 2

mSwitchPrism Switch Module

Prism Controller

■  10 GbE results

□  Using TCP hand-over outperforms traditional approaches for large payloads

Experimental Evaluation with two Clients and Backends

13

1B 32B 1KiB 32KiB 1MiB 32MiB10-410-2

11.25

2.55

1020

Payload

Thro

ughp

ut [G

b/s] Prism

DispatcherHAProxy

ß limited by bandwidth of central dispatcher

ß scales up to bandwidth: min(Σ clients, Σ backends)

■  10 GbE results

□  Using TCP hand-over outperforms traditional approaches for large payloads

□  Hyrise dispatcher performs best for small payload sizes up to 4kB

Experimental Evaluation with two Clients and Backends

14

1B 32B 1KiB 32KiB 1MiB 32MiB10-410-2

11.25

2.55

1020

Payload

Thro

ughp

ut [G

b/s] Prism

DispatcherHAProxy

ß limited by bandwidth of central dispatcher

ß scales up to bandwidth: min(Σ clients, Σ backends)

ß Throughput for 512 B payload Prism: 50 Mb/s Dispatcher: 63 MB/s HAProxy: 42 MB/s

■  Implement transaction ordering inside the network switch (published @ SOSP 2017)

Other Uses of Software Defined Networking in Databases

15

The Case for Network-Accelerated Query Processing

Alberto Lerner Rana Hussein Philippe Cudre-Mauroux

eXascale Infolab, U. of Fribourg—Switzerland

ABSTRACTThe fastest plans in MPP databases are usually those withthe least amount of data movement across nodes, as datais not processed while in transit. The network switchesthat connect MPP nodes are hard-wired to perform packet-forwarding logic only. However, in a recent paradigm shift,network devices are becoming “programmable.” The quoteshere are cautionary. Switches are not becoming general pur-pose computers (just yet). But now the set of tasks they canperform can be encoded in software.

In this paper we explore this programmability to accel-erate OLAP queries. We determined that we can o✏oadonto the switch some very common and expensive querypatterns. Thus, for the first time, moving data throughnetworking equipment can contribute to query execution.Our preliminary results show that we can improve responsetimes on even the best agreed upon plans by more than 2xusing 25 Gbps networks. We also see the promise of linearperformance improvement with faster speeds. The use ofprogrammable switches can open new possibilities of archi-tecting rack- and datacenter-sized database systems, withimplications across the stack.

1. INTRODUCTIONNetworking is an area in constant evolution. New pro-

tocols keep arising from emerging fields such as virtualiza-tion [20], cloud computing [6], or the Internet-of-Things [27].Many such protocols are implemented in hardware-basedswitches. In the past, support for a new protocol would re-quire a new version of a switching silicon – something verycostly and time consuming. Recently, however, chips suchas Barefoot Tofino [1], Cavium Xpliant [2], and Cisco Quan-tum Flow [3] started to support programming protocols viasoftware.

At the core of this innovation is a packet-processing hard-ware called Match-Action Unit (MAU). A MAU combinesa match engine with an action engine. The match engineholds data in a table format and can match a packet’s fields

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2019.9th Biennial Conference on Innovative Data Systems Research (CIDR ‘19)January 13-16, 2019 , Asilomar, California, USA.

0 N… 0 N…parser deparseringress

MAUsegressMAUs

trafficmanager

packets

01:02:03:04:05:06 forward(port4)11:22:33:44:55:66 drop()

… …

06:05:04:03:02:01 forward(port1)field: dest MAC addr action

(a)

(b)

Figure 1: (a) A match-action table programmed to forwardor to drop a packet according to its destination MAC ad-dress. (b) Architecture of a programmable switch dataplaneholding that table.

with a row in this table using, for instance, exact match-ing. Other types of matches are also possible. The action

engine executes simple instructions over a packet or tabledata. Examples of such instructions are simple arithmeticor moving data within a packet. The MAU is programmablein the sense that one can specify its table layout, the typeof lookup to perform, and the processing done at a matchevent, as we illustrate in Figure 1(a). We say that a MAUimplements a match-action table (or, simply, a table) ab-straction.

Such a table abstraction is powerful enough to expressvery common computations in networking protocols. Forexample, it can encode many variations of IP lookup [26] orpacket classification [15] – two of the most recurring prob-lems in packet forwarding. To support full protocols, sev-eral MAUs can be combined in a pipelined fashion to forma programmable dataplane. The dataplane is complete witha programmable packet parser/deparser [13] and a tra�cmanager (e.g. a bu↵ered, routing element that moves pack-ets across switch lanes). We depict such a dataplane in Fig-ure 1(b). Some tables may use fields from the very packetrouting decision made by the tra�c manager. Therefore,there are MAUs in both the ingress and the egress sides ofthe routing element. Incidentally, the tra�c manager itselfcan also be a programmable. [25].

From a database perspective, the switch was historicallya passive element, routing packets for networking purposes.From a networking perspective, tuples generated by queryexecution were opaque payload. We argue here that a pro-

Eris: Coordination-Free Consistent TransactionsUsing In-Network Concurrency ControlJialin Li

University of [email protected]

Ellis MichaelUniversity of Washington

[email protected]

Dan R. K. PortsUniversity of [email protected]

ABSTRACTDistributed storage systems aim to provide strong consis-tency and isolation guarantees on an architecture that is parti-tioned across multiple shards for scalability and replicated forfault tolerance. Traditionally, achieving all of these goals hasrequired an expensive combination of atomic commitmentand replication protocols – introducing extensive coordina-tion overhead. Our system, Eris, takes a different approach.It moves a core piece of concurrency control functionality,which we term multi-sequencing, into the datacenter networkitself. This network primitive takes on the responsibility forconsistently ordering transactions, and a new lightweighttransaction protocol ensures atomicity.

The end result is that Eris avoids both replication and trans-action coordination overhead: we show that it can process alarge class of distributed transactions in a single round-trip

from the client to the storage system without any explicit co-

ordination between shards or replicas in the normal case. Itprovides atomicity, consistency, and fault tolerance with lessthan 10% overhead – achieving throughput 3.6–35⇥ higherand latency 72–80% lower than a conventional design onstandard benchmarks.

CCS CONCEPTS• Information systems ! Database transaction process-ing; Distributed database transactions; • Networks ! In-network processing; Data center networks; • Computer sys-tems organization ! Reliability;

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than the author(s) must be honored. Abstractingwith credit is permitted. To copy otherwise, or republish, to post on servers orto redistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’17, October 28, 2017, Shanghai, China

© 2017 Copyright held by the owner/author(s). Publication rights licensed toAssociation for Computing Machinery.ACM ISBN 978-1-4503-5085-3/17/10. . . $15.00https://doi.org/10.1145/3132747.3132751

KEYWORDSdistributed transactions, in-network concurrency control, net-work multi-sequencing

ACM Reference Format:Jialin Li, Ellis Michael, and Dan R. K. Ports. 2017. Eris:Coordination-Free Consistent Transactions Using In-Network Con-currency Control. In Proceedings of SOSP ’17, Shanghai, China,

October 28, 2017, 17 pages.https://doi.org/10.1145/3132747.3132751

1 INTRODUCTIONDistributed storage systems today face a tension betweentransactional semantics and performance. To meet the de-mands of large-scale applications, these storage systems mustbe partitioned for scalability and replicated for availability.Supporting strong consistency and strict serializability wouldgive the system the same semantics as a single system exe-cuting each transaction in isolation – freeing programmersfrom the need to reason about consistency and concurrency.Unfortunately, doing so is often at odds with the performancerequirements of modern applications, which demand not justhigh scalability but also tight latency bounds. Interactive ap-plications now require contacting hundreds or thousands ofindividual storage services on each request, potentially leav-ing individual transactions with sub-millisecond latency bud-gets [23, 49].

The conventional wisdom is that transaction processingsystems cannot meet these performance requirements due tocoordination costs. A traditional architecture calls for eachtransaction to be carefully orchestrated through a dizzyingarray of coordination protocols – e.g., Paxos for replication,two-phase commit for atomicity, and two-phase locking forisolation – each adding its own overhead. As we show inSection 8, this can increase latency and reduce throughput byan order of magnitude or more.

This paper challenges that conventional wisdom with Eris,1a new system for high-performance distributed transactionprocessing. Eris is optimized for high throughput and lowlatency in the datacenter environment. Eris executes an im-portant class of transactions with no coordination overhead

1Eris takes its name from the ancient Greek goddess of discord, i.e., lack of

coordination.

■  Offload full SQL query segments onto a programmable dataplane (published @ CIDR 2019)

■  Scale-out database systems use central query dispatchers to hide backend complexity, but may be a bandwidth bottleneck

■  We compared dispatching architectures for database systems

□  Traditional dispatcher performs best for small payload sizes

□  Prism’s connection overhead pays off for larger payloads

-> Hybrid approach with on-demand connection hand-over for large results

Summary

16

Thanks Stefan Klauck

[email protected]

http://epic.hpi.de

Documents

Eliminating the Bandwidth Bottleneck of Central Query ... · mSwitch - software switch Honda et al. mSwitch: A Highly-Scalable, Modular Software Switch. SOSR 2015. Experimental Evaluation