Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Eliminating the Bandwidth Bottleneck of Central Query Dispatching Through TCP Connection Hand-Over
Stefan Klauck1, Max Plauth1, Sven Knebel1, Marius Strobl2, Douglas Santry2, Lars Eggert2
1 Hasso Plattner Institute, University of Potsdam, Germany 2 .
March, 2019 Image: wolfro54 CC BY-NC-ND 2.0
In scale-out database systems, queries must be routed to individual servers.
Motivation
2
Central Dispatcher + Simple clients / dynamic backends - Central dispatcher is potential bottleneck
Direct Communication + Latency - Requires smart clients or static backends
In scale-out database systems, queries must be routed to individual servers.
Motivation
3
Client 1
Client m
DB Backend 1
DB Backend n
… …
DB Backend 1
DB Backend n
Dispatcher
Client 1
Client m
… …
Database
1
2 436
5
7 98 10
5
1 2 3 4
3 4 6
97 8
108 9
1
q1
q2
q3
q4
q5
q1 (100%)q5 (50%)
25%
Scale-Out
10%
15%
25%
30%
20%
Replica 1
1
2 436
7 98 10
5
Replica 2
1
2 436
7 98 10
5
Replica 3
1
2 436
7 98 10
5
Replica 4
1
2 436
7 98 10
5
q2 (100%)q5 (33.3%)
25%
q3 (100%)
25%
q4 (100%)q5 (16.6%)
25%
DispatcherClient 1
Client m
…
Client 1
Client m
…
■ Horizontal Partitioning / Sharded Database
■ Partially Replicated Database System
□ Maximize throughput
by balancing the load evenly
while minimizing memory footprint
Motivation – Use Cases for Central Dispatching
4 Rabl et Jacobsen. Query Centric Partitioning and Allocation for Partially Replicated Database Systems. SIGMOD 2017. Klauck et Schlosser. Workload-Driven Fragment Allocation for Partially Replicated Databases Using Linear Programming. ICDE 2019.
Shard 1
DispatcherClient 1
Client m
…
1234
Shard 2
678
5
>>> import psycopg2
>>> conn = psycopg2.connect("dbname='tpch' host='dispatcher'”)
■ Logical view
Motivation – Central Dispatching from a Network Perspective
5
DB Backend 1
DB Backend n
Dispatcher
Client 1
Client m
… …dispatcher:5432
database1:5432
database2:5432
client1:65140
client2:65144
dispatcher:65228
dispatcher:65231
>>> import psycopg2
>>> conn = psycopg2.connect("dbname='tpch' host='dispatcher'”)
■ Logical view
■ Physical view
Motivation – Central Dispatching from a Network Perspective
6
DB Backend 1
DB Backend n
Dispatcher
Client 1
Client m
… …dispatcher:5432
database1:5432
database2:5432
client1:65140
client2:65144
dispatcher:65228
dispatcher:65231
DB Backend 1
DB Backend n
DispatcherClient 1
Client m
… …dispatcher:5432
database1:5432
database2:5432
client1:65140
client2:65144
dispatcher:65228dispatcher:65231
Switch
■ Whether the dispatcher becomes a bottleneck depends on the workload
□ Number and size of queries/messages
□ Ratio of processed tuples and result set size
Motivation
7
DB Backend 1
DB Backend n
Dispatcher
Client 1
Client m
… …
■ Whether the dispatcher becomes a bottleneck depends on the workload
□ Number and size of queries/messages
□ Ratio of processed tuples and result set size
■ “Transferring a large amount of data out of a database system to a client program is a common task.”
□ Needed for statistical analyses or machine learning in clients
□ Main bottleneck is network bandwidth
Motivation
8
Raasveldt et Mühleisen. Don’t Hold My Data Hostage – A Case For Client Protocol Redesign. VLDB 2017.
DB Backend 1
DB Backend n
Dispatcher
Client 1
Client m
… …
■ Integration of a TCP connection hand-over by means of a reprogrammable network switch into a database
■ Comparison of query-based dispatching approaches in terms of
□ Throughput scaling
□ Processing flexibility
Research Goals
9
■ Traditional architecture with two separate TCP connections: client ßà dispatcher ßà database
1. HAProxy – free and open source TCP/HTTP load balancer 2. Hyrise dispatcher
■ Using a reprogrammable switch to perform TCP connection hand-over 3. Prism: exchange most packets directly between client and backend
Y. Hayakawa et al. Prism: A Proxy Architecture for Datacenter Networks. SoCC 2017.
Dispatcher Implementations
10
https://github.com/hyrise
■ Client query is initially sent/routed to Prism Controller
■ Prism Controller forwards connection to an appropriate backend and reprograms the switch
■ Backend processes query and sends result directly to the client (bypassing the Prism Controller)
■ Backend hands back connection to Prism Controller
Dispatcher Implementations - Prism
11 Client
Prism Switch
ClientDB Backend
Prism InterfaceSwitch Logic
Rewrite Paket Information
Lookup(Src IP, Src TCP Port, Dst IP, Dst Port)
Transform RulesUnmatched Packets ConnectionHand-Off/Hand-Back
Prism Controller
■ 10Gb and 40Gb Ethernet experiments
□ Hyrise with a stored procedure https://github.com/hyrise
□ wrk - HTTP benchmarking tool https://github.com/wg/wrk
□ mSwitch - software switch Honda et al. mSwitch: A Highly-Scalable, Modular Software Switch. SOSR 2015.
Experimental Evaluation
12
Switch
Client 1 DB Backend 1
mSwitchLearning Bridge Mode
wrk 1
Client 2wrk 2
Hyrise 1
DB Backend 2Hyrise 2
Load-BalancerHyrise Dispatcher/
HAProxy
Switch
Client 1 DB Backend 1wrk 1
Client 2wrk 2
Hyrise 1
DB Backend 2Hyrise 2
mSwitchPrism Switch Module
Prism Controller
■ 10 GbE results
□ Using TCP hand-over outperforms traditional approaches for large payloads
Experimental Evaluation with two Clients and Backends
13
1B 32B 1KiB 32KiB 1MiB 32MiB10-410-2
11.25
2.55
1020
Payload
Thro
ughp
ut [G
b/s] Prism
DispatcherHAProxy
ß limited by bandwidth of central dispatcher
ß scales up to bandwidth: min(Σ clients, Σ backends)
■ 10 GbE results
□ Using TCP hand-over outperforms traditional approaches for large payloads
□ Hyrise dispatcher performs best for small payload sizes up to 4kB
Experimental Evaluation with two Clients and Backends
14
1B 32B 1KiB 32KiB 1MiB 32MiB10-410-2
11.25
2.55
1020
Payload
Thro
ughp
ut [G
b/s] Prism
DispatcherHAProxy
ß limited by bandwidth of central dispatcher
ß scales up to bandwidth: min(Σ clients, Σ backends)
ß Throughput for 512 B payload Prism: 50 Mb/s Dispatcher: 63 MB/s HAProxy: 42 MB/s
■ Implement transaction ordering inside the network switch (published @ SOSP 2017)
Other Uses of Software Defined Networking in Databases
15
The Case for Network-Accelerated Query Processing
Alberto Lerner Rana Hussein Philippe Cudre-Mauroux
eXascale Infolab, U. of Fribourg—Switzerland
ABSTRACTThe fastest plans in MPP databases are usually those withthe least amount of data movement across nodes, as datais not processed while in transit. The network switchesthat connect MPP nodes are hard-wired to perform packet-forwarding logic only. However, in a recent paradigm shift,network devices are becoming “programmable.” The quoteshere are cautionary. Switches are not becoming general pur-pose computers (just yet). But now the set of tasks they canperform can be encoded in software.
In this paper we explore this programmability to accel-erate OLAP queries. We determined that we can o✏oadonto the switch some very common and expensive querypatterns. Thus, for the first time, moving data throughnetworking equipment can contribute to query execution.Our preliminary results show that we can improve responsetimes on even the best agreed upon plans by more than 2xusing 25 Gbps networks. We also see the promise of linearperformance improvement with faster speeds. The use ofprogrammable switches can open new possibilities of archi-tecting rack- and datacenter-sized database systems, withimplications across the stack.
1. INTRODUCTIONNetworking is an area in constant evolution. New pro-
tocols keep arising from emerging fields such as virtualiza-tion [20], cloud computing [6], or the Internet-of-Things [27].Many such protocols are implemented in hardware-basedswitches. In the past, support for a new protocol would re-quire a new version of a switching silicon – something verycostly and time consuming. Recently, however, chips suchas Barefoot Tofino [1], Cavium Xpliant [2], and Cisco Quan-tum Flow [3] started to support programming protocols viasoftware.
At the core of this innovation is a packet-processing hard-ware called Match-Action Unit (MAU). A MAU combinesa match engine with an action engine. The match engineholds data in a table format and can match a packet’s fields
This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2019.9th Biennial Conference on Innovative Data Systems Research (CIDR ‘19)January 13-16, 2019 , Asilomar, California, USA.
0 N… 0 N…parser deparseringress
MAUsegressMAUs
trafficmanager
packets
01:02:03:04:05:06 forward(port4)11:22:33:44:55:66 drop()
… …
06:05:04:03:02:01 forward(port1)field: dest MAC addr action
(a)
(b)
Figure 1: (a) A match-action table programmed to forwardor to drop a packet according to its destination MAC ad-dress. (b) Architecture of a programmable switch dataplaneholding that table.
with a row in this table using, for instance, exact match-ing. Other types of matches are also possible. The action
engine executes simple instructions over a packet or tabledata. Examples of such instructions are simple arithmeticor moving data within a packet. The MAU is programmablein the sense that one can specify its table layout, the typeof lookup to perform, and the processing done at a matchevent, as we illustrate in Figure 1(a). We say that a MAUimplements a match-action table (or, simply, a table) ab-straction.
Such a table abstraction is powerful enough to expressvery common computations in networking protocols. Forexample, it can encode many variations of IP lookup [26] orpacket classification [15] – two of the most recurring prob-lems in packet forwarding. To support full protocols, sev-eral MAUs can be combined in a pipelined fashion to forma programmable dataplane. The dataplane is complete witha programmable packet parser/deparser [13] and a tra�cmanager (e.g. a bu↵ered, routing element that moves pack-ets across switch lanes). We depict such a dataplane in Fig-ure 1(b). Some tables may use fields from the very packetrouting decision made by the tra�c manager. Therefore,there are MAUs in both the ingress and the egress sides ofthe routing element. Incidentally, the tra�c manager itselfcan also be a programmable. [25].
From a database perspective, the switch was historicallya passive element, routing packets for networking purposes.From a networking perspective, tuples generated by queryexecution were opaque payload. We argue here that a pro-
Eris: Coordination-Free Consistent TransactionsUsing In-Network Concurrency ControlJialin Li
University of [email protected]
Ellis MichaelUniversity of Washington
Dan R. K. PortsUniversity of [email protected]
ABSTRACTDistributed storage systems aim to provide strong consis-tency and isolation guarantees on an architecture that is parti-tioned across multiple shards for scalability and replicated forfault tolerance. Traditionally, achieving all of these goals hasrequired an expensive combination of atomic commitmentand replication protocols – introducing extensive coordina-tion overhead. Our system, Eris, takes a different approach.It moves a core piece of concurrency control functionality,which we term multi-sequencing, into the datacenter networkitself. This network primitive takes on the responsibility forconsistently ordering transactions, and a new lightweighttransaction protocol ensures atomicity.
The end result is that Eris avoids both replication and trans-action coordination overhead: we show that it can process alarge class of distributed transactions in a single round-trip
from the client to the storage system without any explicit co-
ordination between shards or replicas in the normal case. Itprovides atomicity, consistency, and fault tolerance with lessthan 10% overhead – achieving throughput 3.6–35⇥ higherand latency 72–80% lower than a conventional design onstandard benchmarks.
CCS CONCEPTS• Information systems ! Database transaction process-ing; Distributed database transactions; • Networks ! In-network processing; Data center networks; • Computer sys-tems organization ! Reliability;
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than the author(s) must be honored. Abstractingwith credit is permitted. To copy otherwise, or republish, to post on servers orto redistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’17, October 28, 2017, Shanghai, China
© 2017 Copyright held by the owner/author(s). Publication rights licensed toAssociation for Computing Machinery.ACM ISBN 978-1-4503-5085-3/17/10. . . $15.00https://doi.org/10.1145/3132747.3132751
KEYWORDSdistributed transactions, in-network concurrency control, net-work multi-sequencing
ACM Reference Format:Jialin Li, Ellis Michael, and Dan R. K. Ports. 2017. Eris:Coordination-Free Consistent Transactions Using In-Network Con-currency Control. In Proceedings of SOSP ’17, Shanghai, China,
October 28, 2017, 17 pages.https://doi.org/10.1145/3132747.3132751
1 INTRODUCTIONDistributed storage systems today face a tension betweentransactional semantics and performance. To meet the de-mands of large-scale applications, these storage systems mustbe partitioned for scalability and replicated for availability.Supporting strong consistency and strict serializability wouldgive the system the same semantics as a single system exe-cuting each transaction in isolation – freeing programmersfrom the need to reason about consistency and concurrency.Unfortunately, doing so is often at odds with the performancerequirements of modern applications, which demand not justhigh scalability but also tight latency bounds. Interactive ap-plications now require contacting hundreds or thousands ofindividual storage services on each request, potentially leav-ing individual transactions with sub-millisecond latency bud-gets [23, 49].
The conventional wisdom is that transaction processingsystems cannot meet these performance requirements due tocoordination costs. A traditional architecture calls for eachtransaction to be carefully orchestrated through a dizzyingarray of coordination protocols – e.g., Paxos for replication,two-phase commit for atomicity, and two-phase locking forisolation – each adding its own overhead. As we show inSection 8, this can increase latency and reduce throughput byan order of magnitude or more.
This paper challenges that conventional wisdom with Eris,1a new system for high-performance distributed transactionprocessing. Eris is optimized for high throughput and lowlatency in the datacenter environment. Eris executes an im-portant class of transactions with no coordination overhead
1Eris takes its name from the ancient Greek goddess of discord, i.e., lack of
coordination.
■ Offload full SQL query segments onto a programmable dataplane (published @ CIDR 2019)
■ Scale-out database systems use central query dispatchers to hide backend complexity, but may be a bandwidth bottleneck
■ We compared dispatching architectures for database systems
□ Traditional dispatcher performs best for small payload sizes
□ Prism’s connection overhead pays off for larger payloads
-> Hybrid approach with on-demand connection hand-over for large results
Summary
16