11
Improving Asynchronous Invocation Performance in Client-server Systems Shungeng Zhang , Qingyang Wang , Yasuhiko Kanemasa Computer Science and Engineering, Louisiana State University Software Laboratory, FUJITSU LABORATORIES LTD. Abstract—In this paper, we conduct an experimental study of asynchronous invocation on the performance of client-server sys- tems. Through extensive measurements of both realistic macro- and micro-benchmarks, we show that servers with the asyn- chronous event-driven architecture may perform significantly worse than the thread-based version resulting from two non- trivial reasons. First, the traditional wisdom of one-event-one- handler event processing flow can create large amounts of intermediate context switches that significantly degrade the performance of an asynchronous server. Second, some runtime workload (e.g., response size) and network conditions (e.g., net- work latency) may cause significant negative performance impact on the asynchronous event-driven servers, but not on thread- based ones. We provide a hybrid solution by taking advantage of different asynchronous architectures to adapt to varying workload and network conditions. Our hybrid solution searches for the most efficient execution path for each client request based on the runtime request profiling and type checking. Our experimental results show that the hybrid solution outperforms all the other types of servers up to 19%90% on throughput, depending on specific workload and network conditions. Index Terms—Asynchronous, event-driven, threads, client- server applications, performance I. I NTRODUCTION Asynchronous event-driven architecture for high perfor- mance internet servers has been studied extensively be- fore [33][37][42][32]. Many people advocate the asyn- chronous event-driven architecture as a better choice than the thread-based RPC version to handle high level work- load concurrency because of reduced multi-threading over- head [42]. Though conceptually simple, taking advantage of the asynchronous event-driven architecture to construct high performance internet servers is a non-trivial task. For example, asynchronous servers are well-known to be difficult to program and debug due to the obscured control flow [16] compared to the thread-based version. In this paper, we show that building high performance in- ternet servers using the asynchronous event-driven architecture requires careful design of the event processing flow and the ability to adapt to the runtime varying workload and network conditions. Concretely, we conduct extensive benchmark ex- periments to study some non-trivial design deficiencies of the asynchronous event-driven server architectures that lead to the inferior performance compared to the thread-based version when facing high concurrency workload. For example, the traditional wisdom of one-event-one-handler event processing flow may generate a large amount of unnecessary intermediate 0 400 800 1200 1600 2K 4K 6K 8K 10K 12K 14K 0 2 4 6 8 10 12 28% drop Throughput [req/s] Response Time [s] Workload [# of users] SYS tomcatV7 -TP SYS tomcatV7 -RT SYS tomcatV8 -TP SYS tomcatV8 -RT Fig. 1: Upgrading Tomcat from a thread-based version (V7) to an asynchronous version (V8) in a 3-tier system leads to significant performance degradation. events and context switches that significantly degrade the performance of an asynchronous server. We also observed that some runtime workload and network conditions may cause frequent unnecessary I/O system calls (due to the non-blocking nature of asynchronous function calls) for the asynchronous event-driven servers, but not for the thread-based ones. The first contribution is an experimental illustration that simply upgrading a thread-based component server to its asynchronous version in an n-tier web system can lead to significant performance degradation of the whole system at high utilization levels. For instance, we observed that the maximum achievable throughput of a 3-tier system decreases by 23% after we upgrade the Tomcat application server from the traditional thread-based version (Version 7) to the latest asynchronous version (Version 8) (see Figure 1) in a standard n-tier application benchmark (RUBBoS [11]). Our analysis reveals that the unexpected performance degradation of the asynchronous Tomcat results from its poor design of event processing flow, which causes significantly higher CPU context switch overhead than the thread-based version when the server is approaching saturation. Our further study shows such poor design of event processing flow also exists in other popular asynchronous servers/middleware such as Jetty [3], GlassFish [9], and MongoDB Java Asynchronous Driver [6]. The second contribution is a detailed analysis of various workload and network conditions that impact the performance of asynchronous invocation. Concretely, we have observed that a moderate-sized response message (e.g, 100KB) for an asynchronous server can cause a non-trivial write-spin problem, where the server makes frequent unnecessary I/O

Improving Asynchronous Invocation Performance in Client

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving Asynchronous Invocation Performance in Client

Improving Asynchronous Invocation Performance inClient-server Systems

Shungeng Zhangdagger Qingyang Wangdagger Yasuhiko KanemasaDaggerdaggerComputer Science and Engineering Louisiana State University

DaggerSoftware Laboratory FUJITSU LABORATORIES LTD

AbstractmdashIn this paper we conduct an experimental study ofasynchronous invocation on the performance of client-server sys-tems Through extensive measurements of both realistic macro-and micro-benchmarks we show that servers with the asyn-chronous event-driven architecture may perform significantlyworse than the thread-based version resulting from two non-trivial reasons First the traditional wisdom of one-event-one-handler event processing flow can create large amounts ofintermediate context switches that significantly degrade theperformance of an asynchronous server Second some runtimeworkload (eg response size) and network conditions (eg net-work latency) may cause significant negative performance impacton the asynchronous event-driven servers but not on thread-based ones We provide a hybrid solution by taking advantageof different asynchronous architectures to adapt to varyingworkload and network conditions Our hybrid solution searchesfor the most efficient execution path for each client requestbased on the runtime request profiling and type checking Ourexperimental results show that the hybrid solution outperformsall the other types of servers up to 19sim90 on throughputdepending on specific workload and network conditions

Index TermsmdashAsynchronous event-driven threads client-server applications performance

I INTRODUCTION

Asynchronous event-driven architecture for high perfor-mance internet servers has been studied extensively be-fore [33] [37] [42] [32] Many people advocate the asyn-chronous event-driven architecture as a better choice thanthe thread-based RPC version to handle high level work-load concurrency because of reduced multi-threading over-head [42] Though conceptually simple taking advantage ofthe asynchronous event-driven architecture to construct highperformance internet servers is a non-trivial task For exampleasynchronous servers are well-known to be difficult to programand debug due to the obscured control flow [16] compared tothe thread-based version

In this paper we show that building high performance in-ternet servers using the asynchronous event-driven architecturerequires careful design of the event processing flow and theability to adapt to the runtime varying workload and networkconditions Concretely we conduct extensive benchmark ex-periments to study some non-trivial design deficiencies of theasynchronous event-driven server architectures that lead to theinferior performance compared to the thread-based versionwhen facing high concurrency workload For example thetraditional wisdom of one-event-one-handler event processingflow may generate a large amount of unnecessary intermediate

0

400

800

1200

1600

2K 4K 6K 8K 10K 12K 14K0

2

4

6

8

10

1228 drop

Thro

ughp

ut [

req

s]

Res

pons

e Ti

me

[s]

Workload [ of users]

SYStomcatV7-TPSYStomcatV7-RTSYStomcatV8-TPSYStomcatV8-RT

Fig 1 Upgrading Tomcat from a thread-based version(V7) to an asynchronous version (V8) in a 3-tier systemleads to significant performance degradation

events and context switches that significantly degrade theperformance of an asynchronous server We also observed thatsome runtime workload and network conditions may causefrequent unnecessary IO system calls (due to the non-blockingnature of asynchronous function calls) for the asynchronousevent-driven servers but not for the thread-based ones

The first contribution is an experimental illustration thatsimply upgrading a thread-based component server to itsasynchronous version in an n-tier web system can lead tosignificant performance degradation of the whole system athigh utilization levels For instance we observed that themaximum achievable throughput of a 3-tier system decreasesby 23 after we upgrade the Tomcat application serverfrom the traditional thread-based version (Version 7) to thelatest asynchronous version (Version 8) (see Figure 1) in astandard n-tier application benchmark (RUBBoS [11]) Ouranalysis reveals that the unexpected performance degradationof the asynchronous Tomcat results from its poor design ofevent processing flow which causes significantly higher CPUcontext switch overhead than the thread-based version whenthe server is approaching saturation Our further study showssuch poor design of event processing flow also exists in otherpopular asynchronous serversmiddleware such as Jetty [3]GlassFish [9] and MongoDB Java Asynchronous Driver [6]

The second contribution is a detailed analysis of variousworkload and network conditions that impact the performanceof asynchronous invocation Concretely we have observedthat a moderate-sized response message (eg 100KB) for anasynchronous server can cause a non-trivial write-spinproblem where the server makes frequent unnecessary IO

system calls resulting from the default small TCP send buffersize and the TCP wait-ACK mechanism wasting the serverCPU resource about 12sim24 We also observed some net-work conditions such as latency can exacerbate the write-spinproblem of asynchronous event-driven servers even further

The third contribution is a hybrid solution that takes ad-vantage of different asynchronous event-driven architecturesto adapt to various runtime workload and network conditionsWe studied a popular asynchronous network IO library namedldquoNetty [7]rdquo which can mitigate the write-spin problem throughsome write operation optimizations However such optimiza-tion techniques in Netty also bring non-trivial overhead in thecase when the write-spin problem does not occur during theasynchronous invocation period Our hybrid solution extendsNetty by monitoring the occurrence of write-spin and oscillat-ing between alternative asynchronous invocation mechanismsto avoid the unnecessary optimization overhead

In general given the strong economic interest in achievinghigh resource efficiency and high quality of service simultane-ously in cloud data centers our results suggest asynchronousinvocation has potentially significant performance advantageover RPC synchronous invocation but still needs carefultunning according to various non-trivial runtime workloadand network conditions Our work also points to significantfuture research opportunities since asynchronous architecturehas been widely adopted by many distributed systems (egpubsub systems [23] AJAX [26] and ZooKeeper [31])

The rest of the paper is organized as follows Section IIshows a case study of performance degradation after softwareupgrading from a thread-based version to an asynchronousversion Section III describes the context switch problem ofservers with asynchronous architecture Section IV explainsthe write-spin problem of asynchronous architecture when ithandles requests with large size responses Section V evaluatestwo practical solutions Section VI summarizes related workand Section VII concludes the paper

II BACKGROUND AND MOTIVATION

A RPC vs Asynchronous Network IO

Modern internet servers usually adopt a few connectorsto communicate with other component servers or end usersThe main activities of a connector include managing up-stream and downstream network connections readingwritingdata through the established connections parsing and routingthe incoming requests to the application (business logic)layer Though similar in functionality synchronous and asyn-chronous connectors have very different mechanisms to inter-act with the application layer logic

Synchronous connectors are mainly adopted by RPC thread-based servers Once accepting a new connection the mainthread will dispatch the connection to a dedicated workerthread until the close of the connection In this case each con-nection consumes one worker thread and the operating systemtransparently switches among worker threads for concurrentrequest processing Although relatively easy to program dueto the user-perceived sequential execution flow synchronous

connectors bring the well-known multi-threading overhead(eg context switches scheduling and lock contention)

Asynchronous connectors accept new connections and man-age all established connections through an event-driven mech-anism using only one or a few threads Given a pool of estab-lished connections in a server an asynchronous connector han-dles requests received from these connections by repeatedlylooping over two phases The first phase (event monitoringphase) determines which connections have pending events ofinterest These events typically indicate that a particular con-nection (ie socket) is readable or writable The asynchronousconnector pulls the connections with pending events by takingadvantage of an event notification mechanism such as selectpoll or epoll supported by the underlying operating systemThe second phase (event handling phase) iterates over each ofthe connections that have pending events Based on the contextinformation of each event the connector dispatches the eventto an appropriate event handler performing the actual businesslogic computation More details can be found in previousasynchronous server research [42] [32] [38] [37] [28]

In practice there are two general designs of asynchronousservers using the asynchronous connectors The first one isa single-threaded asynchronous server which only uses onethread to loop over the aforementioned two phases for exam-ple Nodejs [8] and Lighttpd [5] Such a design is especiallybeneficial for in-memory workloads because context switcheswill be minimum while the single thread will not be blocked bydisk IO activities [35] Multiple single-threaded servers (alsocalled N -copy approach [43] [28]) can be launched togetherto fully utilize multiple processors The second design is touse a worker thread pool in the second phase to concurrentlyprocess connections that have pending events Such a design issupposed to efficiently utilize CPU resources in case of tran-sient disk IO blocking or multi-core environment [38] Severalvariants of the second design have been proposed beforemostly known as the staged design adopted by SEDA [42]and WatPipe [38] Instead of having only one worker threadpool the staged design decomposes the request processing intoa pipeline of stages separated by event queues each of whichhas its own worker thread pool with the aim of modular designand fine-grained management of worker threads

In general asynchronous event-driven servers are believedto be able to achieve higher throughput than the thread-based version because of reduced multi-threading overheadespecially when the server is handling high concurrency CPUintensive workload However our experimental results in thefollowing section will show the opposite

B Performance Degradation after Tomcat Upgrade

System software upgrade is a common practice for internetservices due to the fast evolving of software components Inthis section we show that simply upgrading a thread-basedcomponent server to its asynchronous version in an n-tiersystem may cause unexpected performance degradation at highresource utilization We show one such case by RUBBoS [11]a representative web-facing n-tier system benchmark modeled

0

2K

4K

6K

8K

10K

12K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(a) The 01KB response size case

0

2K

4K

6K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(b) The 10KB response size case

0

100

200

300

400

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(c) The 100KB response size caseFig 2 Throughput comparison between TomcatSync and TomcatAsync under different workload concurrencies andresponse sizes As response size increases from 01KB in (a) to 100KB in (c) TomcatSync outperforms TomcatAsyncat wider concurrency range indicating the performance degradation of TomcatAsync with large response size

after the popular news website Slashdot [15] Our experimentsadopt a typical 3-tier configuration with 1 Apache webserver 1 Tomcat application server and 1 MySQL databaseserver (details in Appendix A) At the beginning we useTomcat 7 (noted as TomcatSync) which uses a thread-basedsynchronous connector for inter-tier communication We thenupgrade the Tomcat server to Version 8 (the latest version atthe time noted as TomcatAsync) which by default usesan asynchronous connector with the expectation of systemperformance improvement after Tomcat upgrade

Unfortunately we observed a surprising system perfor-mance degradation after the Tomcat server upgrade as shownin Figure 1 We call the system with TomcatSync asSYStomcatV 7 and the other one with TomcatAsync asSYStomcatV 8 This figure shows the SYStomcatV 7 saturatesat workload 11000 while SYStomcatV 8 saturates at work-load 9000 At workload 11000 SYStomcatV 7 outperformsSYStomcatV 8 by 28 in throughput and the average responsetime is one order of magnitude less (226ms vs 2820ms) Sucha result is counter intuitive since we upgrade Tomcat froma lower thread-based version to a newer asynchronous oneWe note that in both cases the Tomcat server CPU is thebottleneck resource in the system all the hardware resources(eg CPU and memory) of other component servers are farfrom saturation (lt 60)

We use Collectl [2] to collect system level metrics (egCPU and context switches) another interesting phenomenonwe observed is that TomcatAsync encounters significantlyhigher number of context switches than TomcatSync whenthe system is at the same workload For example at workload10000 TomcatAsync encounters 12950 context switches persecond while only 5930 for TomcatSync less than halfof the previous case It is reasonable to suggest that highcontext switches in TomcatAsync cause high CPU overheadleading to inferior throughput compared to TomcatSyncHowever the traditional wisdom told us that a server withasynchronous architecture should have less context switchesthan a thread-based server So why did we observe the oppositehere We will discuss about the cause in the next section

Reactor thread

Worker th

read pool

Dispatch read event

generate write event

dispatch write event

return control back

Worker A

Readingprocess-ing request( )

( )Worker B

Sending response

Fig 3 Illustration of the event processing flow whenTomcatAsync processes one request Totally four contextswitches between the reactor thread and worker threads

III INEFFICIENT EVENT PROCESSING FLOW INASYNCHRONOUS SERVERS

In this section we explain why the performance of the 3-tierbenchmark system degrades after we upgrade Tomcat from thethread-based version TomcatSync to the asynchronous ver-sion TomcatAsync To simplify and quantify our analysiswe design micro-benchmarks to test the performance of bothversions of Tomcat

We use JMeter [1] to generate HTTP requests to access thestandalone Tomcat directly These HTTP requests are catego-rized into three types small medium and large with whichthe Tomcat server (either TomcatSync or TomcatAsync)first conducts some simple computation before respondingwith 01KB 10KB and 100KB of in-memory data respec-tively We choose these three sizes because they are represen-tative response sizes in our RUBBoS benchmark applicationJMeter uses one thread to simulate each end-user We set thethink time between the consecutive requests sent from thesame thread to be zero thus we can precisely control theconcurrency of the workload to the target Tomcat server byspecifying the number of threads in JMeter

We compare the server throughput between TomcatSyncand TomcatAsync under different workload concurrenciesand response sizes as shown in Figure 2 The three sub-figures show that as workload concurrency increases from1 to 3200 TomcatAsync achieves lower throughput thanTomcatSync before a certain workload concurrency Forexample TomcatAsync performs worse than TomcatSync

TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

Response size TomcatAsync TomcatSync[times1000sec]

01KB 40 1610KB 25 7100KB 28 2

before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

TABLE II Context switches among user-space threadswhen the server processes one client request

Server type Context NoteSwitch

sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

0

10K

20K

30K

40K

1 4 8 16 64 100 400 1000 3200

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(a) Throughput when response size is 01KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(b) Throughput when response size is 10KB

0

200

400

600

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(c) Throughput when response size is 100KB

0

40K

80K

120K

160K

1 4 8 16 64 100 400 1000 3200

Cont

ext

Switc

hing

[s

]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(d) Context switch when response size is 01KB

0

20K

40K

60K

80K

100K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(e) Context switch when response size is 10KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

A Profiling Results

Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

Server Type sTomcat-Sync SingleT-Async

Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

Resp size req write()socketwrite() per req

01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

Client

Receive Buffer Read from socket

Parsing and encoding

Write to socket

sum lt data size

syscall

syscall

syscall

Return of bytes written

Return zero

Un

expected

Wait fo

r TCP

AC

Ks

Server

Tim

e

Write to socket

sum lt data size

Write to socket

sum lt data size

The worker thread write-spins until ACKs come back from client

Data Copy to Kernel Finish

Return of bytes written

Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 2: Improving Asynchronous Invocation Performance in Client

system calls resulting from the default small TCP send buffersize and the TCP wait-ACK mechanism wasting the serverCPU resource about 12sim24 We also observed some net-work conditions such as latency can exacerbate the write-spinproblem of asynchronous event-driven servers even further

The third contribution is a hybrid solution that takes ad-vantage of different asynchronous event-driven architecturesto adapt to various runtime workload and network conditionsWe studied a popular asynchronous network IO library namedldquoNetty [7]rdquo which can mitigate the write-spin problem throughsome write operation optimizations However such optimiza-tion techniques in Netty also bring non-trivial overhead in thecase when the write-spin problem does not occur during theasynchronous invocation period Our hybrid solution extendsNetty by monitoring the occurrence of write-spin and oscillat-ing between alternative asynchronous invocation mechanismsto avoid the unnecessary optimization overhead

In general given the strong economic interest in achievinghigh resource efficiency and high quality of service simultane-ously in cloud data centers our results suggest asynchronousinvocation has potentially significant performance advantageover RPC synchronous invocation but still needs carefultunning according to various non-trivial runtime workloadand network conditions Our work also points to significantfuture research opportunities since asynchronous architecturehas been widely adopted by many distributed systems (egpubsub systems [23] AJAX [26] and ZooKeeper [31])

The rest of the paper is organized as follows Section IIshows a case study of performance degradation after softwareupgrading from a thread-based version to an asynchronousversion Section III describes the context switch problem ofservers with asynchronous architecture Section IV explainsthe write-spin problem of asynchronous architecture when ithandles requests with large size responses Section V evaluatestwo practical solutions Section VI summarizes related workand Section VII concludes the paper

II BACKGROUND AND MOTIVATION

A RPC vs Asynchronous Network IO

Modern internet servers usually adopt a few connectorsto communicate with other component servers or end usersThe main activities of a connector include managing up-stream and downstream network connections readingwritingdata through the established connections parsing and routingthe incoming requests to the application (business logic)layer Though similar in functionality synchronous and asyn-chronous connectors have very different mechanisms to inter-act with the application layer logic

Synchronous connectors are mainly adopted by RPC thread-based servers Once accepting a new connection the mainthread will dispatch the connection to a dedicated workerthread until the close of the connection In this case each con-nection consumes one worker thread and the operating systemtransparently switches among worker threads for concurrentrequest processing Although relatively easy to program dueto the user-perceived sequential execution flow synchronous

connectors bring the well-known multi-threading overhead(eg context switches scheduling and lock contention)

Asynchronous connectors accept new connections and man-age all established connections through an event-driven mech-anism using only one or a few threads Given a pool of estab-lished connections in a server an asynchronous connector han-dles requests received from these connections by repeatedlylooping over two phases The first phase (event monitoringphase) determines which connections have pending events ofinterest These events typically indicate that a particular con-nection (ie socket) is readable or writable The asynchronousconnector pulls the connections with pending events by takingadvantage of an event notification mechanism such as selectpoll or epoll supported by the underlying operating systemThe second phase (event handling phase) iterates over each ofthe connections that have pending events Based on the contextinformation of each event the connector dispatches the eventto an appropriate event handler performing the actual businesslogic computation More details can be found in previousasynchronous server research [42] [32] [38] [37] [28]

In practice there are two general designs of asynchronousservers using the asynchronous connectors The first one isa single-threaded asynchronous server which only uses onethread to loop over the aforementioned two phases for exam-ple Nodejs [8] and Lighttpd [5] Such a design is especiallybeneficial for in-memory workloads because context switcheswill be minimum while the single thread will not be blocked bydisk IO activities [35] Multiple single-threaded servers (alsocalled N -copy approach [43] [28]) can be launched togetherto fully utilize multiple processors The second design is touse a worker thread pool in the second phase to concurrentlyprocess connections that have pending events Such a design issupposed to efficiently utilize CPU resources in case of tran-sient disk IO blocking or multi-core environment [38] Severalvariants of the second design have been proposed beforemostly known as the staged design adopted by SEDA [42]and WatPipe [38] Instead of having only one worker threadpool the staged design decomposes the request processing intoa pipeline of stages separated by event queues each of whichhas its own worker thread pool with the aim of modular designand fine-grained management of worker threads

In general asynchronous event-driven servers are believedto be able to achieve higher throughput than the thread-based version because of reduced multi-threading overheadespecially when the server is handling high concurrency CPUintensive workload However our experimental results in thefollowing section will show the opposite

B Performance Degradation after Tomcat Upgrade

System software upgrade is a common practice for internetservices due to the fast evolving of software components Inthis section we show that simply upgrading a thread-basedcomponent server to its asynchronous version in an n-tiersystem may cause unexpected performance degradation at highresource utilization We show one such case by RUBBoS [11]a representative web-facing n-tier system benchmark modeled

0

2K

4K

6K

8K

10K

12K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(a) The 01KB response size case

0

2K

4K

6K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(b) The 10KB response size case

0

100

200

300

400

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(c) The 100KB response size caseFig 2 Throughput comparison between TomcatSync and TomcatAsync under different workload concurrencies andresponse sizes As response size increases from 01KB in (a) to 100KB in (c) TomcatSync outperforms TomcatAsyncat wider concurrency range indicating the performance degradation of TomcatAsync with large response size

after the popular news website Slashdot [15] Our experimentsadopt a typical 3-tier configuration with 1 Apache webserver 1 Tomcat application server and 1 MySQL databaseserver (details in Appendix A) At the beginning we useTomcat 7 (noted as TomcatSync) which uses a thread-basedsynchronous connector for inter-tier communication We thenupgrade the Tomcat server to Version 8 (the latest version atthe time noted as TomcatAsync) which by default usesan asynchronous connector with the expectation of systemperformance improvement after Tomcat upgrade

Unfortunately we observed a surprising system perfor-mance degradation after the Tomcat server upgrade as shownin Figure 1 We call the system with TomcatSync asSYStomcatV 7 and the other one with TomcatAsync asSYStomcatV 8 This figure shows the SYStomcatV 7 saturatesat workload 11000 while SYStomcatV 8 saturates at work-load 9000 At workload 11000 SYStomcatV 7 outperformsSYStomcatV 8 by 28 in throughput and the average responsetime is one order of magnitude less (226ms vs 2820ms) Sucha result is counter intuitive since we upgrade Tomcat froma lower thread-based version to a newer asynchronous oneWe note that in both cases the Tomcat server CPU is thebottleneck resource in the system all the hardware resources(eg CPU and memory) of other component servers are farfrom saturation (lt 60)

We use Collectl [2] to collect system level metrics (egCPU and context switches) another interesting phenomenonwe observed is that TomcatAsync encounters significantlyhigher number of context switches than TomcatSync whenthe system is at the same workload For example at workload10000 TomcatAsync encounters 12950 context switches persecond while only 5930 for TomcatSync less than halfof the previous case It is reasonable to suggest that highcontext switches in TomcatAsync cause high CPU overheadleading to inferior throughput compared to TomcatSyncHowever the traditional wisdom told us that a server withasynchronous architecture should have less context switchesthan a thread-based server So why did we observe the oppositehere We will discuss about the cause in the next section

Reactor thread

Worker th

read pool

Dispatch read event

generate write event

dispatch write event

return control back

Worker A

Readingprocess-ing request( )

( )Worker B

Sending response

Fig 3 Illustration of the event processing flow whenTomcatAsync processes one request Totally four contextswitches between the reactor thread and worker threads

III INEFFICIENT EVENT PROCESSING FLOW INASYNCHRONOUS SERVERS

In this section we explain why the performance of the 3-tierbenchmark system degrades after we upgrade Tomcat from thethread-based version TomcatSync to the asynchronous ver-sion TomcatAsync To simplify and quantify our analysiswe design micro-benchmarks to test the performance of bothversions of Tomcat

We use JMeter [1] to generate HTTP requests to access thestandalone Tomcat directly These HTTP requests are catego-rized into three types small medium and large with whichthe Tomcat server (either TomcatSync or TomcatAsync)first conducts some simple computation before respondingwith 01KB 10KB and 100KB of in-memory data respec-tively We choose these three sizes because they are represen-tative response sizes in our RUBBoS benchmark applicationJMeter uses one thread to simulate each end-user We set thethink time between the consecutive requests sent from thesame thread to be zero thus we can precisely control theconcurrency of the workload to the target Tomcat server byspecifying the number of threads in JMeter

We compare the server throughput between TomcatSyncand TomcatAsync under different workload concurrenciesand response sizes as shown in Figure 2 The three sub-figures show that as workload concurrency increases from1 to 3200 TomcatAsync achieves lower throughput thanTomcatSync before a certain workload concurrency Forexample TomcatAsync performs worse than TomcatSync

TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

Response size TomcatAsync TomcatSync[times1000sec]

01KB 40 1610KB 25 7100KB 28 2

before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

TABLE II Context switches among user-space threadswhen the server processes one client request

Server type Context NoteSwitch

sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

0

10K

20K

30K

40K

1 4 8 16 64 100 400 1000 3200

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(a) Throughput when response size is 01KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(b) Throughput when response size is 10KB

0

200

400

600

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(c) Throughput when response size is 100KB

0

40K

80K

120K

160K

1 4 8 16 64 100 400 1000 3200

Cont

ext

Switc

hing

[s

]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(d) Context switch when response size is 01KB

0

20K

40K

60K

80K

100K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(e) Context switch when response size is 10KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

A Profiling Results

Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

Server Type sTomcat-Sync SingleT-Async

Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

Resp size req write()socketwrite() per req

01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

Client

Receive Buffer Read from socket

Parsing and encoding

Write to socket

sum lt data size

syscall

syscall

syscall

Return of bytes written

Return zero

Un

expected

Wait fo

r TCP

AC

Ks

Server

Tim

e

Write to socket

sum lt data size

Write to socket

sum lt data size

The worker thread write-spins until ACKs come back from client

Data Copy to Kernel Finish

Return of bytes written

Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 3: Improving Asynchronous Invocation Performance in Client

0

2K

4K

6K

8K

10K

12K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(a) The 01KB response size case

0

2K

4K

6K

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(b) The 10KB response size case

0

100

200

300

400

1 2 4 8 16 32 64 100 200 400 800 16003200

Crossover Point

Workload Concurrency [ of Connections]

TomcatSyncTomcatAsync

(c) The 100KB response size caseFig 2 Throughput comparison between TomcatSync and TomcatAsync under different workload concurrencies andresponse sizes As response size increases from 01KB in (a) to 100KB in (c) TomcatSync outperforms TomcatAsyncat wider concurrency range indicating the performance degradation of TomcatAsync with large response size

after the popular news website Slashdot [15] Our experimentsadopt a typical 3-tier configuration with 1 Apache webserver 1 Tomcat application server and 1 MySQL databaseserver (details in Appendix A) At the beginning we useTomcat 7 (noted as TomcatSync) which uses a thread-basedsynchronous connector for inter-tier communication We thenupgrade the Tomcat server to Version 8 (the latest version atthe time noted as TomcatAsync) which by default usesan asynchronous connector with the expectation of systemperformance improvement after Tomcat upgrade

Unfortunately we observed a surprising system perfor-mance degradation after the Tomcat server upgrade as shownin Figure 1 We call the system with TomcatSync asSYStomcatV 7 and the other one with TomcatAsync asSYStomcatV 8 This figure shows the SYStomcatV 7 saturatesat workload 11000 while SYStomcatV 8 saturates at work-load 9000 At workload 11000 SYStomcatV 7 outperformsSYStomcatV 8 by 28 in throughput and the average responsetime is one order of magnitude less (226ms vs 2820ms) Sucha result is counter intuitive since we upgrade Tomcat froma lower thread-based version to a newer asynchronous oneWe note that in both cases the Tomcat server CPU is thebottleneck resource in the system all the hardware resources(eg CPU and memory) of other component servers are farfrom saturation (lt 60)

We use Collectl [2] to collect system level metrics (egCPU and context switches) another interesting phenomenonwe observed is that TomcatAsync encounters significantlyhigher number of context switches than TomcatSync whenthe system is at the same workload For example at workload10000 TomcatAsync encounters 12950 context switches persecond while only 5930 for TomcatSync less than halfof the previous case It is reasonable to suggest that highcontext switches in TomcatAsync cause high CPU overheadleading to inferior throughput compared to TomcatSyncHowever the traditional wisdom told us that a server withasynchronous architecture should have less context switchesthan a thread-based server So why did we observe the oppositehere We will discuss about the cause in the next section

Reactor thread

Worker th

read pool

Dispatch read event

generate write event

dispatch write event

return control back

Worker A

Readingprocess-ing request( )

( )Worker B

Sending response

Fig 3 Illustration of the event processing flow whenTomcatAsync processes one request Totally four contextswitches between the reactor thread and worker threads

III INEFFICIENT EVENT PROCESSING FLOW INASYNCHRONOUS SERVERS

In this section we explain why the performance of the 3-tierbenchmark system degrades after we upgrade Tomcat from thethread-based version TomcatSync to the asynchronous ver-sion TomcatAsync To simplify and quantify our analysiswe design micro-benchmarks to test the performance of bothversions of Tomcat

We use JMeter [1] to generate HTTP requests to access thestandalone Tomcat directly These HTTP requests are catego-rized into three types small medium and large with whichthe Tomcat server (either TomcatSync or TomcatAsync)first conducts some simple computation before respondingwith 01KB 10KB and 100KB of in-memory data respec-tively We choose these three sizes because they are represen-tative response sizes in our RUBBoS benchmark applicationJMeter uses one thread to simulate each end-user We set thethink time between the consecutive requests sent from thesame thread to be zero thus we can precisely control theconcurrency of the workload to the target Tomcat server byspecifying the number of threads in JMeter

We compare the server throughput between TomcatSyncand TomcatAsync under different workload concurrenciesand response sizes as shown in Figure 2 The three sub-figures show that as workload concurrency increases from1 to 3200 TomcatAsync achieves lower throughput thanTomcatSync before a certain workload concurrency Forexample TomcatAsync performs worse than TomcatSync

TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

Response size TomcatAsync TomcatSync[times1000sec]

01KB 40 1610KB 25 7100KB 28 2

before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

TABLE II Context switches among user-space threadswhen the server processes one client request

Server type Context NoteSwitch

sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

0

10K

20K

30K

40K

1 4 8 16 64 100 400 1000 3200

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(a) Throughput when response size is 01KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(b) Throughput when response size is 10KB

0

200

400

600

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(c) Throughput when response size is 100KB

0

40K

80K

120K

160K

1 4 8 16 64 100 400 1000 3200

Cont

ext

Switc

hing

[s

]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(d) Context switch when response size is 01KB

0

20K

40K

60K

80K

100K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(e) Context switch when response size is 10KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

A Profiling Results

Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

Server Type sTomcat-Sync SingleT-Async

Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

Resp size req write()socketwrite() per req

01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

Client

Receive Buffer Read from socket

Parsing and encoding

Write to socket

sum lt data size

syscall

syscall

syscall

Return of bytes written

Return zero

Un

expected

Wait fo

r TCP

AC

Ks

Server

Tim

e

Write to socket

sum lt data size

Write to socket

sum lt data size

The worker thread write-spins until ACKs come back from client

Data Copy to Kernel Finish

Return of bytes written

Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 4: Improving Asynchronous Invocation Performance in Client

TABLE I TomcatAsync has more context switches thanTomcatSync under workload concurrencies 8

Response size TomcatAsync TomcatSync[times1000sec]

01KB 40 1610KB 25 7100KB 28 2

before the workload concurrency 64 when the response size is10KB and the crossover point workload concurrency is evenhigher (1600) when the response size increases to 100KBReturn back to our previous 3-tier RUBBoS experimentsour measurements show that under the RUBBoS workloadconditions the average response size of Tomcat per requestis about 20KB and the workload concurrency for Tomcatis about 35 when the system saturates So based on ourmicro-benchmark results in Figure 2 it is not surprising thatTomcatAsync performs worse than TomcatSync SinceTomcat is the bottleneck server of the 3-tier system theperformance degradation of Tomcat also leads to the perfor-mance degradation of the whole system (see Figure 1) Theremaining question is why TomcatAsync performs worsethan TomcatSync before a certain workload concurrency

As we found out that the performance degradation ofTomcatAsync results from its inefficient event processingflow which generates significant amounts of intermediatecontext switches causing non-trivial CPU overhead Table Icompares the context switches between TomcatAsync andTomcatSync at workload concurrency 8 This table showsconsistent results as we have observed in the previous RUB-BoS experiments the asynchronous TomcatAsync encoun-tered significantly higher context switches than the thread-based TomcatSync given the same workload concurrencyand server response size Our further analysis reveals that thehigh context switches of TomcatAsync is because of its poordesign of event processing flow Concretely TomcatAsyncadopts the second design of asynchronous servers (see Sec-tion II-A) which uses a reactor thread for event monitoringand a worker thread pool for event handling Figure 3 illus-trates the event processing flow in TomcatAsync

So to handle one client request there are totally 4 contextswitches among the user-space threads in TomcatAsync(see step 1minus4 in Figure 3) Such inefficient event pro-cessing flow design also exists in many popular asyn-chronous serversmiddleware including network frameworkGrizzly [10] application server Jetty [3] On the other handin TomcatSync each client request is handled by a dedicatedworker thread from the initial reading of the request topreparing the response to sending the response out No contextswitch during the processing of the request unless the workerthread is interrupted or swapped out by operating system

To better quantify the impact of context switches on theperformance of different server architectures we simplifythe implementation of TomcatAsync and TomcatSync byremoving out all the unrelated modules (eg servlet life cyclemanagement cache management and logging) and only keep-ing the essential code related to request processing which we

TABLE II Context switches among user-space threadswhen the server processes one client request

Server type Context NoteSwitch

sTomcat-Async 4 Read and write events are handledby different worker threads (Figure 3)

sTomcat-Async-Fix 2 Read and write events are handledby the same worker thread

sTomcat-Sync 0Dedicated worker thread for each re-quest Context switch occurs due tointerrupt or CPU time slice expires

SingleT-Async 0 No context switches one thread handleboth event monitoring and processing

refer as sTomcat-Async (simplified TomcatAsync) andsTomcat-Sync (simplified TomcatSync) As a referencewe implement two alternative designs of asynchronous serversaiming to reduce the frequency of context switches The firstalternative design which we call sTomcat-Async-Fixmerges the processing of read event and write event fromthe same request by using the same worker thread In thiscase once a worker thread finishes preparing the response itcontinues to send the response out (step 2 and 3 in Figure 3 nolonger exist) thus processing one client request only requirestwo context switches from the reactor thread to a workerthread and from the same worker thread back to the reactorthread The second alternative design is the traditional single-threaded asynchronous server The single thread is responsiblefor both event monitoring and processing The single-threadedimplementation which we refer as SingleT-Async is sup-posed to have the least context switches Table II summarizesthe context switches for each server type when it processesone client request1 Interested readers can check out our serverimplementation from GitHub [13] for further reference

We compare the throughput and context switches among thefour types of servers under increased workload concurrenciesand server response sizes as shown in Figure 4 ComparingFigure 4(a) and 4(d) the maximum achievable throughputby each server type is negatively correlated with the contextswitch frequency during runtime experiments For exampleat workload concurrency 16 sTomcat-Async-Fix outper-forms sTomcat-Async by 22 in throughput while the con-text switches is 34 less In our experiments the CPU demandfor each request is positively correlated to the response sizesmall response size means small CPU computation demandthus the portion of CPU cycles wasted in context switchesbecomes large As a result the gap in context switchesbetween sTomcat-Async-Fix and sTomcat-Async re-flects their throughput difference Such hypothesis is fur-ther validated by the performance of SingleT-Async andsTomcat-Sync which outperform sTomcat-Async by91 and 57 in throughput respectively (see Figure 4(a))Such performance difference is also because of less contextswitches as shown in Figure 4(d) For example the contextswitches of SingleT-Async is a few hundred per secondthree orders of magnitude less than that of sTomcat-Async

1In order to simplify analyzing and reasoning we do not count the contextswitches causing by interrupting or swapping by the operating system

0

10K

20K

30K

40K

1 4 8 16 64 100 400 1000 3200

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(a) Throughput when response size is 01KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(b) Throughput when response size is 10KB

0

200

400

600

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(c) Throughput when response size is 100KB

0

40K

80K

120K

160K

1 4 8 16 64 100 400 1000 3200

Cont

ext

Switc

hing

[s

]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(d) Context switch when response size is 01KB

0

20K

40K

60K

80K

100K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(e) Context switch when response size is 10KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

A Profiling Results

Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

Server Type sTomcat-Sync SingleT-Async

Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

Resp size req write()socketwrite() per req

01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

Client

Receive Buffer Read from socket

Parsing and encoding

Write to socket

sum lt data size

syscall

syscall

syscall

Return of bytes written

Return zero

Un

expected

Wait fo

r TCP

AC

Ks

Server

Tim

e

Write to socket

sum lt data size

Write to socket

sum lt data size

The worker thread write-spins until ACKs come back from client

Data Copy to Kernel Finish

Return of bytes written

Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 5: Improving Asynchronous Invocation Performance in Client

0

10K

20K

30K

40K

1 4 8 16 64 100 400 1000 3200

Thro

ughp

ut [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(a) Throughput when response size is 01KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(b) Throughput when response size is 10KB

0

200

400

600

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(c) Throughput when response size is 100KB

0

40K

80K

120K

160K

1 4 8 16 64 100 400 1000 3200

Cont

ext

Switc

hing

[s

]

Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(d) Context switch when response size is 01KB

0

20K

40K

60K

80K

100K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(e) Context switch when response size is 10KB

0

2K

4K

6K

8K

10K

1 4 8 16 64 100 400 1000 3200Workload Concurrency [ of Connections]

SingleT-AsyncsTomcat-Sync

sTomcat-Async-FixsTomcat-Async

(f) Context switch when response size is 100KBFig 4 Throughput and context switch comparison among different server architectures as the server response sizeincreases from 01KB to 100KB (a) and (d) show that the maximum achievable throughput by each server type is negativelycorrelated with their context switch freqency when the server response size is small (01KB) However as the response sizeincreases to 100KB (c) shows sTomcat-Sync outperforms other asynchronous servers before the workload concurrency400 indicating factors other than context switches cause overhead in asynchronous servers

We note that as the server response size becomes larger theportion of CPU overhead caused by context switches becomessmaller since more CPU cycles will be consumed by process-ing request and sending response This is the case as shownin Figure 4(b) and 4(c) where the response sizes are 10KBand 100KB respectively The throughput difference becomesnarrower among the four server architectures indicating lessperformance impact from context switches

In fact one interesting phenomenon has been observedas the response size increases to 100KB Figure 4(c) showsthat SingleT-Async performs worse than the thread-basedsTomcat-Sync before the workload concurrency 400 eventhough SingleT-Async has much less context switchesthan sTomcat-Sync as shown in Figure 4(f) Such obser-vation suggests that there are other factors causing overheadin asynchronous SingleT-Async but not in thread-basedsTomcat-Sync when the server response size is largewhich we will discuss in the next section

IV WRITE-SPIN PROBLEM OF ASYNCHRONOUSINVOCATION

In this section we study the performance degradation prob-lem of an asynchronous server sending a large size responseWe use fine-grained profiling tools such as Collectl [2] andJProfiler [4] to analyze the detailed CPU usage and some keysystem calls invoked by servers with different architecturesAs we found out that it is the default small TCP sendbuffer size and the TCP wait-ACK mechanism that leadsto a severe write-spin problem when sending a relativelylarge size response which causes significant CPU overheadfor asynchronous servers We also explored several network-

related factors that could exacerbate the negative impact of thewrite-spin problem which further degrades the performance ofan asynchronous server

A Profiling Results

Recall that Figure 4(a) shows when the response sizeis small (ie 01KB) the throughput of the asynchronousSingleT-Async is 20 higher than the thread-basedsTomcat-Sync at workload concurrency 8 However as re-sponse size increases to 100KB SingleT-Async through-put is surprisingly 31 lower than sTomcat-Sync underthe same workload concurrency 8 (see Figure 4(c)) Sincethe only change is the response size it is natural to specu-late that large response size brings significant overhead forSingleT-Async but not for sTomcat-Sync

To investigate the performance degradation ofSingleT-Async when the response size is large wefirst use Collectl [2] to analyze the detailed CPU usageof the server with different server response sizes asshown in Table III The workload concurrency for bothSingleT-Async and sTomcat-Sync is 100 and theCPU is 100 utilized under this workload concurrencyAs the response size for both server architectures increasedfrom 01KB to 100KB the table shows the user-space CPUutilization of sTomcat-Sync increases 25 (from 55 to80) while 34 (from 58 to 92) for SingleT-AsyncSuch comparison suggests that increasing response size hasmore impact on the asynchronous SingleT-Async than thethread-based sTomcat-Sync in user-space CPU utilization

We further use JProfiler [4] to profile the SingleT-Asynccase when the response size increases from 01KB to 100KB

TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

Server Type sTomcat-Sync SingleT-Async

Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

Resp size req write()socketwrite() per req

01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

Client

Receive Buffer Read from socket

Parsing and encoding

Write to socket

sum lt data size

syscall

syscall

syscall

Return of bytes written

Return zero

Un

expected

Wait fo

r TCP

AC

Ks

Server

Tim

e

Write to socket

sum lt data size

Write to socket

sum lt data size

The worker thread write-spins until ACKs come back from client

Data Copy to Kernel Finish

Return of bytes written

Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 6: Improving Asynchronous Invocation Performance in Client

TABLE III SingleT-Async consumes more user-spaceCPU compared to sTomcat-Sync The workload concur-rency keeps 100

Server Type sTomcat-Sync SingleT-Async

Response Size 01KB 100KB 01KB 100KBThroughput [reqsec] 35000 590 42800 520User total 55 80 58 92System total 45 20 42 8

TABLE IV The write-spin problem occurs when theresponse size is 100KB This table shows the measurementof total number of socketwrite() in SingleT-Asyncwith different response size during a one-minute experiment

Resp size req write()socketwrite() per req

01KB 238530 238530 110KB 9400 9400 1100KB 2971 303795 102

and see what has changed in application level We found thatthe frequency of socketwrite() system call is especiallyhigh in the 100KB case as shown in Table IV We note thatsocketwrite() is called when a server sends a responseback to the corresponding client In the case of a thread-based server like sTomcat-Sync socketwrite() iscalled only once for each client request While such onewrite per request is true for the 01KB and 10KB casein SingleT-Async it calls socketwrite() averagely102 times per request in the 100KB case System calls ingeneral are expensive due to the related kernel crossing over-head [20] [39] thus high frequency of socketwrite() inthe 100KB case helps explain high user-space CPU overheadin SingleT-Async as shown in Table III

Our further analysis shows that the multiple socket writeproblem of SingleT-Async is due to the small TCP sendbuffer size (16K by default) for each TCP connection and theTCP wait-ACK mechanism When a processing thread triesto copy 100KB data from the user space to the kernel spaceTCP buffer through the system call socketwrite() thefirst socketwrite() can only copy at most 16KB data tothe send buffer which is organized as a byte buffer ring ATCP sliding window is set by the kernel to decide how muchdata can actually be sent to the client the sliding windowcan move forward and free up buffer space for new data to becopied in only if the server receives the ACKs of the previouslysent-out packets Since socketwrite() is a non-blockingsystem call in SingleT-Async every time it returns howmany bytes are written to the TCP send buffer the systemcall will return zero if the TCP send buffer is full leadingto the write-spin problem The whole process is illustratedin Figure 5 On the other hand when a worker thread in thesynchronous sTomcat-Sync tries to copy 100KB data fromthe user space to the kernel space TCP send buffer only oneblocking system call socketwrite() is invoked for eachrequest the worker thread will wait until the kernel sends the100KB response out and the write-spin problem is avoided

Client

Receive Buffer Read from socket

Parsing and encoding

Write to socket

sum lt data size

syscall

syscall

syscall

Return of bytes written

Return zero

Un

expected

Wait fo

r TCP

AC

Ks

Server

Tim

e

Write to socket

sum lt data size

Write to socket

sum lt data size

The worker thread write-spins until ACKs come back from client

Data Copy to Kernel Finish

Return of bytes written

Fig 5 Illustration of the write-spin problem in an asyn-chronous server Due to the small TCP send buffer size andthe TCP wait-ACK mechanism a worker thread write-spins onthe system call socketwrite() and can only send moredata until ACKs back from the client for previous sent packets

An intuitive solution is to increase the TCP send buffer sizeto the same size as the server response to avoid the write-spinproblem Our experimental results actually show the effective-ness of manually increasing the TCP send buffer size to solvethe write-spin problem for our RUBBoS workload Howeverseveral factors make setting a proper TCP send buffer sizea non-trivial challenge in practice First the response size ofan internet server can be dynamic and is difficult to predictin advance For example the response of a Tomcat servermay involve dynamic content retrieved from the downstreamdatabase the size of which can range from hundreds of bytesto megabytes Second HTTP20 enables a web server to pushmultiple responses for a single client request which makes theresponse size for a client request even more unpredictable [19]For example the response of a typical news website (egCNNcom) can easily reach tens of megabytes resulting froma large amount of static and dynamic content (eg imagesand database query results) all these content can be pushedback by answering one client request Third setting a largeTCP send buffer for each TCP connection to prepare for thepeak response size consumes a large amount of memory ofthe server which may serve hundreds or thousands of endusers (each has one or a few persistent TCP connections) suchover-provisioning strategy is expensive and wastes computingresources in a shared cloud computing platform Thus it ischallenging to set a proper TCP send buffer size in advanceand prevent the write-spin problem

In fact Linux kernels above 24 already provide an auto-tuning function for TCP send buffer size based on the runtimenetwork conditions Once turned on the kernel dynamicallyresizes a serverrsquos TCP send buffer size to provide optimizedbandwidth utilization [25] However the auto-tuning functionaims to efficiently utilize the available bandwidth of the linkbetween the sender and the receiver based on Bandwidth-

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 7: Improving Asynchronous Invocation Performance in Client

0

100

200

300

400

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

s]

Network Latency

SingleT-Async-100KBSingleT-Async-autotuning

Fig 6 Write-spin problem still exists when TCP sendbuffer ldquoautotuningrdquo feature enabled

Delay Product rule [17] it lacks sufficient application infor-mation such as response size Therefore the auto-tuned sendbuffer could be enough to maximize the throughput over thelink but still inadequate for applications which may still causethe write-spin problem for asynchronous servers Figure 6shows SingleT-Async with auto-tunning performs worsethan the other case with a fixed large TCP send buffer size(100kB) suggesting the occurrence of the write-spin problemOur further study also shows the performance difference iseven bigger if there is non-trivial network latency between theclient and the server which is the topic of the next subsection

B Network Latency Exaggerates the Write-Spin Problem

Network latency is common in cloud data centers Con-sidering the component servers in an n-tier application thatmay run on VMs located in different physical nodes acrossdifferent racks or even data centers which can range froma few milliseconds to tens of milliseconds Our experimentalresults show that the negative impact of the write-spin problemcan be significantly exacerbated by the network latency

The impact of network latency on the performance ofdifferent types of servers is shown in Figure 7 In this set ofexperiments we keep the workload concurrency from clients tobe 100 all the time The response size of each client request is100KB the TCP send buffer size of each server is the default16KB with which an asynchronous server encounters thewrite-spin problem We use the Linux command ldquotc(TrafficControl)rdquo in the client side to control the network latencybetween the client and the server Figure 7(a) shows that thethroughput of the asynchronous servers SingleT-Asyncand sTomcat-Async-Fix is sensitive to network latencyFor example when the network latency is 5ms the throughputof SingleT-Async decreases by about 95 which issurprising considering the small amount of latency increased

We found that the surprising throughput degradation resultsfrom the response time amplification when the write-spinproblem happens This is because sending a relatively largesize response requires multiple rounds of data transfer due tothe small TCP send buffer size each data transfer has to waituntil the server receives the ACKs from the previously sent-outpackets (see Figure 5) Thus a small network latency increasecan amplify a long delay for completing one response transferSuch response time amplification for asynchronous servers canbe seen in Figure 7(b) For example the average response timeof SingleT-Async for a client request increases from 018seconds to 360 seconds when 5 milliseconds network latency

0

200

400

600

800

~0ms ~5ms ~10ms ~20ms

Thro

ughp

ut [

req

sec]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(a) Throuthput comparison

0

3

6

9

12

15

~0ms ~5ms ~10ms ~20ms

Res

pons

e Ti

me

[s]

Network Latency

SingleT-AsyncsTomcat-Async-Fix

sTomcat-Sync

(b) Response time comparisonFig 7 Throughput degradation of two asynchronousservers in subfigure (a) resulting from the response timeamplification in (b) as the network latency increases

is added According to Littlersquos Law a serverrsquos throughput isnegatively correlated with the response time of the server giventhat the workload concurrency (queued requests) keeps thesame Since we always keep the workload concurrency foreach server to be 100 server response time increases 20 times(from 018 to 360) means 95 decrease in server throughputin SingleT-Async as shown in Figure 7(a)

V SOLUTION

So far we have discussed two problems of asynchronousinvocation the context switch problem caused by ineffi-cient event processing flow (see Table II) and the write-spin problem resulting from the unpredictable response sizeand the TCP wait-ACK mechanism (see Figure 5) Thoughour research is motivated by the performance degradationof the latest asynchronous Tomcat we found that the in-appropriate event processing flow and the write-spin prob-lems widely exist in other popular open-source asynchronousapplication serversmiddleware including network frameworkGrizzly [10] and application server Jetty [3]

An ideal asynchronous server architecture should avoid bothproblems under various workload and network conditions Wefirst investigate a popular asynchronous network IO librarynamed Netty [7] which is supposed to mitigate the contextswitch overhead through an event processing flow optimiza-tion and the write-spin problem of asynchronous messagingthrough write operation optimization but with non-trivialoptimization overhead Then we propose a hybrid solutionwhich takes advantage of different types of asynchronousservers aiming to solve both the context switch overhead andthe write-spin problem while avoid the optimization overhead

A Mitigating Context Switches and Write-Spin Using Netty

Netty is an asynchronous event-driven network IO frame-work which provides optimized read and write operations inorder to mitigate the context switch overhead and the write-spin problem Netty adopts the second design strategy (seeSection II-A) to support an asynchronous server using areactor thread to accept new connections and a worker threadpool to process the various IO events from each connection

Though using a worker thread pool Netty makes two signif-icant changes compared to the asynchronous TomcatAsyncto reduce the context switch overhead First Netty changes

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 8: Improving Asynchronous Invocation Performance in Client

syscall Write to socket

Conditions

1 Return_size = 0 ampamp 2 writeSpin lt TH ampamp 3 sum lt data size

Application ThreadKernel

Next Event Processing

Data Copy to Kernel Finish

True

Data Sending Finish

Data to be written

False

False

syscall

Fig 8 Netty mitigates the write-spin problem by runtimechecking The write spin jumps out of the loop if any of thethree conditions is not met

0

200

400

600

800

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(a) Response size is 100KB

0

10K

20K

30K

40K

1 4 8 16 64 100 400 10003200

Thr

ough

put [

req

s]

Workload Concurrency [ of Connections]

SingleT-AsyncNettyServer

sTomcat-Sync

(b) Response size is 01KB

Fig 9 Throughput comparison under various workloadconcurrencies and response sizes The default TCP sendbuffer size is 16KB Subfigure (a) shows that NettyServerperforms the best suggesting effective mitigation of the write-spin problem and (b) shows that NettyServer performsworse than SingleT-Async indicating non-trivial writeoptimization overhead in Netty

the role of the reactor thread and the worker threads Inthe asynchronous TomcatAsync case the reactor threadis responsible to monitor events for each connection (eventmonitoring phase) then it dispatches each event to an avail-able worker thread for proper event handling (event handlingphase) Such dispatching operation always involves the contextswitches between the reactor thread and a worker threadNetty optimizes this dispatching process by letting a workerthread take care of both event monitoring and handling thereactor thread only accepts new connections and assigns theestablished connections to each worker thread In this case thecontext switches between the reactor thread and the workerthreads are significantly reduced Second instead of havinga single event handler attached to each event Netty allows achain of handlers to be attached to one event the output ofeach handler is the input to the next handler (pipeline) Sucha design avoids generating unnecessary intermediate eventsand the associate system calls thus reducing the unnecessarycontext switches between reactor thread and worker threads

In order to mitigate the write-spin problem Nettyadopts a write-spin checking when a worker thread callssocketwrite() to copy a large size response to the kernelas shown in Figure 8 Concretely each worker thread in Netty

maintains a writeSpin counter to record how many timesit has tried to write a single response into the TCP sendbuffer For each write the worker thread also tracks how manybytes have been copied noted as return_size The workerthread will jump out the write spin if either of two conditionsis met first the return_size is zero indicating the TCPsend buffer is already full second the counter writeSpinexceeds a pre-defined threshold (the default value is 16 inNetty-v4) Once jumping out the worker thread will savethe context and resume the current connection data transferafter it loops over other connections with pending eventsSuch write optimization is able to mitigate the blocking ofthe worker thread by a connection transferring a large sizeresponse however it also brings non-trivial overhead whenall responses are small and there is no write-spin problem

We validate the effectiveness of Netty for mitigating thewrite-spin problem and also the associate optimization over-head in Figure 9 We build a simple application serverbased on Netty named NettyServer This figure comparesNettyServer with the asynchronous SingleT-Asyncand the thread-based sTomcat-Sync under various work-load concurrencies and response sizes The default TCP sendbuffer size is 16KB so there is no write-spin problem when theresponse size is 01KB and severe write-spin problem in the100KB case Figure 9(a) shows that NettyServer performsthe best among three in the 100KB case for example when theworkload concurrency is 100 NettyServer outperformsSingleT-Async and sTomcat-Sync by about 27 and10 in throughput respectively suggesting NettyServerrsquoswrite optimization effectively mitigates the write-spin problemencountered by SingleT-Async and also avoids the heavymulti-threading overhead encountered by sTomcat-SyncOn the other hand Figure 9(b) shows that the maximumachievable throughput of NettyServer is 17 less than thatof SingleT-Async in the 01KB response case indicatingnon-trivial overhead of unnecessary write operation optimiza-tion when there is no write-spin problem Therefore neitherNettyServer nor SingleT-Async is able to achieve thebest performance under various workload conditions

B A Hybrid Solution

In the previous section we showed that the asynchronoussolutions if chosen properly (see Figure 9) can always out-perform the corresponding thread-based version under variousworkload conditions However there is no single asynchronoussolution that can always perform the best For exampleSingleT-Async suffers from the write-spin problem forlarge size responses while NettyServer suffers from theunnecessary write operation optimization overhead for smallsize responses In this section we propose a hybrid solutionwhich utilizes both SingleT-Async and NettyServerand adapts to workload and network conditions

Our hybrid solution is based on two assumptionsbull The response size of the server is unpredictable and can

vary during runtimebull The workload is in-memory workload

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 9: Improving Asynchronous Invocation Performance in Client

Select()

Pool of connections with pending events

Conn is available

Check req type

Parsing and encoding

Parsing and encoding

Write operation optimization

Socketwrite()

Return to Return to

No

NettyServer SingleT-Async

Event Monitoring Phase

Event Handling Phase

Yes

get next conn

Fig 10 Worker thread processing flow in Hybrid solution

The first assumption excludes the server being initiated witha large but fixed TCP send buffer size for each connectionin order to avoid the write-spin problem This assumption isreasonable because of the factors (eg dynamically generatedresponse and the push feature in HTTP20) we have discussedin Section IV-A The second assumption excludes a workerthread being blocked by disk IO activities This assumption isalso reasonable since in-memory workload becomes commonfor modern internet services because of near-zero latencyrequirement [30] for example MemCached server has beenwidely adopted to reduce disk activities [36] The solutionfor more complex workloads that involve frequent disk IOactivities is challenging and will require additional research

The main idea of the hybrid solution is to take advan-tage of different asynchronous server architectures such asSingleT-Async and NettyServer to handle requestswith different response sizes and network conditions as shownin Figure 10 Concretely our hybrid solution which we callHybridNetty profiles different types of requests based onwhether or not the response causes a write-spin problem duringthe runtime In initial warm-up phase (ie workload is low)HybridNetty uses the writeSpin counter of the originalNetty to categorize all requests into two categories the heavyrequests that can cause the write-spin problem and the lightrequests that can not HybridNetty maintains a map objectrecording which category a request belongs to Thus whenHybridNetty receives a new incoming request it checks themap object first and figures out which category it belongs toand then chooses the most efficient execution path In practicethe response size even for the same type of requests maychange over time (due to runtime environment changes suchas dataset) so we update the map object during runtime oncea request is detected to be classified into a wrong category inorder to keep track of the latest category of such requests

C Validation of HybridNetty

To validate the effectiveness of our hybrid solution Fig-ure 11 compares HybridNetty with SingleT-Asyncand NettyServer under various workload conditions andnetwork latencies Our workload consists of two classes ofrequests the heavy requests which have large response sizes(eg 100KB) and the light requests which have small response

size (eg 01KB) heavy requests can cause the write-spinproblem while light requests can not We increase the percent-age of heavy requests from 0 to 100 in order to simulatingdifferent scenarios of realistic workloads The workload con-currency from clients in all cases keeps 100 under which theserver CPU is 100 utilized To clearly show the effectivenessof our hybrid solution we adopt the normalized throughputcomparison and use the HybridNetty throughput as thebaseline Figure 11(a) and 11(b) show that HybridNettybehaves the same as SingleT-Async when all requests arelight (0 heavy requests) and the same as NettyServerwhen all requests are heavy other than that HybridNettyalways performs the best For example Figure 11(a) showsthat when the heavy requests reach to 5 HybridNettyachieves 30 higher throughput than SingleT-Async and10 higher throughput than NettyServer This is becauseHybridNetty always chooses the most efficient path to pro-cess request Considering that the distribution of requests forreal web applications typically follows a Zipf-like distributionwhere light requests dominate the workload [22] our hybridsolution makes more sense in dealing with realistic workloadIn addition SingleT-Async performs much worse than theother two cases when the percentage of heavy requests is non-zero and non-negligible network latency exists (Figure 11(b))This is because of the write-spin problem exacerbated bynetwork latency (see Section IV-B for more details)

VI RELATED WORK

Previous research has shown that a thread-based serverif implemented properly can achieve the same or even bet-ter performance as the asynchronous event-driven one doesFor example Von et al develop a thread-based web serverKnot [40] which can compete with event-driven servers at highconcurrency workload using a scalable user-level threadingpackage Capriccio [41] However Krohn et al [32] show thatCapriccio is a cooperative threading package that exports thePOSIX thread interface but behaves like events to the underly-ing operating system The authors of Capriccio also admit thatthe thread interface is still less flexible than events [40] Theseprevious research results suggest that the asynchronous event-driven architecture will continue to play an important role inbuilding high performance and resource efficiency servers thatmeet the requirements of current cloud data centers

The optimization for asynchronous event-driven servers canbe divided into two broad categories improving operatingsystem support and tuning software configurations

Improving operating system support mainly focuseson either refining underlying event notification mecha-nisms [18] [34] or simplifying the interfaces of network IOfor application level asynchronous programming [27] Theseresearch efforts have been motivated by reducing the overheadincurred by system calls such as select poll epoll or IOoperations under high concurrency workload For example toavoid the kernel crossings overhead caused by system callsTUX [34] is implemented as a kernel-based web server byintegrating the event monitoring and handling into the kernel

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 10: Improving Asynchronous Invocation Performance in Client

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(a) No network latency between client and server

0

02

04

06

08

10

0 2 5 10 20 100

Nor

mal

ized

Thr

ough

put

Ratio of Large Size Response

SingleT-Async NettyServer HybridNetty

(b) sim5ms network latency between client and server

Fig 11 Hybrid solution performs the best in different mixes of lightheavy request workload with or without networklatency The workload concurrency keeps 100 in all cases To clearly show the throughput difference we compare the normalizedthroughput and use HybridNetty as the baseline

Tuning software configurations to improve asynchronousweb serversrsquo performance has also been studied before Forexample Pariag et al [38] show that the maximum achievablethroughput of event-driven (microServer) and pipeline (WatPipe)servers can be significantly improved by carefully tuning thenumber of simultaneous TCP connections and blockingnon-blocking sendfile system call Brecht et al [21] improvethe performance of event-driven microServer by modifying thestrategy of accepting new connections based on differentworkload characteristics Our work is closely related to Googleteamrsquos research about TCPrsquos congestion window [24] Theyshow that increasing TCPrsquos initial congestion window to atleast ten segments (about 15KB) can improve average latencyof HTTP responses by approximately 10 in large-scaleInternet experiments However their work mainly focuses onshort-lived TCP connections Our work complements theirresearch but focuses on more general network conditions

VII CONCLUSIONS

We studied the performance impact of asynchronous in-vocation on client-server systems Through realistic macro-and micro-benchmarks we showed that servers with theasynchronous event-driven architecture may perform signif-icantly worse than the thread-based version resulting fromthe inferior event processing flow which creates high contextswitch overhead (Section II and III) We also studied a generalproblem for all the asynchronous event-driven servers thewrite-spin problem when handling large size responses andthe associate exaggeration factors such as network latency(Section IV) Since there is no one solution fits all weprovide a hybrid solution by utilizing different asynchronousarchitectures to adapt to various workload and network condi-tions (Section V) More generally our research suggests thatbuilding high performance asynchronous event-driven serversneeds to take both the event processing flow and the runtimevarying workloadnetwork conditions into consideration

ACKNOWLEDGMENT

This research has been partially funded by National ScienceFoundation by CISErsquos CNS (1566443) Louisiana Board ofRegents under grant LEQSF(2015-18)-RD-A-11 and gifts or

HTTP Requests

Apache Tomcat MySQLClients

(b) 111 Sample Topology

(a) Software and Hardware Setup

Fig 12 Details of the RUBBoS experimental setup

grants from Fujitsu Any opinions findings and conclusionsor recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of theNational Science Foundation or other funding agencies andcompanies mentioned above

APPENDIX ARUBBOS EXPERIMENTAL SETUP

We adopt the RUBBoS standard n-tier benchmark whichis modeled after the famous news website Slashdot Theworkload consists of 24 different web interactions The defaultworkload generator emulates a number of users interactingwith the web application layer Each userrsquos behavior follows aMarkov chain model to navigate between different web pagesthe think time between receiving a web page and submitting anew page download request is about 7-second Such workloadgenerator has a similar design as other standard n-tier bench-marks such as RUBiS [12] TPC-W [14] and Cloudstone [29]We run the RUBBoS benchmark on our testbed Figure 12outlines the software configurations hardware configurationsand a sample 3-tier topology used in the Subsection II-Bexperiments Each server in the 3-tier topology is deployedin a dedicated machine All other client-server experimentsare conducted with one client and one server machine

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References
Page 11: Improving Asynchronous Invocation Performance in Client

REFERENCES

[1] Apache JMeterTM httpjmeterapacheorg[2] Collectl httpcollectlsourceforgenet[3] Jetty A Java HTTP (Web) Server and Java Servlet Container http

wwweclipseorgjetty[4] JProfiler The award-winning all-in-one Java profiler rdquohttpswww

ej-technologiescomproductsjprofileroverviewhtmlrdquo[5] lighttpd httpswwwlighttpdnet[6] MongoDB Async Java Driver httpmongodbgithubio

mongo-java-driver35driver-async[7] Netty httpnettyio[8] Nodejs httpsnodejsorgen[9] Oracle GlassFish Server httpwwworaclecomtechnetwork

middlewareglassfishoverviewindexhtml[10] Project Grizzly NIO Event Development Simplified httpsjavaee

githubiogrizzly[11] RUBBoS Bulletin board benchmark httpjmobow2orgrubboshtml[12] RUBiS Rice University Bidding System httprubisow2org[13] sTomcat-NIO sTomcat-BIO and two alternative asynchronous

servers httpsgithubcomsgzhangAsynMessaging[14] TPC-W A Transactional Web e-Commerce Benchmark httpwwwtpc

orgtpcw[15] ADLER S The slashdot effect an analysis of three internet publications

Linux Gazette 38 (1999) 2[16] ADYA A HOWELL J THEIMER M BOLOSKY W J AND

DOUCEUR J R Cooperative task management without manual stackmanagement In Proceedings of the General Track of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2002) ATEC rsquo02 USENIX Association pp 289ndash302

[17] ALLMAN M PAXSON V AND BLANTON E Tcp congestion controlTech rep 2009

[18] BANGA G DRUSCHEL P AND MOGUL J C Resource containers Anew facility for resource management in server systems In Proceedingsof the Third Symposium on Operating Systems Design and Implemen-tation (Berkeley CA USA 1999) OSDI rsquo99 USENIX Associationpp 45ndash58

[19] BELSHE M THOMSON M AND PEON R Hypertext transferprotocol version 2 (http2)

[20] BOYD-WICKIZER S CHEN H CHEN R MAO Y KAASHOEK FMORRIS R PESTEREV A STEIN L WU M DAI Y ZHANGY AND ZHANG Z Corey An operating system for many coresIn Proceedings of the 8th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2008) OSDIrsquo08USENIX Association pp 43ndash57

[21] BRECHT T PARIAG D AND GAMMO L Acceptable strategiesfor improving web server performance In Proceedings of the AnnualConference on USENIX Annual Technical Conference (Berkeley CAUSA 2004) ATEC rsquo04 USENIX Association pp 20ndash20

[22] BRESLAU L CAO P FAN L PHILLIPS G AND SHENKER SWeb caching and zipf-like distributions Evidence and implicationsIn INFOCOMrsquo99 Eighteenth Annual Joint Conference of the IEEEComputer and Communications Societies Proceedings IEEE (1999)vol 1 IEEE pp 126ndash134

[23] CANAS C ZHANG K KEMME B KIENZLE J AND JACOBSENH-A Publishsubscribe network designs for multiplayer games InProceedings of the 15th International Middleware Conference (NewYork NY USA 2014) Middleware rsquo14 ACM pp 241ndash252

[24] DUKKIPATI N REFICE T CHENG Y CHU J HERBERT TAGARWAL A JAIN A AND SUTIN N An argument for increasingtcprsquos initial congestion window SIGCOMM Comput Commun Rev 403 (June 2010) 26ndash33

[25] FISK M AND FENG W-C Dynamic right-sizing in tcp httplib-www lanl govla-pubs00796247 pdf (2001) 2

[26] GARRETT J J ET AL Ajax A new approach to web applications[27] HAN S MARSHALL S CHUN B-G AND RATNASAMY S

Megapipe A new programming interface for scalable network io InProceedings of the 10th USENIX Conference on Operating SystemsDesign and Implementation (Berkeley CA USA 2012) OSDIrsquo12USENIX Association pp 135ndash148

[28] HARJI A S BUHR P A AND BRECHT T Comparing high-performance multi-core web-server architectures In Proceedings of the5th Annual International Systems and Storage Conference (New YorkNY USA 2012) SYSTOR rsquo12 ACM pp 11ndash112

[29] HASSAN O A-H AND SHARGABI B A A scalable and efficientweb 20 reader platform for mashups Int J Web Eng Technol 7 4(Dec 2012) 358ndash380

[30] HUANG Q BIRMAN K VAN RENESSE R LLOYD W KUMAR SAND LI H C An analysis of facebook photo caching In Proceedingsof the Twenty-Fourth ACM Symposium on Operating Systems Principles(New York NY USA 2013) SOSP rsquo13 ACM pp 167ndash181

[31] HUNT P KONAR M JUNQUEIRA F P AND REED B ZookeeperWait-free coordination for internet-scale systems In Proceedings of the2010 USENIX Conference on USENIX Annual Technical Conference(Berkeley CA USA 2010) USENIXATCrsquo10 USENIX Associationpp 11ndash11

[32] KROHN M KOHLER E AND KAASHOEK M F Events can makesense In 2007 USENIX Annual Technical Conference on Proceedings ofthe USENIX Annual Technical Conference (Berkeley CA USA 2007)ATCrsquo07 USENIX Association pp 71ndash714

[33] KROHN M KOHLER E AND KAASHOEK M F Simplified eventprogramming for busy network applications In Proceedings of the 2007USENIX Annual Technical Conference (Santa Clara CA USA (2007)

[34] LEVER C ERIKSEN M A AND MOLLOY S P An analysis ofthe tux web server Tech rep Center for Information TechnologyIntegration 2000

[35] LI C SHEN K AND PAPATHANASIOU A E Competitive prefetch-ing for concurrent sequential io In Proceedings of the 2Nd ACMSIGOPSEuroSys European Conference on Computer Systems 2007(New York NY USA 2007) EuroSys rsquo07 ACM pp 189ndash202

[36] NISHTALA R FUGAL H GRIMM S KWIATKOWSKI M LEEH LI H C MCELROY R PALECZNY M PEEK D SAABP STAFFORD D TUNG T AND VENKATARAMANI V Scalingmemcache at facebook In Presented as part of the 10th USENIXSymposium on Networked Systems Design and Implementation (NSDI13) (Lombard IL 2013) USENIX pp 385ndash398

[37] PAI V S DRUSCHEL P AND ZWAENEPOEL W Flash An efficientand portable web server In Proceedings of the Annual Conferenceon USENIX Annual Technical Conference (Berkeley CA USA 1999)ATEC rsquo99 USENIX Association pp 15ndash15

[38] PARIAG D BRECHT T HARJI A BUHR P SHUKLA A ANDCHERITON D R Comparing the performance of web server archi-tectures In Proceedings of the 2Nd ACM SIGOPSEuroSys EuropeanConference on Computer Systems 2007 (New York NY USA 2007)EuroSys rsquo07 ACM pp 231ndash243

[39] SOARES L AND STUMM M Flexsc Flexible system call schedulingwith exception-less system calls In Proceedings of the 9th USENIXConference on Operating Systems Design and Implementation (BerkeleyCA USA 2010) OSDIrsquo10 USENIX Association pp 33ndash46

[40] VON BEHREN R CONDIT J AND BREWER E Why events area bad idea (for high-concurrency servers) In Proceedings of the 9thConference on Hot Topics in Operating Systems - Volume 9 (BerkeleyCA USA 2003) HOTOSrsquo03 USENIX Association pp 4ndash4

[41] VON BEHREN R CONDIT J ZHOU F NECULA G C ANDBREWER E Capriccio Scalable threads for internet services InProceedings of the Nineteenth ACM Symposium on Operating SystemsPrinciples (New York NY USA 2003) SOSP rsquo03 ACM pp 268ndash281

[42] WELSH M CULLER D AND BREWER E Seda An architecturefor well-conditioned scalable internet services In Proceedings of theEighteenth ACM Symposium on Operating Systems Principles (NewYork NY USA 2001) SOSP rsquo01 ACM pp 230ndash243

[43] ZELDOVICH N YIP A DABEK F MORRIS R MAZIERES DAND KAASHOEK M F Multiprocessor support for event-drivenprograms In USENIX Annual Technical Conference General Track(2003) pp 239ndash252

  • Introduction
  • Background and Motivation
    • RPC vs Asynchronous Network IO
    • Performance Degradation after Tomcat Upgrade
      • Inefficient Event Processing Flow in Asynchronous Servers
      • Write-Spin Problem of Asynchronous Invocation
        • Profiling Results
        • Network Latency Exaggerates the Write-Spin Problem
          • Solution
            • Mitigating Context Switches and Write-Spin Using Netty
            • A Hybrid Solution
            • Validation of HybridNetty
              • Related Work
              • Conclusions
              • Appendix A RUBBoS Experimental Setup
              • References