Enhancing Quality of Service Metrics for High Fan-In Node ...kth.diva-portal.org/smash/get/diva2:867903/FULLTEXT01.pdfEnhancing Quality of Service Metrics for High Fan-in Node.js Applications

DEGREE PROJECT, IN , SECOND LEVELCOMPUTER SCIENCELAUSANNE, SWITZERLAND 2015

Enhancing Quality of Service Metricsfor High Fan-In Node.js Applicationsby Optimising the Network StackLEVERAGING IX:THE DATAPLANE OPERATING SYSTEM

FREDRIK PETER LILKAER

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

F R E D R I K P E T E R L I L K A E R

-Leveraging IX: The Dataplane Operating System

Enhancing Quality of Service Metrics for High Fan-in Node.js Applications by

Optimising the Network Stack

DD221X, Master’s Thesis in Computer Science (30 ECTS credits) Degree Progr. in Computer Science and Engineering 300 credits Master Programme in Computer Science 120 credits Royal Institute of Technology year 2015 Supervisor at EPFL was Edouard Bugnion

Supervisor at CSC was Carl-Henrik Ek

Examiner was Johan Håstad

Royal Institute of Technology School of Computer Science and Communication

KTH CSC SE-100 44 Stockholm, Sweden

URL: www.kth.se/csc

Presented: 2015-10-01

AbstractThis thesis investigates the feasibility of porting Node.js, a

JavaScript web application framework and server, to IX, a data-plane operating system specifically developed to meet the needsof high performance microsecond-computing type of applicationsin a datacentre setting. We show that porting requires exten-sions to the IX kernel to support UDS polling, which we imple-ment. We develop a distributed load generator to benchmarkthe framework. The results show that running Node.js on IXimproves throughput by up to 20.6%, latency by up to 5.23×,and tail latency by up to 5.68× compared to a Linux baseline.We show how server side request level reordering affect the la-tency distribution, predominantly in cases where the server isload saturated. Finally, due to various limitations of IX1, weare unable at this time to recommend running Node.js on IX ina production environment, despite improved metrics in all testcases. However, the limitations are not fundamental, and couldbe resolved in future work.

ReferatFörbättran av Quality of Service för högbelastade Node.js-

webbapplikationer genom effektivare operativsystem

Detta exjobb undersöker möjligheterna till att använda IX, ettspecialiserat dataplansoperativsystem avsett för högpresterandedatacentertillämpningar, för att köra Node.js, ett webapplika-tionramverk för JavaScript-applikationer. För att porta Node.jstill IX krävs att vi utvidgar IX med funktionalitet för samtidigpollning av Unix Domain Sockets och nätverksflöden, vilket visassamt genomförs. Vidare utvecklas en distribuerad lastgeneratorför att utvärdera applikationsramverket under IX jämfört baslin-je som utgörs av en omodifierad Linuxdistribution. Resultaten vi-sar att throughput förbättras med upp till 20.6%, latens upp till5.23× och tail latency upp till 5.68×. Sedermera undersöker vihuruvida latensvariansen ökat på grund av request-omordningarpå serversidan, vilket tycks vara fallet vid hög serverbelastning,även om andra faktorer tycks ha större inverkan vid låg server-belastning. Slutligen, även om alla storheter förbättrats vid allaobserverade mätpunkter, kan ännu inte vidspredd adoption avIX för att köra Node.js applikationer rekommenderas, främst pågrund av problem med horisontal skalning samt problem att ingåsom frontend-server i en klassisk tiered-datacentre arkitektur.

1Mainly lack of outgoing TCP connections and multi-process execution, respectively preventingNode.js from acting as a frontend in a multi-tiered architecture and scaling horizontally within asingle node.

AcknowledgmentsWriting a thesis can be a long, and at times straining task. I would therefore liketo thank the people that helped me achieve my thesis.

First, I would like to thank the Data Center Systems laboratory at École Poly-technique Fédérale de Lausanne, EPFL, that allowed me to work with them for theduration of my thesis. In particular, I would like to thank my supervisor EdouardBugnion, who offered invaluable advice every time I was stuck in my work. I wouldalso like to thank Mia Primorac and George Prekas who I had the pleasure ofworking alongside, and who also withstood all my questions on IX.

I would like to thank my supervisor at KTH, Carl-Henrik Ek, for offering goodacademic guidance and writing advice.

Finally, I would like to thank all my friends of Lausanne for support and moti-vation during the semester. An extra thanks goes out to those of you that helpedme to proofread.

Contents

Contents

Glossary

1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 The IX Dataplane Operating System . . . . . . . . . . . . . . . . . . 6

2.2.1 Requirements and Motivations . . . . . . . . . . . . . . . . . 72.2.2 What is a Dataplane Operating System? . . . . . . . . . . . . 82.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Web Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Apache, the Traditional Forking Web Server . . . . . . . . . . 92.3.2 Nginx - the Event Driven Web Server . . . . . . . . . . . . . 92.3.3 Node.js . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Queueing Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Software Foundation 133.1 The IX Dataplane Operating System . . . . . . . . . . . . . . . . . . 13

3.1.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . 133.1.2 Dune Process Virtualisation . . . . . . . . . . . . . . . . . . . 143.1.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.4 IX System Call API . . . . . . . . . . . . . . . . . . . . . . . 163.1.5 IX Event Conditions . . . . . . . . . . . . . . . . . . . . . . . 173.1.6 libix Userspace API . . . . . . . . . . . . . . . . . . . . . . . 173.1.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Node.js . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 V8 Javascript Engine . . . . . . . . . . . . . . . . . . . . . . 193.2.2 libuv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Design 25

CONTENTS

4.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Modifications of IX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 Motivation for IX Kernel Extensions . . . . . . . . . . . . . . 264.3.2 Kernel Extension . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.3 libix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Modifications of Node.js . . . . . . . . . . . . . . . . . . . . . . . . . 284.4.1 Modifications of libuv . . . . . . . . . . . . . . . . . . . . . . 284.4.2 Modifications of the V8 Javascript Engine . . . . . . . . . . . 34

5 Evaluation 355.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . 355.1.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 365.1.3 A Note on Poisson Distributed Arrival Rates . . . . . . . . . 375.1.4 Load Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1.5 Connection Scalability . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Result Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2.1 Throughput Increase . . . . . . . . . . . . . . . . . . . . . . . 395.2.2 Reordering & Tail Latency . . . . . . . . . . . . . . . . . . . 39

6 Discussion 436.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Bibliography 49

A Resources 53A.1 libuv - ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.2 Node.js . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

B dialog - high concurrency rate controlled poisson distributed loadgenerator 55B.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55B.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55B.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56B.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Glossary

API Application Programming Interface. 2, 16–20, 24, 25, 28, 29, 31, 47

ASLR Address Space Layout Randomisation. 34, 35, 53

FIFO First-In, First-Out. 11, 41

HTTP HyperText Transfer Protocol. 18, 35, 56

IPC Inter-Process Communication. 5, 46

libOS library Operating System. 6, 15, 47

LIFO Last-In, First-Out. 12

NIC Network Interface Controller. 26, 44

OS Operating System. 5, 19, 43

RPC Remote Procedure Call. 16, 28

RSS Receive Side Scaling. 15

SIRO Service in Random Order. 12

SLA Service Level Agreement. 3, 36, 38

TCP Transmission Control Protocol. 20, 22, 29, 32, 43

TLB Translation Lookaside Buffer. 14

UDP User Datagram Protocol. 43

UDS Unix Domain Socket. 20, 24, 26–28, 34, 47

Chapter 1

Introduction

Almost everyone have probably heard about Moore’s law in one form or another;that computers double in processing power approximately every 18 months1. Con-sequently we should, by now, be free of performance problems since our computersought to be super fast, given an exponential growth in processing power. And theyare. The problem is just that we are constantly telling our computers to solve big-ger, and/or harder problems. Around the year 2004, it stopped to be efficient toscale CPU processing performance vertically, that is increasing the clock frequency.As a result, we are now constructing software to make use of multi-core processors,and we are engineering large, complex, distributed systems to deal with the giganticdatasets that we like to call “big data”. We find that it is important to bound theend-to-end latency, particularly in such systems. End-to-end latency is a key per-formance indicator and has a direct correlation with user experience and thus, for acommercial system, both customer conversion and customer retention, in particularin a realtime/online system

In such distributed systems, computation is divided between multiple entities,which may be spread across a pleathoria of machines within a single - or across -datacentre(s). Therefore, one way to minimise the end-to-end latency and to controlits distribution is to attempt to bound the latency of every participating component.The motivation is that latency and variance in latency is induced in every step ofcommunication along the execution path.

Furthermore, in current computer cluster deployments, energy accounts for asignificant portion of operational expenses. Consequently, if we can engineer systemsthat are able to perform the required tasks more efficiently, they can run with fewerhardware resources and thus consume less energy resources. Therefore, it is stilldesired to improve the efficiency of our systems, even if we have at our disposal,extremely powerful computational resources.

In this work we explore a method to improve the performance of web serversbased on the Node.js application framework, that may or may not, be used in sucha distributed setting as described in the first paragraph. The performance met-

1 The number of transistors on a die doubles approximately every 18 months.

1

CHAPTER 1. INTRODUCTION

rics/Quality of Service metrics we study are mainly latency and its distributionas motivated in the second paragraph, and throughput. Throughput is the num-ber of transactions per time unit, and exhibits correlation with energy efficiencyrequirements as described in the third paragraph.

The IX [1] dataplane operating system, a specialised operating system for en-hanced network performance, is the result of a research collaboration between Stan-ford University and École Polytechnique Fédérale de Lausanne. It is designed tobridge the four-way trade-off between low latency, high throughput, strong pro-tection and resource efficiency. Low latency and high throughput encourages theconstruction of scalable, maintainable and fault-tolerant micro-service oriented ar-chitectures. Improved resource efficiency in conjunction with strong protection re-duces both capital and operational expenses, as it permits workload consolidation,and energy proportionality directly affects the operational expenses [2].

IX uses hardware virtualisation to provide strong protection between applica-tions while retaining performance. Performance is further enhanced by techniquessuch as adaptive batching, run-to-completion, strict FIFO ordering and a native,zero-copy Application Programming Interface (API). The results show greatly en-chanced throughput, latency reductions as well as tail latency reductions comparedto the standard Linux networking stack.

Low latency and tail latency considerations are predominantly important in asetting where a frontend or mid-tier layer fan-outs requests to a large numberof servers in a backend layer. As such, the performance of IX has primarily beenassessed for microsecond computing type of applications, such as the key-value storememcached [3] where throughput is increased by a factor 3.6× and tail latencyreduced by 2×.

Since the publication of [1], the IX-team has realised that Linux performs poorlyregarding fairness and Quality of Service, and IX also seems to handle connectionscalability better. Therefore it seems that IX can also be more suitable than Linuxfor a high fan-in situation, such as one faced by a web server.

Node.js [4] is a contemporary JavaScript web application framework that rose topopularity in recent years due to providing a non-blocking scalable I/O mechanismwith a low learning curve. By leveraging non-blocking I/O, Node provides a singlethreaded execution model based on an event loop [5]. By not dedicating a threadper connection, the system can save resources, which enables it to scale to a highnumber of concurrent clients. Furthermore, it became popular by unifying the serverand client codebases with a single development language. The motivation is thatit increases developer productivity and eases hiring by letting companies combinebackend and frontend teams to single units [6].

1.1 Problem Statement

The IX dataplane operating system improves upon Linux in throughput and latencyby up to an order of magnitude [1], apart from providing better connection scalabil-

2

1.2. CONTRIBUTION

ity. Such optimisations could benefit a broad range of network bound applications.Node.js is designed to improve connection scalability for I/O bound applications,and could potentially enjoy an underlying operating system specifically engineeredfor such a purpose. However, IX assumes a specialised processing model in order toimplement its optimisations and is directly targeted towards microsecond-computingtype of applications.

Thus, can a more general network bound application, such as a web applica-tion framework, in particular Node.js, be effectively ported to IX to benefit fromits advantages? If so, what are the benefits, limitations and how is performanceaffected?

1.2 ContributionWe show that Node.js can, thanks to its event driven design, be ported to IX.The results show that Node on IX brings performance enhancements in terms ofthroughput, latency, but most notably in terms of 99th percentile latencies andthroughput under varying levels of 99th percentile Service Level Agreements (SLAs)versus a standard Linux installation. We investigate and account partially for thesources of improvements in performance metrics. Namely the throughput increasecan primarily be traced to the improved efficiency by using batched system calls. Weshow an increased rate of request reorderings on Linux, which could contribute tothe increased 99th percentile latency, due to a change in effective queueing discipline.However, we are unable to verify that this is the primary contributor to increasedtail latency on Linux.

While most functionality of Node.js can directly be supported on IX there area few shortcomings of IX that prevent us from supporting the full functionality ofNode.js. Namely support for concurrent event notification on Unix Domain Sock-ets and network flows, outbound network connections and multiprocess/multipleaddress space applications, such as Node running the cluster module, require mod-ifications to the IX kernel.

In this work we extend the IX kernel with an epoll-like interface to supportconcurrent polling of Unix Domain Sockets and network flows, but multiple addressspaces and outbound connections are left for future work.

Additionally, the engineering objective has been to construct a port with thesmallest possible changeset to the codebases of Node.js [7], libuv [8], V8 [9] andIX [1]. We accomplished the work with 946 changed or added lines to libuv, 1 toV8 and 422 to IX, whereof 132 to libix and 290 to the IX kernel.

Finally, we developed Dialog2, a closed-loop load generator for request/responsetype of server applications such as web services. Dialog combines high connectionconcurrency with a rate controlled Poisson process load. The purpose is to enableload measurements for a high connection count in order to measure connectionscalability of Node.js on IX compared to the Linux baseline.

2See appendix B.

3

Chapter 2

Background

This chapter explains the necessary background required to understand Node.js andIX. More specifically it explains what they do, rather than how, meaning that thedetails and software architecture is left to chapter 3. We start by looking at thebackground of Operating Systems (OSs) in section 2.1 and an introduction of the IXdataplane operating system follows in section 2.2. We follow up by a brief historyon web servers in section 2.3, and conclude with a succinct queueing theory primerin section 2.4 as web servers essentially are queueing systems. We will see laterthat queueing theory will have an impact on the performance metrics we study inchapter 5.

2.1 Operating Systems

The main purpose of an OS is to abstract the details of the underlying hardwareand to multiplex the access to various resources between applications. It providesthe application programmer with a clean interface that abstracts away the pecu-liarities of the underlying hardware [10]. In general, Operating Systems consistsof a kernel that provides the core functionality of the Operating System, such asmultiplexing of CPU and memory resources. On top of the kernel each operatingsystem typically comes with a set of user space libraries that enable applicationsto request services from the OS. Such libraries may implement OS functionalityin user space, or may perform system calls that transfer control to kernel space.Most mainstream operating system generally provide applications with facilities forprocess scheduling, Inter-Process Communication (IPC), memory management, afile system and I/O, such as a networking stack. Furthermore, they often includehigh level libraries designed to support the development of user space applications,such libraries may include sound players or especially GUI windowing toolkits thatpermits application developers to create applications with unified look and feel.

The literature [10] classifies operating systems into three main types; the mono-lithic kernel, the microkernel and the exokernel. Out of the three the monolithickernel is by far the most commonly used for commodity operating systems. Win-

5

CHAPTER 2. BACKGROUND

dows, Linux, and Unix systems such as BSD variants, Mac OSX and Solaris are allbuilt on a monolithic architecture. Monolithic means that the kernel is a single largeprogram, with no internal information hiding, all procedures can basically call allavailable procedures [10]. Not having to do any context switches while performingcross-module tasks in the kernel improves performance and is, apart from the sim-plified engineering task compared to other designs, a reason that many commodityoperating systems have chosen this design.

As a bug in kernel code can bring down the entire system, the idea that asmuch functionality as possible should be put outside the kernel naturally comes tomind. The idea gives birth to the microkernel, which improves system reliabilityby separating system functionality into different modules, isolated as different userspace processes. Most notably, device drivers are run outside the kernel so that abug in e.g. a video driver can only crash itself, and not the entire kernel. In thecase of a monolithic kernel there is no protection between module, so a bug in onemodule or a rogue module can easily corrupt the data of any arbitrary module andthus bring the entire kernel down. Note that microkernels historically have receivedcriticism for being inefficient due to cross-module calls causing context switches.

Finally, the Exokernel [11] makes an end-to-end argument that operating sys-tems provide inefficient abstractions, and that applications know better which ab-stractions they need. This led the authors to a minimalistic kernel which exportsthe concept of secure bindings, secure allocations of hardware resources to allow ef-ficient multiplexing of resources across applications. The secure bindings uses phys-ical names to remove a layer of indirection and the exokernel furthermore exposesallocation and revocation to allow deeper optimisations of “client applications”. Theexokernel architecture allows for the constructions of library Operating Systems (li-bOSs), operating system that are run in user space, linked with the application.The concept allows for different libOSs to be used for each application, tailored forits specific needs, exporting just the abstractions the application needs, in the mostefficient manner. Additionally, since libOSs are untrusted, they allow faster innova-tion of operating system software, as bugs are not nearly as fatal as in a monolithickernel; they can only bring down the application and not the entire system.

2.2 The IX Dataplane Operating SystemIX is a specialised operating system designed for aggressive networking requirementsposed mainly by datacentre applications. It runs as a virtualised process with pro-tected access to hardware inside an environment called Dune1 [12]. Dune provides a(Linux) process abstraction with access to privileged hardware instructions throughvirtualisation hardware. Since IX is a Dune extended Linux process, it does notneed to implement everything an operating system needs to provide a process, suchas a file system, device drivers or process multiplexing. IX implements a specialisedprocessing model and its own optimised datapath for network I/O. System calls not

1 Dune is further described in section 3.1.2.

6

2.2. THE IX DATAPLANE OPERATING SYSTEM

directly supported by IX can thus be supported by simply passing it through to theunderlying Linux kernel. Therefore, IX could be seen as a library operating systemspecifically designed for datacentre application needs, eschewing inefficient Linuxabstractions whilst keeping acceptable ditos.

In the remainder of this section, we look at the motivations behind IX (sec-tion 2.2.1), what a dataplane operating system is (section 2.2.2) and the resultsthat it achieves (section 2.2.3).

2.2.1 Requirements and Motivations

The purpose of IX is to deal with the increasingly specific demands that largescale datacentre applications put on infrastructure and underlying software layers.Specifically, microsecond tail latencies are required to allow the construction ofdistributed applications with predictable latencies composed by a large numberof participating nodes [13]. Dean and Barroso showed that the tail latencies forindividual components are amplified by scale: if one request out of 100 is slow ona single server, and a request requires answers from 100 servers in parallel, then63% such distributed queries will, in fact, be slow. Therefore it is imperative forlarge scale datacenter applications to control the latency distribution and limit the99th percentile latency of its components. Modern datacenter applications alsorequire high packet rates to be able to sustain throughput, since packet sizes oftenare small [14, 1]. Furthermore, the practice of co-locating applications induces theneed to isolate applications due to security reasons, and also demands resourceefficiency so that server resources can be shared and reallocated amongst co-locatedapplications [15, 16] with varying resource demands.

Commodity operating systems2 were designed during an era with notably dif-ferent hardware characteristics than what is readily available in datacentres today.Processors used to sport a single processing core, multiplexing different applica-tions through timesharing and for networking, packet inter-arrival times used tobe much slower than interrupts and system calls [1]. With 10Gb Ethernet, packetinter-arrival times are reaching nanoseconds, as the interarrival time of minimumsized packets at 10GbE is 67ns.3 67ns is well below the time scales of interruptsand system calls and therefore they suddenly become significant sources of latencyand diminished throughput in high performance datacentre applications. Further-more, as a single cache miss handled by DRAM may occupy 100ns, the even lowerinterarrival-times also encourages data oriented design for such applications. It ispossible to argue that with the advent of multicore processing, some applications nolonger need the type of resource scheduling provided by legacy operating systems.We are therefore free to revisit operating system design in order to improve both

2Readily available and in-production deployed operating systems such as Linux, WindowsServer, FreeBSD and all the other various Unix flavours.

3 64+512+12(bits)10×109(bits/sec) = 67.2ns for preamble, start-of-frame delimiter, minimum sized Layer 2 Eth-

ernet frame and Interpacket Gap added and divided by bit rate.

7


throughput and latency of datacenter applications by no longer trading them forfine-grained resource scheduling.

User space networking stacks could solve some of the overheads involved in sys-tem calls by kernel crossings, but they do not necessarily solve the tradeoffs betweenlow latency and high throughput [1]. Moreover, they do not offer protection betweenthe application and the networking stack, which could lead to corruptions of thenetwork stack due to application level bugs. More critically, such corruptions couldenable a malicious user to exploit the network stack in ways normally reserved tousers with root access to the system, such as transmitting raw packets or enablingpromiscuous mode [17]. Belay, Prekas, Klimovic, et al. [1] argue that the improve-ments gained by removing kernel crossings are marginal compared to amortizingthe costs over multiple system calls by batching system calls as proposed by Soaresand Stumm [18].

2.2.2 What is a Dataplane Operating System?

IX is a Dataplane Operating System, which implicitly tells us that it distinguishesbetween the Dataplane and the Control Plane. Along with other contemporaryoperating system such as Arrakis [19], it borrows the nomenclature from the net-working community, where the separation between dataplane and control plane iswidespread. In networking, switches typically operate in two planes: the dataplaneand the control plane. The dataplane is responsible for packet forwarding alongthe forwarding path, typically implemented in hardware, performing fast lookupin the forwarding tables. The control plane on the other hand, is responsible forconfiguring the dataplane(s), in the case of a switch to setup the forwarding tableby the means of a control plane routing protocol, such as BGP [20].

Likewise, IX separates the areas of responsibility, improving efficiency by re-moving the control plane from the data path. The control plane performs coarsegrained resource allocation, such as allocation of dedicated CPU-cores and networkqueues. The dataplane(s) are responsible for everything on the datapath; frompacket processing to application logic. In IX, the Linux kernel acts as control planethrough the Dune kernel module. By eliminating the Linux kernel from the data-path and replacing it with a specialised optimised datapath, IX can improve uponthe throughput achieved by Linux by up to an order of magnitude [1].

2.2.3 Results

IX improves the throughput of sustained connections by up to 1.9× over mTCPand 8.8× over Linux for a 64 byte packet [1, pp 58]. For memcached [3] throughputfor the given SLA 500µs @ 99th pct is increased by 3.6×, whilst the unloaded taillatency is reduced by 2×. For further descriptions, evaluation, and results of IX,please refer to [1].

8

2.3. WEB SERVERS

2.3 Web ServersA web server is a piece of software, running on a machine connected to a network,capable of serving resources over the HTTP [21] protocol [22]. In some literature,the term may refer to the physical hardware server running such software, or thecombination of such a dedicated hardware server and the web server software. Inthis work “Web Server” refers to the web server software.

Web servers are by tradition divided into static and dynamic servers, wherethe classification indicates the type of content the server may serve. Static webservers merely serve static content, such as files stored on disk. Dynamic web serverseither perform some processing or may generate the full content on a per-requestbasis. Such servers may run arbitrarily complex server programs, but typicallyrun an application program performing application business logic, store data to anunderlying database and respond to the requests with customised, dynamic webpages.

2.3.1 Apache, the Traditional Forking Web Server

Apache HTTP server [23], the web server that the Apache foundation’s name hasbecome synonymous with, started in 1995, has been the by far most deployed webserver in the past, and still holds a majority share of web server deployments as ofJuly 2015 [24].

Apache used a fork-and-execute model for its version 1 deployments; spawninga number of processes, each handling a single request at a time. Most modern daydeployments use the Apache MPM worker module [25], which is a hybrid multi-process/multi-threaded concurrency module. The server spawns multiple processes,each running multiple threads. Each thread serves a single request at a time, butis held ready in a thread pool whilst idle.

2.3.2 Nginx - the Event Driven Web Server

Nginx [26], launched in 2004, is an asynchronous event-driven web server specifi-cally engineered to generate a small resource footprint and to solve the C10K [27]scalability problem. The C10K problem is Kegel’s encouragement, that with thehardware of that time4, web servers should be able to handle 10 thousand concurrentconnections.

Event driven programming is a programming paradigm where the execution of aprogram is driven by the reaction to events. Such events include user input, networkdata or sensory input. Most commonly the model is implemented by a main looppolling different event sources. Upon receiving an event from a source, the eventloop will call the preregistered callback function for the triggered event.

The event-driven architecture means, for a web server, that requests are split upinto smaller chunks of work and that I/O operations are performed by asynchronous

41 GHz CPU, 2GB RAM and 1GbE [27].

9


system calls. The processing model lets a single thread of execution handle morethan one connection, therefore less resources are dedicated per connection and thesystem can scale to a higher number of concurrent connections [28].

2.3.3 Node.jsNode.js takes the event-driven web server concept of Nginx, and combines it with theV8 Javascript Engine [9] to create an event-driven application server for applicationswritten in Javascript. Node.js leverages the V8 engine to provide a platform for fastexecution of JavaScript. JavaScript was designed for, and developers are alreadyused to, writing callback driven programs with asynchronous execution, as employedin web frontend UI applications. Therefore it is a suitable language for event-drivenweb applications.

Node.js was created by Ryan Dahl in 2009 to ease the implementation of real-time web applications. The combination of adequate websocket support and highconnection scalability allows a Node.js application to hold a high number of concur-rent connections with web clients open simultaneously, which facilitates the creationof real-time web applications.

Note that Node.js, as well as other event-driven web server architectures, solvesthe IO scalability problem [5], and not the computation scalability problem. If aworkload is CPU-bound, the performance might decrease by running it on an event-driven architecture. Fast, short running requests might be queued up behind a longrunning CPU-intensive task, whereas on a threaded architecture, the long runningtask would be preempted and the fast tasks would complete before it. Node.js couldstill be used as a part of such compute intensive applications, but since the eventloop must not be blocked, it would write the request data to a computation backendthrough some message queue.

Among users of Node.js we find renowned companies such as Paypal and LinkedIn.LinkedIn reduced the number of servers from 30 down to 3, while still having head-room to handle ten times the amount of traffic they currently do [6]. Moreover,they claim to have improved “speed” by a factor of 20 by moving away from theirprevious Rails based solution to Node.js [6]. Although, care has to be put intothe claim, as LinkedIn, out of political reasons, used a proxying architecture thatblocked the entire process while performing a cross-datacenter request for each andevery request [29]. The moral of the story is thus that if the application is spend-ing a lot of time waiting for I/O, then efficiency can be improved by employingasynchronous, non-blocking I/O.

2.4 Queueing TheoryQueueing theory is a branch of statistical mathematics that models the dynamicsof queues in service systems. Briefly, customers, or clients, arrive at a service pointwith an arrival rate λ, may be forced to wait in line (or might leave the system),eventually gets serviced and then leave the system.

10

2.4. QUEUEING THEORY

The Kendall notation is generally used to describe a queueing system, as follows(in its most basic form):

A/B/c

Where A is the arrival process, B the service time distribution and c the numberof service stations.

An arrival process is always a point process, which is a process such that thearrivals are points, isolated in time. The arrival process is typically assigned intoone of four categories:

M indicates a Markovian, or memoryless arrival process. For queueing systems thisimplies the utilisation of a Poisson process.

D indicates a deterministic arrival rate.

GI abbreviates “General Independent”, a general process with the requirement thatinterarrival times are independent and equally distributed.

G designates a general process, any arbitrary point process.

A Poisson process is usually assumed for the arrival process as it in many applicationis a reasonable model of reality whilst still offering a simple mathematical model.Furthermore, as we are often looking at the queue in short time horizons, the processis often assumed to be homogenous, that is having a constant expected rate.

The service time is generally described belonging to one of the following threeclasses:

M indicates a Markovian, or memoryless service distribution, which leads us toexponentially distributed service times.

G designates a general distribution.

D indicated a deterministic service time distribution, which means that the servicetime is constant.

The number of service stations affect the performance of the service if the servicestations share a queue. If we have four service stations, each with their own queueand a total arrival rate λ we will in fact for a Markovian arrival process observe 4M/G/1 systems with λ2 = λ/4 instead of a M/G/4 system. Finally, Node is singlethreaded and even in the case of multiple service stations by usage of the clustermodule, for real time websocket systems, client affinity will render a (Number ofprocesses) × (M/M/1) system anyhow. Therefore we will not delve any deeper intothe topic of multiple service stations in this thesis.

The queueing discipline describes the rules for how the next client to be servicedis chosen from the queue when a service station is ready to service a new request.The most common discipline to assume is First-In, First-Out (FIFO), the mathe-matically simplest model. A request comes in and is placed at the back of the queue,

11


and requests to be serviced are always taken from the front of the queue. A queueingsystem may also utilise disciplines such as priority queueing, or random order. Li,Sharma, Ports, et al. demonstrates how the FIFO discipline is optimal from a taillatency consideration [30]. The motivation is simple: it minimises queueing timevariation. Queueing time variation increases tail latency as longer queueing timesbecome more likely. For each given queue length at time of arrival the request has agiven expected queueing time of the expected service time, times the length of thequeue. If the service time is deterministic it also has a certain queueing time, giventhe queue length. For any other queueing discipline, including Last-In, First-Out(LIFO), Priority Queueing and Service in Random Order (SIRO), the queueing timevariance increases. Even if the service time is deterministic, the queueing time isnot bounded for these disciplines. In LIFO the queueing time is determined by thearrival process, even in the case of a given queue length at arrival. If an additionalrequest arrives while the at arrival time processed request is still being processed,such a new request will get processed before our process, increasing variance inqueueing time. For random queueing a request may end up staying in the queue aslong as there is at least one more request in the queue; such a behaviour increasesvariance.

12

Chapter 3

Software Foundation

This chapter explains the inner workings of IX, including Dune, and Node.js, includ-ing V8 and libuv. This extends chapter 2 by providing detail on how the systemsachieve their functionality.

3.1 The IX Dataplane Operating System

This section is organised as follows, section 3.1.1 explains the overall software ar-chitecture of IX, section 3.1.2 explains how the Dune process virtualisation works,which eases the understanding of the IX dataplane (section 3.1.3), since the IXdataplanes are Dune threads1.

3.1.1 Architectural Overview

IX is divided into a control plane, responsible for resource control and allocation,and dataplanes, responsible for network I/O and application logic, as illustrated infig. 3.1a. The control plane initialises network interfaces and provides an interfacefor dataplanes to request allocations of cores, network queues and memory. Itconsists of the full Linux kernel and a user-level program that implements resourceallocation policies. The Linux kernel is run in VMX root ring 02, leveraging Dune(section 3.1.2) to provide the capability of exercising control over the dataplane,without interfering in its normal mode of operation.

Like in the Exokernel[11], the dataplane(s) can be seen as application specificoperating system(s), as it runs in VMX non-root ring 0 and provides a single addressspace specific for each application running on IX.

There are two fundamental thread types for applications running on IX, elasticthreads and background threads. Elastic threads interact with the IX dataplane,commute between dataplane operation and user application, whereas background

1Dune describes itself as enabling processes to enter Dune mode, while in fact it allows threadsto enter Dune mode[12].

2Used for hypervisors in virtualised systems.

13

CHAPTER 3. SOFTWARE FOUNDATION

...

IX

libix

sshd

IXCP

httpd

CC C C

rin

g 0

vm

x-r

oo

t

rin

g 0

no

n-r

oo

tri

ng

3

C

Dune

Linux

IX

libix

memcached

(a) Protection and separation of control and dataplane.

app

libix

tcp/ip tcp/ip

timer

adaptive batch

1

2

4

5

Event

ConditionsBatched

Syscalls

3

6

rin

g 0

no

n-r

oo

tri

ng

3

(b) Interleaving of protocol processing and appli-cation execution.

Figure 3.1: The IX dataplane operating system. Reprinted from A. Belay, G.Prekas, A. Klimovic, et al., “IX: a protected dataplane operating system for highthroughput and low latency”, in 11th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 14), 2014, pp. 49–65.

threads do not interact with the IX dataplane. Both thread types can issue arbitraryPOSIX system calls, however, elastic threads are assumed to not perform any long-running actions, as it may result in dropped packets.

3.1.2 Dune Process VirtualisationDune uses VT-x virtualisation hardware to expose a process, rather than a machineabstraction, with access to privileged hardware features in a safe manner [12]. Byprivileged hardware features we mean functionality previously only available tokernel level code, such as control over page tables, TLBs, ring protection (CPUprivilege modes) and access to NIC hardware queues.

Dune works by extending the Linux kernel with the Dune kernel module thatputs the kernel into VMX root and provides the facility for a process to enter Dunemode, transferring it from VMX root, ring 3 to VMX non-root, ring 0 and allowingit hardware access through the underlying virtualisation support.

Dune includes the implementation of a sandbox. The Dune sandbox applicationleverages the privilege modes exposed by Dune to constrain untrusted 64-bit Linuxapplications in ring 3, whilst the trusted sandbox module itself runs in ring 0.

The Dune system allows us to view the Linux kernel as a form of optional exok-ernel. Since it exposes priviledged hardware features, we are free to implement ourown abstractions directly on top of the hardware interface, if we are not content withthe Linux abstractions. By providing access through virtualisation hardware Dunecan multiplex access to the hardware features in a safe way, similar to the securebindings of the Exokernel [11]. Along with the sandbox module, Dune simplifies the

14


creation and usage of libOSs. It allows us to write libOSs that overrides and replacesinefficient abstractions of the current platform while retaining the ability to makedowncalls to the underlying host if its abstractions are deemed suitable. At the sametime we are able to run completely unmodified applications using the standard ab-straction set concurrently on the machine, which may significantly ease adoptationof an Exokernel inspired application architecture in existing infrastructure. Finally,the priviledge modes exposed to processes by Dune allows the construction of libOSsthat are protected against application level bugs by hardware protection, a featurethat is not provided in the original Exokernel design.

3.1.3 Execution ModelWe present the IX dataplane execution model by first introducing the executionmodel inside an IX elastic thread, in the following subsection, Intra-dataplane. Inthe subsequent subsection we look at the bigger picture, how performance is en-hanced by a synchronisation free execution model.

Intra-dataplane

IX assumes an event driven application, where events can only be generated fromthe network interface. The application can, from elastic threads, synchronously pollthe kernel for events. Upon network activity (1, fig. 3.1b) the kernel will processthe packets (2), and notify the application by writing event conditions into anarray that is mapped read-only into userspace and return from the system call (3).At this point, the userspace library libix will process the returned event conditions,calling the associated callbacks to notify the application of the occurred events. Theapplication may respond by issuing further system calls, for which their argumentswould be written in the batched system call vector, to be issued when the applicationyet again polls the kernel for new events(4-6).

IX employs a run-to-completion model with (bounded) adaptive batching. Run-to-completion means that task are run until they finish, which reduces latencyincurred by scheduling and improves throughput and latency due to data cachelocality since consecutive processing stages often access the same data[1]. Batchingreduces overhead by system call transitions, and also improves instruction cachelocality, since the same instruction sequences are reused for multiple packets, whichboth leads to a higher packet rate. Furthermore, batching is adaptive so thatbatching is only used upon congestion, minimising the effect on latency in non-congested cases. Upon congestion the efficiencies of batching can improve latencyby reducing head-of-line blocking [1]. Bounding the batch size bounds the latencyimposed by batching and effectively avoids exceeding capacity of the data cache.

Inter-dataplane

Due to the coarse-grained allocation policy of IX, dataplanes are allocated entireCPU-cores and NIC queues. Receive Side Scaling (RSS) is used to hash incoming

15


System Call Parameters Descriptionsys_bpoll struct bsys_desc __user *d, unsigned int nr performs I/O processing and issues a batch of system callssys_bcall struct bsys_desc __user *d, unsigned int nr issues a batch of system callssys_baddr void get the address of the batched syscall arraysys_mmap void *addr, int nr, int size, int perm) maps pages of memory into userspacesys_unmap void *addr, int nr, int size unmaps pages of memory from userspacesys_spawnmode bool spawn_cores sets the spawn mode, if clone spawns elastic or background threadssys_nrcpus void returns the number of active CPUs

(a) Exception driven system calls

System Call Parameters Descriptionbsys_tcp_connect Opens a connectionbsys_tcp_accept hid_t handle, unsigned long cookie Accepts a connectionbsys_tcp_reject hid_t handle Rejects a connectionbsys_tcp_send hid_t handle, void *addr, size_t len Transmits an array of databsys_tcp_sendv hid_t handle, struct sg_entry __user *ents, unsigned int nrents Transmits a scatter-gather array of databsys_tcp_recv_done hid_t handle, size_t len Advances the receive window and frees memory buffersbsys_tcp_close hid_t handle Closes or rejects a connection

(b) Batched System calls

Table 3.1: IX system calls

flows to a consistent NIC queue. The design, in conjunction with the omission of aPOSIX style socket API (a shared flow namespace between threads of execution),removes the need of synchronisation between dataplanes. Such a design scales wellhorizontally for servers with an increasing number of CPU-cores. The dataplanesstill share the memory namespace, which can be used to exchange messages andperform application-level synchronisation. IX itself examplifies this possibility byimplementing an in-kernel Remote Procedure Call (RPC) mechanism to synchroniseexecution of functionality on a foreign elastic thread.

3.1.4 IX System Call API

The IX system call API is divided into two sets, standard, exception driven systemcalls (table 3.1a) and batched system calls (table 3.1b). The standard system callsfunction like standard exception-driven system calls, provide IX specific functional-ity, such as int sys_bpoll(struct bsys_desc __user *d, unsigned int nr);or int sys_baddr(struct bsys_desc __user *d, unsigned int nr);, or for IXrequired overloads of standard Linux system calls such as mmap. These system callsare meant to mainly be used in the upstart phase of the application, not to be usedin the hot path of the application. For performance critical paths the applicationshould use the batching API, that amortizes kernel transition costs over multiplesystem calls. Therefore, network communication system calls are only available asbatched system calls.

The exception driven system calls return results as normal. The batched systemcalls write their respective results in the system call vector sent to the kernel, andrequire the application to examine the results after the batch has been processed.If the application uses the event based API, such processing is provided, otherwisethe application is required to implement it.

16


3.1.5 IX Event Conditions

Apart from issueing a set of batched system calls to the kernel, the sys_bpoll callalso polls the kernel for events. Such events include but are not limited to incomingpackets, sent buffers, accepted and dropped connections. The full list, along withexplanations of the event conditions can be seen in table 3.2.

3.1.6 libix Userspace API

The userspace library libix offers two sets of API for the application to utilise IX,a plain API that simply follows the system call enumeration, and an event basedAPI that may be easier for an application programmer to use.

Plain API

The plain API mirrors the system calls described in table 3.1. The userspace libraryprovides an application the possibility of calling the IX system calls by implementingthe userspace system call mechanisms.

Event API

The event based API is modeled after the libevent [31] API, as Memcached [3] useslibevent, and the API was developed when porting Memcached to IX. It builds onthe plain API, that merely exports the system calls as functions, and augments itwith a copying API, a flow abstraction with individual registration of event handlers,and system call return handling.

Libix introduces the ixev_ctx struct that abstracts a network flow. It is abidirectional flow handle that enables reading and writing. Furthermore it allows theuser to bind an event handler on a per flow basis rather than a global multiplexinghandler.

The ixev_wait function polls the IX kernel using sys_bpoll, handles returnvalues from eventual system calls, and handles generated events by calling userregistered callback functions.

The ixev_recv and ixev_send provides I/O with copy semantics, which accel-erates the implementation of some applications by eliminating the need to referencecount I/O buffers. Some software assume that the buffer received by the read callneeds to be deallocated, and this poses an incompatibility with zero copy APIs.

Event Condition Parameters Descriptionconnected cookie, outcome A locally initiated connection was successfully established.knock handle, src IP, src port A remotely initiated connection is requested.recv cookie, mbuf ptr, mbuf len A message buffer was received.sent cookie, bytes sent, window size A number of bytes was sent and/or the window size was changed.dead cookie, reason A connection died; was concluded or expired.

Table 3.2: IX Event Conditions

17


Function Descriptionssize_t ixev_recv(struct ixev_ctx *ctx, void *addr, size_t len); read data with copying.void * ixev_recv_zc(struct ixev_ctx *ctx, size_t len); read an exact amount of data without copying.ssize_t ixev_send(struct ixev_ctx *ctx, void *addr, size_t len); send data using copying.ssize_t ixev_send_zc(struct ixev_ctx *ctx, void *addr, size_t len); send data using zero-copyvoid ixev_add_sent_cb(struct ixev_ctx *ctx, struct ixev_ref *ref); registers a callback for when all current sends complete.void ixev_close(struct ixev_ctx *ctx); closes a contextvoid ixev_dial(struct ixev_ctx *ctx, struct ip_tuple *id); Open a connection.void ixev_ctx_init(struct ixev_ctx *ctx); prepares a context for usevoid ixev_wait(void); wait for new eventsvoid ixev_set_handler(struct ixev_ctx *ctx, unsigned int mask, ixev_handler_t handler); sets the event handler and which events trigger itint ixev_init_thread(void); thread-local initializerint ixev_init(struct ixev_conn_ops *ops); global initializer

Table 3.3: IX Event Conditions

The ixev_recv_zc and ixev_send_zc exports the zero copy API. They pro-vide higher performance than their copying counterparts, but may prove harder tointegrate in an application.

ixev_add_sent_cb provides a facility to register a callback when all outstand-ing transfers have been sent. The functionality is useful to attach a callback todeallocate memory after a zero copy transfer has completed.

3.1.7 Limitations

Currently we can observe a range of limitations on the IX platform. IX does cur-rently not support outgoing TCP connections, nor any form of UDP communication,due to an issue of how the allocation of cores and NIC queues is performed for flows.When IX is launched, it occupies the entire NIC, which prevents multiple IX in-stances from running in parallel. The behaviour also prevents the machine runningIX from performing any DNS lookups, unless an extra NIC is present. Since IXoccupies the entire NIC, Linux can’t perform such lookups and since IX does notsupport outgoing connections, we are unable to issue DNS lookups. Furthermorethe listening port of IX is hardcoded in the kernel source code. Finally, IX doescurrently only support the Intel x520 and 82599ES NICs.

3.2 Node.js

Node.js is an event-driven JavaScript application server. It consists mainly oflibuv [8] (see section 3.2.2 and fig. 3.2b.) for core functionality such as the eventloop, I/O and timers, and Google’s V8 JavaScript engine (section 3.2.1) that sup-plies swift JavaScript execution. Node, as seen in fig. 3.2a, consists of a set ofcore JavaScript libraries and a small C-kernel that glues libuv together with V8, sothat the libuv functionality can be used in JavaScript via V8. The remaining partsof Node.js, are as mentioned JavaScript libraries, implementing functionality suchas HyperText Transfer Protocol (HTTP) parsing, to export the Node.js API [32].Since such libraries do not directly issue system calls, but do so through V8 andlibuv, they are OS independent, and are therefore omitted from the scope of thisthesis.

18

3.2. NODE.JS

(a) Overview of Node. (b) libuv architecture. Reprinted from http://docs.libuv.org/en/v1.x/design.html.

Figure 3.2: Node.js Application Structure.

3.2.1 V8 Javascript Engine

V8 [9] is a high performant JavaScript engine developed by Google primarily for itsweb browser project Chrome. It has been open-sourced as a separate project andcan be run either standalone or included in any C++ project.

3.2.2 libuv

Libuv [8] is a multi-platform support library that abstracts common operating sys-tem functionality such as network I/O, file system operations, and multi-threadingover the supported OSs, including but not limited to Linux, Windows and Mac OSX.The library mainly focuses on asynchronous I/O, and aims to provide a platformto build highly scalable event-driven applications.

This subsection mainly focuses on the workings of the core event loop (sec-tion 3.2.2) and the stream API (section 3.2.2), as those require modification to portlibuv to IX. Due to IX’s nature as an OS built on top of Linux, file system, threadpool and other miscellaneous functionality can be left unmodified.

Event Loop

The event loop is the core of libuv library. Libuv provides the abstraction of anevent loop, which means that an application can use many event loops, but eachlibuv data structure or handle must belong to one and only one event loop. Thelibuv operations are reentrant but not thread safe, meaning that operations can beperformed concurrently on objects residing in different event loops concurrently,but cross thread/event loop operations must not be performed without carefulsynchronisation. Naturally there can only be as many event loops running con-currently as there are threads running concurrently. The event loop can be runeither a single iteration, or as long as events still can be generated by calling theint uv_run(uv_loop_t* loop, uv_run_mode mode); function. Each event loopiteration will perform the following actions, in the following order [33]:

19

http://docs.libuv.org/en/v1.x/design.html



1. Update the loop time, libuv caches the time per iteration to minimise thecount of time related system calls.

2. Activation check. The loop will only iterate if it is “alive”. A loop is alive ifit has active and referenced handles, active requests or closing handles.

3. Runs timers that are scheduled to run before the loop time established in (1).

4. Pending callbacks are called, for example if an I/O callback for some reasonhas been deferred to the next loop iteration.

5. “Idle handle callbacks” are run. Idle handles are handles whose callbacks arerun on every loop iteration.

6. “Prepare handle callbacks” are run.

7. Calculate the loop timeout: 0 if the loop is triggered as UV_RUN_NOWAIT,there are idle handles, or no active handles etc., for a full list please see [33].Otherwise the timeout assumes the value of the next timer timeout, or infinityif there is no active timer.

8. BLOCKS FOR I/O, up to the timeout calculated in step 7. The I/O pollinguses different polling mechanisms depending on the platform. E.g. epoll isused on linux, kqueue on OpenBSD and Mac OSX and IOCP on Windows.

9. “Check handle callbacks” are run.

10. “Close callbacks” are called for handles that were closed with uv_close().

11. If the loop was run as UV_RUN_ONCE forward progressed is guaranteed by thelibrary, and thus if no I/O callback fired, the library will retest for due timers.

12. If the loop was invoked with UV_RUN_DEFAULT run mode, then go to (1),otherwise return.

Network and UDS sockets

TCP network flows and Unix Domain Sockets are exposed as stream abstractionsfollowing an asynchronous API. Common for these stream types is that they areimplemented using asynchronous system calls and polled using the (platform de-pendent) scalable event notification system used in step 8 in section 3.2.2.

Since this API is the main API that needs to be implemented using the IX APIrather than the Linux/POSIX API, the API is presented in detail.

The stream API includes the following data types:

• uv_stream_t, a stream handle. “Subtypes” follow:

– uv_tcp_t: A TCP handle that are used to represent TCP streams andservers.

20

3.2. NODE.JS

– uv_pipe_t: A UDS handle on Unix systems and named pipes on Win-dows.

– uv_tty_t: A handle for a stream to a console

• uv_connect_t: A connect request.

• uv_shutdown_t: A shutdown request.

• uv_write_t: A write request.

• Callback function types:

– void (*uv_write_cb)(uv_write_t* req, int status): A write re-quest callback. Status is negative for failed requests, 0 for successfulrequests.

– void (*uv_connect_cb)(uv_connect_t* req, int status): A con-nect request callback, called when a connection started by uv_connecthas completed. Status is negative for failed requests, 0 for successfulrequests.

– void (*uv_shutdown_cb)(uv_shutdown_t* req, int status): A shut-down request callback. Status is negative for failed requests, 0 for suc-cessful requests.

– void (*uv_connection_cb)(uv_stream_t* server, int status): Aconnection callback. Called when a stream server has an incoming con-nection.

Libuv streams, subtypes of uv_stream_t, support the following operations:

• int uv_shutdown(uv_shutdown_t* req, uv_stream_t* handle,uv_shutdown_cb cb);↪→

Shuts down the write side of a duplex stream. Waits for potential pendingrequests to complete, and when the shutdown has finished, the callback iscalled.

• int uv_listen(uv_stream_t* stream, int backlog, uv_connection_cbcb);↪→

Starts listening for incoming connections on the server specified by stream.The callback is called upon connections.

• int uv_accept(uv_stream_t* server, uv_stream_t* client);

Accepts incoming connections and creates new bidirectional TCP flows (han-dles). Should be used after receiving uv_connection_cb callback, then itguarantees successful completion, a behaviour that is not guaranteed if usedmore than once per uv_connection_cb.

21


• int uv_read_start(uv_stream_t* stream, uv_alloc_cb alloc_cb,uv_read_cb read_cb);↪→

Start reading on a stream. The alloc_cb will be called to allocate readbuffers, and the read_cb will be called when data is available. The readcallback will be called repeatedly until there is no more data available, or intuv_read_stop(uv_stream_t*) has been called.

• int uv_read_stop(uv_stream_t* stream);

Stop reading from the stream.

• int uv_write(uv_write_t*, uv_stream_t*, const uv_buf_t[],unsigned int, uv_write_c;↪→

Write supplied buffers in order on the stream. The write callback will becalled upon write completion.

• int uv_write2(uv_write_t* req, uv_stream_t* handle, constuv_buf_t bufs[], unsigned int nbufs, uv_stream_t*send_handle, uv_write_cb cb);

↪→

↪→

Extended write functionality to send handles over a pipe.

• int uv_try_write(uv_stream_t* handle, const uv_buf_t bufs[],unsigned int nbufs);↪→

Same as int uv_write but does not queue requests if they are unable tocomplete immediately.

• int uv_is_readable(const uv_stream_t* handle);

Return a non-zero number if the stream is readable and zero if it is not.

• int uv_is_writable(const uv_stream_t* handle);

Return a non-zero number if the stream is writeable and zero if it is not.

• int uv_stream_set_blocking(uv_stream_t* handle, int blocking);

Set or unset stream blocking operation; that stream operations completeblockingly instead of non-blockingly. The asynchronous interface remainsfixed.

TCP flows additionally support the following operations:

• int uv_tcp_init(uv_loop_t* loop, uv_tcp_t* handle);

Initialise the TCP handle data structure. Does not create a connection.

22

3.2. NODE.JS

• int uv_tcp_open(uv_tcp_t* handle, uv_os_sock_t sock);

Open a file descriptor or socket as a libuv TCP handle.

• int uv_tcp_nodelay(uv_tcp_t* handle, int enable);

Enable / disable Nagle’s algorithm.

• int uv_tcp_keepalive(uv_tcp_t* handle, int enable, unsigned intdelay);↪→

On/off TCP keep-alive.

• int uv_tcp_simultaneous_accepts(uv_tcp_t* handle, int enable);

Enable / disable simultaneous asynchronous accept requests that are queuedby the operating system when listening for new TCP connections.

• int uv_tcp_bind(uv_tcp_t* handle, const struct sockaddr* addr,unsigned int flags);↪→

Bind the handle to an IP-tuple3.

• int uv_tcp_getsockname(const uv_tcp_t* handle, struct sockaddr*name, int* namelen);↪→

Get the current address to which the handle is bound.

• int uv_tcp_getpeername(const uv_tcp_t* handle, struct sockaddr*name, int* namelen);↪→

Get the address of the peer bound to the handle.

• int uv_tcp_connect(uv_connect_t*, uv_tcp_t*, const structsockaddr*, uv_connect_cb);↪→

Establish an outgoing TCP connection. The uv_connect_cb will be calledupon completion or error.

“Pipe” flows additionally support the following operations:

• int uv_pipe_init(uv_loop_t* loop, uv_pipe_t* handle, int ipc);

Initialise the pipe data structure.

• int uv_pipe_open(uv_pipe_t* handle, uv_file file);

3IP address and port number.

23


Open a FD or existing handle as a libuv pipe.

• int uv_pipe_bind(uv_pipe_t* handle, const char* name);

Bind the pipe to a file path.

• void uv_pipe_connect(uv_connect_t* req, uv_pipe_t* handle, constchar* name, uv_connect_cb cb);↪→

Make an outgoing connection to the specified Unix Domain Socket (UDS)(Unix), or Named Pipe (Windows).

• int uv_pipe_getsockname(const uv_pipe_t* handle, char* buffer,size_t* size;↪→

• int uv_pipe_getpeername(const uv_pipe_t* handle, char* buffer,size_t* size);↪→

Get the name of the remote end of the peer.

• void uv_pipe_pending_instances(uv_pipe_t* handle, int count);

Set the pipe queue size when the handle is is used as a pipe server. (Maximumnumber of pending connections.)

• int uv_pipe_pending_count(uv_pipe_t* handle);

• uv_handle_t uv_pipe_pending_type(uv_pipe_t* handle);

Receive a stream handle over an IPC pipe.

File System

File system operations follow the asynchronous API set by libuv for other streamtypes, but can also be run synchronously, if no callback function is supplied. How-ever, even when file system operations are run asynchronously, they are run usingsynchronous system calls in a separate worker thread, using libuv’s threadpool.The reason is that not all scalable I/O mechanisms (epoll) supports file system filedescriptors.

Thread Pool

Libuv implements a thread pool that can facilitate asynchronous execution of inher-ent synchronous work such as file system system calls, DNS lookups or user suppliedtasks. The threadpool uses UDSs as its synchronisation mechanisms with the mainthread to cause the blocking poll to return. The synchronisation is needed to allowthe main thread to timely process callback functions in a thread safe manner.

24

Chapter 4

Design

The design chapter covers the overall design of the port, but also the modifications toevery software module in detail. The modifications of Node.js, covered in section 4.4are almost exclusively done to libuv due to good software modularisation of theNode.js project. Regarding libuv, the major work is adapting it to use the libixAPI in lieu of POSIX system calls. Section 4.3 covers the modifications of IX,in particular the support for an epoll-like API to support polling of Unix DomainSockets.

4.1 Design Overview

Node.js is adapted to use IX’s system call for networking through modifications oflibuv, Node’s core event loop library. We implement our version of libuv in top ofthe IX userspace library libix in order to minimise the changeset of the codebasesand simplify the implementation. An epoll-like interface is introduced to the IXkernel, exposed through a new system call in IX, and finally to the applicationthrough a user space function in libix. Naturally, the libuv library leverages thisnew interface to provide applications the possibility to register interest for eventson UDS as well as using libuv’s threadpool.

4.2 Limitations

The libuv version designed in this thesis supports Node.js but makes no claimsto universally support all applications relying on the libuv API. Most limitationsholding for the IX branch of libuv stem from various limitations in IX1 that preventus to implement the libuv API with the very same guarantees as the standard libuv.Such limitations include but are not limited to: only one libuv event loop can beused per IX elastic thread, event loops can only be run in IX elastic threads, multipleprocesses as required by the Node cluster module[34] cannot run concurrently and

1See section 3.1.7

25

CHAPTER 4. DESIGN

finally listening “sockets”, or handles, cannot be bound to other port numbers thanport 8000. Neither do we support DNS queries on machines with less than oneNetwork Interface Controller (NIC) left to Linux, due to IX not providing a DNSAPI. For machines with more than a single NIC, one can leave a NIC attached toLinux, and the functionality remains through system call passthrough.

4.3 Modifications of IXThis section describes the implementation of UDS polling support in IX. Sec-tion 4.3.1 motivates the need to introduce changes to the IX kernel. Section 4.3.2describes the architectural design of the kernel level functionality and the newlyintroduced system call. Section 4.3.3 is a brief passage describing the user level APIinterfacing the introduced system call.

4.3.1 Motivation for IX Kernel ExtensionsNode, and libuv in extension supports a rich API of stream I/O operations onnetwork flows, UDSs and files in the file system. IX provides a synchronous pollingmethod that does not support a timeout. This is done by design, the run-to-completion paradigm allows IX to assume that there is no more work to be donein the application level when a loop iteration has passed. Yielding control to thedataplane through the bpoll call has the sematics that nothing more can happenin userspace, all following events will be network triggered. This does not holdfor a general Node.js application. For example, a web request might trigger adatabase call to a MySQL server running on the localhost, where communication isachieved over a Unix Domain Socket. A user might request a file to be read through(synchronous) system calls in a background thread from libuv’s threadpool; whenthe background task has completed an event must be raised, and its origin is notthe IX dataplane.

When the IX bpoll has returned and the elastic thread has processed all incomingpackets the application has to take the polling decision, should it call bpoll orshould it not? If it does call bpoll it risks getting stuck indefinitely in the IXbpoll if no new packets arrive. That will effectively prevent responses from beingdelivered to clients waiting for asynchronous work performed in the threadpool,or reading data from a UDS, e.g. connected to another application on the samemachine. If it does not call bpoll, or postponing it till there is no more queued workor waiting clients it risks to miss incoming packets in the dataplane, and furtherreduces throughput of the system. Therefore it is impossible to implement a fullyfunctional port of Node.js to IX without modifying or extending the IX kernel.

However, note that if the IX kernel is extended with support for UDSs, both theproblem of concurrently polling UDS sockets and network flows and the problemof synchronisation with other event sources can be resolved. The first problemis trivially solved by combining the polling API of the two pollable types. Thepossibility for synchronisation with arbitrary event sources, such as the finish of

26

4.3. MODIFICATIONS OF IX

Figure 4.1: The IX dataplane kernel including the UDS worker thread addition.

some task run asynchronously in the thread pool can be implemented by writingon a UDS registered with IX.

4.3.2 Kernel Extension

The IX kernel is extended with functionality to concurrently poll for events onboth UDS and network flows. IX does not discern between different polling setsfor network flows like the Linux epoll functionality does, it reports every flow witha change for the queues tied to the dataplane in question as an event condition2.Therefore, in the name of API coherence, with the introduction of notification sup-port for events on Unix Domain Sockets, we do not introduce multiple polling sets,but provide a singleton polling set per IX dataplane. We introduce only one newsystem call, sys_uds_ctl, which replaces the role of epoll_ctl, to the elastic threadglobal polling set.int sys_uds_ctl(int fd, int op, struct epoll_event *event);int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);Events on the subscribed Unix Domain Sockets are returned as a new event condi-tion upon calling the IX specific system call sys_bpoll.int sys_bpoll(struct bsys_desc *d, unsigned int nr);

The extension, as seen in fig. 4.1, is implemented by the addition of a kernellevel worker thread. The worker thread, and the singleton polling instance are lazily

2 A change is defined as following: if more data has been received on a flow since the last eventcondition. The IX polling is thus neither edge-, nor level triggered.

27

CHAPTER 4. DESIGN

spawned on the first call to int sys_uds_ctl(...);. Every elastic thread that hasa nonempty UDS polling set will thus maintain a 1-to-1 mapping to a UDS workerthread, sharing an epoll instance, and the UDS worker thread keeping an identifierto its parent elastic thread. The UDS worker thread will continuously poll theshared epoll instance through int epoll_pwait(...); and upon encountering anon-empty return set it will update the shared PollingSetState. Upon modificationand no outstanding notification to its corresponding elastic thread, it will notify theelastic thread through IX’s RPC mechanism for posting work to a specific elasticthread, that it has changed the PollingSetState.

In the IX polling system call int sys_bpoll(); the system will synchroniseagainst the instanding work queue of the elastic thread in question once per iterationof the main loop. Thus if the worker thread posts a notification of a change ofthe PollingSetState, the elastic thread will synchronise by running the callback.Subsequently, in the next polling loop iteration, the elastic thread will generate anevent condition for each file descriptor marked in the PollingSetState and then resetthe PollingSetState. After polling the NIC queue one additional time, it will triggerthe polling system call to return, since a non-empty set of event conditions has beengenerated.

4.3.3 libix

Libix provides a user level function that enables the application to call theint sys_uds_ctl(...); system call of IX. Libix is also modified to accept a func-tion pointer for a callback function for the added UDS_ACTIVITY event condition.

4.4 Modifications of Node.js

Since Node.js consists of a small core kernel that binds together the V8 JavaScriptengine with the libuv event library, implementing as much functionality as possiblein Javascript, our porting efforts are largely concentrated on the libuv event library.Libuv implements most of the core functionality Node provides, whereas the coreNode kernel exposes bindings for said functionality to Javascript code, through thev8 javascript engine.

4.4.1 Modifications of libuv

Libuv, as seen in fig. 4.2 is modified to use the libix event API instead of the variousPOSIX APIs (epoll, kqueue or event ports) or Windows IOCP API. Networking isperformed via the API described in section 3.1.6. Furthermore, to support UDS, theintroduced epoll-like interface for Unix Domain Sockets are used, both to providesupport for UDS, but also for synchronisation of the libuv internal threadpool.

28

4.4. MODIFICATIONS OF NODE.JS

Figure 4.2: libuv implemented on top of the libix event polling API.

Networking

The libuv port to IX uses the event API of libix to a large extent as possible, asit provides an API fairly close to libuv’s API and performs reference counting ofwrite buffers. We implement a subset of the libuv API, following the descriptionin section 3.2.2, describe the implementation of each function while motivating thedepartures from fulfilling either the libuv API by functionality or by contract. Theimplementation follows:

First we supplement the uv_tcp_t struct, libuv’s user facing TCP stream handle,with a pointer to a libix ixev_ctx struct. Likewise the ixev_ctx’s field user_datais choosen to contain a pointer to a uv_tcp_t if properly set up and paired. Wealso extend the libuv TCP struct with a linked list of pointers to libuv contextsto facilitate the usage of a TCP handle as a server. The list serves as a buffer ofconnections accepted by the libuv layer from IX, awaiting acceptance from the userapplication.

The changes for the networking subsystem includes the functions regarding libuvstreams and libuv TCP streams:

UV stream handle API:

• int uv_shutdown(uv_shutdown_t* req, uv_stream_t* handle,uv_shutdown_cb cb);↪→

Shuts down the write side of a duplex stream. Waits for potential pending requests to com-plete, and when the shutdown has finished, the callback is called.The call to uv__io_start is disabled, otherwise left unmodified. The functioninitiliases the shutdown request and shedules its execution in the loop.

• int uv_listen(uv_stream_t* stream, int backlog, uv_connection_cbcb);↪→

29

CHAPTER 4. DESIGN

Starts listening for incoming connections on the server specified by stream. The callback iscalled upon connections.

A setter function for the connection callback function. Note that to effectivelylisten to incoming connections, the server must be bound to an address. ForTCP flows, it is done via uv_tcp_bind.

• int uv_accept(uv_stream_t* server, uv_stream_t* client);

Accepts incoming connections and creates new bidirectional TCP flows (handles). Should beused after receiving uv_connection_cb callback, then it guarantees successful completion, abehaviour that is not guaranteed if used more than once per uv_connection_cb.

The acceptance of new connections is a three step procedure. First, a library-internal callback3 is registered with libix to fire upon a USYS_TCP_KNOCKevent condition. Said callback will, if a listening server is registered withuv_tcp_bind, create a new ixev_ctx and enqueue it with the listening server,and enqueue the listening server for I/O. Then it will return a pointer to theallocated ixev_ctx which will cause libix to issue a bsys_tcp_accept systemcall.Secondly an interrim callback handling the readiness of the listening handlewill be called, which will handle the control redirection to the user-suppliedconnection callback.The third phase is to call uv_accept from the connection callback. For TCPflows, the function simply dequeues an ixev_ctx from the supplied uv_tcp_t*server handle and connects it with the client handle.

• int uv_read_start(uv_stream_t* stream, uv_alloc_cb alloc_cb,uv_read_cb read_cb);↪→

Start reading on a stream. The alloc_cb will be called to allocate read buffers, and theread_cb will be called when data is available. The read callback will be called repeatedlyuntil there is no more data available, or int uv_read_stop(uv_stream_t*) has been called.

The UV_READING flag is set for the stream and the read callback handler isregistered. From standard libuv we disable the activation of listening on a filedescriptor in the case of TCP streams.

• int uv_read_stop(uv_stream_t* stream);

Stop reading from the stream.

The UV_READING flag is cleared, along with the read callback function, blockingthe user from notifications of available data.

• int uv_write(uv_write_t*, uv_stream_t*, const uv_buf_t[],unsigned int, uv_write_cb);↪→

3struct ixev_ctx *ixuv__accept(struct ip_tuple *id)

30


Write supplied buffers in order on the stream. The write callback will be called upon writecompletion.

Writing to a uv_tcp_t uses the libix zero copy API. The assumption thatthe user may not modify or free the submitted buffer before the submittedwrite completion callback has been called. With the submission of the writebuffer (by reference) we submit an ixev_ref_t containing a pointer to theuser uv_write_t write request. When the IX kernel triggers a sent eventcondition with a transmission count that exceeds the position for our bufferour internal callback will be called. In that callback the write request willbe looked up and added to the write complete queue. Eventually this leadsto the user-supplied callback being called, as a notification of the write beingcompleted allowing the user to free its related buffers or otherwise proceed inits processing.

• int uv_write2(uv_write_t* req, uv_stream_t* handle, constuv_buf_t bufs[], unsigned int nbufs, uv_stream_t*send_handle, uv_write_cb cb);

↪→

↪→

Extended write functionality to send handles over a pipe.

NOT APPLICABLE. Unlike the Unix socket layer, IX does not provide thepossibility of sending flows between processes.

• int uv_try_write(uv_stream_t* handle, const uv_buf_t bufs[],unsigned int nbufs);↪→

Same as int uv_write but does not queue requests if they are unable to complete immedi-ately.

NOT IMPLEMENTED. Always returns 0.

• int uv_is_readable(const uv_stream_t* handle);

Return a non-zero number if the stream is readable and zero if it is not.

No modification, reads flag from handle field.

• int uv_is_writable(const uv_stream_t* handle);

Return a non-zero number if the stream is writeable and zero if it is not.

No modification, reads flag from handle field.

• int uv_stream_set_blocking(uv_stream_t* handle, int blocking);

Set or unset stream blocking operation; that stream operations complete blockingly instead ofnon-blockingly. The asynchronous interface remains fixed.

NOT APPLICABLE. IX does not support blocking operation.

31

CHAPTER 4. DESIGN

UV TCP handle API:

• int uv_tcp_init(uv_loop_t* loop, uv_tcp_t* handle);

Initialise the TCP handle data structure. Does not create a connection.

The I/O handling is replaced with a handler that only performs callbacks forcompleted writes, since all other functionality of the io handler is provided inlibix.

• int uv_tcp_open(uv_tcp_t* handle, uv_os_sock_t sock);

Open a file descriptor or socket as a libuv TCP handle.

NOT APPLICABLE. IX does not have a socket layer, supporting file descrip-tors to be used as TCP streams.

• int uv_tcp_nodelay(uv_tcp_t* handle, int enable);

Enable / disable Nagle’s algorithm.

NOT APPLICABLE. IX does not implement Nagle’s algorithm, so libuv-ixdoes not offer a way to control it.

• int uv_tcp_keepalive(uv_tcp_t* handle, int enable, unsigned intdelay);↪→

On/off TCP keep-alive.

NOT APPLICABLE. IX does not implement TCP KeepAlive, so libuv-ix doesnot offer a way to control it.

• int uv_tcp_simultaneous_accepts(uv_tcp_t* handle, int enable);

Enable / disable simultaneous asynchronous accept requests that are queued by the operatingsystem when listening for new TCP connections.

NOT IMPLEMENTED.

• int uv_tcp_bind(uv_tcp_t* handle, const struct sockaddr* addr,unsigned int flags);↪→

Bind the handle to an IP-tuple4.

Binding listening “sockets” is done by binding a mapping from an (ip, port)tuple to a listening handle. To libix we register a library-interal intermediarycallback5. Upon connection events, the intermediary callback will lookup thelistening handle for the connection tuple and direct the connection event toconcerned handle.

4IP address and port number.5

32


Since IX does currently only support listening on port 8000, the mechanism isimplemented by storage of a single pointer in the BSS segment, but can easilybe extended to support multiple listening handles by implementing the inter-face with a hash map, as all accesses to the singleton mapping are performedthrough the map interface.

• int uv_tcp_getsockname(const uv_tcp_t* handle, struct sockaddr*name, int* namelen);↪→

Get the current address to which the handle is bound.

NOT IMPLEMENTED. Consider replacing with suitable replacement and/orimplement reverse lookup for TCP listening servers.

• int uv_tcp_getpeername(const uv_tcp_t* handle, struct sockaddr*name, int* namelen);↪→

Get the address of the peer bound to the handle.

NOT IMPLEMENTED. Consider extending libix to save remote informationupon accepting a connection.

• int uv_tcp_connect(uv_connect_t*, uv_tcp_t*, const structsockaddr*, uv_connect_cb);↪→

Establish an outgoing TCP connection. The uv_connect_cb will be called upon completionor error.

NOT APPLICABLE/NOT IMPLEMENTED. IX does currently not supportoutgoing connections. When outgoing connections are supported, this func-tion needs to call void ixev_dial(struct ixev_ctx *ctx, struct ip_tuple*id);.

Unix Domain Sockets

Libuv uses a single “backend_fd” file descriptor for a singleton polling set per eventloop, and Node.js uses a single libuv event loop per Node.js process. ImplementingUDS support over the IX abstraction of a single polling set thus poses no unneces-sary restrictions. All calls to int epoll_ctl(...); of libuv are proxied throughint uv__epoll_ctl(...);. Thus, all that we need to do is to replace the sys-tem call to int epoll_ctl(...); from int uv__epoll_ctl(...); to a call toint sys_uds_ctl(...);, discarding the int epfd parameter. Finally, we need toregister a callback with libix for handling USYS_UDS_ACTIVITY event conditions.Naturally the code within that callback function is the same code that normallyfollows the call to int epoll_wait(...); in standard libuv, that looks up the userregistered callback function for the specified file descriptor, and yields control to theuser. Thus, no changes to the implementation of the stream or pipe abstractions inlibuv are needed.

33

CHAPTER 4. DESIGN

Timers

Timers continue to work since IX can pass the time system calls down to the Linuxkernel, and all timer structures in libuv are time agnostic. Although, with theIX processing model there is a risk that the polling blocks for a long time, if norequest reaches the server for an extended period of time. That will prevent libuvfrom executing user provided timer callback functions, until the IX polling returns.Note that this behaviour does not break the API, as timers are only guaranteedto run some time after they have expired, there is no upper bound on the delay.Therefore we do not do any extra implementation to handle timers. For usersthat are interested at having timers execute with an upper bound from the timerexpiration point, we suggest to implement a worker thread that periodically wakesup to write on a UDS to trigger an event condition, and force IX to return frompolling, enabling timer handling in the elastic thread.

4.4.2 Modifications of the V8 Javascript EngineThe V8 javascript engine employs Address Space Layout Randomisation (ASLR)[35] as a measure to prevent buffer overflow attacks. The technique randomisesmemory placement of data areas of the process such as stack, heap and library inorder to prevent a perpetrator performing a buffer overflow attack from reliablejumping between memory locations. The protection mechanism is implemented asa random hint of memory placement to the mmap system call. Currently IX doesnot support hints that V8 is generating. Therefore ASLR is disabled by changingthe function generating the randomised placement hint6 to always return 0.

6void* OS::GetRandomMmapAddr(), in v8/src/base/platform/platform-posix.cc

34

Chapter 5

Evaluation

The evaluation chapter is organised as follows, section 5.1 explains the main per-formance evaluation along with its methodology. In section 5.2 we account for themechanisms in IX that improve the various performance metrics, namely through-put in section 5.2.1 and latency distribution in section 5.2.2.

5.1 Results

The main result section starts by introducing the test methodology in section 5.1.1.In section 5.1.4 we look at how the latency of the system depends on the arrivalrate of requests. Connection scalability is explored in section 5.1.5, where we lookat how the latency varies under increased number of sustained connections for asub-saturation load.

5.1.1 Test Methodology

We compared Node-on-IX vs Node-on-Linux for a version of Node1 modified to dis-able ASLR in V8, dynamically built against libuv. Both test systems used Ubuntu14.10 with Linux kernel version 3.16.0-41. For the Linux tests we used version 1.5.0of libuv2, whereas IX uses the libuv version described in chapter 4. Since a Node ap-plication typically does not perform any CPU-, but I/O intensive work we chose tobenchmark node with a Hello World type of application issuing an HTTP responsebody of 17 bytes.

For the benchmarks, an in-house scalable distributed load generator was de-veloped. The application allows a rate controlled Poisson distributed load to begenerated from a set of load generating machines, synchronised by a master noderesponsible for latency measurements and data reconcilation. Each physical ma-chine is able to simulate a high number of virtual clients by a coroutine based

1git commit: 9010dd26529cea60b7ee55ddae12688f81a09fcb2git commit: db0624a465493931c790445c22227660b88c5a8e

35

CHAPTER 5. EVALUATION

parallelisation model. For further description of the load generator, please see ap-pendix B.

Each test case used 4 slave load generating physical machines, and one masterthat distributed load and measured latency over non-saturated probe connections.The server and each of the load generating machine used a dual socket mother-board with 2× 2.60 GHz 8 core Intel Xeon E5-2650 for a total of 16 cores and 32hyperthreads with 64 GB memory. The machines were configured with Intel x52010GbE NICs and connected by 10 Gb Ethernet over a Quanta/Cumulus 48x10GbEswitch with a Broadcom Trident+ ASIC. The latency measurement machine held 4concurrent connections open at an average request rate of 1000 requests per second.The remaining virtual clients and load was evenly distributed among participatingload generators. Experiments were run for 5 minutes per data point.

5.1.2 Performance MetricsThroughput

Throughput is the number of transactions that can be completed for a given timeframe. In the case of Node, we look at the number of HTTP responses that can becompleted per second. We are concurrently interested in the total throughput, thatis the maximum throughput the system can sustain and also the throughput givensome SLA. An SLA if often given as a 99th percentile latency. Throughput givenan SLA will thus mean the maximal throughput attainable whilst not exceeding thegiven SLA.

Latency

Latency is the time it takes for the server to serve a given request. The latencyof a request include the network round trip time, time spent waiting while earlierrequest are serviced (queueing time) and finally also the service time of the requestin question.

We measure and care for tail latency, that is the far end of the latency distribu-tion, not only because the world is becoming increasingly more real time oriented,with interactive applications inducing the need to guarantee timely service almostin all cases. Every hiccup from a smooth user experience decreases user retentionand thus profit[36]. We also do care for the tail latency because the 99th percentilecase is far more common than intuition tells us.

The tail latency becomes more common and thus important when we introducedependencies between request, such that e.g. the slowest request in a set determinesthe “response of the set”. As Dean and Barroso[13] demonstrated, if a frontend ser-vice fans-out requests to an underlying layer such as a key-value store, and requiresall responses before it can proceed and aggregate a response to the client, the enduser will experience the 99th percentile often. Assume that the number of fan-outrequest is 100: the chance to undercut the 99th percentile for a single request is natu-rally 0.99. Whereas to do the same for all 100 requests follows to be 0.99100 = 0.366.

36

5.1. RESULTS

Thus actually 63.4% of all requests will observe the 99th percentile latency. Notethat the math is identical for all scenarios when the response time is given by theslowest response of a set of requests. For a frontend web server this occurs in at leasttwo scenarios; if the web server serves all types of web objects3 needed to displaya page, or if, using a standard tiered datacentre architecture, the underlying layeralready provides a very tight bound on tail latencies.

A (web) client needs to load a set of web objects in order to display a web page.As of 2012, the average number of objects per web page was 100[37]. Thus, byemploying the same math as in the previous paragraph we see that 63.4% of allrequests will observe the 99th percentile latency.

If the tail latencies of the underlying layers have been severely tamed, a vastsimplification gives that the responses come back to the web server “more or lessat the same time”, which again ties the end-to-end latency to the web server’sprocessing of the slowest of those requests. Note that if the tail latency of theunderlying service is not very controlled, the latency distribution of the end responsewill just be the latency distribution of a single transaction of the web server plusthe distribution of the underlying layer.

5.1.3 A Note on Poisson Distributed Arrival Rates

Note that for a uniform arrival rate sub-saturation there would be no queue buildup, thus leaving no variance in latency; all requests would observe the latency ofan unloaded system. The Poisson distributed arrival rate more accurately modelsreality, with requests from different clients being independent, which inherentlyinduces a non-uniformity in the arrival rate. Such a behaviour will cause momentaryqueue build-ups, when the momentary arrival rate is greater than the service rate.Such build-ups will induce latency variance, and by that it becomes interesting tostudy the latency distributions of the systems, even for sub saturation loads.

5.1.4 Load Scaling

We test how the systems respond to load scaling by fixing the number of connectionsto powers of 2 from 1 to 16384 and varying the load for each settings. The resultsshown are limited to 64 and 512 connections respectively; other concurrency levelsyield similar looking graphs. In fig. 5.1 we see those results on an xy-plot. On the x-axis we have the achieved throughput, and the y-axis depicts the observed latenciesat the specific load level for each of the systems. For both systems we study boththe average, or expected, latency, and the 99th percentile of the latency distribution.When a queueing system hits saturation in terms of throughput, queueing theorypredicts that the latency will approach infinity at an exponential rate4. Therefore,we are interested in studying the latency response even in sub-saturation cases.

3Such as images, stylesheets and scripts.4For a load with an open-loop. With “enough” concurrency and a low SLO the phenomena

will appear to take place even for the closed-loop loads we are studying.

37


0

1000

2000

3000

4000

5000

6000

7000

0 2 4 6 8 10 12 14

Latency (us)

103 Requests/sec

64 clients

IX-ZC AVGIX-ZC 99th

LINUX AVGLINUX 99th

(a) Load scaling under 64 clients

0

3000

6000

9000

12000

15000

0 2 4 6 8 10 12

Latency (us)

103 Requests/sec

512 clients

IX-ZC AVGIX-ZC 99th

LINUX AVGLINUX 99th

(b) Load scaling under 512 clients.

Figure 5.1: Load scaling.

In fig. 5.1a we observe an increase in throughput by 16.75%, and we can see thatthe 99th percentile tail latency is reduced by 5.24× at 7000 requests per second.Note that if given an SLA of 2 ms 99th percentile latency, the effective attainablethroughput rises from 4000 requests per second on Linux to 11000 on IX, a 2.75×increase of throughput under the given SLA. In fig. 5.1b we see how both the averagelatencies and the 99th percentile latencies are much lower on IX for all load levels.Note how the 99th percentile of the IX line gracefully ducks even the average latencyof the system running on Linux. Furthermore, for this specific concurrency level,we observe a 20.62% increase in throughput, 5.23× reduction in average latency (at7000 req/s) and 5.68× for 99th percentile tail latency (at 6000 req/s).

5.1.5 Connection Scalability

Figure 5.2a shows an almost unloaded system running at approximately 20% ofits throughput. Notice how the 99th percentile latency of Linux spikes for 16384connections, while the latency of IX merely doubles, relatively appearing constant.At a throughput of 5000 requests per second (fig. 5.2b) we start to see the disparitybetween the two systems already at a much lower connection concurrency. At 1024connections the IX system exhibits a 4.92× reduction in 99th percentile tail latency.Note that given an SLA of 2 ms @ 99th pp., for this particular throughput theweb server can handle 32 concurrent connections if it runs on Linux , and 8192 byrunning on IX.

5.2 Result Tracing

In this section we try to account for the causes of the performance differencesbetween the systems seen in section 5.1.4 and section 5.1.5. In particular, we look

38

5.2. RESULT TRACING

0

500

1000

1500

2000

2500

3000

3500

1 4 16 64 256 1024 4096 16384

Latency (us)

# concurrent connections

Connection scaling, fixed TP 2000 req/s

IX-ZC AVGIX-ZC 99th

LINUX AVGLINUX 99th

(a) Connection scalability for 2000 requests/s.

0

1000

2000

3000

4000

5000

6000

1 4 16 64 256 1024 4096 16384

Latency (us)

# concurrent connections

Connection scaling, fixed TP 5000 req/s

IX-ZC AVGIX-ZC 99th

LINUX AVGLINUX 99th

(b) Connection scalability for 5000 requests/s.

Figure 5.2: Connection scalability.

at the increased throughput of IX in section 5.2.1 and the lowered 99th percentilelatency in section 5.2.2.

5.2.1 Throughput Increase

To determine the batched system calls effects on the throughput of the server run-ning on IX we experimentally set the IX event condition batching size to 1. Theaction has the implication that every packet delivery will trigger a kernel crossing,and every buffer that has been sent will also trigger a kernel crossing, just as ifno batching had been performed at the system call layer. We run at a moderateload and concurrency level of 512 concurrently connected virtual clients. Figure 5.3shows the system running on the unmodified IX kernel in black, Linux in red andfinally IX with disabled batching in blue. The figure clearly shows how the non-batched IX system, in both the average(fig. 5.3a) and the 99th pp. (fig. 5.3b) case,reaches saturation at the same load level as Linux as opposed to the elevated satu-ration point in the case of unmodified IX. Note that the non-batched version stillprovides lower latency than Linux for sub saturation loads.

5.2.2 Reordering & Tail Latency

Since the queueing discipline effects the latency distribution5, we investigate if thereare more request reorderings on Linux than IX and if it causes the elevated 99th pp.latency. The motivation is that if a server reorders requests, then it will, in fact,change the effective queueing dicipline.

We define an ordering violation to be a pair of requests A and B such that arequest A that reached the server’s NIC before a request B, and that B was pro-cessed before A. Let the total ordering violation be the sum of ordering violations.

5See section 2.4

39


0

1

2

3

4

5

6

7

8

0 2 4 6 8 10 12

Latency (ms)

103 Requests/sec

512 clients

IX AVGLINUX AVG

IX-NB AVG

(a) Averages latencies

0

2

4

6

8

10

12

14

2 4 6 8 10 12

Latency (ms)

103 Requests/sec

512 clients

IX 99thLINUX 99th

IX-NB 99th

(b) 99th percentile latencies

Figure 5.3: Throughput plot for Linux and IX, with and without batching.

Furthermore, let the total number of reorderings be the number of requests thathas been the victim of an ordering violation.

We mark each request’s processing time with a sequence number in the serverapplication program6. We approximate the arrival order by a client side timestamponce the request has been copied into the kernel space send buffer. All requestson the load generator are saved with these metadata for offline processing. Tofind the number of violations we process them in order of server side processing,incrementally adding them to a sorted set sorted in issue order. If a request is addedto the end of that set, it means that no request processed before it was issued after,i.e. it was not violated. If it is added anywhere else but at the end of the set,it means that we have identified a set of requests that have been issued after butprocessed before. Since we traverse the requests in processing order it also meansthat there can be no other request that was processed before, and therefore wehave found the violation subset for which the request in question is request A inthe definition given above. Since violations are symmetric we need only find theviolation subsets for all request such that for a given global choice if to look forrequests of type A or B, the request in question is the (global choice)-end of theviolation. If we sum the size of all such subsets we will find the total number ofviolations.

Table 5.1 shows the ordering violation count for three different workloads, 2000,5000 and 7000 requests per second, with 8, 1024, 4096 and 16384 connected virtualclients. For the reordering test we let the master node continue to measure thelatency, but the reordering request set is the request set issued by a single loadgenerator. The clients column describes the number of concurrently connectedvirtual clients and TP the achieved total throughput of the system7. AVG accountsfor the average latency and the first 99th column the 99th percentile latency, both

6Since Node is single threaded we have no race conditions on the sequence number.7Not a maximum test, but aimed a at a target load level.

40

5.2. RESULT TRACING

SYSTEM Clients TP AVG [µs] 99th[µs] Requests Violations 99th 99 lat / 99 vio.Linux 8 1999.5 287 633 300190 3117 1 353IX 8 1998.8 172 280 300401 1174 0Linux 8 5005 286 869 1201395 80945 1 412IX 8 4995.7 212 457 1199043 7789 0Linux 8 6994.3 335 1144 1799121 290947 2 511IX 8 7001.6 264 633 1800112 18493 1Linux 1024 2002 426 1363 300105 5834 1 853IX 1024 2000.6 197 510 300115 2042 0Linux 1024 4989.4 807 3696 1197414 193804 2 3015IX 1024 4993.6 241 681 1197962 13411 1Linux 1024 7000.2 963 7069 1799767 3076127 27 215IX 1024 7003.2 412 1472 1801614 34357 1Linux 4096 2003.3 421 1234 299918 4923 1 613IX 4096 1997.2 214 621 298860 2273 0Linux 4096 4987.6 771 3149 1197092 140086 1 NaNIX 4096 5003.3 276 945 1201014 13451 1Linux 4096 6996.7 954 5646 1799855 1112979 15 224IX 4096 7003.8 547 2506 1800287 667747 1Linux 16384 1995.6 431 1157 299217 4732 1 593IX 16384 1995.8 213 564 298966 2568 0Linux 16384 4998.9 844 4130 1199935 95874 1 NaNIX 16384 4998.6 347 1923 1198846 14345 1Linux 16384 6895.3 1105 7595 1800366 1011314 13 261IX 16384 7006.6 614 4459 1802588 437480 1

Table 5.1: Ordering violations.

measured in microseconds. Requests and Violations describe the total number ofrequests and number of violations respectively, in the load generator request set.The second 99th column describes the 99th percentile of violations per request in therequest set. The last column describes the ratio between the absolute difference in99th pp latency between IX and Linux, and the absolute difference in 99th percentileof violations per request. Observe that for load rate = 7000 requests/s, the ratiois close to the unloaded average latency on Linux. Therefore, the increased requestlevel reordering may well be the contributing factor to increased tail latency onIX for loads close to saturation. For sub saturation loads, the high ratio numberssuggest that there are more factors at work.

Since it is known that IX processes packages in strict FIFO ordering, and Node.jshandles events in the order they were generated, we know that the reorderingsmeasured for IX are client generated. Since the endpoint of measurement is theserver’s processing sequence number, we do know that these reorderings happenduring the client’s send phase. The observation highlights the problem of using amulti threaded client to test reordering; for best results a highly optimised IX clientshould have been used. However, it was not possible during the scope of the projectsince IX currently does not support outgoing connections.

41

Chapter 6

Discussion

We set out to find out whether or not Node.js could be effectively ported to IX,furthermore to chart the performance benefits and disadvantages of using the IXoperating system to operate Node.js. The results clearly show how it is not onlypossible, but that we can improve upon the performance of Node in all metrics testedby using IX instead of Linux. Naturally the impact of the benefit varies across thetested metrics. Unloaded latency is improved by roughly a factor 2×, and up toa factor 5.23× for some sub saturation loads1 and tail latency is also significantlyimproved, by up to 5.68×. For cases where we care about fast responses and gooddistribution of such latencies, it makes sense to run Node.js on IX. Such casesinclude, but are not limited to: a single web server handling all the requested webobjects (i.e. not having a Content Delivery Network)2, but also for a standard webserver in a classical tiered datacenter hierarchy, if the underlying key-value storeis assumed to keep an already very tight bound on tail latency. However, one canpose the question whether or not Node.js is the ideal framework for an applicationwith tight requirement on low latency, as a considerable cost in latency is inducedby the execution of JavaScript, as the transition cost between executing JavaScriptand C++ in V8 is high.

Note the very modest throughput increase of roughly 20%, which even if a sig-nificant number, it is not a game changing number. If the increase would have comewith no drawbacks, just plug and play into a new OS and get a 20% performanceincrease, a switch might have been a nobrainer. Many of the drawbacks could beignored depending on the use case, e.g. inability to lookup DNS without a supple-mentary NIC, which could easily and inexpensively be installed, or applications thatuse TCP might accept the lack of UDP. Likewise, IX’s current inability to initiateremote network connections might be acceptable to a single web server setup. Butit does pose a problem in a multi-tiered datacentre architecture, as it disables thefront end web server from initiating connections with the nodes of services in theunderlying layers, such as key-value store replicas.

1See e.g. 512 clients @ 8000 RPS, fig. 5.1b.2See argument in section 5.1.2

43

CHAPTER 6. DISCUSSION

The major hindrance from viewing the 20% throughput improvement as a rea-son to immediately start running Node on IX is, however, the lack of support forhorizontal scaling. The idiomatic way of horizontally scaling a Node.js applicationis to run the Node.js cluster module that runs multiple Node processes in parallel. Itruns a set of worker processes that performs application business logic, and a singlemaster process; responsible of accepting connections and distributing them over theavailable workers. Theoretically the cluster module scales throughput linearly withthe number of CPU-cores since it utilises a process per CPU-core. Our practicalresults do not suggest otherwise: with 16 cores and 32 hyperthreads we should beseeing anywhere from 16× up to 32× increase in throughput, and we observed athroughput improvement of up to 24× compared to a single process system by em-ploying the cluster module. Note that it is a significantly performance gap to the20% optimisation achieved by IX. The two optimisations should be orthogonal, butthe 20% improvement by running on IX only becomes relevant once IX supports ahorizontally scalable execution model.

There is work in progress regarding the usage of SR-IOV3 to, among otherpurposes, support concurrent execution of multiple IX instances - a multi-processexecution model. However, it is not sufficient to horizontally scale Node.js on IX, asIX processes do not share a single network flow namespace, thus IX flows cannot beshared among multiple IX processes the way Node.js currently scales horizontally onLinux systems. Furthermore, IX dataplanes are tightly coupled to ther respectiveNIC queues, and so would IX processes be to their respective Virtual Functions.Therefore, the behaviour of having a master process accept connections and thendistributing them over worker processes does not work on IX.

Note that SR-IOV virtual functions assigns a different MAC address to eachVirtual Function. Thus, using SR-IOV to multiplex packets over multiple processesrequires a demultiplexing function such as a load balancer to distribute the connec-tions over the “mini nodes”, present for each core. Depending on the characteristicsof the connections and the mechanism of the load balancer the latency increasemight be acceptable, for web servers such as Node.js. For other services, such amicrosecond computing applications we might not prefer to scale horizontally if itrequires an additional system passthrough.

Summatively, horizontal scaling with a process per CPU core, each running aworker event loop, like the Node.js cluster module, is not currently possible on IXfor Node.js. In particular, the Node.js cluster module paradigm does not work,and we need another solution, which may or may not prove hard to engineer. Theproblem, apart from software maturity, and usability and support of IX, remains themain gripe that prevents imminent adoption of IX to run network bound Node.jsapplications.

3Single-Root I/O Virtualization, an Intel technology to multiplex a NIC as multiple virtualNICs.

44

6.1. RELATED WORK

6.1 Related Work

The Exokernel [11] started the debate concerning radical operating system design inorder to counter the inefficiencies posed by general-purpose abstractions providedin operating systems of classical design. It created a new design paradigm of oper-ating systems, that aims to provide as thin abstractions as possible, possibly justexposing the hardware interface. The designers do realise that the kernel have tocontrol resources in order to isolate applications. They design secure bindings, whichprovide a secure allocation of a hardware resource. The secure bindings are imple-mented differently for different resources, but what they do provide is decouplingof allocation/management and usage/access control.

IX [1] and Arrakis [19] are in a way both modern incarnations of the exokernelidea. They both leverage virtualisation hardware to be able to export secure accessto the underlying hardware interface. Where IX uses VT-x to export the interfaceof a process having access to privilege modes and NIC hardware rings, Arrakisutilises SR-IOV technology, that provides virtual functions - appearing as NIC onthe PCI bus. By that, Arrakis does only export the hardware interface of NICs and,assuming a similar technology as SR-IOV for storage, to storage controllers.

Cheetah is a sample web server application built to showcase Greg Gangersextensible I/O library XIO for the second exokernel, Xok. By exploiting the exten-sibility of Xok, the team managed to build a web server that improved throughputby 8× versus the best result they were able to achieve on OpenBSD.

6.2 Lessons Learned

We have seen that a specialised library operating system built for Memcached canbe used for a more general networked event-driven application framework, Node.js.It is general enough to support the operations required by Node after our extensionsregarding UDS.

The abstractions exposed by IX, mainly flows that replaces the Socket layerof a Unix system corresponds well with the abstractions required by Node.js inthe single-threaded case, and therefore Node.js can benefit from the optimisationsallowed by this new abstraction set. The exception is that the flow abstraction andIX’s memory model makes it difficult to scale Node.js horizontally, as done in thecluster module, even if multiple concurrently running IX processes were supported.

The fact that Node.js is less efficient than Memcached limits the performanceimprovements possible by running it on IX. Since Node.js spends a larger fractionof its execution time in user space, smaller improvements by running an optimisedkernel and system call layer can be achieved.

It is important to analyse an application completely to design an efficient li-bOS for it. Even though the IX execution model is general enough to support themain features of Node, some of its current limitations prevent it from competingin throughput versus a multicore Linux system. Thus, it is important to analyse

45

CHAPTER 6. DISCUSSION

the full execution model and its application to design an efficient library operatingsystem tailored for a specific application.

6.3 Future WorkLimitations

The enumerated limitations of IX, including UDP, support for outgoing connections,wider range of NICs and multiple processes running in parallel needs to be addressedfor IX to reach a wider public and are required to enable adoptation in production.

Horizontal Scaling

Find a way to scale Node horizontally on IX. Even if the development using SR-IOVtechnology renders it possible to run multiple instances of IX in parallel on differentcores, the cluster module of Node.js will not provide horizontal scaling. The clustermodule works by having a main process accepting new connections and subsequentlydistributing them over the worker processes by sending the file descriptors overIPC. To enable horizontal scaling, either flows needs to support migration betweenprocesses, multiple processes need to be able to listen to the same incoming port (onthe same network interface), or a new scheme of horizontal scaling, maybe based onelastic threads instead of processes would need to be devised. Note that the thread-safety of the JavaScript runtime comes into play if the last route is pursued. Afinal remark is that for event-driven architectures, a single crash will disrupt manyinflight requests, and the risk is increased with a multi-threaded model comparedto a multi-process model.

Nginx on IX

We have seen how Node can benefit in throughput from running on IX. However,Node.js is relatively slow, which limits the potential throughput gains. Nginx is anevent-driven web server written in C for performance and that does not inherentlyexecute dynamic languages. It would be interesting to see what kind of performanceimprovements that could be realised by running Nginx on IX. The tail-latencyargument for static resources is more valid for Nginx, as it is commonly used forCDN deployments. Nginx uses system calls like sendfile aggressively on Linux toimprove performance by eliminating kernel crossings. Does IX still outperformLinux for serving static files, even without sendfile semantics? Could IX include itsown abstractions for files and UDS to enable a sendfile-like interface?

Improved Tools for Generating Load and Measuring Server SideReorderings

We have seen how Dialog could be used to generate a Poisson distributed ratecontrolled load from a large set of virtual clients, easily implemented by leveraging

46

6.4. CONCLUSION

coroutines. But the server side reordering tests do show a significant number ofreorderings, even on the IX system. We know that IX operates in a strict FIFOdiscipline, which implies that the reorderings occur client side, not very surprisingdue to the extreme number of parallel threads of execution in the design. Thus, howwould a load generator and probing suite be designed to maximise client side controlor at least knowledge over wire-time for each packet? The application knowledgeof the time a packet was put on the wire could both be used to minimise clientside reordering mistakes to better probe the server side reorderings, or to generatea more accurate on-the-wire request rate distribution.

libOS for Node/VM Dynamic Languages

IX is basically a library operating system developed to increase the performance ofmicrosecond-computation type of applications in datacentre settings. The webserverCheetah demonstrates what an optimised libOS can do for the performance of aweb server. How would a libOS designed to run a virtual machine of a dynamicallyinterpreted language, or a JavaScript runtime, maybe V8 in particular, be designed?

6.4 ConclusionBy extending IX with functionality for concurrent polling of network flows and UnixDomain Sockets we effectively permit a larger set of applications to be run on IX.We prove that Node.js from now on can run on IX, by the implementation of aminimal port of libuv to IX’s API. Furthermore we show that Node.js on IX sig-nificantly outperforms itself on a Linux baseline, especially regarding latency andlatency distribution. However, due to the semantics of IX flows we are unable toscale horizontally within a single node, which effectively restricts the attainablethroughput of a single node by more than an order of magnitude. Nevertheless, webelieve that the restrictions can be lifted, in order to show performance improve-ments that matter in a real world setting.

Furthermore we do believe that the project has reinforced the exokernel’s thesisthat general purpose abstractions hurt performance, and that a library operatingsystem with improved performance can prove useful even to third-party applicationsit was not originally designed for.

47

Bibliography

[1] A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion,“IX: a protected dataplane operating system for high throughput and low la-tency”, in 11th USENIX Symposium on Operating Systems Design and Im-plementation (OSDI 14), 2014, pp. 49–65 (cit. on pp. 2, 3, 7, 8, 14, 15, 45).

[2] G. Prekas, M. Primorac, A. Belay, C. Kozyrakis, and E. Bugnion, “Energyproportionality and workload consolidation for latency-critical applications”,in Proceedings of the Sixth ACM Symposium on Cloud Computing, ser. SoCC’15, Kohala Coast, Hawaii: ACM, 2015, pp. 342–355, isbn: 978-1-4503-3651-2.doi: 10.1145/2806777.2806848. [Online]. Available: http://doi.acm.org/10.1145/2806777.2806848 (cit. on p. 2).

[3] Memcached – a distributed memory object caching system, http://memcached.org, 2015 (cit. on pp. 2, 8, 17).

[4] Node.js, https://nodejs.org/, 2015 (cit. on p. 2).[5] T. Capan. (2013). Why the hell would i use node.js? a case-by-case tutorial,

[Online]. Available: http : / / www . toptal . com / nodejs / why - the - hell -would-i-use-node-js (visited on 07/02/2015) (cit. on pp. 2, 10).

[6] R. Paul. (2012). A behind-the-scenes look at linkedin’s mobile engineering,[Online]. Available: http://arstechnica.com/information-technology/2012/10/a-behind-the-scenes-look-at-linkedins-mobile-engineering/2/ (visited on 07/14/2015) (cit. on pp. 2, 10).

[7] (Jun. 2015). Nodejs, [Online]. Available: http://nodejs.org (cit. on p. 3).[8] Libuv, https://github.com/libuv/libuv (cit. on pp. 3, 18, 19).[9] V8 javascript engine, https://code.google.com/p/v8/, 2015 (cit. on pp. 3,

10, 19).[10] A. S. Tanenbaum, Modern Operating Systems, 3rd. Upper Saddle River, NJ,

USA: Prentice Hall Press, 2007, isbn: 9780136006633 (cit. on pp. 5, 6).[11] D. R. Engler, M. F. Kaashoek, and J. O’Toole, “Exokernel: An Operating Sys-

tem Architecture for Application-Level Resource Management.”, in SOSP95,1995, pp. 251–266 (cit. on pp. 6, 13, 14, 45).

49

http://dx.doi.org/10.1145/2806777.2806848

http://doi.acm.org/10.1145/2806777.2806848

http://doi.acm.org/10.1145/2806777.2806848

http://memcached.org

http://memcached.org

https://nodejs.org/

http://www.toptal.com/nodejs/why-the-hell-would-i-use-node-js

http://www.toptal.com/nodejs/why-the-hell-would-i-use-node-js

http://arstechnica.com/information-technology/2012/10/a-behind-the-scenes-look-at-linkedins-mobile-engineering/2/



http://nodejs.org

https://github.com/libuv/libuv

https://code.google.com/p/v8/

BIBLIOGRAPHY

[12] A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazières, and C. Kozyrakis,“Dune: safe user-level access to privileged cpu features”, in Presented as partof the 10th USENIX Symposium on Operating Systems Design and Implemen-tation (OSDI 12), USENIX, 2012, pp. 335–348 (cit. on pp. 6, 13, 14).

[13] J. Dean and L. A. Barroso, “The tail at scale”, Commun. ACM, vol. 56, no.2, pp. 74–80, Feb. 2013, issn: 0001-0782. doi: 10.1145/2408776.2408794.[Online]. Available: http://doi.acm.org/10.1145/2408776.2408794 (cit.on pp. 7, 36).

[14] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, “Work-load analysis of a large-scale key-value store”, in Proceedings of the 12thACM SIGMETRICS/PERFORMANCE Joint International Conference onMeasurement and Modeling of Computer Systems, ser. SIGMETRICS ’12,London, England, UK: ACM, 2012, pp. 53–64, isbn: 978-1-4503-1097-0. doi:10.1145/2254756.2254766. [Online]. Available: http://doi.acm.org/10.1145/2254756.2254766 (cit. on p. 7).

[15] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S.Shenker, and I. Stoica, “Mesos: a platform for fine-grained resource sharing inthe data center”, in Proceedings of the 8th USENIX Conference on NetworkedSystems Design and Implementation, ser. NSDI’11, Boston, MA: USENIXAssociation, 2011, pp. 295–308. [Online]. Available: http://dl.acm.org/citation.cfm?id=1972457.1972488 (cit. on p. 7).

[16] C. Delimitrou and C. Kozyrakis, “Quasar: Resource-Efficient and QoS-AwareCluster Management.”, in ASPLOS14, 2014, pp. 127–144 (cit. on p. 7).

[17] S. Dhar, “Sniffers, basics and detection”, [Online]. Available: http://www.just.edu.jo/~tawalbeh/nyit/incs745/presentations/Sniffers.pdf(visited on 07/16/2015) (cit. on p. 8).

[18] L. Soares and M. Stumm, “Flexsc: flexible system call scheduling with exception-less system calls”, in Proceedings of the 9th USENIX Conference on OperatingSystems Design and Implementation, ser. OSDI’10, Vancouver, BC, Canada:USENIX Association, 2010, pp. 1–8. [Online]. Available: http://dl.acm.org/citation.cfm?id=1924943.1924946 (cit. on p. 8).

[19] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. An-derson, and T. Roscoe, “Arrakis: the operating system is the control plane”,in 11th USENIX Symposium on Operating Systems Design and Implemen-tation (OSDI 14), Broomfield, CO: USENIX Association, Oct. 2014, pp. 1–16, isbn: 978-1-931971-16-4. [Online]. Available: https://www.usenix.org/conference/osdi14/technical- sessions/presentation/peter (cit. onpp. 8, 45).

[20] Y. Rekhter, T. Li, and S. Hares, “RFC 4271: A Border Gateway Protocol 4(BGP-4)”, IETF, Tech. Rep., 2006. [Online]. Available: www.ietf.org/rfc/rfc4271.txt (cit. on p. 8).

50

http://dx.doi.org/10.1145/2408776.2408794

http://doi.acm.org/10.1145/2408776.2408794

http://dx.doi.org/10.1145/2254756.2254766

http://doi.acm.org/10.1145/2254756.2254766

http://doi.acm.org/10.1145/2254756.2254766

http://dl.acm.org/citation.cfm?id=1972457.1972488


http://www.just.edu.jo/~tawalbeh/nyit/incs745/presentations/Sniffers.pdf

http://www.just.edu.jo/~tawalbeh/nyit/incs745/presentations/Sniffers.pdf



https://www.usenix.org/conference/osdi14/technical-sessions/presentation/peter

https://www.usenix.org/conference/osdi14/technical-sessions/presentation/peter

www.ietf.org/rfc/rfc4271.txt

www.ietf.org/rfc/rfc4271.txt

BIBLIOGRAPHY

[21] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T.Berners-Lee, “Hypertext transfer protocol – http/1.1”, United States, RFC2616, 1999. [Online]. Available: http://tools.ietf.org/html/rfc2616(cit. on p. 9).

[22] J. Patonnier, F. Culloca, A. Pfeiffer, and S. (sboroba). (2015). What is aweb server, [Online]. Available: https : / / developer . mozilla . org / en -US/Learn/What_is_a_web_server (visited on 07/17/2015) (cit. on p. 9).

[23] The apache http server project, http://httpd.apache.org/ (cit. on p. 9).[24] (2015). June 2015 web server survey, [Online]. Available: http : / / news .

netcraft.com/archives/2015/06/25/june-2015-web-server-survey.html (visited on 07/17/2015) (cit. on p. 9).

[25] Apache mpm worker, http://httpd.apache.org/docs/2.4/mod/worker.html (cit. on p. 9).

[26] Nginx, http://nginx.org/ (cit. on p. 9).[27] D. Kegel, The C10K Problem, http://www.kegel.com/c10k.html, 1999

(cit. on p. 9).[28] The architecture of open source applications (volume 2): nginx, http://www.

aosabook.org/en/nginx.html (cit. on p. 10).[29] I. Lan. (2012). Clearing up some things about linkedin mobile’s move from

rails to node.js, [Online]. Available: http://ikaisays.com/2012/10/04/clearing-up-some-things-about-linkedin-mobiles-move-from-rails-to-node-js/ (visited on 08/19/2015) (cit. on p. 10).

[30] J. Li, N. K. Sharma, D. R. K. Ports, and S. D. Gribble, “Tales of the tail:hardware, os, and application-level sources of tail latency”, in Proceedingsof the ACM Symposium on Cloud Computing, ser. SOCC ’14, Seattle, WA,USA: ACM, 2014, 9:1–9:14, isbn: 978-1-4503-3252-1. doi: 10.1145/2670979.2670988. [Online]. Available: http://doi.acm.org/10.1145/2670979.2670988 (cit. on p. 12).

[31] N. Provos and N. Mathewson, libevent: an event notification library, http://libevent.org, 2003 (cit. on p. 17).

[32] Node.js, https://nodejs.org/api/, 2015 (cit. on p. 18).[33] Libuv design, http://docs.libuv.org/en/v1.x/design.html (cit. on

pp. 19, 20).[34] (Aug. 2015). Nodejs api docs: cluster, [Online]. Available: https://nodejs.

org/api/cluster.html (visited on 08/11/2015) (cit. on p. 25).[35] (2005). Address space layout randomization (aslr), [Online]. Available: https:

//developer.cisco.com/media/onepk_security_guide/GUID-527CB4BF-B5AC-41A3-92B1-883C09B8730D.html (visited on 07/17/2015) (cit. on p. 34).

51

http://tools.ietf.org/html/rfc2616

https://developer.mozilla.org/en-US/Learn/What_is_a_web_server

https://developer.mozilla.org/en-US/Learn/What_is_a_web_server

http://httpd.apache.org/

http://news.netcraft.com/archives/2015/06/25/june-2015-web-server-survey.html



http://httpd.apache.org/docs/2.4/mod/worker.html

http://httpd.apache.org/docs/2.4/mod/worker.html

http://nginx.org/

http://www.kegel.com/c10k.html

http://www.aosabook.org/en/nginx.html

http://www.aosabook.org/en/nginx.html

http://ikaisays.com/2012/10/04/clearing-up-some-things-about-linkedin-mobiles-move-from-rails-to-node-js/



http://dx.doi.org/10.1145/2670979.2670988

http://dx.doi.org/10.1145/2670979.2670988

http://doi.acm.org/10.1145/2670979.2670988

http://doi.acm.org/10.1145/2670979.2670988

http://libevent.org

http://libevent.org

https://nodejs.org/api/


https://nodejs.org/api/cluster.html

https://nodejs.org/api/cluster.html

https://developer.cisco.com/media/onepk_security_guide/GUID-527CB4BF-B5AC-41A3-92B1-883C09B8730D.html



BIBLIOGRAPHY

[36] (2011). How loading time affects your bottom line, [Online]. Available: https://blog.kissmetrics.com/loading-time/ (visited on 08/19/2015) (cit. onp. 36).

[37] (2012). Average number of web page objects breaks 100, [Online]. Available:http://www.websiteoptimization.com/speed/tweak/average-number-web-objects/ (visited on 08/03/2015) (cit. on p. 37).

[38] Wrk - a http benchmarking tool, https://github.com/wg/wrk (cit. on p. 57).

52

https://blog.kissmetrics.com/loading-time/

https://blog.kissmetrics.com/loading-time/

http://www.websiteoptimization.com/speed/tweak/average-number-web-objects/

http://www.websiteoptimization.com/speed/tweak/average-number-web-objects/

https://github.com/wg/wrk

Appendix A

Resources

This appendix references the locations where the work can be found.

A.1 libuv - ixThe libuv branch with IX support can be found at https://github.com/Lilk/libuv.git.

A.2 Node.jsThe Node.js version, with disabled ASLR in its V8 depency, and built towards adynamically loaded libuv, can be found at https://github.com/Lilk/node-ix.

53

https://github.com/Lilk/libuv.git

https://github.com/Lilk/libuv.git

https://github.com/Lilk/node-ix

Appendix B

dialog - high concurrency rate controlledpoisson distributed load generator

B.1 PurposeDialog is a tool that can help to assess the performance of servers running request-response types of protocols such as HTTP. It combines the ability to generate a ratecontrolled (average) load according to a Poisson process1 with high concurrency(up to thousands of virtual clients per physical client). Furthermore it allows adistributed load testing mode of operation, where the expected load is farmed outover a set of worker machines, and the latency measurements are carried out by aselected machine, to minimise client-side latency in the measurements.

One objective of the tool is to measure connection scalability, and therefore thesystem is implemented as a closed-loop, to keep the number of connected (virtual)clients constant for a set parameterised experiment.

B.2 ImplementationDialog is based on a coroutine execution model, implemented in Go2, the coremodule spawns a goroutine for each virtual client, handling exactly one connectionper goroutine. The connection control routine randomises a waiting time betweeneach request, according to an exponential distribution time to achieve a Poissonprocess with the expected rate. Scheduling of goroutines between cores and uponI/O is performed by the go runtime. Furthermore each virtual client keeps a movingaverage of it’s own scheduling overhead in order to self-tune its request rate.

Since the load generating problem is embarrassingly parallel, for distributedexecution the master divides the target load and number of virtual clients equallyover the participating slave machines, keeping 4 virtual clients and a lesser share

1The software could be modified to support any type of distribution where the time betweentwo requests could be expressed as a distribution given a rate parameter λ.

2http://golang.org/

55

APPENDIX B. DIALOG - HIGH CONCURRENCY RATE CONTROLLED POISSONDISTRIBUTED LOAD GENERATOR

Configuration # Virtual Clients TP (req/s) AVG lat. (µs) 99th pp. lat. (µs)1 probe + 0 LG3 256 9863.7 25968 284341 + 4 256 10047.3 1946 4060

Table B.1: Dialog: Separation between latency measurement and load generation.Node.js server.

of throughput that minimises client side latency effects on the measurements. Inall cases the master synchronises the measurements with successful connections byall participating slaves and only starts the measurements once all connections areestablished.

The protocol implementation of dialog is dependency-injected, which impliesthat is easy for a user to use dialog as a framework, changing the protocol dependingon the service used. By default dialog is bundled with two protocol implementationsof the HTTP protocol. The first protocol uses the go network stack “net/http”,providing compatibility with a large number of websites. Dialog is also bundledwith the SimpleChunkedReader, which provides a barebones implementation thatreads only Chunked encoding, significantly improving on the client side latency ofmeasurements over the standard library HTTP implementation.

B.3 EvaluationAll tests in this section were carried out using an identical hardware and test setupas in chapter 5, unless otherwise described.

First, we show how separation of latency probing and load generation on differ-ent physical machines effect latency measurements in table B.1. The server usedin table B.1 is a single worker Node.js server running a Hello world application.Comparing the two rows, the first row, depicting the scenario with no auxiliaryload generating machines, exhibits a high client side induced latency on the probingmachine (when it has to generate all load) that dominates the latency cost. Thesecond row shows a reduced average latency by 13.3× and 99th percentile of 7×,showing the importance to measure latency from an unloaded physical machine.

Secondly, in table B.2, we show how our minimal HTTP reader SimpleChun-kedReader outperforms the standard library’s more complete implementation interms of performance. As seen in the two first rows, we can see this difference fornon-loaded cases, where the server responds fast, as client side overhead is visible

HTTP reader # Virtual Clients TP (req/s) AVG lat. (µs) 99th pp. lat. (µs)golang net/http 16 5001.8 657 2255SimpleChunkedReader 16 5002.8 423 1196go net/http 256 10195.1 10139 26948SimpleChunkedReader 256 10271.9 9743 26837

Table B.2: Dialog: golang net/http stack vs SimpleChunkedReader. Node.js server

56

B.3. EVALUATION

Client # Virtual Clients TP (req/s) AVG lat. (µs) 99th pp. lat. (µs)Dialog: 1 + 0 1 8627 116 149Dialog: 1 + 4 1 8500 118 151wrk 1 10080 101 399Dialog: 1 + 0 8 46770 170 666Dialog: 1 + 4 8 39815 172 603wrk 8 67928 132 561Dialog: 1 + 0 128 162868 798 2525Dialog: 1 + 4 128 201831 602 2353wrk 128 185787 765 2520Dialog: 1 + 0 512 186739 2641 5911Dialog: 1 + 4 512 246706 1542 5255wrk 512 238770 2200 5610Dialog: 1 + 0 4096 214921 18144 45203Dialog: 1 + 4 4096 277377 6202 34472wrk 4096 289575 14700 36220Dialog: 1 + 0 16384 195478 76243 800383Dialog: 1 + 4 16384 230700 11125 259100wrk 16384 232865 112139 872650

Table B.3: Dialog vs wrk. golang server

in these cases. Comparing row 3 and 4 shows the server at a point of saturation,where server side latency is dominating. Therefore no meaningful difference can beobserved in this case.

To evaluate dialog, we compare it to wrk[38], the nginx project load tester tool.We compare both minimal achievable latency of both tools against a common serverand also achievable throughput. In order to show that Dialog is not the bottleneckwhen benchmarking Node.js applications, we test both Dialog and wrk against ahigher performing horizontally scaled web server written in go using goroutines andits standard net/http library. The results can be seen in table B.3. Note that wrkdetermines maximal throughput, whereas Dialog tries to remain a predefined setglobal throughput, which results in a comparison that may be hard to reason about.

Comparing minimal latency, we observe that dialog exhibits approximately 15%higher minimum average latency in the case of a single connection. Throughputachieved by wrk is also higher, both suggesting client-side inefficiencies in Dialogcompared to wrk. Dialog shines for higher connection counts with distributed load,such as for 16374 concurrent connections, for an almost identical throughput wemeasure an average latency an order of magnitude less, again reinforcing the needof measuring latency on a separate machine from load generation. For the non-distributed version we observe roughly 20% higher latency or less throughput.

We have before motivated the need for Dialog, to measure latency and latencydistribution for loads of high connection counts running at throughput levels belowsaturation at realistic arrival processes. Wrk does not provide this functionality. By

57

APPENDIX B. DIALOG - HIGH CONCURRENCY RATE CONTROLLED POISSONDISTRIBUTED LOAD GENERATOR

this comparison, we have shown that the two load generators perform comparatively,with a 20% advantage for wrk in the single-node case. Moreover, they both provideample performance to not act as a bottleneck for the tests performed in chapter 5.

B.4 ResourcesDialog can be found at https://github.com/Lilk/dialog.

58

https://github.com/Lilk/dialog

www.kth.se

Documents

Enhancing Quality of Service Metrics for High Fan-In Node ...kth.diva-portal.org/smash/get/diva2:867903/FULLTEXT01.pdfEnhancing Quality of Service Metrics for High Fan-in Node.js Applications