6

Click here to load reader

All quiet on the internet front?

Embed Size (px)

Citation preview

Page 1: All quiet on the internet front?

IEEE Communications Magazine • October 201446 0163-6804/14/$25.00 © 2014 IEEE

The authors are with DelftUniversity of Technology.

INTRODUCTION

The Internet constitutes a vital societal network-of-networks infrastructure in which even smallhiccups could have detrimental consequences,resulting in significant economic damage to insti-tutions and entire economies.

The importance of the Internet and its ser-vices to society make it evident that it should bemade resilient to failures. This awareness hasinstigated a large body of research on how toprotect networks, although they typically consid-er a single simplistic failure model in which thenetwork is represented by a graph consisting ofnodes and links. In order to protect the Internetagainst failures, we believe that it is essential tounderstand what kinds of failures exist, theimpact they have, and the frequency at whichthey occur; in other words, a taxonomy of Inter-net failures. Even though large-scale Internetincidents have been reported in the media, andsome papers include a brief list of several suchfailures, a taxonomy of key Internet failuresshowing the cause, duration, range, and effectdoes not yet exist.

In this article, we present such an overviewand discuss the resulting implications for effec-tive challenge mitigation. Our findings indicatethat even failure scenarios for which mitigationstrategies exist still pose a major source of out-ages, indicating that more fine-grained networkrisk assessment methods and better resilienceplanning and responses are still needed.

The remainder of this article is organized asfollows. First, we offer an overview of our find-ings on Internet failures and present our major

conclusions. Based on this, we discuss the effec-tiveness of current mitigation strategies and giverecommendations to better avoid Internet fail-ures. Finally, we conclude the article.

A TIMELINE OF INTERNET FAILURESTo arrive at a comprehensive overview of Inter-net failures, a broad foundation is needed. Forthe work presented in this article, a variety ofsources were consulted. We started by interview-ing practitioners and representatives fromregional Internet service providers (ISP), nation-al research and education network operators(NRENs), national incumbent operators, andmulti-national networks about their experiencesand incidents, and their root causes. Our find-ings and recommendations [4] were validated ina formative workshop hosted by the EuropeanNetwork Information and Security Agency(ENISA). Subsequently, we augmented ouroverview with operator reports and literarysearches in academic and trade articles, as wellas news websites, blogs, fora, and operator mail-ing lists about Internet incidents. In the follow-ing, we limit our discussion to “Internet” servicesas commonly referred to by the end user, andwill not extend the discussion to IP-based enter-prise networks.

From our list of Internet incidents, 54 majorand representative Internet failures over theperiod of June 2007–December 2013 were cho-sen, which are displayed in Fig. 1. The figurevisualizes the time, duration, impact size, andultimate root cause of each event, denoted by acircle where the size of the circle’s area propor-tionately indicates the approximate number ofaffected customers and the color the incidentduration on a log scale. The markers are cen-tered at the time and ultimate root cause; thatis, if a service failed because of a database repli-cation issue that was due to a defective corerouter, the event is marked as a networkingissue. When no accurate number of the affectedcustomer base was available and no meaningfulestimate could be derived from operator reportsor the literature, the figure only marks the time,root cause, and duration by a square. For detailson these and other incidents beyond the spaceconstraints of this article, we refer the reader towww.internetview.org.

There are a variety of ways to structure themost prevalent types of Internet failures. A firstcrude classification one could make is into inten-

ABSTRACT

With the proliferation and increasing depen-dence of many services and applications on theInternet, this network has become a vital societalasset. This creates the need to protect this criti-cal infrastructure, and over the past years a vari-ety of resilience schemes have been proposed.The effectiveness of protection schemes, howev-er, highly depends on the causes and circum-stances of Internet failures, but a detailedcomprehensive study of this is not yet availableto date. This article provides a high-level sum-mary of an evaluation of Internet failures overthe past six years, and presents a number of rec-ommendations for future network resilienceresearch.

DISASTER RESILIENCE INCOMMUNICATION NETWORKS

Christian Doerr and Fernando A. Kuipers

All Quiet on the Internet Front?

DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 46

Page 2: All quiet on the internet front?

IEEE Communications Magazine • October 2014 47

tional failures (i.e., attacks) and unintentionalfailures. However, by analyzing the listed inci-dents and their causes, it becomes apparent thatmost Internet failures were unintentional, andonly a few of the incidents were the result ofmalicious attacks. We therefore adopted a slight-ly different categorization into infrastructurefailures, Border Gateway Protocol (BGP)-relat-ed failures, and service failures resulting from anattack. Each category is further subdivided asfollows.

1) Infrastructure failures list all instanceswhere a component necessary to provide a par-ticular service has failed, either directly as partof the operator’s service development or outsideof the operator’s scope but still indirectly havingan impact on the assets of the operator. Com-mon failure types comprise network and cablefailures, power failures, hardware failures (serverfailures, issues with storage systems, coolingfacilities, structural failures, etc.), failures in theservice architecture, or failures in software com-ponents necessary to provide a particular Inter-net service, ranging from server-side end-userapplications to database applications. In addi-tion, we also list service impairments in this cate-gory that specifically stem from an accident ornatural disaster, such as a hurricane or a fire in adata center.

2) The Internet is a network of networks,where each network (called an autonomous sys-tem) possesses its own range of IP addresses andoperates its own routing protocol. The BGPfacilitates the routing between autonomous sys-tems; it is the necessary “glue” to hold the tensof thousands of networks together into a com-monly accessible Internet. Despite this keyimportance, the BGP is surprisingly susceptibleto malfunctions; Internet service impairmentsand service failures due to the BGP are listed inthis category. Most common are BGP hijackingevents, where a network announces some IPaddress space that it actually does not own. As aresult, traffic toward a particular network whichis the actual user of that IP prefix is temporarilymisdirected. Other previous incidents related tothe BGP were hardware- and protocol-based; forexample, unusual but valid BGP messages letkey routers in the Internet crash due to softwarebugs, thereby also effectively cutting off net-works from the overall Internet.

3) Finally, service-related failures list Inter-net service incidents stemming from either fail-ures in some underlying enabling service ordirect attacks on the service itself. The categoryassembles all incidents on the Domain NameService (DNS), which is necessary to translateURLs to their corresponding IP addresses (andwithout which websites become practicallyinvisible to the end user), as well as impair-ments and outages of the Secure Socket Layer(SSL) infrastructure that enables encryptionbetween a service and the end user. This cate-gory also lists distributed denial of service(DDoS) attacks. These are malicious attacksexecuted from hundreds or thousands ofhijacked computers simultaneously, with theintent to overload a system so that its real endusers are denied service. In the classification“Miscellaneous,” we collect various events

Figure 1. A timeline of Internet failures between June 2007 and December2013.

Net

w./c

able

Ener

gyH

ardw

are

Arc

hite

ctur

eSo

ftw

are

Dis

aste

rs

Hija

ckin

gH

ardw

are/

prot

ocol

DD

oSD

NS

SSL

Mis

c

2007Jun

British TelecomLevel 3Amazon AWS

2008

2009

2010

2011

2012

2013

1000

For large eventsonly an outline is

shown to maintainreadability

Unknown magnitudebut major eventEstimated

?

10,000100,000

1,000,000

Infrastructurefailures

BGP Services

Diginotar

WikileaksFoursquare

WordPress Sony

China TelecomWikipedia

LINX

GMail

GMail

GMail

Azure

FriendsterCTBC

Apple MobileMe

Google Docs

The PlanetYouTube

Revision3

GMail,YouTube

PaypalBlackberry

Blackberry

Microsoft Sidekick.se

FLAG FEA, GO-1,SEA-ME-WE 4

SEA-ME-WE 4Skype

NaviSite

Twitter

Playstation Network

Duration of outage in log(seconds)

Aff

ecte

d nu

mbe

r of

cus

tom

ers

1 m

in

5 m

in

30 m

in1

hour

6 ho

urs

1 da

y

1 w

eek

1 m

onth

1 ye

ar

?

?

??

??

?

?

?

??

?

?

? ?

?

?

??

??

Google

NetflixGodaddy

Spamhouse/CloudFlare

CloudFlare

TurkTrust

Blackberry

Microsoft

Amazon EC2

Amazon EC2/RDS

Netflix

Tumblr Gmail/Google Apps

ShawCommunications

Azure

?

365 Main Hostway

Sandy

DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 47

Page 3: All quiet on the internet front?

IEEE Communications Magazine • October 201448

aimed to interrupt a particular service, such asinsider attacks and hacks. For an overview ofattack types in the Internet and their economicincentives, we refer to Kim et al. [6].

Figure 1 demonstrates that large-scale Inter-net service failures occur with regularity — atleast and usually more than once a year – evenwhen the plethora of usually unnoticeable small-er incidents and those events related to nationalsecurity are not considered. It also becomes evi-dent that the vast majority of events visible tothe end user revolve mostly around the failure ofthe infrastructure and enabling services. This isnoteworthy as problems around the notoriouslyvulnerable BGP (for an excellent survey, see [2])capture much attention, as it is theoretically pos-sible to generate a large impact on the globalinterconnection system with comparatively littlecomplexity. While those incidents in practice dooccur, their frequency and impact is, however,usually bounded, thanks to the established moni-toring infrastructures such as BGPmon.

Based on the analysis of the incidents andtheir root causes, we also arrive at several othersurprising conclusions. Much of the recentresearch work on network resilience has focusedon the development of algorithmic link/path pro-tection schemes that try to place backup routersand fiber optic cables in such a way in the net-work that most end-to-end connections are pro-tected while minimizing cost. In the review ofInternet failures however, almost no major inci-dents were identified that were ultimately causedby fiber cuts and that could have been preventedby such protection schemes. Major events suchas the cuts of the “South East Asia — MiddleEast — Western Europe” (SEA-ME-WE), the“Fiber-Optic Link Around the Globe” (FLAGFEA), and GO-1 submarine cables in late 2008in the Mediterranean Sea, or prior events suchas the 2006 Taiwan earthquake (during whicheight submarine cables were cut) are usually notin the scope of such protection schemes, whichtypically only plan for a limited number of simul-taneous failures. On the other hand, theseresilience methods do seem effective againstsmall-scale localized events that, according tothe conducted ISP interviews, probably are notdirectly visible due to their magnitude, successfulmitigation, and routine status.

Network infrastructure failures, however,involve not only issues such as cable cuts, butalso failures of core routers and switches that wefound to be a surprisingly common root cause ofmajor outages, especially as it is a common goodpractice in the ISP community [4] to deploy criti-cal core components at least redundantly oreven with entire pools of hot spares. Neverthe-less, there were multiple instances where a faultynetworking element resulted in a failure of somehigher-layer software component, such as adatabase breakdown, which ultimately caused anentire service to fail.

Part of this issue is due to the increasingcomplexity of Internet services and a tendencyto build services by federating lower-level build-ing blocks. While cost effective, this, results,from an availability standpoint, in a tightly cou-pled system, and with the introduced co-depen-dence on multiple systems, the frequency and

impact of breakdowns increase. On one hand,this is true for intra-organizational services thatall rely on a common core component so that incase of a failure a variety of services areimpaired (e.g., simultaneous failures of GoogleApps, Gmail, etc. in early 2011). On the otherhand, this is also the case for inter-organization-al services and infrastructures, where servicefrom one organization critically depends on theavailability of another one. How services dependon each other, as well as the strength andamount of co-dependence, is less and less knownthe higher one goes up the stack, so in the endmultiple competing and apparently redundantservices are actually relying on the same infra-structure. With the advent of cloud providers,this issue seems to have amplified, as it repeat-edly became visible over the past years. A fail-ure in the Amazon Web Services (AWS) cloudinfrastructure, for example, will render dozensof very diverse services unusable at the sametime. This issue was on one occasion illustratedin an exemplary manner when several commer-cial uptime monitoring providers that track andalert website and service providers about anoutage all failed simultaneously, as they all pro-cured an underlying but critical piece of theirmonitoring solution from the same cloudprovider. In these cases, the common goodpractice to geographically distribute resourcesfrequently does not seem to save the day, as therelatively less impactful connectivity and energyfailures are traded against the apparently morefrequent failures in the system architecture andsoftware stack. In particular, when such diversi-fication is done via the same providers and com-ponents, not much is gained. For instance, if anapplication is hosted in different data centers bythe same cloud provider, a service might bemore vulnerable as it relies on a centralized sys-tem and now has an architectural single point offailure (SPoF).

As can also be seen in Fig. 1, the actualimpact of many Internet failures is not known atall, predominantly because no global measure-ment and monitoring infrastructure exists as itdoes, for example, in the case of BGP, wheremonitors distributed worldwide record changesin the global routing table, and allow an estima-tion of which networks are affected by the BGPprefix hijacking and routing issues. While somemonitoring providers exist that test the uptimeof Internet services, we believe that their deploy-ment sizes (a few hundred nodes in data centers)are not sufficient to get a good real-time view ofthe state of the Internet as experienced by theend user and good localization of failures.

Finally, the results should prompt us to thinkdifferently about mitigation strategies currentlybeing used in network resilience engineering.The fact that major events have much longerdurations and different root causes (not predom-inantly network and fiber-driven) than common-ly assumed suggests that more attention shouldbe directed at resilience engineering of theentire service stack, specifically to the decou-pling and challenge containment in tightly cou-pled systems. Our findings and recommendationsfor resilience optimization are further discussedin the next section.

The fact that major

events have much

longer durations and

different root causes

than commonly

assumed suggests

that more attention

should be directed at

resilience engineering

of the entire service

stack, specifically to

decoupling and

challenge contain-

ment in tightly

coupled systems.

DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 48

Page 4: All quiet on the internet front?

IEEE Communications Magazine • October 2014 49

RECOMMENDATIONS ANDCHALLENGES

In this section, we discuss several commonly usedfailure mitigation strategies, exemplify under whatcircumstances they have failed, and provide rec-ommendations and challenges on how to reach amore effective Internet failure mitigation plan.Since the Internet consists of a network of net-works, some of our (intranetwork) recommenda-tions could be followed or implemented by ISPsand network operators to strengthen their infras-tructures against accidental failures and maliciousattacks, while other (internetwork) recommenda-tions may warrant action by policy makers whogovern the global Internet to lead to a moreresilient Internet ecosystem.

NETWORK RISK ASSESSMENTThe first step in obtaining a (more) robust net-work is creating a risk profile of the networkthat identifies possible network vulnerabilities, aswell as a method to measure and assess theresilience of a network. Reference [3] provides acomprehensive overview of various resilienceclassification approaches in the literature. Inaddition to a suitable metric, obtaining an accu-rate risk profile that can serve as a solid founda-tion for resilience engineering will require anumber of aspects.

Going Beyond a Graph Representation — Anetwork typically consists of physical (point-of-presence) locations, the hardware at those loca-tions, and the physical (optical fiber) connectionsbetween locations. On top of this network, theoperator could run several logical network ser-vices, such as dense wavelength-division multi-plexing (DWDM), synchronous digital hierarchy(SDH)/carrier Ethernet, and Ethernet, each con-stituting a layer on top of the previous one.

Regardless of the complexity of a network,they are often modeled as a graph consisting ofnodes and links, and as a result, much work onimproving network robustness has directed itsattention to improving various graph connectivi-ty metrics. In practice, however, a connectiontypically does not form a straight line betweenthe locations it connects, and such lines hide anumber of underlying dependencies. For exam-ple, identifying the location of all single points offailure (SPoFs) in a network based on a graphrepresentation of that network could miss thevulnerabilities of several links close together.Geographical SPoFs may exist, and should beidentified at and across different layers.

Data to Determine Shared Risks — Duringour study, it became evident that many — espe-cially small — providers do not have sufficientinformation about their used resources, whichare typically leased, to detect shared risk groupsand correspondingly provision a resilient net-work. In addition to such geo-localized data,inference tools need to be developed to effi-ciently determine shared risk groups andimprove network design even for medium-sizedoperator networks. A noteworthy example in thisdirection is [1].

Probabilistic Embedded Risk Assessment —Not only is geo-information on the networkimportant; so is its embedding in a geographicalregion and the context in which they operate. Asnetwork failures could be the result of naturaldisasters or abound in densely populated areaswhere fiber cuts are more frequent, the geo-graphic areas in which the network is embeddedclearly affects the risk to which the network isexposed. In addition, not all disasters and fail-ures are created equal, and resilience engineer-ing approaches should take the estimatedlikelihood and projected impact of a challengeinto account for a cost- and risk-optimized miti-gation strategy.

BUSINESS CONTINUITY MANAGEMENT ANDMUTUAL AID

From studying the crisis responses that havebeen published and via several interviews withnetwork operators, it became apparent that abusiness continuity management (BCM) plandoes not always exist or is not up to date, lead-ing to many failures being addressed in an adhoc manner. When investigating the incidentsthat were successfully overcome with minimalimpact, mutual aid between operators (e.g., intemporarily lending equipment or routing trafficover another ISP’s infrastructure) seemed to bea key factor to challenge containment. On onehand, this underlines the importance of BCM.Moreover, the extent of BCM policies andplanned responses, if they exist at all (typicallyonly at larger operators), tend to greatly differbetween network operators. On the other hand,this also highlights the fact that resilience engi-neering is not and should not be limited to a sin-gle network. When addressing global incidents, itis important to have coordinated actions oragreements where one can rely on someoneelse’s network for offloading traffic. Followingsuch approaches similarly for the technical sideof network design and resilience optimizationwould allow higher resilience levels for a particu-lar deployment to be achieved at a lower overallcost and network complexity; as in insurance,risk and impact are distributed over more shoul-ders.

RESILIENCE BY DESIGNDepending on the outcome of the risk assess-ment, the network may need to be augmented(i.e., adding nodes and/or links) to improve itsresilience against the identified risks. The art ofnetwork augmentation is how to best balanceresilience and augmentation costs. An overviewof network planning under traffic and risk uncer-tainty can be found in [7].

Resilience of the Entire Stack — Despite theimportance of communication networks, theirsecurity and resilience has long been onlymarginally addressed, typically as a later add-on,while in other critical systems (like airplanes)resilience has been designed from the get go andtested continuously. As a result, several depen-dences have been introduced in communicationnetworks that might cause a ripple-througheffect when only a single component fails. For

Depending on the

outcome of the risk

assessment, the net-

work may need to

be augmented (i.e.,

adding nodes and/or

links) to improve its

resilience against the

identified risks. The

art of network aug-

mentation is how to

best balance

resilience and aug-

mentation costs.

DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 49

Page 5: All quiet on the internet front?

IEEE Communications Magazine • October 201450

instance, in October 2011, a core switch withinthe Blackberry network failed. Such hardwarefailure is in practice usually quickly resolved byproper fail-over schemes, but in this particularcase it caused a malfunctioning of a databasethat was much harder to resolve and eventuallyled to an outage lasting three days. Hence,resilience engineering in networks should look atthe entire networking and application stack aseven minor challenges that are remediated with-in the allowed mitigation margins may amplifyand pose a large impact at other layers. Cross-layer resilience engineering — in contrast to, forexample, cross-layer performance optimizationin wireless networks — has unfortunatelyreceived little attention to date.

Spare Resources — The endpoints of intercon-nections between individual networks take placein data centers that follow a wide variety of prac-tices to increase resilience. There, the level ofredundancy and protection against typical fail-ures is described by tiers, with specific guidelinesas to what practices must be implemented for adata center to meet these levels and be certifi-able as such. The Amsterdam Internet Exchange(AMS-IX), for example, has extended theseavailable standards and further refined theminto a list of 141 minimum baseline (technicaldesign, operational, and business continuity)requirements for the data centers providing ser-vice to the exchange. While, as stipulated inthese standards, it is recommended to overprovi-sion network elements by a factor of two andcreate independent availability regions capableof securing network operations, there is current-ly an ongoing trend where providers are operat-ing their networks at higher and higher loads(e.g., as Google is doing with their software-defined wide area network connecting their datacenters). The “hotter” the network is operated,the fewer backup resources are available, andthe higher the risk in case of failure, since back-up paths/resources might not be available. More-over, running a network at high utilizationintroduces a risk of overload, as we have seen,for instance, with popular applications like Twit-ter in their early days. Finally, adopting newtechnologies, such as software-defined network-ing (SDN, and its protocol, OpenFlow) couldpose new vulnerabilities, for instance, withrespect to the robustness of the SDN controllernow introducing a new SPoF or the security ofthe OpenFlow protocol.

Implications of Tightly Coupled Systems,Shared Infrastructure, and Unknown SPoFs— In the past few years, the role of cloud com-puting, in which the infrastructure, platform, andeven software used by IT operations are out-sourced services, has become more prominent.The flexibility of cloud services certainly has itsadvantages, since they can be used when andonly for how long they are needed, and be leasedfor prices charged in small increments of actualusage. However, these shared infrastructuresalso pose a risk that failures of a data centercould cripple many services. This is supported bysome analysts proclaiming that 2012 was the yearof cloud (computing) outages. For many Inter-

net services building on such cloud infrastruc-tures, this creates risks that cannot be mitigated,as customers typically do not have much insightinto the concrete building blocks of the usedinfrastructure and potential architectural SPoFs.This general issue, however, greatly extends thisparticular scenario of cloud computing. In recentyears, services have become increasingly coupledand integrated, which has also increased the vul-nerability of Internet services due to commonshared or cross-dependent infrastructures. Simi-lar to the intensified linkages among actors inthe financial market that led to the housing bub-ble burst in 2008, we might have created similarsystemic or hyper risks [5] in Internet serviceswhich might explain the comparatively largemagnitude of outages. The resilience of suchtightly coupled systems is, however, in both gen-eral and specific for the case of computer net-works and the Internet as its most prominentexample still largely unknown. More research isneeded to understand risk and failure trajecto-ries in these tightly coupled systems to developeffective challenge mitigation strategies forInternet services operating under such circum-stances.

MONITORING OF INTERNETWORK RESILIENCEThe key to inter-domain routing resilience is theestablishment of redundancy at multiple physicalendpoints and, if possible, also across multiplelevels. The most fundamental inter-domain pro-tection concept is the establishment of multi-homing, that is, the presence of at least twodistinct uplink connections toward non-local des-tinations. To realize the maximum possibleresilience from such a setup, the critical depen-dencies of the upstream providers should ideallybe investigated (e.g., where the transit providers’fibers run, from which grid their equipment ispowered, or where they interconnect), butobtaining a comprehensive view of this is fre-quently difficult.

If a network operator has established severalinterconnection points with another ISP, theBGP provides additional means to manage andthereby strengthen the interconnection. By tun-ing the individual BGP configuration at eachlocation and influencing through which pointstraffic should enter or exit the autonomous sys-tem, such as the BGP multi-exit discriminators,local preferences, or path attributes, providerscan obtain a fine level of control on the trafficflows between networks, privileging or relievingparticular hardware over others.

For such setups at network operators and tofurther deepen insight into the resilience andreliability of Internet services and their under-lying infrastructures in academia, a large moni-toring framework should be established. Itwould be able to build up an assessment and atrack record of “how good” connections via aparticular autonomous system are, what thestabil ity of individual paths is within anautonomous system, and what particular hard-ware resides at certain geographical locations.Such monitoring systems have contributed agreat deal to minimize the impacts of BGPhijacking incidents, as malicious and accidentalprefix announcements can today be rapidly

The key to inter-

domain routing

resilience is the

establishment of

redundancy at multi-

ple physical end

points and if possi-

ble, also across multi-

ple levels. The most

fundamental inter-

domain protection

concept is the estab-

lishment of multi-

homing, i.e., the

presence of at least

two distinct uplink

connections

toward non-local

destinations.

DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 50

Page 6: All quiet on the internet front?

IEEE Communications Magazine • October 2014 51

detected. Establishing a similar system tounderstand the exact mechanics, location, andimpact of Internet failures could promise togenerate a similar leap to a more resilientInternet.

CONCLUSIONIs it “all quiet on the Internet front?” In thisarticle, we have investigated a broad scale ofInternet failures during the course of the past sixyears, where it became apparent that failuresabound, and analyzed their root cause, frequen-cy, duration, and societal impact. Such a study,to date, was missing, but it is vital in establishingproper Internet failure mitigation schemes.

In the second part of this article, we havescrutinized currently employed mitigationschemes, exemplified in which cases they failedand why, and proposed recommendations andchallenges, to be on the road toward reachingfine-grained network risk assessment methodsand better resilience planning and responses.

ADDITIONAL RESOURCESDetails about the incidents described in this arti-cle as well as other resources can be found athttps://www.internetview.org, a new website ded-icated to Internet infrastructure monitoring andresilience.

ACKNOWLEDGMENTSPart of this work has been supported by the EUFP7 EINS project under grant agreement No.288021.

REFERENCES[1] N. Adam et al, “Consequence Analysis of Complex

Events on Critical U.S. Infrastructure,” Commun. ACM,vol. 56, no. 6, 2013, pp. 83–91.

[2] K. Butler et al., “A Survey of BGP Security Issues andSolutions,” Proc. IEEE, vol. 98, no. 1, Jan. 2010.

[3] P. Cholda et al., “A Survey of Resilience DifferentiationFrameworks in Communication Networks,” IEEE Com-mun. Surveys, vol. 9, no. 4, 2007.

[4] C. Doerr et al., “Good Practices in Resilient InternetInterconnection,” ENISA report, June 2012.

[5] D. Helbing, “Globally Networked Risks and How toRespond,” Nature 497, 2013, pp. 51–59.

[6] W. Kimet al., “The Dark Side of the Internet: Attacks,Costs and Responses,” Information Systems, vol. 36,2011, pp. 675–705.

[7] S. Yang and F. A. Kuipers, “Traffic Uncertainty Modelsin Network Planning,” IEEE Commun. Mag., vol. 52, no.2, Feb. 2014, pp. 172–77.

BIOGRAPHIESCHRISTIAN DOERR ([email protected]) is an assistant profes-sor in the Network Architectures and Services group atDelft University of Technology (TUDelft). He received anM.Sc. degree in computer science and a Ph.D. degree incomputer science and cognitive science from the Universityof Colorado at Boulder. His research interests revolvearound critical infrastructure protection, cyber security, andresilience engineering.

FERNANDO A. KUIPERS [SM] ([email protected]) is an asso-ciate professor in the Network Architectures and Servicesgroup at TUDelft. He received his M.Sc. degree in electricalengineering from TUDelft in June 2000 and subsequentlyobtained his Ph.D. degree (cum laude) in 2004 at the sameuniversity. His research interests mainly revolve around net-work algorithms and cover routing, quality of service, net-work survivability, optical networks, and content distribution.His work on these subjects includes distinguished papers atIEEE INFOCOM 2003, Chinacom 2006, IFIP Networking 2008,IEEE FMN 2008, IEEE ISM 2008, and ITC 2009.

Is it “all quiet on the

Internet front?” We

have investigated a

broad scale of Inter-

net failures over the

course of the past six

years, where it

became apparent

that failures

abound, and

analyzed their root

cause, frequency,

duration, and

societal impact.

DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 51