Click here to load reader
Upload
fernando
View
214
Download
0
Embed Size (px)
Citation preview
IEEE Communications Magazine • October 201446 0163-6804/14/$25.00 © 2014 IEEE
The authors are with DelftUniversity of Technology.
INTRODUCTION
The Internet constitutes a vital societal network-of-networks infrastructure in which even smallhiccups could have detrimental consequences,resulting in significant economic damage to insti-tutions and entire economies.
The importance of the Internet and its ser-vices to society make it evident that it should bemade resilient to failures. This awareness hasinstigated a large body of research on how toprotect networks, although they typically consid-er a single simplistic failure model in which thenetwork is represented by a graph consisting ofnodes and links. In order to protect the Internetagainst failures, we believe that it is essential tounderstand what kinds of failures exist, theimpact they have, and the frequency at whichthey occur; in other words, a taxonomy of Inter-net failures. Even though large-scale Internetincidents have been reported in the media, andsome papers include a brief list of several suchfailures, a taxonomy of key Internet failuresshowing the cause, duration, range, and effectdoes not yet exist.
In this article, we present such an overviewand discuss the resulting implications for effec-tive challenge mitigation. Our findings indicatethat even failure scenarios for which mitigationstrategies exist still pose a major source of out-ages, indicating that more fine-grained networkrisk assessment methods and better resilienceplanning and responses are still needed.
The remainder of this article is organized asfollows. First, we offer an overview of our find-ings on Internet failures and present our major
conclusions. Based on this, we discuss the effec-tiveness of current mitigation strategies and giverecommendations to better avoid Internet fail-ures. Finally, we conclude the article.
A TIMELINE OF INTERNET FAILURESTo arrive at a comprehensive overview of Inter-net failures, a broad foundation is needed. Forthe work presented in this article, a variety ofsources were consulted. We started by interview-ing practitioners and representatives fromregional Internet service providers (ISP), nation-al research and education network operators(NRENs), national incumbent operators, andmulti-national networks about their experiencesand incidents, and their root causes. Our find-ings and recommendations [4] were validated ina formative workshop hosted by the EuropeanNetwork Information and Security Agency(ENISA). Subsequently, we augmented ouroverview with operator reports and literarysearches in academic and trade articles, as wellas news websites, blogs, fora, and operator mail-ing lists about Internet incidents. In the follow-ing, we limit our discussion to “Internet” servicesas commonly referred to by the end user, andwill not extend the discussion to IP-based enter-prise networks.
From our list of Internet incidents, 54 majorand representative Internet failures over theperiod of June 2007–December 2013 were cho-sen, which are displayed in Fig. 1. The figurevisualizes the time, duration, impact size, andultimate root cause of each event, denoted by acircle where the size of the circle’s area propor-tionately indicates the approximate number ofaffected customers and the color the incidentduration on a log scale. The markers are cen-tered at the time and ultimate root cause; thatis, if a service failed because of a database repli-cation issue that was due to a defective corerouter, the event is marked as a networkingissue. When no accurate number of the affectedcustomer base was available and no meaningfulestimate could be derived from operator reportsor the literature, the figure only marks the time,root cause, and duration by a square. For detailson these and other incidents beyond the spaceconstraints of this article, we refer the reader towww.internetview.org.
There are a variety of ways to structure themost prevalent types of Internet failures. A firstcrude classification one could make is into inten-
ABSTRACT
With the proliferation and increasing depen-dence of many services and applications on theInternet, this network has become a vital societalasset. This creates the need to protect this criti-cal infrastructure, and over the past years a vari-ety of resilience schemes have been proposed.The effectiveness of protection schemes, howev-er, highly depends on the causes and circum-stances of Internet failures, but a detailedcomprehensive study of this is not yet availableto date. This article provides a high-level sum-mary of an evaluation of Internet failures overthe past six years, and presents a number of rec-ommendations for future network resilienceresearch.
DISASTER RESILIENCE INCOMMUNICATION NETWORKS
Christian Doerr and Fernando A. Kuipers
All Quiet on the Internet Front?
DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 46
IEEE Communications Magazine • October 2014 47
tional failures (i.e., attacks) and unintentionalfailures. However, by analyzing the listed inci-dents and their causes, it becomes apparent thatmost Internet failures were unintentional, andonly a few of the incidents were the result ofmalicious attacks. We therefore adopted a slight-ly different categorization into infrastructurefailures, Border Gateway Protocol (BGP)-relat-ed failures, and service failures resulting from anattack. Each category is further subdivided asfollows.
1) Infrastructure failures list all instanceswhere a component necessary to provide a par-ticular service has failed, either directly as partof the operator’s service development or outsideof the operator’s scope but still indirectly havingan impact on the assets of the operator. Com-mon failure types comprise network and cablefailures, power failures, hardware failures (serverfailures, issues with storage systems, coolingfacilities, structural failures, etc.), failures in theservice architecture, or failures in software com-ponents necessary to provide a particular Inter-net service, ranging from server-side end-userapplications to database applications. In addi-tion, we also list service impairments in this cate-gory that specifically stem from an accident ornatural disaster, such as a hurricane or a fire in adata center.
2) The Internet is a network of networks,where each network (called an autonomous sys-tem) possesses its own range of IP addresses andoperates its own routing protocol. The BGPfacilitates the routing between autonomous sys-tems; it is the necessary “glue” to hold the tensof thousands of networks together into a com-monly accessible Internet. Despite this keyimportance, the BGP is surprisingly susceptibleto malfunctions; Internet service impairmentsand service failures due to the BGP are listed inthis category. Most common are BGP hijackingevents, where a network announces some IPaddress space that it actually does not own. As aresult, traffic toward a particular network whichis the actual user of that IP prefix is temporarilymisdirected. Other previous incidents related tothe BGP were hardware- and protocol-based; forexample, unusual but valid BGP messages letkey routers in the Internet crash due to softwarebugs, thereby also effectively cutting off net-works from the overall Internet.
3) Finally, service-related failures list Inter-net service incidents stemming from either fail-ures in some underlying enabling service ordirect attacks on the service itself. The categoryassembles all incidents on the Domain NameService (DNS), which is necessary to translateURLs to their corresponding IP addresses (andwithout which websites become practicallyinvisible to the end user), as well as impair-ments and outages of the Secure Socket Layer(SSL) infrastructure that enables encryptionbetween a service and the end user. This cate-gory also lists distributed denial of service(DDoS) attacks. These are malicious attacksexecuted from hundreds or thousands ofhijacked computers simultaneously, with theintent to overload a system so that its real endusers are denied service. In the classification“Miscellaneous,” we collect various events
Figure 1. A timeline of Internet failures between June 2007 and December2013.
Net
w./c
able
Ener
gyH
ardw
are
Arc
hite
ctur
eSo
ftw
are
Dis
aste
rs
Hija
ckin
gH
ardw
are/
prot
ocol
DD
oSD
NS
SSL
Mis
c
2007Jun
British TelecomLevel 3Amazon AWS
2008
2009
2010
2011
2012
2013
1000
For large eventsonly an outline is
shown to maintainreadability
Unknown magnitudebut major eventEstimated
?
10,000100,000
1,000,000
Infrastructurefailures
BGP Services
Diginotar
WikileaksFoursquare
WordPress Sony
China TelecomWikipedia
LINX
GMail
GMail
GMail
Azure
FriendsterCTBC
Apple MobileMe
Google Docs
The PlanetYouTube
Revision3
GMail,YouTube
PaypalBlackberry
Blackberry
Microsoft Sidekick.se
FLAG FEA, GO-1,SEA-ME-WE 4
SEA-ME-WE 4Skype
NaviSite
Playstation Network
Duration of outage in log(seconds)
Aff
ecte
d nu
mbe
r of
cus
tom
ers
1 m
in
5 m
in
30 m
in1
hour
6 ho
urs
1 da
y
1 w
eek
1 m
onth
1 ye
ar
?
?
??
??
?
?
?
??
?
?
? ?
?
?
??
??
NetflixGodaddy
Spamhouse/CloudFlare
CloudFlare
TurkTrust
Blackberry
Microsoft
Amazon EC2
Amazon EC2/RDS
Netflix
Tumblr Gmail/Google Apps
ShawCommunications
Azure
?
365 Main Hostway
Sandy
DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 47
IEEE Communications Magazine • October 201448
aimed to interrupt a particular service, such asinsider attacks and hacks. For an overview ofattack types in the Internet and their economicincentives, we refer to Kim et al. [6].
Figure 1 demonstrates that large-scale Inter-net service failures occur with regularity — atleast and usually more than once a year – evenwhen the plethora of usually unnoticeable small-er incidents and those events related to nationalsecurity are not considered. It also becomes evi-dent that the vast majority of events visible tothe end user revolve mostly around the failure ofthe infrastructure and enabling services. This isnoteworthy as problems around the notoriouslyvulnerable BGP (for an excellent survey, see [2])capture much attention, as it is theoretically pos-sible to generate a large impact on the globalinterconnection system with comparatively littlecomplexity. While those incidents in practice dooccur, their frequency and impact is, however,usually bounded, thanks to the established moni-toring infrastructures such as BGPmon.
Based on the analysis of the incidents andtheir root causes, we also arrive at several othersurprising conclusions. Much of the recentresearch work on network resilience has focusedon the development of algorithmic link/path pro-tection schemes that try to place backup routersand fiber optic cables in such a way in the net-work that most end-to-end connections are pro-tected while minimizing cost. In the review ofInternet failures however, almost no major inci-dents were identified that were ultimately causedby fiber cuts and that could have been preventedby such protection schemes. Major events suchas the cuts of the “South East Asia — MiddleEast — Western Europe” (SEA-ME-WE), the“Fiber-Optic Link Around the Globe” (FLAGFEA), and GO-1 submarine cables in late 2008in the Mediterranean Sea, or prior events suchas the 2006 Taiwan earthquake (during whicheight submarine cables were cut) are usually notin the scope of such protection schemes, whichtypically only plan for a limited number of simul-taneous failures. On the other hand, theseresilience methods do seem effective againstsmall-scale localized events that, according tothe conducted ISP interviews, probably are notdirectly visible due to their magnitude, successfulmitigation, and routine status.
Network infrastructure failures, however,involve not only issues such as cable cuts, butalso failures of core routers and switches that wefound to be a surprisingly common root cause ofmajor outages, especially as it is a common goodpractice in the ISP community [4] to deploy criti-cal core components at least redundantly oreven with entire pools of hot spares. Neverthe-less, there were multiple instances where a faultynetworking element resulted in a failure of somehigher-layer software component, such as adatabase breakdown, which ultimately caused anentire service to fail.
Part of this issue is due to the increasingcomplexity of Internet services and a tendencyto build services by federating lower-level build-ing blocks. While cost effective, this, results,from an availability standpoint, in a tightly cou-pled system, and with the introduced co-depen-dence on multiple systems, the frequency and
impact of breakdowns increase. On one hand,this is true for intra-organizational services thatall rely on a common core component so that incase of a failure a variety of services areimpaired (e.g., simultaneous failures of GoogleApps, Gmail, etc. in early 2011). On the otherhand, this is also the case for inter-organization-al services and infrastructures, where servicefrom one organization critically depends on theavailability of another one. How services dependon each other, as well as the strength andamount of co-dependence, is less and less knownthe higher one goes up the stack, so in the endmultiple competing and apparently redundantservices are actually relying on the same infra-structure. With the advent of cloud providers,this issue seems to have amplified, as it repeat-edly became visible over the past years. A fail-ure in the Amazon Web Services (AWS) cloudinfrastructure, for example, will render dozensof very diverse services unusable at the sametime. This issue was on one occasion illustratedin an exemplary manner when several commer-cial uptime monitoring providers that track andalert website and service providers about anoutage all failed simultaneously, as they all pro-cured an underlying but critical piece of theirmonitoring solution from the same cloudprovider. In these cases, the common goodpractice to geographically distribute resourcesfrequently does not seem to save the day, as therelatively less impactful connectivity and energyfailures are traded against the apparently morefrequent failures in the system architecture andsoftware stack. In particular, when such diversi-fication is done via the same providers and com-ponents, not much is gained. For instance, if anapplication is hosted in different data centers bythe same cloud provider, a service might bemore vulnerable as it relies on a centralized sys-tem and now has an architectural single point offailure (SPoF).
As can also be seen in Fig. 1, the actualimpact of many Internet failures is not known atall, predominantly because no global measure-ment and monitoring infrastructure exists as itdoes, for example, in the case of BGP, wheremonitors distributed worldwide record changesin the global routing table, and allow an estima-tion of which networks are affected by the BGPprefix hijacking and routing issues. While somemonitoring providers exist that test the uptimeof Internet services, we believe that their deploy-ment sizes (a few hundred nodes in data centers)are not sufficient to get a good real-time view ofthe state of the Internet as experienced by theend user and good localization of failures.
Finally, the results should prompt us to thinkdifferently about mitigation strategies currentlybeing used in network resilience engineering.The fact that major events have much longerdurations and different root causes (not predom-inantly network and fiber-driven) than common-ly assumed suggests that more attention shouldbe directed at resilience engineering of theentire service stack, specifically to the decou-pling and challenge containment in tightly cou-pled systems. Our findings and recommendationsfor resilience optimization are further discussedin the next section.
The fact that major
events have much
longer durations and
different root causes
than commonly
assumed suggests
that more attention
should be directed at
resilience engineering
of the entire service
stack, specifically to
decoupling and
challenge contain-
ment in tightly
coupled systems.
DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 48
IEEE Communications Magazine • October 2014 49
RECOMMENDATIONS ANDCHALLENGES
In this section, we discuss several commonly usedfailure mitigation strategies, exemplify under whatcircumstances they have failed, and provide rec-ommendations and challenges on how to reach amore effective Internet failure mitigation plan.Since the Internet consists of a network of net-works, some of our (intranetwork) recommenda-tions could be followed or implemented by ISPsand network operators to strengthen their infras-tructures against accidental failures and maliciousattacks, while other (internetwork) recommenda-tions may warrant action by policy makers whogovern the global Internet to lead to a moreresilient Internet ecosystem.
NETWORK RISK ASSESSMENTThe first step in obtaining a (more) robust net-work is creating a risk profile of the networkthat identifies possible network vulnerabilities, aswell as a method to measure and assess theresilience of a network. Reference [3] provides acomprehensive overview of various resilienceclassification approaches in the literature. Inaddition to a suitable metric, obtaining an accu-rate risk profile that can serve as a solid founda-tion for resilience engineering will require anumber of aspects.
Going Beyond a Graph Representation — Anetwork typically consists of physical (point-of-presence) locations, the hardware at those loca-tions, and the physical (optical fiber) connectionsbetween locations. On top of this network, theoperator could run several logical network ser-vices, such as dense wavelength-division multi-plexing (DWDM), synchronous digital hierarchy(SDH)/carrier Ethernet, and Ethernet, each con-stituting a layer on top of the previous one.
Regardless of the complexity of a network,they are often modeled as a graph consisting ofnodes and links, and as a result, much work onimproving network robustness has directed itsattention to improving various graph connectivi-ty metrics. In practice, however, a connectiontypically does not form a straight line betweenthe locations it connects, and such lines hide anumber of underlying dependencies. For exam-ple, identifying the location of all single points offailure (SPoFs) in a network based on a graphrepresentation of that network could miss thevulnerabilities of several links close together.Geographical SPoFs may exist, and should beidentified at and across different layers.
Data to Determine Shared Risks — Duringour study, it became evident that many — espe-cially small — providers do not have sufficientinformation about their used resources, whichare typically leased, to detect shared risk groupsand correspondingly provision a resilient net-work. In addition to such geo-localized data,inference tools need to be developed to effi-ciently determine shared risk groups andimprove network design even for medium-sizedoperator networks. A noteworthy example in thisdirection is [1].
Probabilistic Embedded Risk Assessment —Not only is geo-information on the networkimportant; so is its embedding in a geographicalregion and the context in which they operate. Asnetwork failures could be the result of naturaldisasters or abound in densely populated areaswhere fiber cuts are more frequent, the geo-graphic areas in which the network is embeddedclearly affects the risk to which the network isexposed. In addition, not all disasters and fail-ures are created equal, and resilience engineer-ing approaches should take the estimatedlikelihood and projected impact of a challengeinto account for a cost- and risk-optimized miti-gation strategy.
BUSINESS CONTINUITY MANAGEMENT ANDMUTUAL AID
From studying the crisis responses that havebeen published and via several interviews withnetwork operators, it became apparent that abusiness continuity management (BCM) plandoes not always exist or is not up to date, lead-ing to many failures being addressed in an adhoc manner. When investigating the incidentsthat were successfully overcome with minimalimpact, mutual aid between operators (e.g., intemporarily lending equipment or routing trafficover another ISP’s infrastructure) seemed to bea key factor to challenge containment. On onehand, this underlines the importance of BCM.Moreover, the extent of BCM policies andplanned responses, if they exist at all (typicallyonly at larger operators), tend to greatly differbetween network operators. On the other hand,this also highlights the fact that resilience engi-neering is not and should not be limited to a sin-gle network. When addressing global incidents, itis important to have coordinated actions oragreements where one can rely on someoneelse’s network for offloading traffic. Followingsuch approaches similarly for the technical sideof network design and resilience optimizationwould allow higher resilience levels for a particu-lar deployment to be achieved at a lower overallcost and network complexity; as in insurance,risk and impact are distributed over more shoul-ders.
RESILIENCE BY DESIGNDepending on the outcome of the risk assess-ment, the network may need to be augmented(i.e., adding nodes and/or links) to improve itsresilience against the identified risks. The art ofnetwork augmentation is how to best balanceresilience and augmentation costs. An overviewof network planning under traffic and risk uncer-tainty can be found in [7].
Resilience of the Entire Stack — Despite theimportance of communication networks, theirsecurity and resilience has long been onlymarginally addressed, typically as a later add-on,while in other critical systems (like airplanes)resilience has been designed from the get go andtested continuously. As a result, several depen-dences have been introduced in communicationnetworks that might cause a ripple-througheffect when only a single component fails. For
Depending on the
outcome of the risk
assessment, the net-
work may need to
be augmented (i.e.,
adding nodes and/or
links) to improve its
resilience against the
identified risks. The
art of network aug-
mentation is how to
best balance
resilience and aug-
mentation costs.
DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 49
IEEE Communications Magazine • October 201450
instance, in October 2011, a core switch withinthe Blackberry network failed. Such hardwarefailure is in practice usually quickly resolved byproper fail-over schemes, but in this particularcase it caused a malfunctioning of a databasethat was much harder to resolve and eventuallyled to an outage lasting three days. Hence,resilience engineering in networks should look atthe entire networking and application stack aseven minor challenges that are remediated with-in the allowed mitigation margins may amplifyand pose a large impact at other layers. Cross-layer resilience engineering — in contrast to, forexample, cross-layer performance optimizationin wireless networks — has unfortunatelyreceived little attention to date.
Spare Resources — The endpoints of intercon-nections between individual networks take placein data centers that follow a wide variety of prac-tices to increase resilience. There, the level ofredundancy and protection against typical fail-ures is described by tiers, with specific guidelinesas to what practices must be implemented for adata center to meet these levels and be certifi-able as such. The Amsterdam Internet Exchange(AMS-IX), for example, has extended theseavailable standards and further refined theminto a list of 141 minimum baseline (technicaldesign, operational, and business continuity)requirements for the data centers providing ser-vice to the exchange. While, as stipulated inthese standards, it is recommended to overprovi-sion network elements by a factor of two andcreate independent availability regions capableof securing network operations, there is current-ly an ongoing trend where providers are operat-ing their networks at higher and higher loads(e.g., as Google is doing with their software-defined wide area network connecting their datacenters). The “hotter” the network is operated,the fewer backup resources are available, andthe higher the risk in case of failure, since back-up paths/resources might not be available. More-over, running a network at high utilizationintroduces a risk of overload, as we have seen,for instance, with popular applications like Twit-ter in their early days. Finally, adopting newtechnologies, such as software-defined network-ing (SDN, and its protocol, OpenFlow) couldpose new vulnerabilities, for instance, withrespect to the robustness of the SDN controllernow introducing a new SPoF or the security ofthe OpenFlow protocol.
Implications of Tightly Coupled Systems,Shared Infrastructure, and Unknown SPoFs— In the past few years, the role of cloud com-puting, in which the infrastructure, platform, andeven software used by IT operations are out-sourced services, has become more prominent.The flexibility of cloud services certainly has itsadvantages, since they can be used when andonly for how long they are needed, and be leasedfor prices charged in small increments of actualusage. However, these shared infrastructuresalso pose a risk that failures of a data centercould cripple many services. This is supported bysome analysts proclaiming that 2012 was the yearof cloud (computing) outages. For many Inter-
net services building on such cloud infrastruc-tures, this creates risks that cannot be mitigated,as customers typically do not have much insightinto the concrete building blocks of the usedinfrastructure and potential architectural SPoFs.This general issue, however, greatly extends thisparticular scenario of cloud computing. In recentyears, services have become increasingly coupledand integrated, which has also increased the vul-nerability of Internet services due to commonshared or cross-dependent infrastructures. Simi-lar to the intensified linkages among actors inthe financial market that led to the housing bub-ble burst in 2008, we might have created similarsystemic or hyper risks [5] in Internet serviceswhich might explain the comparatively largemagnitude of outages. The resilience of suchtightly coupled systems is, however, in both gen-eral and specific for the case of computer net-works and the Internet as its most prominentexample still largely unknown. More research isneeded to understand risk and failure trajecto-ries in these tightly coupled systems to developeffective challenge mitigation strategies forInternet services operating under such circum-stances.
MONITORING OF INTERNETWORK RESILIENCEThe key to inter-domain routing resilience is theestablishment of redundancy at multiple physicalendpoints and, if possible, also across multiplelevels. The most fundamental inter-domain pro-tection concept is the establishment of multi-homing, that is, the presence of at least twodistinct uplink connections toward non-local des-tinations. To realize the maximum possibleresilience from such a setup, the critical depen-dencies of the upstream providers should ideallybe investigated (e.g., where the transit providers’fibers run, from which grid their equipment ispowered, or where they interconnect), butobtaining a comprehensive view of this is fre-quently difficult.
If a network operator has established severalinterconnection points with another ISP, theBGP provides additional means to manage andthereby strengthen the interconnection. By tun-ing the individual BGP configuration at eachlocation and influencing through which pointstraffic should enter or exit the autonomous sys-tem, such as the BGP multi-exit discriminators,local preferences, or path attributes, providerscan obtain a fine level of control on the trafficflows between networks, privileging or relievingparticular hardware over others.
For such setups at network operators and tofurther deepen insight into the resilience andreliability of Internet services and their under-lying infrastructures in academia, a large moni-toring framework should be established. Itwould be able to build up an assessment and atrack record of “how good” connections via aparticular autonomous system are, what thestabil ity of individual paths is within anautonomous system, and what particular hard-ware resides at certain geographical locations.Such monitoring systems have contributed agreat deal to minimize the impacts of BGPhijacking incidents, as malicious and accidentalprefix announcements can today be rapidly
The key to inter-
domain routing
resilience is the
establishment of
redundancy at multi-
ple physical end
points and if possi-
ble, also across multi-
ple levels. The most
fundamental inter-
domain protection
concept is the estab-
lishment of multi-
homing, i.e., the
presence of at least
two distinct uplink
connections
toward non-local
destinations.
DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 50
IEEE Communications Magazine • October 2014 51
detected. Establishing a similar system tounderstand the exact mechanics, location, andimpact of Internet failures could promise togenerate a similar leap to a more resilientInternet.
CONCLUSIONIs it “all quiet on the Internet front?” In thisarticle, we have investigated a broad scale ofInternet failures during the course of the past sixyears, where it became apparent that failuresabound, and analyzed their root cause, frequen-cy, duration, and societal impact. Such a study,to date, was missing, but it is vital in establishingproper Internet failure mitigation schemes.
In the second part of this article, we havescrutinized currently employed mitigationschemes, exemplified in which cases they failedand why, and proposed recommendations andchallenges, to be on the road toward reachingfine-grained network risk assessment methodsand better resilience planning and responses.
ADDITIONAL RESOURCESDetails about the incidents described in this arti-cle as well as other resources can be found athttps://www.internetview.org, a new website ded-icated to Internet infrastructure monitoring andresilience.
ACKNOWLEDGMENTSPart of this work has been supported by the EUFP7 EINS project under grant agreement No.288021.
REFERENCES[1] N. Adam et al, “Consequence Analysis of Complex
Events on Critical U.S. Infrastructure,” Commun. ACM,vol. 56, no. 6, 2013, pp. 83–91.
[2] K. Butler et al., “A Survey of BGP Security Issues andSolutions,” Proc. IEEE, vol. 98, no. 1, Jan. 2010.
[3] P. Cholda et al., “A Survey of Resilience DifferentiationFrameworks in Communication Networks,” IEEE Com-mun. Surveys, vol. 9, no. 4, 2007.
[4] C. Doerr et al., “Good Practices in Resilient InternetInterconnection,” ENISA report, June 2012.
[5] D. Helbing, “Globally Networked Risks and How toRespond,” Nature 497, 2013, pp. 51–59.
[6] W. Kimet al., “The Dark Side of the Internet: Attacks,Costs and Responses,” Information Systems, vol. 36,2011, pp. 675–705.
[7] S. Yang and F. A. Kuipers, “Traffic Uncertainty Modelsin Network Planning,” IEEE Commun. Mag., vol. 52, no.2, Feb. 2014, pp. 172–77.
BIOGRAPHIESCHRISTIAN DOERR ([email protected]) is an assistant profes-sor in the Network Architectures and Services group atDelft University of Technology (TUDelft). He received anM.Sc. degree in computer science and a Ph.D. degree incomputer science and cognitive science from the Universityof Colorado at Boulder. His research interests revolvearound critical infrastructure protection, cyber security, andresilience engineering.
FERNANDO A. KUIPERS [SM] ([email protected]) is an asso-ciate professor in the Network Architectures and Servicesgroup at TUDelft. He received his M.Sc. degree in electricalengineering from TUDelft in June 2000 and subsequentlyobtained his Ph.D. degree (cum laude) in 2004 at the sameuniversity. His research interests mainly revolve around net-work algorithms and cover routing, quality of service, net-work survivability, optical networks, and content distribution.His work on these subjects includes distinguished papers atIEEE INFOCOM 2003, Chinacom 2006, IFIP Networking 2008,IEEE FMN 2008, IEEE ISM 2008, and ITC 2009.
Is it “all quiet on the
Internet front?” We
have investigated a
broad scale of Inter-
net failures over the
course of the past six
years, where it
became apparent
that failures
abound, and
analyzed their root
cause, frequency,
duration, and
societal impact.
DOERR_LAYOUT_Layout 9/25/14 1:06 PM Page 51