63
External Memory Algorithms and Data Structures: Dealing with Massive Data JEFFREY SCOTT VITTER Duke University Data sets in large applications are often too massive to fit completely inside the computer’s internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this article we survey the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory. For the batched problem of sorting and related problems such as permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, Categories and Subject Descriptors: B.4.3 [Input /Output and Data Communications]: Interconnections—parallel I/O; E.1 [Data Structures]: graphs and networks, trees; E.5 [Files]: sorting/searching; F.1.1 [Computation by Abstract Devices]: Models of Computation—bounded action devices, relations between models; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems—computations on discrete structures, geometrical problems and computations, sorting and searching; H.2.2 [Database Management]: Physical Design—access methods; H.2.8 [Database Management]: Database Applications— spatial databases and GIS; H.3.2 [Information Storage and Retrieval]: Information Storage—file organization; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—information filtering, search process General Terms: Algorithms, Design, Experimentation, Performance, Theory Additional Key Words and Phrases: Batched, block, B-tree, disk, dynamic, extendible hashing, external memory, hierarchical memory, I/O, multidimensional access methods, multilevel memory, online, out-of-core, secondary storage, sorting This work was supported in part by Army Research Office MURI grant DAAH04-96-1-0013 and by National Science Foundation research grants CCR-9522047, EIA-9870734, and CCR-9877133. Part of this work was done at BRICS, University of Aarhus, Aarhus, Denmark and at INRIA, Sophia Antipolis, France. Earlier, shorter versions of some of this material appeared in Vitter [1998; 1999a; 1999b; 1999c]. Author’s address: Department of Computer Science, Levine Science Research Center, Duke University Box 90129, Durham, NC 27708-0129, e-mail: [email protected], URL: http://www.cs.duke.edu/˜jsv. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]. c 2001 ACM 0360-5611/01/0700-0001 $5.00 ACM Computing Surveys, Vol. 33, No. 2, June 2001, pp. 209–271.

Massive Data - University of Kansas

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Massive Data - University of Kansas

External Memory Algorithms and Data Structures:

Dealing with Massive DataJEFFREY SCOTT VITTERDuke University

Data sets in large applications are often too massive to fit completely inside thecomputer’s internal memory. The resulting input/output communication (or I/O)between fast internal memory and slower external memory (such as disks) can be amajor performance bottleneck. In this article we survey the state of the art in the designand analysis of external memory (or EM) algorithms and data structures, where thegoal is to exploit locality in order to reduce the I/O costs. We consider a variety of EMparadigms for solving batched and online problems efficiently in external memory. Forthe batched problem of sorting and related problems such as permuting and fast Fouriertransform, the key paradigms include distribution and merging. The paradigm of diskstriping offers an elegant way to use multiple disks in parallel. For sorting, however,

Categories and Subject Descriptors: B.4.3 [Input /Output and DataCommunications]: Interconnections—parallel I/O; E.1 [Data Structures]: graphsand networks, trees; E.5 [Files]: sorting/searching; F.1.1 [Computation by AbstractDevices]: Models of Computation—bounded action devices, relations between models;F.2.2 [Analysis of Algorithms and Problem Complexity]: NonnumericalAlgorithms and Problems—computations on discrete structures, geometrical problemsand computations, sorting and searching; H.2.2 [Database Management]: PhysicalDesign—access methods; H.2.8 [Database Management]: Database Applications—spatial databases and GIS; H.3.2 [Information Storage and Retrieval]: InformationStorage—file organization; H.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—information filtering, search process

General Terms: Algorithms, Design, Experimentation, Performance, Theory

Additional Key Words and Phrases: Batched, block, B-tree, disk, dynamic, extendiblehashing, external memory, hierarchical memory, I/O, multidimensional access methods,multilevel memory, online, out-of-core, secondary storage, sorting

This work was supported in part by Army Research Office MURI grant DAAH04-96-1-0013 and by NationalScience Foundation research grants CCR-9522047, EIA-9870734, and CCR-9877133.

Part of this work was done at BRICS, University of Aarhus, Aarhus, Denmark and at INRIA, SophiaAntipolis, France. Earlier, shorter versions of some of this material appeared in Vitter [1998; 1999a; 1999b;1999c].

Author’s address: Department of Computer Science, Levine Science Research Center, Duke University Box90129, Durham, NC 27708-0129, e-mail: [email protected], URL: http://www.cs.duke.edu/ ˜jsv.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works, requires prior specific permission and/or a fee. Permissions maybe requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212)869-0481, or [email protected]©2001 ACM 0360-5611/01/0700-0001 $5.00

ACM Computing Surveys, Vol. 33, No. 2, June 2001, pp. 209–271.

Page 2: Massive Data - University of Kansas

210 Jeffrey Scott Vitter

disk striping can be nonoptimal with respect to I/O, so to gain further improvements wediscuss distribution and merging techniques for using the disks independently. We alsoconsider useful techniques for batched EM problems involving matrices (such as matrixmultiplication and transposition), geometric data (such as finding intersections andconstructing convex hulls), and graphs (such as list ranking, connected components,topological sorting, and shortest paths). In the online domain, canonical EMapplications include dictionary lookup and range searching. The two important classesof indexed data structures are based upon extendible hashing and B-trees. Theparadigms of filtering and bootstrapping provide a convenient means in online datastructures to make effective use of the data accessed from disk. We also reexaminesome of the above EM problems in slightly different settings, such as when the dataitems are moving, when the data items are variable-length (e.g., text strings), or whenthe allocated amount of internal memory can change dynamically. Programming toolsand environments are available for simplifying the EM programming task. During thecourse of the survey, we report on some experiments in the domain of spatial databasesusing the TPIE system (transparent parallel I/O programming environment). Thenewly developed EM algorithms and data structures that incorporate the paradigms wediscuss are significantly faster than methods currently used in practice.

1. INTRODUCTION

1.1. Background

For reasons of economy, general-purposecomputer systems usually contain a hier-archy of memory levels, each level withits own cost and performance character-istics. At the lowest level, CPU registersand caches are built with the fastest butmost expensive memory. For internal mainmemory, dynamic random access memory(DRAM) is typical. At a higher level, in-expensive but slower magnetic disks areused for external mass storage, and evenslower but larger-capacity devices suchas tapes and optical disks are used forarchival storage. Figure 1 depicts a typicalmemory hierarchy and its characteristics.

Most modern programming languagesare based upon a programming model inwhich memory consists of one uniform ad-dress space. The notion of virtual memoryallows the address space to be far largerthan what can fit in the internal memoryof the computer. Programmers have a nat-ural tendency to assume that all memoryreferences require the same access time.In many cases, such an assumption is rea-sonable (or at least doesn’t do any harm),especially when the data sets are not large.The utility and elegance of this program-ming model are to a large extent why ithas flourished, contributing to the produc-tivity of the software industry.

However, not all memory references arecreated equal. Large address spaces spanmultiple levels of the memory hierar-chy, and accessing the data in the low-est levels of memory is orders of magni-tude faster than accessing the data at thehigher levels. For example, loading a reg-ister takes on the order of a nanosecond(10−9 seconds), and accessing internalmemory takes tens of nanoseconds, but thelatency of accessing data from a disk isseveral milliseconds (10−3 seconds), whichis about one million times slower! In ap-plications that process massive amountsof data, the input/output communication(or simply I/O) between levels of memoryis often the bottleneck.

Many computer programs exhibit somedegree of locality in their pattern of mem-ory references: Certain data are refer-enced repeatedly for a while, and thenthe program shifts attention to other setsof data. Modern operating systems takeadvantage of such access patterns bytracking the program’s so-called “work-ing set”–a vague notion that roughly cor-responds to the recently referenced dataitems [Denning 1980]. If the working setis small, it can be cached in high-speedmemory so that access to it is fast. Cachingand prefetching heuristics have been de-veloped to reduce the number of occur-rences of a “fault,” in which the referenceddata item is not in the cache and must be

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 3: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 211

Fig. 1 . The memory hierarchy of a typical unipro-cessor system, including registers, instruction cache,data cache (level 1 cache), level 2 cache, internalmemory, and disks. Below each memory level is therange of typical sizes for that memory level. Eachvalue of B at the top of the figure denotes the blocktransfer size between two adjacent levels of the hi-erarchy. All sizes are given in units of bytes (B),kilobytes (KB), megabytes (MB), gigabytes (GB), orterabytes (TB). (In the PDM model described inSection 2, we measure B in units of items rather thanin units of bytes.) In this figure, 8 KB is the indicatedphysical block transfer size between internal mem-ory and the disks. However, in batched applicationsit is often more appropriate to use a substantiallylarger logical block transfer size.

retrieved by an I/O from a higher level ofmemory. For example, in a page fault, anI/O is needed to retrieve a disk page fromdisk and bring it into internal memory.

Caching and prefetching methods aretypically designed to be general-purpose,and thus they cannot be expected to takefull advantage of the locality present inevery computation. Some computationsthemselves are inherently nonlocal, andeven with omniscient cache managementdecisions they are doomed to perform largeamounts of I/O and suffer poor perfor-mance. Substantial gains in performancemay be possible by incorporating local-ity directly into the algorithm design andby explicit management of the contentsof each level of the memory hierarchy,thereby bypassing the virtual memorysystem.

We refer to algorithms and datastructures that explicitly manage dataplacement and movement as externalmemory (or EM) algorithms and datastructures. Some authors use the termsI/O algorithms or out-of-core algorithms.

We concentrate in this survey on the I/Ocommunication between the random ac-cess internal memory and the magneticdisk external memory, where the relativedifference in access speeds is most ap-parent. We therefore use the term I/O todesignate the communication between theinternal memory and the disks.

1.2. Overview

In this article, we survey severalparadigms for exploiting locality andthereby reducing I/O costs when solvingproblems in external memory. The prob-lems we consider fall into two generalcategories:

(1) batched problems, in which no prepro-cessing is done and the entire file ofdata items must be processed, often bystreaming the data through the inter-nal memory in one or more passes; and

(2) online problems, in which computationis done in response to a continuousseries of query operations. A commontechnique for online problems is to or-ganize the data items via a hierarchi-cal index, so that only a very small por-tion of the data needs to be examined inresponse to each query. The data beingqueried can be either static, which canbe preprocessed for efficient query pro-cessing, or dynamic, where the queriesare intermixed with updates such asinsertions and deletions.

We base our approach upon the par-allel disk model (PDM) described in thenext section. PDM provides an elegantand reasonably accurate model for ana-lyzing the relative performance of EM al-gorithms and data structures. The threemain performance measures of PDM arethe number of I/O operations, the diskspace usage, and the CPU time. For rea-sons of brevity, we focus on the firsttwo measures. Most of the algorithmswe consider are also efficient in termsof CPU time. In Section 3, we list fourfundamental I/O bounds that pertain tomost of the problems considered here. InSection 4, we show why it is crucial forEM algorithms to exploit locality, and we

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 4: Massive Data - University of Kansas

212 Jeffrey Scott Vitter

discuss an automatic load balancing tech-nique called disk striping for using multi-ple disks in parallel.

In Section 5, we look at the canoni-cal batched EM problem of external sort-ing and the related problems of permut-ing and fast Fourier transform. The twoimportant paradigms of distribution andmerging account for all well-known ex-ternal sorting algorithms. Sorting with asingle disk is now well understood, so weconcentrate on the more challenging taskof using multiple (or parallel) disks, forwhich disk striping is nonoptimal. Thechallenge is to guarantee that the datain each I/O are spread evenly across thedisks so that the disks can be used si-multaneously. We also cover the funda-mental lower bounds on the number ofI/Os needed to perform sorting and re-lated batched problems. In Section 6, wediscuss grid and linear algebra batchedcomputations.

For most problems, parallel disks canbe utilized effectively by means of diskstriping or the parallel disk techniquesof Section 5, and hence we restrict our-selves starting in Section 7 to the con-ceptually simpler single-disk case. InSection 7, we mention several effectiveparadigms for batched EM problems incomputational geometry. The paradigmsinclude distribution sweep (for spatial joinand finding all nearest neighbors), per-sistent B-trees (for batched point locationand graph drawing), batched filtering (for3-D convex hulls and batched point lo-cation), external fractional cascading (forred–blue line segment intersection), exter-nal marriage-before-conquest (for output-sensitive convex hulls), and randomizedincremental construction with gradations(for line segment intersections and othergeometric problems). In Section 8, we lookat EM algorithms for combinatorial prob-lems on graphs, such as list ranking,connected components, topological sort-ing, and finding shortest paths. One tech-nique for constructing I/O-efficient EMalgorithms is to simulate parallel algo-rithms; sorting is used between parallelsteps in order to reblock the data for thesimulation of the next parallel step.

In Sections 9 to 11, we consider datastructures in the online setting. The dy-namic dictionary operations of insert,delete, and lookup can be implementedby the well-known method of hashing. InSection 9, we examine hashing in exter-nal memory, in which extra care must betaken to pack data into blocks and to allowthe number of items to vary dynamically.Lookups can be done generally with onlyone or two I/Os. Section 10 begins witha discussion of B-trees, the most widelyused online EM data structure for dictio-nary operations and 1-D range queries.Weight-balanced B-trees provide a uni-form mechanism for dynamically rebuild-ing substructures and are useful for avariety of online data structures. Level-balanced B-trees permit maintenance ofparent pointers and support cut and con-catenate operations, which are used inreachability queries on monotone subdi-visions. The buffer tree is a so-called“batched dynamic” version of the B-tree forefficient implementation of search treesand priority queues in EM sweep line ap-plications. In Section 11, we discuss spa-tial data structures for multidimensionaldata, especially those that support on-line range search. Multidimensional ex-tensions of the B-tree, such as the popu-lar R-tree and its variants, use a linearamount of disk space and often performwell in practice, although their worst-caseperformance is poor. A nonlinear amountof disk space is required to perform 2-Dorthogonal range queries efficiently in theworst case, but several important specialcases of range searching can be done ef-ficiently using only linear space. A use-ful paradigm for developing an efficientEM data structure is to “externalize” anefficient data structure designed for in-ternal memory; a key component of howto make the structure I/O-efficient is to“bootstrap” a static EM data structurefor small-sized problems into a fully dy-namic data structure of arbitrary size.This paradigm provides optimal linear-space EM data structures for several vari-ants of 2-D orthogonal range search.

In Section 12, we discuss some addi-tional EM approaches useful for dynamic

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 5: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 213

Table I. Paradigms for I/O Efficiency Discussed in thisSurvey

Paradigm Reference

Batched dynamic processing §10.4Batched filtering §7Batched incremental construction §7Bootstrapping §11Disk striping §4.2Distribution §5.1Distribution Sweeping §7Externalization §11.3Fractional Cascading §7Filtering §11Lazy Updating §10.4Load Balancing §4Locality §4Marriage-before-conquest §7Merging §5.2Parallel simulation §8Persistence §7Random sampling §5.1Scanning (or streaming) §2.2Sparsification §8Time-forward processing §10.4

data structures, and we also investigatekinetic data structures, in which the dataitems are moving. In Section 13, we focuson EM data structures for manipulatingand searching text strings.

In Section 14, we discuss program-ming environments and tools that facil-itate high-level development of efficientEM algorithms. We focus on the TPIEsystem (transparent parallel I/O envi-ronment), which we use in the vari-ous timing experiments in the article. InSection 15 we discuss EM algorithms thatadapt optimally to dynamically changinginternal memory allocations. We concludewith some final remarks and observationsin Section 16. Table I lists several of theEM paradigms discussed in this survey.

2. PARALLEL DISK MODEL (PDM)

EM algorithms explicitly control dataplacement and movement, and thus it isimportant for algorithm designers to havea simple but reasonably accurate modelof the memory system’s characteristics.Magnetic disks consist of one or morerotating platters and one read/writehead per platter surface. The data are

stored on the platters in concentric circlescalled tracks, as shown in Figure 2. Toread or write a data item at a certainaddress on disk, the read/write headmust mechanically seek the correct trackand then wait for the desired addressto pass by. The seek time move fromone random track to another is oftenon the order of 3 to 10 milliseconds, andthe average rotational latency, which isthe time for half a revolution, has thesame order of magnitude. In order toamortize this delay, it pays to transfera large contiguous group of data items,called a block. Similar considerations ap-ply to all levels of the memory hierarchy.Typical block sizes are shown in Figure 1.

Even if an application can structureits pattern of memory accesses to exploitlocality and take full advantage of diskblock transfer, there is still a substantialaccess gap between internal and exter-nal memory performance. In fact the ac-cess gap is growing, since the latency andbandwidth of memory chips are improvingmore quickly than those of disks. Useof parallel processors further widens thegap. Storage systems such as RAID deploymultiple disks in order to get additionalbandwidth [Chen et al. 1994; Hellersteinet al. 1997].

In the next section, we describe thehigh-level parallel disk model (PDM),which we use throughout this survey forthe design and analysis of EM algorithmsand data structures. In Section 2.2, weconsider some practical modeling issuesdealing with the sizes of blocks and tracksand the corresponding parameter valuesin PDM. In Section 2.3, we review the his-torical development of models of I/O andhierarchical memory.

2.1. PDM and Problem Parameters

We can capture the main properties ofmagnetic disks and multiple disk systemsby the commonly used parallel disk model(PDM) introduced by Vitter and Shriver[1994a]:

N =problem size (in units of dataitems),

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 6: Massive Data - University of Kansas

214 Jeffrey Scott Vitter

Fig. 2 . Platter of a magnetic disk drive.

M = internal memory size (in units ofdata items),

B= block transfer size (in units of dataitems),

D=number of independent disk drives,and

P =number of CPUs,

where M <N , and 1≤ DB≤M/2. Thedata items are assumed to be of fixedlength. In a single I/O, each of the D diskscan simultaneously transfer a block of Bcontiguous data items.

If P ≤ D, each of the P processors candrive about D/P disks; if D< P , each diskis shared by about P/D processors. Theinternal memory size is M/P per proces-sor, and the P processors are connectedby an interconnection network. For rout-ing considerations, one desired propertyfor the network is the capability to sort theM data items in the collective main memo-ries of the processors in parallel in optimalO((M/P ) log M ) time.1 The special casesof PDM for the case of a single proces-sor (P = 1) and multiprocessors with onedisk per processor (P = D) are pictured inFigure 3.

Queries are naturally associated withonline computations, but they can also bedone in batched mode. For example, in thebatched orthogonal 2-D range searching

1 We use the notation log n to denote the binary(base 2) logarithm log2 n. For bases other than 2, thebase is specified explicitly.

problem discussed in Section 7, we aregiven a set of N points in the plane and aset of Q queries in the form of rectangles,and the problem is to report the pointslying in each of the Q query rectangles.In both the batched and online settings,the number of items reported in responseto each query may vary. We thus need todefine two more performance parameters:

Q =number of input queries(for a batched problem), and

Z = query output size (in units of dataitems).

It is convenient to refer to some of theabove PDM parameters in units of diskblocks rather than in units of data items;the resulting formulas are often simpli-fied. We define the lowercase notation

n= NB

, m= MB

, q= QB

, z = ZB

(1)

to be the problem input size, internalmemory size, query specification size, andquery output size, respectively, in units ofdisk blocks.

We assume that the input data are ini-tially “striped” across the D disks, in unitsof blocks, as illustrated in Figure 4, andwe require the output data to be simi-larly striped. Striped format allows a fileof N data items to be read or writtenin O(N/DB) = O(n/D) I/Os, which isoptimal.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 7: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 215

Fig. 3 . Parallel disk model: (a) P = 1, in which the D disks are connectedto a common CPU; (b) P = D, in which each of the D disks is connected to aseparate processor.

The primary measures of performancein PDM are

(1) the number of I/O operations per-formed,

(2) the amount of disk space used, and(3) the internal (sequential or parallel)

computation time.

For reasons of brevity in this survey wefocus on only the first two measures. Mostof the algorithms we mention run in op-timal CPU time, at least for the single-processor case. Ideally algorithms anddata structures should use linear space,which means O(N/B)=O(n) disk blocks ofstorage.

2.2. Practical Modeling Considerations

Track size is a fixed parameter of the diskhardware; for most disks it is in the range50 to 200 KB. In reality, the track size forany given disk depends upon the radiusof the track (cf. Figure 2). Sets of adjacenttracks are usually formatted to have thesame track size, so there are typically onlya small number of different track sizesfor a given disk. A single disk can havea 3 : 2 variation in track size (and there-

fore bandwidth) between its outer and theinner tracks.

The minimum block transfer size im-posed by hardware is often 512 bytes, butoperating systems generally use a largerblock size, such as 8 KB, as in Figure 1. It ispossible (and preferable in batched appli-cations) to use logical blocks of larger size(sometimes called clusters) and further re-duce the relative significance of seek androtational latency, but the wall clock timeper I/O will increase accordingly. For ex-ample, if we set PDM parameter B to befive times larger than the track size, sothat each logical block corresponds to fivecontiguous tracks, the time per I/O willcorrespond to five revolutions of the diskplus the (now relatively less significant)seek time and rotational latency. If thedisk is smart enough, rotational latencycan even be avoided altogether, since theblock spans entire tracks and reading canbegin as soon as the read head reachesthe desired track. Once the block transfersize becomes larger than the track size, thewall clock time per I/O grows linearly withthe block size.

For best results in batched applications,especially when the data are streamed

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 8: Massive Data - University of Kansas

216 Jeffrey Scott Vitter

Fig. 4 . Initial data layout on the disks, for D= 5disks and block size B= 2. The input data itemsare initially striped block by block across the disks.For example, data items 16 and 17 are stored in thesecond block (i.e., in stripe 1) of disk D3.

sequentially through internal memory, theblock transfer size B in PDM should beconsidered to be a fixed hardware param-eter a little larger than the track size (say,on the order of 100 KB for most disks), andthe time per I/O should be adjusted ac-cordingly. For online applications that usepointer-based indexes, a smaller B valuesuch as 8 KB is appropriate, as in Figure 1.The particular block size that optimizesperformance may vary somewhat from ap-plication to application.

PDM is a good generic programmingmodel that facilitates elegant design ofI/O-efficient algorithms, especially whenused in conjunction with the program-ming tools discussed in Section 14. Morecomplex and precise disk models, such asthe ones by Ruemmler and Wilkes [1994],Ganger [1995], Shriver et al. [1998], Barveet al. [1999], and Farach et al. [1998],distinguish between sequential reads andrandom reads and consider the effects offeatures such as disk buffer caches andshared buses, which can reduce the timeper I/O by eliminating or hiding the seektime. For example, algorithms for spatialjoin that access preexisting index struc-tures (and thus do random I/O) can oftenbe slower in practice than algorithms thataccess substantially more data but in asequential order (as in streaming) [Argeet al. 2000]. It is thus helpful not onlyto consider the number of block trans-fers, but also to distinguish between theI/Os that are random versus those thatare sequential. In some applications, au-tomated dynamic block placement can im-prove disk locality and help reduce I/Otime [Seltzer et al. 1995].

Another simplification of PDM is thatthe D block transfers in each I/O are syn-chronous; they are assumed to take thesame amount of time. This assumptionmakes it easier to design and analyze al-gorithms for multiple disks. In practice,however, if the disks are used indepen-dently, some block transfers will completemore quickly than others. We can oftenimprove overall elapsed time if the I/Ois done asynchronously, so that disks getutilized as soon as they become available.Buffer space in internal memory can beused to queue the read and write requestsfor each disk.

2.3. Related Memory Models, HierarchicalMemory, and Caching

The study of problem complexity and algo-rithm analysis when using EM devices be-gan more than 40 years ago with Demuth’sPhD on sorting [Demuth 1956; Knuth1998]. In the early 1970s, Knuth [1998]did an extensive study of sorting usingmagnetic tapes and (to a lesser extent)magnetic disks. At about the same time,Floyd [1972] and Knuth [1998] considereda disk model akin to PDM for D= 1, P = 1,B=M/2=2(Nc), for constant c> 0, anddeveloped optimal upper and lower I/Obounds for sorting and matrix transposi-tion. Hong and Kung [1981] developed apebbling model of I/O for straightline com-putations, and Savage and Vitter [1987]extended the model to deal with blocktransfer. Aggarwal and Vitter [1988] gen-eralized Floyd’s I/O model to allow D si-multaneous block transfers, but the modelwas unrealistic in that the D simultane-ous transfers were allowed to take placeon a single disk. They developed match-ing upper and lower I/O bounds for allparameter values for a host of problems.Since the PDM model can be thought ofas a more restrictive (and more realistic)version of Aggarwal and Vitter’s model,their lower bounds apply as well to PDM.In Section 5.3 we discuss a recent simula-tion technique due to Sanders et al. [2000];the Aggarwal–Vitter model can be simu-lated probabilistically by PDM with onlya constant factor more I /Os, thus makingthe two models theoretically equivalent

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 9: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 217

in the randomized sense. Deterministicsimulations on the other hand require afactor of log(N/D)/ log log(N/D) more I/Os[Armen 1996].

Surveys of I/O models, algorithms,and challenges appear in Arge [1997],Gibson et al. [1996], and Shriver andNodine [1996]. Several versions of PDMhave been developed for parallel compu-tation [Dehne et al. 1999; Li et al. 1996;Sibeyn and Kaufmann 1997]. Models of“active disks” augmented with process-ing capabilities to reduce data traffic tothe host, especially during streaming ap-plications, are given in Acharya et al.[1998] and Riedel et al. [1998]. Models ofmicroelectromechanical systems (MEMS)for mass storage appear in Griffin et al.[2000].

Some authors have studied problemsthat can be solved efficiently by makingonly one pass (or a small number of passes)over the data [Feigenbaum et al. 1999;Henzinger et al. 1999]. One approach toreduce the internal memory requirementsis to require only an approximate answerto the problem; the more memory avail-able, the better the approximation. A re-lated approach to reducing I/O costs for agiven problem is to use random samplingor data compression in order to constructa smaller version of the problem whosesolution approximates the original. Theseapproaches are highly problem-dependentand somewhat orthogonal to our focus inthis survey.

The same type of bottleneck that occursbetween internal memory (DRAM) and ex-ternal disk storage can also occur at otherlevels of the memory hierarchy, such as be-tween registers and level 1 cache, betweenlevel 1 cache and level 2 cache, betweenlevel 2 cache and DRAM, and between diskstorage and tertiary devices. The PDMmodel can be generalized to model thehierarchy of memories ranging from reg-isters at the small end to tertiary stor-age at the large end. Optimal algorithmsfor PDM often generalize in a recursivefashion to yield optimal algorithms in thehierarchical memory models [Aggarwalet al. 1987a,b; Vitter and Shriver 1994b;Vitter and Nodine 1993]. Conversely, the

algorithms for hierarchical models can berun in the PDM setting, and in that set-ting many have the interesting propertythat they use no explicit knowledge of thePDM parameters like M and B. Frigo etal. [1999] and Bender et al. [2000] developcache-oblivious algorithms and data struc-tures that require no knowledge of thestorage parameters.

However, the match between theoryand practice is harder to establish forhierarchical models and caches than fordisks. The simpler hierarchical modelsare less accurate, and the more practicalmodels are architecture-specific. The rela-tive memory sizes and block sizes of thelevels vary from computer to computer.Another issue is how blocks from onememory level are stored in the caches ata lower level. When a disk block is readinto internal memory, it can be stored inany specified DRAM location. However, inlevel 1 and level 2 caches, each item canonly be stored in certain cache locations,often determined by a hardware moduluscomputation on the item’s memory ad-dress. The number of possible storage lo-cations in cache for a given item is calledthe level of associativity. Some caches aredirect-mapped (i.e., with associativity 1),and most caches have fairly low associa-tivity (typically at most 4).

Another reason why the hierarchicalmodels tend to be more architecture-specific is that the relative difference inspeed between level 1 cache and level 2cache or between level 2 cache and DRAMis orders of magnitude smaller than therelative difference in latencies betweenDRAM and the disks. Yet, it is apparentthat good EM design principles are usefulin developing cache-efficient algorithms.For example, sequential internal memoryaccess is much faster than random access,by about a factor of 10, and the more wecan build locality into an algorithm, thefaster it will run in practice. By properlyengineering the “inner loops,” a program-mer can often significantly speed up theoverall running time. Tools such as sim-ulation environments and system moni-toring utilities [Knuth 1999; Rosenblumet al. 1997; Srivastava and Eustace 1994]

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 10: Massive Data - University of Kansas

218 Jeffrey Scott Vitter

Table II. I/O Bounds for the Four Fundamental Operations. ThePDM Parameters Are Defined in Section 2.1

Operation I/O Bound, D= 1 I/O Bound, General D≥ 1

Scan(N ) 2

(NB

)= 2(n) 2

(N

DB

)= 2

(nD

)Sort(N )

2

(NB

logM/BNB

)= 2(n logm n)

2

(N

DBlogM/B

NB

)= 2

(nD

logm n)

Search(N ) 2(logB N ) 2(logDB N )

Output(Z )2

(max

{1,

ZB

})= 2(max{1, z})

2

(max

{1,

ZDB

})= 2

(max

{1,

zD

})

can provide sophisticated help in theoptimization process.

For reasons of focus, we do not considersuch hierarchical models and caching is-sues in this survey. We refer the readerto the following references. Aggarwalet al. [1987a] define an elegant hier-archical memory model, and Aggarwalet al. [1987b] augment it with blocktransfer capability. Alpern et al. [1994]model levels of memory in which thememory size, block size, and bandwidthgrow at uniform rates. Vitter and Shriver[1994b] and Vitter and Nodine [1993]discuss parallel versions and variantsof the hierarchical models. The parallelmodel of Li et al. [1996] also appliesto hierarchical memory. Savage [1995]gives a hierarchical pebbling version ofSavage and Vitter [1987]. Carter andGatlin [1998] define pebbling modelsof nonassociative direct-mapped caches.Rahman and Raman [2000] and Sen andChatterjee [2000] apply EM techniques tomodels of caches and translation lookasidebuffers. Rao and Ross [1999; 2000] use B-tree techniques to exploit locality for thedesign of cache-conscious search trees.

3. FUNDAMENTAL I/O OPERATIONSAND BOUNDS

The I/O performance of many algorithmsand data structures can be expressed interms of the bounds for fundamental op-erations:

(1) scanning (a.k.a. streaming or touching)a file of N data items, which involvesthe sequential reading or writing of theitems in the file;

(2) sorting a file of N data items, whichputs the items into sorted order;

(3) searching online through N sorteddata items; and

(4) outputting the Z answers to a query ina blocked “output-sensitive” fashion.

We give the I/O bounds for these opera-tions in Table II. We single out the specialcase of a single disk (D= 1), since the for-mulas are simpler and many of the discus-sions in this survey are restricted to thesingle-disk case.

The first two of these I/O bounds–Scan(N ) and Sort(N )–apply to batchedproblems. The last two I/O bounds–Search(N ) and Output(Z )–apply to onlineproblems and are typically combined intothe form Search(N )+Output(Z ). As men-tioned in Section 2.1, some batched prob-lems also involve queries, in which casethe I/O bound Output(Z ) may be relevantto them as well. In some pipelined con-texts, the Z answers to a query do notneed to be output to disk but rather can be“piped” to another process, in which casethere is no I/O cost for output. Relationaldatabase queries are often processed insuch a pipeline fashion. For simplicity, inthis article we explicitly consider the out-put cost for queries.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 11: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 219

The I/O bound Scan(N )=O(n/D),which is clearly required to read or writea file of N items, represents a linearnumber of I/Os in the PDM model. An in-teresting feature of the PDM model is thatalmost all nontrivial batched problemsrequire a nonlinear number of I/Os, eventhose that can be solved easily in linearCPU time in the (internal memory) RAMmodel. Examples we discuss later includepermuting, transposing a matrix, listranking, and several combinatorial graphproblems. Many of these problems areequivalent in I/O complexity to permutingor sorting.

The linear I/O bounds for Scan(N )and Output(Z ) are trivial. The algo-rithms and lower bounds for Sort(N ) andSearch(N ) are relatively new and arediscussed in later sections. As Table II in-dicates, the multiple-disk I/O bounds forScan(N ), Sort(N ), and Output(Z ) are Dtimes smaller than the correspondingsingle-disk I/O bounds; such a speedupis clearly the best improvement possi-ble with D disks. For Search(N ), thespeedup is less significant: the I/O bound2(logB N ) for D= 1 becomes 2(log DB N )for D≥ 1; the resulting speedup is only2((logB N )/log DB N ) = 2((log DB)/ log B)=2(1+ (log D)/ log B), which is typicallyless than 2.

In practice, the logarithmic terms logm nin the Sort(N ) bound and log DB N in theSearch(N ) bound are small constants. Forexample, in units of items, we could haveN = 1010, M = 107, and B= 104, and thuswe get n= 106, m= 103, and logm n= 2, inwhich case sorting can be done in a lin-ear number of I/Os. If memory is sharedwith other processes, the logm n term willbe somewhat larger, but still bounded by aconstant. In on-line applications, a smallerB value, such as B= 102, is more appro-priate, as explained in Section 2.2. Thecorresponding value of logB N for the ex-ample is 5, so even with a single disk, on-line search can be done in a small constantnumber of I/Os.

It still makes sense to explicitly iden-tify terms like logm n and logB N in theI/O bounds and not hide them withinthe big-oh or big-theta factors, since the

terms can have a significant effect in prac-tice. (Of course, it is equally importantto consider any other constants hidden inbig-oh and big-theta notations!) The non-linear I/O bound 2(n logm n) usually indi-cates that multiple or extra passes overthe data are required. In truly massiveproblems, the data will reside on tertiarystorage. As we suggested in Section 2.3,PDM algorithms can often be generalizedin a recursive framework to handle mul-tiple levels of memory. A multilevel algo-rithm developed from a PDM algorithmthat does n I/Os will likely run at leastan order of magnitude faster in hierarchi-cal memory than would a multilevel algo-rithm generated from a PDM algorithmthat does n logm n I/Os [Vitter and Shriver1994b].

4. EXPLOITING LOCALITY ANDLOAD BALANCING

The key to achieving efficient I/O perfor-mance in EM applications is to design theapplication to access its data with a highdegree of locality. Since each read I/O op-eration transfers a block of B items, wemake optimal use of that read operationwhen all B items are needed by the appli-cation. A similar remark applies to writeoperations. An orthogonal form of localitymore akin to load balancing arises whenwe use multiple disks, since we can trans-fer D blocks in a single I/O only if theD blocks reside on distinct disks.

An algorithm that does not exploit lo-cality can be reasonably efficient whenit is run on data sets that fit in inter-nal memory, but it will perform miser-ably when deployed naively in an EMsetting and virtual memory is used to han-dle page management. Examining suchperformance degradation is a good way toput the I/O bounds of Table II into perspec-tive. In Section 4.1, we examine this phe-nomenon for the single-disk case, whenD= 1.

In Section 4.2, we look at the multiple-disk case and discuss the importantparadigm of disk striping [Kim 1986;Salem and Garcia-Molina 1986], forautomatically converting a single-disk

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 12: Massive Data - University of Kansas

220 Jeffrey Scott Vitter

algorithm into an algorithm for multipledisks. Disk striping can be used to getoptimal multiple-disk I/O algorithms forthree of the four fundamental operationsin Table II. The only exception is sort-ing. The optimal multiple-disk algorithmsfor sorting require more sophisticated loadbalancing techniques, which we cover inSection 5.

4.1. Locality Issues with a Single Disk

A good way to appreciate the fundamen-tal I/O bounds in Table II is to considerwhat happens when an algorithm does notexploit locality. For simplicity, we restrictourselves in this section to the single-diskcase D = 1. For many of the batched prob-lems we look at in this survey, such assorting, FFT, triangulation, and comput-ing convex hulls, it is well known how towrite programs to solve the correspond-ing internal memory versions of the prob-lems in O(N log N ) CPU time. But if weexecute such a program on a data setthat does not fit in internal memory, rely-ing upon virtual memory to handle pagemanagement, the resulting number ofI/Os may be Ä(N log n), which representsa severe bottleneck. Similarly, in the on-line setting, many types of search queries,such as range search queries and stabbingqueries, can be done using binary treesin O(log N + Z ) query CPU time whenthe tree fits into internal memory, but thesame data structure in an external mem-ory setting may require Ä(log N + Z ) I/Osper query.

We would like instead to incorporate lo-cality directly into the algorithm designand achieve the desired I/O bounds ofO(n logm n) for the batched problems andO(logB N + z) for online search, in linewith the fundamental bounds listed inTable II. At the risk of oversimplifying, wecan paraphrase the goal of EM algorithmdesign for batched problems in the follow-ing syntactic way: to derive efficient algo-rithms so that the N and Z terms in theI/O bounds of the naive algorithms are re-placed by n and z, and so that the baseof the logarithm terms is not 2 but in-stead m. For on-line problems, we want

the base of the logarithm to be B and toreplace Z by z. The relative speedup inI/O performance can be very significant,both theoretically and in practice. For ex-ample, for batched problems, the I/O per-formance improvement can be a factor of(N log n)/n logm n= B log m, which is ex-tremely large. For on-line problems, theperformance improvement can be a fac-tor of (log N + Z )/(logB N + z); this valueis always at least (log N )/ logB N = log B,which is significant in practice, and can beas much as Z/z = B for large Z .

4.2. Disk Striping for Multiple Disks

It is conceptually much simpler to pro-gram for the single-disk case (D= 1) thanfor the multiple-disk case (D≥ 1). Diskstriping [Kim 1986; Salem and Garcia-Molina 1986] is a practical paradigm thatcan ease the programming task with mul-tiple disks: I/Os are permitted only on en-tire stripes, one stripe at a time. For ex-ample, in the data layout in Figure 4, dataitems 20 to 29 can be accessed in a singleI/O step because their blocks are groupedinto the same stripe. The net effect of strip-ing is that the D disks behave as a singlelogical disk, but with a larger logical blocksize DB.

We can thus apply the paradigm of diskstriping to automatically convert an algo-rithm designed to use a single disk withblock size DB into an algorithm for useon D disks each with block size B: in thesingle-disk algorithm, each I/O step trans-mits one block of size DB; in the D-diskalgorithm, each I/O step consists of Dsimultaneous block transfers of size Beach. The number of I/O steps in both al-gorithms is the same; in each I/O step,the DB items transferred by the twoalgorithms are identical. Of course, interms of wall clock time, the I/O step inthe multiple-disk algorithm will be 2(D)times faster than in the single-disk algo-rithm because of parallelism.

Disk striping can be used to get op-timal multiple-disk algorithms for threeof the four fundamental operations ofSection 3–streaming, online search, andoutput reporting–but it is nonoptimal

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 13: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 221

for sorting. To see why, consider whathappens if we use the technique of diskstriping in conjunction with an optimalsorting algorithm for one disk, such asmerge sort [Knuth 1998]. The optimalnumber of I/Os to sort using one disk withblock size B is

2(n logm n) = 2

(n

log nlog m

)= 2

(NB

log(N/B)log(M/B)

). (2)

With disk striping, the number of I/O stepsis the same as if we use a block size of DBin the single-disk algorithm, which corre-sponds to replacing each B in (2) by DB,which gives the I/O bound

2

(NDB

log(N/DB)log(M/DB)

)=2

(nD

log(n/D)log(m/D)

).

(3)

On the other hand, the optimal bound forsorting is

2

(nD

logm n)= 2

(nD

log nlog m

). (4)

The striping I/O bound (3) is larger thanthe optimal sorting bound (4) by a multi-plicative factor of

log(n/D)log n

log mlog(m/D)

≈ log mlog(m/D)

. (5)

When D is on the order of m, the log(m/D)term in the denominator is small, andthe resulting value of (5) is on the or-der of log m, which can be significant inpractice.

It follows that the only way theoreticallyto attain the optimal sorting bound (4) isto forsake disk striping and to allow thedisks to be controlled independently, sothat each disk can access a different stripein the same I/O step. Actually, the only re-quirement for attaining the optimal boundis that either reading or writing is done in-dependently. It suffices, for example, to doonly read operations independently and to

use disk striping for write operations. Anadvantage of using striping for write op-erations is that it facilitates the writing ofparity information for error correction andrecovery, which is a big concern in RAIDsystems. (We refer the reader to Chenet al. [1994] and Hellerstein et al. [1994]for a discussion of RAID and error correc-tion issues.)

In practice, sorting via disk striping canbe more efficient than complicated tech-niques that utilize independent disks, es-pecially when D is small, since the ex-tra factor (log m)/ log(m/D) of I/Os due todisk striping may be less than the algo-rithmic and system overhead of using thedisks independently [Vengroff and Vitter1996b]. In the next section we discuss algo-rithms for sorting with multiple indepen-dent disks. The techniques that arise canbe applied to many of the batched prob-lems addressed later in the article. Twosuch sorting algorithms–distribution sortwith randomized cycling and simple ran-domized merge sort–have relatively lowoverhead and will outperform disk-stripedapproaches.

5. EXTERNAL SORTING ANDRELATED PROBLEMS

The problem of external sorting (or sort-ing in external memory) is a central prob-lem in the field of EM algorithms, partlybecause sorting and sorting-like opera-tions account for a significant percentageof computer use [Knuth 1998], and alsobecause sorting is an important paradigmin the design of efficient EM algorithms,as we show in Section 8. With some tech-nical qualifications, many problems thatcan be solved easily in linear time in in-ternal memory, such as permuting, listranking, expression tree evaluation, andfinding connected components in a sparsegraph, require the same number of I/Os inPDM as does sorting.

THEOREM 5.1 [AGGARWAL AND VITTER

1988; NODINE AND VITTER 1995]. Theaverage-case and worst-case numberof I/Os required for sorting N =nB data

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 14: Massive Data - University of Kansas

222 Jeffrey Scott Vitter

items using D disks is

Sort(N ) = 2(

nD

logm n). (6)

We saw in Section 4.2 how to con-struct efficient sorting algorithms for mul-tiple disks by applying the disk strip-ing paradigm to an efficient single-diskalgorithm. But in the case of sorting,the resulting multiple-disk algorithm doesnot meet the optimal Sort(N ) bound ofTheorem 5.1. In Sections 5.1 and 5.2, wediscuss some recently developed externalsorting algorithms that use disks inde-pendently. The algorithms are based uponthe important distribution and mergeparadigms, which are two generic ap-proaches to sorting. The SRM method andits variants [Barve et al. 1997; Barve andVitter 1999a; Sanders 2000], which arebased upon a randomized merge tech-nique, outperform disk striping in practicefor reasonable values of D. All the algo-rithms use online load balancing strate-gies so that the data items accessed in anI/O operation are evenly distributed on theD disks. The same techniques can be ap-plied to many of the batched problems wediscuss later in this survey.

All the methods we cover for paralleldisks, with the exception of Greed Sortin Section 5.2, provide efficient supportfor writing redundant parity informationonto the disks for purposes of error correc-tion and recovery. For example, some of themethods access the D disks independentlyduring parallel read operations, but in astriped manner during parallel writes. Asa result, if we write D−1 blocks at a time,the exclusive-OR of the D− 1 blocks canbe written onto the Dth disk during thesame write operation.

In Section 5.3, we show that, if we al-low independent reads and writes, we canprobabilistically simulate any algorithmwritten for the Aggarwal–Vitter modeldiscussed in Section 2.3 by use of PDMwith the same number of I/Os, up to a con-stant factor.

In Section 5.4, we consider the situa-tion in which the items in the input filedo not have unique keys. In Sections 5.5

and 5.6, we consider problems related tosorting, such as permuting, permutationnetworks, transposition, and fast Fouriertransform. In Section 5.7, we give lowerbounds for sorting and related problems.

5.1. Sorting by Distribution

Distribution sort [Knuth 1998] is a re-cursive process in which we use a set ofS − 1 partitioning elements to partitionthe items into S disjoint buckets. All theitems in one bucket precede all the itemsin the next bucket. We complete the sort byrecursively sorting the individual bucketsand concatenating them to form a singlefully sorted list.

One requirement is that we choose theS− 1 partitioning elements so that thebuckets are of roughly equal size. Whenthat is the case, the bucket sizes decreasefrom one level of recursion to the next by arelative factor of 2(S), and thus there areO(logS n) levels of recursion. During eachlevel of recursion, we scan the data. Asthe items stream through internal mem-ory, they are partitioned into S buckets inan online manner. When a buffer of size Bfills for one of the buckets, its block is writ-ten to the disks in the next I/O, and an-other buffer is used to store the next set ofincoming items for the bucket. Therefore,the maximum number of buckets (and par-titioning elements) is S=2(M/B)=2(m),and the resulting number of levels of re-cursion is 2(logm n).

It seems difficult to find S=2(m) par-titioning elements using 2(n/D) I/Os andguarantee that the bucket sizes are withina constant factor of one another. Efficientdeterministic methods exist for choosingS=√m partitioning elements [Aggarwaland Vitter 1988; Nodine and Vitter 1993;Vitter and Shriver 1994a], which has theeffect of doubling the number of levelsof recursion. Probabilistic methods basedupon random sampling can be found inFeller [1968]. A deterministic algorithmfor the related problem of (exact) selection(i.e., given k, find the kth item in the filein sorted order) appears in Sibeyn [1999].

In order to meet the sorting bound (6),we must form the buckets at each level

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 15: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 223

of recursion using O(n/D) I/Os, whichis easy to do for the single-disk case.In the more general multiple-disk case,each read step and each write step dur-ing the bucket formation must involveon the average 2(D) blocks. The file ofitems being partitioned is itself one of thebuckets formed in the previous level ofrecursion. In order to read that file ef-ficiently, its blocks must be spread uni-formly among the disks, so that no one diskis a bottleneck. The challenge in distribu-tion sort is to write the blocks of the buck-ets to the disks in an online manner andachieve a global load balance by the end ofthe partitioning, so that the bucket can beread efficiently during the next level of therecursion.

Partial striping is an effective techniquefor reducing the amount of informationthat must be stored in internal memory inorder to manage the disks. The disks aregrouped into clusters of size C and dataare written in “logical blocks” of size CB,one per cluster. Choosing C=√D won’tchange the optimal sorting time by morethan a constant factor, but as pointedout in Section 4.2, full striping (in whichC= D) can be nonoptimal.

Vitter and Shriver [1994a] develop tworandomized online techniques for thepartitioning so that with high probabilityeach bucket will be well balanced acrossthe D disks. In addition, they use partialstriping in order to fit in internal memorythe pointers needed to keep track of thelayouts of the buckets on the disks. Theirfirst partitioning technique applies whenthe size N of the file to partition is suf-ficiently large or when M/DB=Ä(log D),so that the number 2(n/S) of blocks ineach bucket is Ä(D log D). Each parallelwrite operation writes its D blocks inindependent random order to a diskstripe, with all D! orders equally likely.At the end of the partitioning, with highprobability each bucket is evenly dis-tributed among the disks. This situationis intuitively analogous to the classicaloccupancy problem, in which b balls areinserted independently and uniformly atrandom into d bins. It is well known thatif the load factor b/d grows asymptotically

faster than log d , the most densely popu-lated bin contains b/d balls asymptoticallyon the average, which corresponds to aneven distribution. However if the loadfactor b/d is 1, the largest bin contains(ln d )/ ln ln d balls, whereas an averagebin contains only one ball [Vitter andFlajolet 1990]. Intuitively, the blocks ina bucket act as balls and the disks act asbins. In our case, the parameters corre-spond to b=Ä(d log d ), which suggeststhat the blocks in the bucket should beevenly distributed among the disks.

By further analogy to the occupancyproblem, if the number of blocks perbucket is not Ä(D log D), then the tech-nique breaks down and the distribution ofeach bucket among the disks tends to beuneven, causing a bottleneck for I/O op-erations. For these smaller values of N ,Vitter and Shriver [1994a] use their sec-ond partitioning technique: the file is readin one pass, one memoryload at a time.Each memoryload is independently andrandomly permuted and written back tothe disks in the new order. In a secondpass, the file is accessed one memoryloadat a time in a “diagonally striped” manner.Vitter and Shriver [1994a] show that withvery high probability each individual “di-agonal stripe” contributes about the samenumber of items to each bucket, so theblocks of the buckets in each memory-load can be assigned to the disks in a bal-anced round-robin manner using an opti-mal number of I/Os.

DeWitt et al. [1991] present a random-ized distribution sort algorithm in a simi-lar model to handle the case when sortingcan be done in two passes. They use a sam-pling technique to find the partitioning el-ements and route the items in each bucketto a particular processor. The buckets aresorted individually in the second pass.

An even better way to do distributionsort, and deterministically at that, is theBalanceSort method developed by Nodineand Vitter [1993]. During the partitioningprocess, the algorithm keeps track of howevenly each bucket has been distributedso far among the disks. It maintains aninvariant that guarantees good distribu-tion across the disks for each bucket. For

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 16: Massive Data - University of Kansas

224 Jeffrey Scott Vitter

each bucket 1≤ b≤ S and disk 1≤d ≤ D,let numb be the total number of items inbucket b processed so far during the par-titioning and let numb(d ) be the numberof those items written to disk d ; that is,numb=

∑1≤d≤D numb(d ). By application

of matching techniques from graph the-ory, the BalanceSort algorithm is guaran-teed to write at least half of any givenmemoryload to the disks in a blocked man-ner and still maintain the invariant foreach bucket b that the bD/2c largest valuesamong numb(1), numb(2), . . . , numb(D)differ by at most 1. As a result, eachnumb(d ) is at most about twice the idealvalue numb/D, which implies that thenumber of I/Os needed to read a bucketinto memory during the next level of recur-sion will be within a small constant factorof optimal.

The distribution sort methods that wementioned above for parallel disks per-form write operations in complete stripes,which makes it easy to write parity infor-mation for use in error correction and re-covery. But since the blocks written in eachstripe typically belong to multiple buckets,the buckets themselves will not be stripedon the disks, and we must use the disksindependently during read operations. Inthe write phase, each bucket must there-fore keep track of the last block written toeach disk so that the blocks for the bucketcan be linked together.

An orthogonal approach is to stripe thecontents of each bucket across the disksso that read operations can be done ina striped manner. As a result, the writeoperations must use disks independently,since during each write, multiple bucketswill be writing to multiple stripes. Errorcorrection and recovery can still be han-dled efficiently by devoting to each bucketone block-sized buffer in internal memory.The buffer is continuously updated to con-tain the exclusive-OR (parity) of the blockswritten to the current stripe, and afterD− 1 blocks have been written, the parityinformation in the buffer can be written tothe final (Dth) block in the stripe.

Under this new scenario, the basic loopof the distribution sort algorithm is, as be-fore, to read one memoryload at a time and

partition the items into S buckets. How-ever, unlike before, the blocks for each in-dividual bucket will reside on the disks incontiguous stripes. Each block thereforehas a predefined place where it must bewritten. If we choose the normal round-robin ordering for the stripes (name-ly, . . . , 1, 2, 3, . . . , D, 1, 2, 3, . . . , D, . . .), theblocks of different buckets may “collide,”meaning that they need to be written tothe same disk, and subsequent blocks inthose same buckets will also tend to col-lide. Vitter and Hutchinson [2001] solvethis problem by the technique of ran-domized cycling. For each of the S buck-ets, they determine the ordering of thedisks in the stripe for that bucket via arandom permutation of {1, 2, . . . , D}. TheS random permutations are chosen in-dependently. If two blocks (from differ-ent buckets) happen to collide during awrite to the same disk, one block is writ-ten to the disk and the other is kepton a write queue. With high probabil-ity, subsequent blocks in those two buck-ets will be written to different disks andthus will not collide. As long as there isa small pool of available buffer space totemporarily cache the blocks in the writequeues, Vitter and Hutchinson show thatwith high probability the writing proceedsoptimally.

We expect that the randomized cyclingmethod or the related merge sort methodsdiscussed at the end of Section 5.2 will bethe methods of choice for sorting with par-allel disks. Experiments are under way toevaluate their relative performance. Dis-tribution sort algorithms may have an ad-vantage over the merge approaches pre-sented in Section 5.2 in that they typicallymake better use of lower levels of cachein the memory hierarchy of real systems,based upon analysis of distribution sortand merge sort algorithms on models ofhierarchical memory, such as the RUMHmodel of Vitter and Nodine [1993].

5.2. Sorting by Merging

The merge paradigm is somewhat orthog-onal to the distribution paradigm of theprevious section. A typical merge sort

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 17: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 225

algorithm works as follows [Knuth 1998].In the “run formation” phase, we scan then blocks of data, one memoryload at a time;we sort each memoryload into a single“run,” which we then output onto a seriesof stripes on the disks. At the end of therun formation phase, there are N/M =n/m(sorted) runs, each striped across thedisks. (In actual implementations, we canuse the “replacement-selection” techniqueto get runs of 2M data items, on the av-erage, when MÀ B [Knuth 1998].) Afterthe initial runs are formed, the mergingphase begins. In each pass of the merg-ing phase, we merge groups of R runs. Foreach merge, we scan the R runs and mergethe items in an online manner as theystream through internal memory. Doublebuffering is used to overlap I/O and com-putation. At most R =2(m) runs can bemerged at a time, and the resulting num-ber of passes is O(logm n).

To achieve the optimal sorting bound(6), we must perform each merging passin O(n/D) I/Os, which is easy to do forthe single-disk case. In the more generalmultiple-disk case, each parallel read op-eration during the merging must on aver-age bring in the next 2(D) blocks neededfor the merging. The challenge is to ensurethat those blocks reside on different disksso that they can be read in a single I/O(or a small constant number of I/Os). Thedifficulty lies in the fact that the runs be-ing merged were themselves formed dur-ing the previous merge pass. Their blockswere written to the disks in the previouspass without knowledge of how they wouldinteract with other runs in later merges.

For the binary merging case R = 2 wecan devise a perfect solution, in which thenext D blocks needed for the merge areguaranteed to be on distinct disks, basedupon the Gilbreath principle [Gardner1977; Knuth 1998]: we stripe the firstrun into ascending order by disk number,and we stripe the other run into descend-ing order. Regardless of how the items inthe two runs interleave during the merge,it is always the case that we can accessthe next D blocks needed for the outputvia a single I/O operation, and thus theamount of internal memory buffer space

needed for binary merging is minimized.Unfortunately there is no analogue to theGilbreath principle for R > 2, and as wehave seen above, we need the value of Rto be large in order to get an optimal sort-ing algorithm.

The Greed Sort method of Nodine andVitter [1995] was the first optimal deter-ministic EM algorithm for sorting withmultiple disks. It handles the case R > 2by relaxing the condition on the merg-ing process. In each step, two blocks fromeach disk are brought into internal mem-ory: the block b1 with the smallest dataitem value and the block b2 whose largestitem value is smallest. If b1= b2, onlyone block is read into memory, and it isadded to the next output stripe. Other-wise, the two blocks b1 and b2 are mergedin memory; the smaller B items are writ-ten to the output stripe, and the remain-ing B items are written back to the disk.The resulting run that is produced isonly an “approximately” merged run, butits saving grace is that no two inverteditems are too far apart. A final applicationof Columnsort [Leighton 1985] sufficesto restore total order; partial striping isemployed to meet the memory constraints.One disadvantage of Greed Sort is that theblock writes and block reads involve inde-pendent disks and are not done in a stripedmanner, thus making it difficult to writeparity information for error correction andrecovery.

Aggarwal and Plaxton [1994] developedan optimal deterministic merge sort basedupon the Sharesort hypercube parallelsorting algorithm [Cypher and Plaxton1993]. To guarantee even distribution dur-ing the merging, it employs two high-levelmerging schemes in which the schedulingis almost oblivious. Like Greed Sort, theSharesort algorithm is theoretically opti-mal (i.e., within a constant factor of opti-mal), but the constant factor is larger thanthe distribution sort methods.

One of the most practical methods forsorting is based upon the simple ran-domized merge sort (SRM) algorithm ofBarve et al. [1997] and Barve and Vitter[1999a], referred to as “randomized strip-ing” by Knuth [1998]. Each run is striped

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 18: Massive Data - University of Kansas

226 Jeffrey Scott Vitter

Table III. Ratio of the Number of I/Os Used by SimpleRandomized Merge Sort (SRM) to the Number of I/OsUsed by Merge Sort with Disk Striping, During a Mergeof kD Runs, Where kD ≈ M/2B. The Figures WereObtained by Simulation

D= 5 D= 10 D= 50

k= 5 0.56 0.47 0.37

k= 10 0.61 0.52 0.40

k= 50 0.71 0.63 0.51

across the disks, but with a random start-ing point (the only place in the algorithmwhere randomness is utilized). During themerging process, the next block neededfrom each disk is read into memory, andif there is not enough room, the leastneeded blocks are “flushed” (without anyI/Os required) to free up space. Barveet al. [1997] derive an asymptotic upperbound on the expected I/O performance,with no assumptions on the input distri-bution. A more precise analysis, which isrelated to the so-called cyclic occupancyproblem, is an interesting open problem.The cyclic occupancy problem is similarto the classical occupancy problem we dis-cussed in Section 5.1 in that there are bballs distributed into d bins. However, inthe cyclical occupancy problem, the b ballsare grouped into c chains of length b1,b2, . . . , bc, where

∑1≤i≤c bi = b. Only the

head of each chain is randomly insertedinto a bin; the remaining balls of the chainare inserted into the successive bins in acyclic manner (hence the name “cyclic oc-cupancy”). It is conjectured that the ex-pected maximum bin size in the cyclic oc-cupancy problem is at most that of theclassical occupancy problem [Barve et al.1997; Knuth 1998, problem 5.4.9–28]. Thebound has been established so far only inan asymptotic sense.

The expected performance of SRM is notoptimal for some parameter values, but itsignificantly outperforms the use of diskstriping for reasonable values of the pa-rameters, as shown in Table III. Experi-mental confirmation of the speedup wasobtained on a 500 megahertz CPU withsix fast disk drives, as reported by Barveand Vitter [1999a].

We can get further improvements inmerge sort by a more careful prefetchingschedule for the runs. Barve et al. [2000]and Kallahalla and Varman [1999, 2000]have developed competitive and optimalmethods for prefetching blocks in paral-lel I/O systems. Hutchinson et al. [2001a,2001b] have demonstrated a powerful du-ality between parallel writing and paral-lel prefetching, which gives an easy way tocompute optimal prefetching and cachingschedules for multiple disks. More signif-icantly, they show that the same dual-ity exists between distribution and merg-ing, which they exploit to get a provablyoptimal and very practical parallel diskmerge sort. Rather than use random start-ing points and round-robin stripes as inSRM, Hutchinson et al. [2001a, 2001b] or-der the stripes for each run independently,based upon the randomized cycling strat-egy discussed in Section 5.1 for distribu-tion sort.

5.3. A General Simulation

Sanders et al. [2000] and Sanders [2001]give an elegant randomized technique tosimulate the Aggarwal–Vitter model ofSection 2.3, in which D simultaneousblock transfers are allowed regardless ofwhere the blocks are located on the disks.On the average, the simulation realizeseach I/O in the Aggarwal–Vitter modelby only a constant number of I/Os inPDM. One property of the technique isthat the read and write steps use thedisks independently. Armen [1996] hadearlier shown that deterministic simula-tions resulted in an increase in the num-ber of I/Os by a multiplicative factor oflog(N/D)/ log log(N/D).

The technique of Sanders et al. [2000]consists of duplicating each disk block andstoring the two copies on two indepen-dently and uniformly chosen disks (cho-sen by a hash function). In terms of theoccupancy model, each ball (block) is du-plicated and stored in two random bins(disks). Let us consider the problem of re-trieving a specific set of D blocks fromthe disks. For each block, there is a choiceof two disks from which it can be read.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 19: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 227

Regardless of which D blocks are re-quested, Sanders et al. [2000] show howto choose the copies that permit the Dblocks to be retrieved with high probabil-ity in only two parallel I/Os. A natural ap-plication of this technique is to the lay-out of data on multimedia servers in orderto support multiple stream requests, as invideo-on-demand.

When writing blocks of data to the disks,each block must be written to both thedisks where a copy is stored. Sanderset al. [2000] prove that with high prob-ability D blocks can be written in O(1)I/O steps, assuming that there are O(D)blocks of internal buffer space to serveas write queues. The read and writebounds can be improved with a corre-sponding trade-off in redundancy and in-ternal memory space.

5.4. Handling Duplicates

Arge et al. [1993] describe a single-diskmerge sort algorithm for the problem ofduplicate removal, in which there are atotal of K distinct items among the Nitems. It runs in O(n max{1, logm(K/B)})I/Os, which is optimal in the comparisonmodel. The algorithm can be used to sortthe file, assuming that a group of equalitems can be represented by a single itemand a count.

A harder instance of sorting called bun-dle sorting arises when we have K dis-tinct key values among the N items, butall the items have different secondary in-formation. Abello et al. [1998] and Matiaset al. [2000] develop optimal distribu-tion sort algorithms for bundle sorting us-ing BundleSort(N , K )=O(n max{1, logmmin{K , n}}) I/Os, and Matias et al. [2000]prove the matching lower bound. Matiaset al. [2000] also show how to do bundlesorting (and sorting in general) in place(i.e., without extra disk space). In distri-bution sort, for example, the blocks for thesubfiles can be allocated from the blocksfreed up from the file being partitioned;the disadvantage is that the blocks in theindividual subfiles are no longer consec-utive on the disk. The algorithms can beadapted to run on D disks with a speedup

of O(D) using the techniques described inSections 5.1 and 5.2.

5.5. Permuting and Transposition

Permuting is the special case of sorting inwhich the key values of the N data itemsform a permutation of {1, 2, . . . , N }.

THEOREM 5.2 [AGGARWAL AND VITTER

1998]. The average-case and worst-casenumber of I/Os required for permutingN data items using D disks is

2

(min

{ND

, Sort(N )}). (7)

The I/O bound (7) for permuting canbe realized by using one of the sortingalgorithms from Section 5 except in theextreme case B log m= o(log n), in whichcase it is faster to move the data itemsone by one in a nonblocked way. The one-by-one method is trivial if D= 1, but withmultiple disks there may be bottlenecks onindividual disks; one solution for doing thepermuting in O(N/D) I/Os is to apply therandomized balancing strategies of Vitterand Shriver [1994a].

Matrix transposition is the special caseof permuting in which the permutationcan be represented as a transposition of amatrix from row-major order into column-major order.

THEOREM 5.3 [AGGARWAL AND VITTER

1988]. With D disks, the number of I/Osrequired to transpose a p×q matrix fromrow-major order to column-major order is

2

(nD

logm min{M , p, q, n})

, (8)

where N = pq and n=N/B.

When B is relatively large (say, 12 M )

and N is O(M 2), matrix transposition canbe as hard as general sorting, but forsmaller B, the special structure of thetransposition permutation makes trans-position easier. In particular, the matrixcan be broken up into square submatri-ces of B2 elements such that each sub-matrix contains B blocks of the matrix

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 20: Massive Data - University of Kansas

228 Jeffrey Scott Vitter

in row-major order and also B blocks ofthe matrix in column-major order. Thus,if B2<M , the transpositions can be donein a simple one-pass operation by trans-posing the submatrices one at a time ininternal memory.

Matrix transposition is a special caseof a more general class of permutationscalled bit-permute/complement (BPC) per-mutations, which in turn is a subset ofthe class of bit-matrix-multiply/comple-ment (BMMC) permutations. BMMC per-mutations are defined by a log N × log Nnonsingular 0–1 matrix A and a (log N )-length 0–1 vector c. An item with binaryaddress x is mapped by the permutation tothe binary address given by Ax⊕ c, where⊕ denotes bitwise exclusive-OR. BPC per-mutations are the special case of BMMCpermutations in which A is a permuta-tion matrix; that is, each row and eachcolumn of A contain a single 1. BPC per-mutations include matrix transposition,bit-reversal permutations (which arise inthe FFT), vector-reversal permutations,hypercube permutations, and matrix re-blocking. Cormen et al. [1999] character-ize the optimal number of I/Os neededto perform any given BMMC permutationsolely as a function of the associated ma-trix A, and they give an optimal algorithmfor implementing it.

THEOREM 5.4 [CORMEN ET AL. 1999].With D disks, the number of I/Os requiredto perform the BMMC permutation definedby matrix A and vector c is

2

(nD

(1+ rank(γ )

log m

)), (9)

where γ is the lower-left log n× log B sub-matrix of A.

An interesting theoretical question is todetermine the I/O cost for each individualpermutation, as a function of some simplecharacterization of the permutation, suchas number of inversions.

5.6. Fast Fourier Transform andPermutation Networks

Computing the fast Fourier transform(FFT) in external memory consists of a

series of I/Os that permit each computa-tion implied by the FFT directed graph (orbutterfly) to be done while its argumentsare in internal memory. A permutationnetwork computation consists of an obliv-ious (fixed) pattern of I/Os such that anyof the N ! possible permutations can be re-alized; data items can only be reorderedwhen they are in internal memory. A per-mutation network can be realized by a se-ries of three FFTs [Wu and Feng 1981].

THEOREM 5.5 With D disks, the num-ber of I/Os required for computing theN-input FFT digraph or an N-input per-mutation network is Sort(N ).

Cormen and Nicol [1988] give somepractical implementations for one-dimensional FFTs based upon theoptimal PDM algorithm of Vitter andShriver [1994a]. The algorithms for FFTare faster and simpler than for sortingbecause the computation is nonadaptivein nature, and thus the communicationpattern is fixed in advance.

5.7. Lower Bounds on I/O

In this section we prove the lower boundsfrom Theorems 5.1 to 5.5 and mentionsome related I/O lower bounds for thebatched problems in computational geom-etry and graphs that we cover later inSections 7 and 8.

The most trivial batched problem is thatof scanning (a.k.a. streaming or touching)a file of N data items, which can be donein a linear number O(N/DB)=O(n/D) ofI/Os. Permuting is one of several simpleproblems that can be done in linear CPUtime in the (internal memory) RAM model,but require a nonlinear number of I/Os inPDM because of the locality constraintsimposed by the block parameter B.

The following proof of the permutationlower bound (7) of Theorem 5.2 is due toAggarwal and Vitter [1988]. The idea ofthe proof is to calculate, for each t ≥ 0, thenumber of distinct orderings that are re-alizable by sequences of t I/Os. The valueof t for which the number of distinct order-ings first exceeds N !/2 is a lower bound on

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 21: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 229

the average number of I/Os (and hence theworst-case number of I/Os) needed for per-muting.

We assume for the moment that thereis only one disk, D= 1. Let us considerhow the number of realizable orderingscan change as a result of an I/O. In termsof increasing the number of realizable or-derings, the effect of reading a disk blockis considerably more than that of writinga disk block, so it suffices to consider onlythe effect of read operations. During a readoperation, there are at most B data itemsin the read block, and they can be inter-spersed among the M items in internalmemory in at most ( M

B ) ways, so the num-ber of realizable orderings increases by afactor of ( M

B ). If the block has never beforeresided in internal memory, the numberof realizable orderings increases by an ex-tra B! factor, since the items in the blockcan be permuted among themselves. (Thisextra contribution of B! can only happenonce for each of the N/B original blocks.)There are at most n+ t ≤N log N ways tochoose which disk block is involved in thetth I/O. (We allow the algorithm to use anarbitrary amount of disk space.) Hence,the number of distinct orderings that canbe realized by all possible sequences of tI/Os is at most

(B!)N/B(

N (log N )(

MB

))t

. (10)

Setting the expression in (10) to be at leastN !/2, and simplifying by taking the loga-rithm, we get

N log B+ t(

log N + B logMB

)= Ä(N log N ). (11)

Solving for t, we get the matching lowerbound Ä(n logm n) for permuting for thecase D= 1. The general lower bound (7)of Theorem 5.2 follows by dividing by D.

We get a stronger lower bound froma more refined argument that countsinput operations separately from out-put operations [Hutchinson et al 2001c].For the typical case in which B log m=

ω(log N ), the I/O lower bound, up to lower-order terms, is 2n logm n. For the patho-logical in which B log m= o(log N ), theI/O lower bound, up to lower-order terms,is N/D.

Permuting is a special case of sorting,and hence, the permuting lower boundapplies also to sorting. In the unlikelycase that B log m= o(log n), the permutingbound is only Ä(N/D), and we must re-sort to the comparison model to get the fulllower bound (6) of Theorem 5.1 [Aggarwaland Vitter 1988]. In the typical case inwhich B log m=Ä(log n), the comparisonmodel is not needed to prove the sortinglower bound; the difficulty of sorting inthat case arises not from determining theorder of the data but from permuting (orrouting) the data.

The proof used above for permutingalso works for permutation networks, inwhich the communication pattern is obliv-ious (fixed). Since the choice of disk blockis fixed for each t, there is no N log Nterm as there is in (10), and correspond-ingly there is no additive log N term inthe inner expression as there is in (11).Hence, when we solve for t, we get thelower bound (6) rather than (7). The lowerbound follows directly from the countingargument; unlike the sorting derivation,it does not require the comparison modelfor the case B log m= o(log n). The lowerbound also applies directly to FFT, sincepermutation networks can be formed fromthree FFTs in sequence. The transposi-tion lower bound involves a potential argu-ment based upon a togetherness relation[Aggarwal and Vitter 1988].

Arge et al. [1993] show for the com-parison model that any problem with anÄ(N log N ) lower bound in the (internalmemory) RAM model requires Ä(n logm n)I/Os in PDM for a single disk. Their ar-gument leads to a matching lower boundof Ä(n max{1, logm(K/B)}) I/Os in thecomparison model for duplicate removalwith one disk.

For the problem of bundle sorting, inwhich the N items have a total of Kdistinct key values (but the secondaryinformation of each item is different),Matias et al. [2000] derive the matching

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 22: Massive Data - University of Kansas

230 Jeffrey Scott Vitter

lower bound BundleSort(N , K )=Ä(n max{1, logm min{K , n}}). The proof consists ofthe following parts. The first part is a sim-ple proof of the same lower bound as forduplicate removal, but without resortingto the comparison model (except for thepathological case B log m= o(log n)). It suf-fices to set (10) to be at least N !/((N/K )!)K ,which is the maximum number of per-mutations of N numbers having K dis-tinct values. Solving for t gives the lowerbound Ä(n max{1, logm(K/B)}), which isequal to the desired lower bound forBundleSort(N , K ) when K = B1+Ä(1) orM = B1+Ä(1). Matias et al. [2000] derivethe remaining case of the lower boundfor BundleSort(N , K ) by a potential argu-ment based upon the transposition lowerbound. Dividing by D gives the lowerbound for D disks.

Chiang et al. [1995], Arge [1995b],Arge and Miltersen [1999], and Munagalaand Ranade [1999] give models andlower bound reductions for several com-putational geometry and graph prob-lems. The geometry problems discussed inSection 7 are equivalent to sorting in boththe internal memory and PDM models.Problems such as list ranking and expres-sion tree evaluation have the same nonlin-ear I/O lower bound as permuting. Otherproblems such as connected components,biconnected components, and minimumspanning forests of sparse graphs with Eedges and V vertices require as many I/Osas E/V instances of permuting V items.This situation is in contrast with the (in-ternal memory) RAM model, in which thesame problems can all be done in linearCPU time. (The known linear-time RAMalgorithm for finding a minimum span-ning tree is randomized.) In some casesthere is a gap between the best-known up-per and lower bounds, which we examinefurther in Section 8.

The lower bounds mentioned above as-sume that the data items are in some sense“indivisible,” in that they are not split upand reassembled in some magic way toget the desired output. It is conjecturedthat the sorting lower bound (6) remainsvalid even if the indivisibility assumptionis lifted. However, for an artificial prob-

lem related to transposition, Adler [1996]showed that removing the indivisibilityassumption can lead to faster algorithms.A similar result is shown by Arge andMiltersen [1999] for the decision problemof determining if N data item values aredistinct. Whether the conjecture is true isa challenging theoretical open problem.

6. MATRIX AND GRID COMPUTATIONS

Dense matrices are generally representedin memory in row-major or column-majororder. Matrix transposition, which isthe special case of sorting that involvesconversion of a matrix from one repre-sentation to the other, was discussed inSection 5.5. For certain operations suchas matrix addition, both representationswork well. However, for standard matrixmultiplication (using only semiring oper-ations) and LU decomposition, a betterrepresentation is to block the matrixinto square

√B×√B submatrices, which

gives the upper bound of the followingtheorem.

THEOREM 6.1 [HONG AND KUNG 1981;SAVAGE AND VITTER 1987; VITTER AND

SHRIVER 1994a; WOMBLE ET AL. 1993]. Thenumber of I/Os required for standard ma-trix multiplication of two K × K matricesor to compute the LU factorization of aK × K matrix is2(K 3/min{K ,

√M }DB).

Hong and Kung [1981] and Nodine et al.[1991] give optimal EM algorithms for it-erative grid computations, and Leisersonet al. [1993] reduce the number of I/Osof naive multigrid implementations by a2(M 1/5) factor. Gupta et al. [1995] showhow to derive efficient EM algorithms au-tomatically for computations expressed intensor form.

If a K × K matrix A is sparse, that is, ifthe number Nz of nonzero elements in Ais much smaller than K 2, then it maybe more efficient to store only the nonzeroelements. Each nonzero element Ai, j isrepresented by the triple (i, j , Ai, j ). Un-like the dense case, in which transposi-tion can be easier than sorting (e.g., seeTheorem 5.3 when B2≤M ), transpositionof sparse matrices is as hard as sorting.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 23: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 231

THEOREM 6.2 For a matrix stored insparse format and containing Nz nonzeroelements, the number of I/Os required toconvert the matrix from row-major orderto column-major order, and vice versa, is2(Sort(Nz )).

The lower bound follows by reductionfrom sorting. If the ith item in the input ofthe sorting instance has key value x 6= 0,there is a nonzero element in matrix posi-tion (i, x).

For further discussion of numerical EMalgorithms we refer the reader to the sur-vey by Toledo [1999]. Some issues re-garding programming environments arecovered in Corbett et al. [1996] andSection 14.

7. BATCHED PROBLEMS INCOMPUTATIONAL GEOMETRY

Problems involving massive amounts ofgeometric data are ubiquitous in spatialdatabases [Laurini and Thompson 1992;Samet 1989a,b], geographic informationsystems (GIS) [Laurini and Thompson1992; Somet 1989a; Van Kreveld et al.1997], constraint logic programming[Kanellakis et al. 1990; 1996], object-oriented databases [Zdonik and Maier1990], statistics, virtual reality systems,and computer graphics [Funkhouser et al.1992]. NASA’s Earth Observing System[1999] project, the core part of the EarthScience Enterprise (formerly Mission toPlanet Earth), produces petabytes (1015

bytes) of raster data per year. Microsoft’sTerraServer [1998] online database ofsatellite images is over one terabyte insize. A major challenge is to developmechanisms for processing the data, orelse much of the data will be useless.2

For systems of this size to be efficient,we need fast EM algorithms and data

2 For brevity, in the remainder of this survey we dealonly with the single-disk case D= 1. The single-diskI/O bounds for the batched problems can often be cutby a factor of 2(D) for the case D≥ 1 by using theload balancing techniques of Section 5. In practice,disk striping (cf. Section 4.2) may be sufficient. Foronline problems, disk striping will convert optimalbounds for the case D= 1 into optimal bounds forD≥ 1.

structures for basic problems in computa-tional geometry. Luckily, many problemson geometric objects can be reduced to asmall core of problems, such as comput-ing intersections, convex hulls, or nearestneighbors. Useful paradigms have beendeveloped for solving these problems in ex-ternal memory.

THEOREM 6.3 Certain batched prob-lems involving N =nB input items,Q = qB queries, and Z = zB output itemscan be solved using

O((n+ q) logm n+ z) (12)

I/Os with a single disk:

(1) Computing the pairwise intersectionsof N segments in the plane and theirtrapezoidal decomposition;

(2) Finding all intersections between Nnonintersecting red line segmentsand N nonintersecting blue line seg-ments in the plane;

(3) Answering Q orthogonal 2-D rangequeries on N points in the plane (i.e.,finding all the points within the Qquery rectangles);

(4) Constructing the 2-D and 3-D convexhull of N points;

(5) Voronoi diagram and triangulationof N points in the plane;

(6) Performing Q point location queriesin a planar subdivision of size N;

(7) Finding all nearest neighbors for a setof N points in the plane;

(8) Finding the pairwise intersections ofN orthogonal rectangles in the plane;

(9) Computing the measure of the unionof N orthogonal rectangles in theplane;

(10) Computing the visibility of N seg-ments in the plane from a point; and

(11) Performing Q ray-shooting queriesin 2-D constructive solid geometry(CSG) models of size N.

The parameters Q and Z are set to 0 if theyare not relevant for the particular problem.

Goodrich et al. [1993], Zhu [1994], Argeet al. [1995; 1998b], and Crauser et al.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 24: Massive Data - University of Kansas

232 Jeffrey Scott Vitter

[1998; 1999] develop EM algorithms forthose problems using these EM paradigmsfor batched problems:

Distribution sweeping, a generalizationof the distribution paradigm of Sec-tion 5 for “externalizing” plane sweepalgorithms.

Persistent B-trees, an offline method forconstructing an optimal-space per-sistent version of the B-tree datastructure (see Section 10.1), yield-ing a factor of B improvement overthe generic persistence techniques ofDriscoll et al. [1989].

Batched filtering, a general method forperforming simultaneous EM se-arches in data structures that canbe modeled as planar layered di-rected acyclic graphs; it is useful for3-D convex hulls and batched pointlocation. Multisearch on parallelcomputers is considered in Dittrichet al. [1988].

External fractional cascading, an EManalogue to fractional cascading on asegment tree, in which the degree ofthe segment tree is O(mα) for someconstant 0<α≤ 1. Batched queriescan be performed efficiently usingbatched filtering; online queries canbe supported efficiently by adaptingthe parallel algorithms of the work ofTamassia and Vitter [1996] to the I/Osetting.

External marriage-before-conquest, anEM analogue to the technique ofKirkpatrick and Seidel [1986] for per-forming output-sensitive convex hullconstructions.

Batched incremental construction, a lo-calized version of the randomized in-cremental construction paradigm ofClarkson and Shor [1989], in whichthe updates to a simple dynamicdata structure are done in a ran-dom order, with the goal of fastoverall performance on the average.The data structure itself may havebad worst-case performance, but therandomization of the update ordermakes worst-case behavior unlikely.

The key for the EM version so as togain the factor of B I/O speedup is tobatch the incremental modifications.

We focus in the remainder of this sec-tion primarily on the distribution sweepparadigm [Goodrich et al. 1993], which is acombination of the distribution paradigmof Section 5.1 and the well-known sweep-ing paradigm from computational geome-try [Preparata and Shamos 1985; de Berget al. 1997]. As an example, let us considercomputing the pairwise intersections of Northogonal segments in the plane by thefollowing recursive distribution sweep. Ateach level of recursion, the region underconsideration is partitioned into2(m) ver-tical slabs, each containing 2(N/m) of thesegments’ endpoints. We sweep a horizon-tal line from top to bottom to process theN segments. When the sweep line encoun-ters a vertical segment, we insert the seg-ment into the appropriate slab. When thesweep line encounters a horizontal seg-ment h, as pictured in Figure 5, we reporth’s intersections with all the “active” verti-cal segments in the slabs that are spannedcompletely by h. (A vertical segment is“active” if it intersects the current sweepline; vertical segments that are found tobe no longer active are deleted from theslabs.) The remaining two end portionsof h (which “stick out” past a slab bound-ary) are passed recursively to the nextlevel, along with the vertical segments.The downward sweep then proceeds. Afterthe initial sorting (to get the segmentswith respect to the y-dimension), thesweep at each of the O(logm n) levels ofrecursion requires O(n) I/Os, yielding thedesired bound (12). Some timing experi-ments on distribution sweeping appear inChiang [1998]. Arge et al. [1998b] developa unified approach to distribution sweepin higher dimensions.

A central operation in spatial databasesis spatial join. A common preprocessingstep is to find the pairwise intersectionsof the bounding boxes of the objects in-volved in the spatial join. The problem ofintersecting orthogonal rectangles can besolved by combining the previous sweepline algorithm for orthogonal segments

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 25: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 233

Fig. 5 . Distribution sweep used for finding intersections among N orthogonal segments.The vertical segments currently stored in the slabs are indicated in bold (namely, s1,s2, . . . , s9). Segments s5 and s8 are not active, but have not yet been deleted from theslabs. The sweep line has just advanced to a new horizontal segment that completelyspans slabs 2 and 3, so slabs 2 and 3 are scanned and all the active vertical segments inslabs 2 and 3 (namely, s2, s3, s4, s6, s7) are reported as intersecting the horizontal segment.In the process of scanning slab 3, segment s5 is discovered to be no longer active and canbe deleted from slab 3. The end portions of the horizontal segment that “stick out” intoslabs 1 and 4 are handled by the lower levels of recursion, where the intersection with s8is eventually discovered.

with one for range searching. Arge et al.[1998b] take a more unified approach us-ing distribution sweep, which is extendibleto higher dimensions: the active objectsthat are stored in the data structure in thiscase are rectangles, not vertical segments.The authors choose the branching factorto be 2(

√m ). Each rectangle is associated

with the largest contiguous range of ver-tical slabs that it spans. Each of the pos-sible2((

√m2 )) = 2(m) contiguous ranges of

slabs is called a multislab. The reason whythe authors choose the branching factor tobe 2(

√m ) rather than 2(m) is so that the

number of multislabs is 2(m), and thusthere is room in internal memory for abuffer for each multislab. The height of thetree remains O(logm n).

The algorithm proceeds by sweepinga horizontal line from top to bottom toprocess the N rectangles. When the sweepline first encounters a rectangle R, we con-sider the multislab lists for all the multi-

slabs that R intersects. We report all theactive rectangles in those multislab lists,since they are guaranteed to intersect R.(Rectangles no longer active are discardedfrom the lists.) We then extract the leftand right end portions of R that partially“stick out” past slab boundaries, and wepass them down to process in the nextlower level of recursion. We insert the re-maining portion of R, which spans com-plete slabs, into the list for the appropriatemultislab. The downward sweep then con-tinues. After the initial sorting preprocess-ing, each of the O(logm n) sweeps (one perlevel of recursion) takes O(n) I/Os, yield-ing the desired bound (12).

The resulting algorithm, called scal-able sweeping-based spatial join (SSSJ)[Arge et al. 1998a; 1998b], outperformsother techniques for rectangle intersec-tion. It was tested against two othersweep line algorithms: the partition-basedspatial-merge (QPBSM) used in Paradise

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 26: Massive Data - University of Kansas

234 Jeffrey Scott Vitter

[Patel and Dewitt 1996] and a faster ver-sion called MPBSM that uses an improveddynamic data structure for intervals [Argeet al. 1998a]. The TPIE system describedin Section 14 served as the common imple-mentation platform. The algorithms weretested on several data sets. The timing re-sults for the two data sets in Figures 6(a)and 6(b) are given in Figures 6(c) and 6(d),respectively. The first data set is the worstcase for sweep line algorithms; a largefraction of the line segments in the fileare active (i.e., they intersect the currentsweep line). The second data set is a bestcase for sweep line algorithms, but the twoPBSM algorithms have the disadvantageof making extra copies of the rectangles.In both cases, SSSJ shows considerableimprovement over the PBSM-based meth-ods. In other experiments done on moretypical data, such as TIGER/line road datasets [TIGER 1992] SSSJ and MPBSM per-form about 30% faster than does QPBSM.The conclusion we draw is that SSSJ isas fast as other known methods on typ-ical data, but unlike other methods, itscales well even for worst-case data. If therectangles are already stored in an indexstructure, such as the R-tree index struc-ture we consider in Section 11.2, hybridmethods that combine distribution sweepwith inorder traversal often perform best[Arge et al. 2000b].

For the problem of finding all intersec-tions among N line segments, Arge et al.[1995] give an efficient algorithm basedupon distribution sort, but the output com-ponent of the I/O bound is slightly nonop-timal: z logm n rather than z. Crauseret al. [1998; 1999] attain the optimalI/O bound (12) by constructing the trape-zoidal decomposition for the intersectingsegments using an incremental random-ized construction. For I/O efficiency, theydo the incremental updates in a series ofbatches, in which the batch size is geomet-rically increasing by a factor of m.

8. BATCHED PROBLEMS ON GRAPHS

The first work on EM graph algorithmswas by Ullman and Yannakakis [1991]for the problem of transitive closure.

Chiang et al. [1995] consider a vari-ety of graph problems, several of whichhave upper and lower I/O bounds relatedto sorting and permuting. Abello et al.[1998] formalize a functional approach toEM graph problems, in which computa-tion proceeds in a series of scan opera-tions over the data; the scanning avoidsside-effects and thus permits checkpoint-ing to increase reliability. Kumar andSchwabe [1996], followed by Buchsbaumet al. [2000], develop graph algorithmsbased upon amortized data structuresfor binary heaps and tournament trees.Munagala and Ranade [1999] give im-proved graph algorithms for connectivityand undirected breadth-first search, andArge et al. [2000a] extend the approachto compute the minimum spanning forest(MSF). Meyer [2001] provides some im-provements for graphs of bounded degree.Arge [1995b] gives efficient algorithms forconstructing ordered binary decision di-agrams. Grossi and Italiano [1999] ap-ply their multidimensional data structureto get dynamic EM algorithms for MSFand two-dimensional priority queues (inwhich the delete min operation is replacedby delete minx and delete miny ). Tech-niques for storing graphs on disks for effi-cient traversal and shortest path queriesare discussed in Agarwal et al. [1998b],Goldman et al. [1998], Hutchinson et al.[1999], and Nodine et al. [1996]. Com-puting wavelet decompositions and his-tograms [Vitter and Wang 1999; Vitteret al. 1998; Wang et al. 2001] is anEM graph problem related to transpo-sition that arises in online analyticalprocessing (OLAP). Wang et al. [1998]give an I/O-efficient algorithm for con-structing classification trees for datamining.

Table IV gives the best known I/Obounds for several graph problems, as afunction of the number V = vB of ver-tices and the number E = eB of edges.The best known I/O lower bound for theseproblems is Ä((E/V )Sort(V )= e logm v), asmentioned in Section 5.7. A sparsifica-tion technique [Eppstein et al. 1997]can often be applied to convert I/Obounds of the form O(Sort(E)) to the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 27: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 235

Fig. 6 . Comparision of scalable sweeping-based spatial join (SSSJ) with theoriginal PBSM (QPBSM) a new variant (MPBSM): (a) data set 1 consist oftall and skinny (verticlally aligned) rectangles; (b) data set 2 consist of shortand wide (horizontally aligned) rectangles; (c) running times on data set 1; (d)running times on data set 2.

improved form O((E/V )Sort(V )). For ex-ample, the actual I/O bounds for connec-tivity and MSF derived by Munagala andRanade [1999] and Arge et al. [2000a] are

O(max{1, log log(V/e)}Sort(E)). For theMSF problem, we can partition the edgesof the graph into E/V sparse subgraphs onV vertices, and then apply the algorithm

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 28: Massive Data - University of Kansas

236 Jeffrey Scott Vitter

Table IV. Best Known I/O Bounds for Batched Graph Problems for the Single-Disk Case D= 1. The Number ofVertices is Denoted by V= vB and the Number of Edges by E=eB. The Terms Sort(N) and BunSort(N,K) are

Defined in Sections 3 and 5.4. Lower Bounds are Discussed in Section 5.7Graph Problem I/O Bound, D= 1

List ranking,Euler tour of a tree,Centroid decomposition,Expression tree evaluation

2(Sort(V )) [Chiang et al. 1995]

Connected components,Minimum spanning forest (MSF)

O(

max{

1, log logVe

}EV

Sort(V ))

[Arge et al. 2000a; Eppstein et al. 1997;Munagala and Ranade 1999] (deterministic)

2

(EV

Sort(V ))

[Chiang et al. 1995] (randomized)

Bottleneck MSF,Biconnected components

O(

min{

V 2, max{

1, logVM

}EV

Sort(V ),

(log B)EV

Sort(V )+ e log V})

[Abelo et al. 1998; Chiang et al. 1995;Eppstein et al. 1997; Kumar and Schwabe 1996]

(deterministic)

2

(EV

Sort(V ))

[Chiang et al. 1995;

Eppstein et al. 1997] (randomized)

Ear decomposition,Maximal matching

O(

min{

V 2, max{

1, logVM

}Sort(E),

(log B)Sort(E)+ e log V})

[Abello et al. 1998; Chiang et al. 1995;Kumar and Schwabe 1996] (deterministic)

O(Sort(E)) [Chiang et al. 1995](randomized)

Undirected breadth-first search O(BundleSort(E, V )+ V )[Munagala and Ranade 1999]

Undirected single-sourceshortest paths

O(e log e + V ) [Kumar and Schwabe 1996]

Directed and undirecteddepth-first search,Topological sorting,Directed breadth-first search,Directed single-source shortest paths

O(

min{

vem+ V , (V + e) log v

})[Buchsbaum et al. 2000; Chiang et al. 1995;

Munagala and Schwabe 1996]

Transitive closure O(

V v√

em

)[Chiang et al. 1995]

of Arge et al. [2000a] to each subproblemto create E/V spanning forests in a to-tal of O(max{1, log log(V/e)}(E/V )Sort(V ))I/Os. We can then merge the E/V spanningforests, two at a time, in a balanced bi-nary merging procedure by repeatedly ap-

plying the algorithm of Arge et al. [2000a].After the first level of binary merging,the spanning forests collectively have atmost E/2 edges; after two levels, they haveat most E/4 edges, and so on in a geo-metrically decreasing manner. The total

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 29: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 237

cost for the final spanning forest is thusO(max{1, log log(V/e)}(E/V )Sort(V )) I/Os.The reason why sparsification works isthat the spanning forest output by each bi-nary merge is only2(V ) in size, yet it pre-serves the necessary information neededfor the next merge step. The same ap-proach works for connectivity.

In the case of semi-external graphproblems [Abello et al. 1998], in whichthe vertices fit in internal memory butnot the edges (i.e., V ≤M < E), severalof the problems in Table IV can besolved optimally in external memory. Forexample, finding connected components,biconnected components, and minimumspanning forests can be done in O(e)I/Os when V ≤M . The I/O complexitiesof several problems in the general caseremain open, including connected compo-nents, biconnected components, and min-imum spanning forests in the determin-istic case, as well as breadth-first search,topological sorting, shortest paths, depth-first search, and transitive closure. It maybe that the I/O complexity for severalof these problems is 2((E/V )Sort(V )+V ).For special cases, such as trees, planargraphs, outerplanar graphs, and graphs ofbounded tree width, several of these prob-lems can be solved substantially fasterin O(Sort(E)) I/Os [Agarwal et al. 1998b;Chiang 1995; Maheshwari and Zeh 1999;2001].

Chiang et al. [1995] exploit the key ideathat efficient EM algorithms can often bedeveloped by a sequential simulation of aparallel algorithm for the same problem.The intuition is that each step of a parallelalgorithm specifies several operations andthe data upon which they act. If we bringtogether the data arguments for each oper-ation, which we can do by two applicationsof sorting, then the operations can be per-formed by a single linear scan through thedata. After each simulation step, we sortagain in order to reblock the data into thelinear order required for the next simula-tion step. In list ranking, which is used as asubroutine in the solution of several othergraph problems, the number of workingprocessors in the parallel algorithm

decreases geometrically with time, so thenumber of I/Os for the entire simulationis proportional to the number of I/Os usedin the first phase of the simulation, whichis Sort(N )=2(n logm n). The optimalityof the EM algorithm given in Chianget al. [1995] for list ranking assumesthat

√m log m=Ä(log n), which is usually

true in practice. That assumption can beremoved by use of the buffer tree datastructure [Arge 1995a] (see Section 10.4).A practical randomized implementationof list ranking appears in Sibeyn [1997].

Dehne et al. [1997; 1999] and Sibeynand Kaufmann [1997] use a related ap-proach and get efficient I/O bounds bysimulating coarse-grained parallel algo-rithms in the BSP parallel model. Coarse-grained parallel algorithms may exhibitmore locality than the fine-grained algo-rithms considered in Chiang et al. [1995],and as a result the simulation may re-quire fewer sorting steps. Dehne et al.make certain assumptions, most notablythat logm n≤ c for some small constant c(or equivalently that M c <NB), so thatthe periodic sortings can each be done ina linear number of I/Os. Since the BSPliterature is well developed, their simu-lation technique provides efficient single-processor and multiprocessor EM algo-rithms for a wide variety of problems.

In order for the simulation techniquesto be reasonably efficient, the parallelalgorithm being simulated must run inO((log N )c) time using N processors. Un-fortunately, the best known polylog-timealgorithms for problems such as depth-first search and shortest paths use a poly-nomial number of processors, not a linearnumber. P-complete problems such as lex-icographically first depth-first search areunlikely to have polylogarithmic-time al-gorithms even with a polynomial numberof processors. The interesting connectionbetween the parallel domain and the EMdomain suggests that there may be re-lationships between computational com-plexity classes related to parallel comput-ing (such as P-complete problems) andthose related to I/O efficiency. It may thusbe possible to show by reduction that

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 30: Massive Data - University of Kansas

238 Jeffrey Scott Vitter

certain groups of problems are “equallyhard” to solve efficiently in terms of I/Oand are thus unlikely to have solutions asfast as sorting.

9. EXTERNAL HASHING FOR ONLINEDICTIONARY SEARCH

We now turn our attention to online datastructures for supporting the dictionaryoperations of insert, delete, and lookup.Given a value x, the lookup operation re-turns the item(s), if any, in the structurewith key value x. The two main types ofEM dictionaries are hashing, which wediscuss in this section, and tree-based ap-proaches, which we defer until Section 10.The advantage of hashing is that the ex-pected number of probes per operation isa constant, regardless of the number Nof items. The common element of all EMhashing algorithms is a predefined hashfunction

hash : {all possible keys}→ {0, 1, 2, . . . , K − 1}

that assigns the N items to K addresslocations in a uniform manner. Hash-ing algorithms differ from each other inhow they resolve the collision that resultswhen there is no room to store an item atits assigned location.

The goals in EM hashing are to achievean average of O(Output(Z ))=O(dze) I/Osper lookup, where Z = z B is the numberof items output, O(1) I/Os per insert anddelete, and linear disk space. Most tradi-tional hashing methods use a statically al-located table and are thus designed to han-dle only a fixed range of N . The challengeis to develop dynamic EM structures thatcan adapt smoothly to widely varying val-ues of N .

EM hashing methods fall into one of twocategories: directory methods and direc-toryless methods. Fagin et al. [1979] pro-posed a directory scheme called extendiblehashing. Let us assume that the size Kof the range of the hash function hash issufficiently large. The directory, for a givend ≥ 0, consists of a table (array) of 2d point-ers. Each item is assigned to the table lo-

cation corresponding to the d least signif-icant bits of its hash address. The valueof d, called the global depth, is set to thesmallest value for which each table loca-tion has at most B items assigned to it.Each table location contains a pointer to ablock where its items are stored. Thus, alookup takes two I/Os: one to access the di-rectory and one to access the block storingthe item. If the directory fits in internalmemory, only one I/O is needed.

Several table locations may have manyfewer than B assigned items, and for pur-poses of minimizing storage utilization,they can share the same disk block for stor-ing their items. A table location shares adisk block with all the other table locationshaving the same k least significant bits intheir address, where the local depth k ischosen to be as small as possible so thatthe pooled items fit into a single disk block.Each disk block has its own local depth. Anexample is given in Figure 7.

When a new item is inserted, and itsdisk block overflows, the global depth dand the block’s local depth k are recalcu-lated so that the invariants on d and konce again hold. This process correspondsto “splitting” the block that overflows andredistributing its items. Each time theglobal depth d is incremented by 1, thedirectory doubles in size, which is how ex-tendible hashing adapts to a growing N .The pointers in the new directory are ini-tialized to point to the appropriate diskblocks. The important point is that thedisk blocks themselves do not need to bedisturbed during doubling, except for theone block that overflows.

More specifically, let hashd be the hashfunction corresponding to the d least sig-nificant bits of hash; that is, hashd (x) =hash(x) mod 2d . Initially a single diskblock is created to store the data items,and all the slots in the directory are initial-ized to point to the block. The local depth kof the block is set to 0.

When an item with key value x is in-serted, it is stored in the disk block pointedto by directory slot hashd (x). If as a re-sult the block (call it b) overflows, thenblock b splits into two blocks—the origi-nal block b and a new block b′—and its

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 31: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 239

Fig. 7 . Extendible hashing with block size B= 3. The keys are indicated in italics; thehash address of a key consists of its binary representation. For example, the hash addressof key 4 is “. . .000100” and the hash address of key 44 is “. . .0101100.” (a) The hash tableafter insertion of the keys 4, 23, 18, 10, 44, 32, 9. (b) Insertion of the key 76 into tablelocation 100 causes the block with local depth 2 to split into two blocks with local depth 3.(c) Insertion of the key 20 into table location 100 causes a block with local depth 3 to splitinto two blocks with local depth 4. The directory doubles in size and the global depth dis incremented from 3 to 4.

items are redistributed based upon the(b.k + 1)st least significant bit of hash(x).(Here b.k refers to b’s local depth k.) We in-crement b.k by 1 and store that value alsoin b′.k. In the unlikely event that b or b′is still overfull, we continue the splittingprocedure and increment the local depthsappropriately. At this point, some of thedata items originally stored in block bhavebeen moved to other blocks, based upontheir hash addresses. If b.k ≤ d , we sim-ply update those directory pointers orig-inally pointing to b that need changing,as shown in Figure 7(b). Otherwise, thedirectory isn’t large enough to accommo-date hash addresses with b.k bits, so werepeatedly double the directory size andincrement the global depth d by 1 until dbecomes equal to b.k, as shown inFigure 7(c). The pointers in the new di-

rectory are initialized to point to the ap-propriate disk blocks. As noted before, thedisk blocks do not need to be modifiedduring doubling, except for the block thatoverflows.

Extendible hashing can handle dele-tions in a similar way. When two blockswith the same local depth k contain itemswhose hash addresses share the samek− 1 least significant bits and can fit intoa single block, then their items are mergedinto a single block with a decrementedvalue of k. The combined size of the blocksbeing merged must be sufficiently lessthan B to prevent immediate splitting af-ter a subsequent insertion. The directoryshrinks by half (and the global depth dis decremented by 1) when all the localdepths are less than the current valueof d .

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 32: Massive Data - University of Kansas

240 Jeffrey Scott Vitter

The expected number of disk blocks re-quired to store the data items is asymptot-ically n/ ln 2≈n/0.69; that is, the blockstend to be about 69% full [Mendelson1982]. At least Ä(n/B) blocks are neededto store the directory. Flajolet [1983]showed that on average the directoryuses 2(N1/Bn/B)=2(N1+ 1/B/B2) blocks,which can be superlinear in N asymp-totically! However, for practical values ofN and B, the N 1/B term is a small con-stant, typically less than 2, and directorysize is within a constant factor of optimal.

The resulting directory is equivalentto the leaves of a perfectly balanced trie[Knuth 1998], in which the search pathfor each item is determined by its hashaddress, except that hashing allows theleaves of the trie to be accessed directlyin a single I/O. Any item can thus beretrieved in a total of two I/Os. If the di-rectory fits in internal memory, only oneI/O is needed.

A disadvantage of directory schemes isthat two I/Os rather than one I/O are re-quired when the directory is stored in ex-ternal memory. Litwin [1980] and Larson[1982] developed a directoryless methodcalled linear hashing that expands thenumber of data blocks in a controlled regu-lar fashion. For example, suppose that thedisk blocks currently allocated are blocks0, 1, 2, . . . , 2d + p−1, for some 0 ≤ p < 2d .When N grows sufficiently larger (say, by0.8B items), block p is split by allocatinga new block 2d + p. Some of the data itemsfrom block p are redistributed to block2d + p, based upon the value of hashd+1,and p is incremented by 1. When p reaches2d, it is reset to 0 and the global depthd is incremented by 1. To search for anitem with key value x, the hash addresshashd (x) is used if it is p or larger; oth-erwise if the address is less than p, thenthe corresponding block has already beensplit, so hashd+1(x) is used instead as thehash address.

In contrast to directory schemes, theblocks in directoryless methods are cho-sen for splitting in a predefined order.Thus the block that splits is usually notthe block that has overflowed, so someof the blocks may require auxiliary over-

flow lists to store items assigned to them.On the other hand, directoryless methodshave the advantage that there is no needfor access to a directory structure, andthus searches often require only one I/O.A related technique called spiral storage(or spiral hashing) [Martin 1979; Mullin1985] combines constrained bucket split-ting and overflowing buckets. More de-tailed surveys and analysis of methods fordynamic hashing appear in Baeza-Yatesand Soza-Pollman [1998] and Enbody andDu [1988].

The above hashing schemes and theirmany variants work very well for dictio-nary applications in the average case, buthave poor worst-case performance. Theyalso do not support sequential search, suchas retrieving all the items with key valuein a specified range. Some clever work hasbeen done on order-preserving hash func-tions, in which items with sequential keysare stored in the same block or in adja-cent blocks, but the search performanceis less robust and tends to deterioratebecause of unwanted collisions. (SeeGaede and Gunther [1998] for a survey,plus recent work in Indyk et al. [1997]).A more effective approach for sequentialsearch is to use multiway trees, which weexplore next.

10. MULTIWAY TREE DATA STRUCTURES

An advantage of search trees over hashingmethods is that the data items in a treeare sorted, and thus the tree can be usedreadily for one-dimensional range search.The items in a range [x, y] can be found bysearching for x in the tree, and then per-forming an inorder traversal in the treefrom x to y . In this section we exploresome important search-tree data struc-tures in external memory.

10.1. B-Trees and Variants

Tree-based data structures arise natu-rally in the online setting, in which thedata can be updated and queries mustbe processed immediately. Binary treeshave a host of applications in the (internalmemory) RAM model. In order to exploit

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 33: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 241

block transfer, trees in external memorygenerally use a block for each node, whichcan store 2(B) pointers and data values.

The well-known balanced multiway B-tree due to Bayer and McCreight [1972],Comer [1979], and Knuth [1998] is themost widely used nontrivial EM datastructure. The degree of each node in theB-tree (with the exception of the root) isrequired to be 2(B), which guaranteesthat the height of a B-tree storing Nitems is roughly logB N . B-trees supportdynamic dictionary operations and one-dimensional range search optimally in lin-ear space, O(logB N ) I/Os per insert ordelete, and O(logB N + z) I/Os per query,where Z = z B is the number of items out-put. When a node overflows during an in-sertion, it splits into two half-full nodes,and if the splitting causes the parent nodeto overflow, the parent node splits, and soon. Splittings can thus propagate up to theroot, which is how the tree grows in height.Deletions are handled in a symmetric wayby merging nodes.

In the B+-tree variant, pictured inFigure 8, all the items are stored in theleaves, and the leaves are linked togetherin symmetric order to facilitate rangequeries and sequential access. The inter-nal nodes store only key values and point-ers and thus can have a higher branchingfactor. In the most popular variant ofB+-trees, called B*-trees, splitting canusually be postponed when a node over-flows, by “sharing” the node’s data withone of its adjacent siblings. The nodeneeds to be split only if the sibling is alsofull; when that happens, the node splitsinto two, and its data and those of its fullsibling are evenly redistributed, makingeach of the three nodes about 2/3 full. Thislocal optimization reduces the numberof times new nodes must be created andthus increases the storage utilization.And since there are fewer nodes in thetree, search I/O costs are lower. Whenno sharing is done (as in B+-trees), Yao[1978] shows that nodes are roughlyln 2≈ 69% full on the average, assumingrandom insertions. With sharing (as inB*-trees), the average storage utiliza-tion increases to about 2 ln(3/2)≈ 81%

[Baeza-Yates 1989; Kuspert 1983]. Stor-age utilization can be increased furtherby sharing among several siblings, atthe cost of more complicated insertionsand deletions. Some helpful space-savingtechniques borrowed from hashing arepartial expansions [Baeza-Yates andLarson 1989] and use of overflow nodes[Srinivasan 1991].

A cross between B-trees and hashing,where each subtree rooted at a certainlevel of the B-tree is instead organizedas an external hash table, was devel-oped by Litwin and Lomet [1987] and fur-ther studied in Baeza-Yates [1996] andLomet [1988]. O’Neil [1992] proposed aB-tree variant called the SB-tree that clus-ters together on the disk symmetrically or-dered nodes from the same level so as tooptimize range queries and sequential ac-cess. Rao and Ross [1999; 2000] use sim-ilar ideas to exploit locality and optimizesearch tree performance in internal mem-ory. Reducing the number of pointers al-lows a higher branching factor and thusfaster search.

Partially persistent versions of B-treeshave been developed by Becker et al.[1996] and Varman and Verma [1997].By persistent data structure, we meanthat searches can be done with respectto any timestamp y [Driscoll et al. 1989;Easton 1986]. In a partially persistentdata structure, only the most recent ver-sion of the data structure can be updated.In a fully persistent data structure, anyupdate done with timestamp y affects allfuture queries for any time after y . An in-teresting open problem is whether B-treescan be made fully persistent. Salzberg andTsotras [1999] survey work done on persis-tent access methods and other techniquesfor time-evolving data. Lehman and Yao[1981], Mohan [1990], and Lomet andSalzberg [1997] explore mechanisms toadd concurrency and recovery to B-trees.

10.2. Weight-Balanced B-Trees

Arge and Vitter [1996] introduce a pow-erful variant of B-trees called weight-balanced B-trees, with the property thatthe weight of any subtree at level h (i.e.,

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 34: Massive Data - University of Kansas

242 Jeffrey Scott Vitter

Fig. 8 . B+-tree multiway search tree. Each internal and leaf nodecorresponds to a disk block. All the items are stored in the leaves; thedarker portion of each leaf block indicates its relative fullness. Theinternal nodes store only key values and pointers, 2(B) of them pernode. Although not indicated here, the leaf blocks are linked togethersequentially.

the number of nodes in the subtree rootedat a node of height h) is 2(ah), for somefixed parameter a of order B. By contrast,the sizes of subtrees at level h in a reg-ular B-tree can differ by a multiplicativefactor that is exponential in h. When anode on level h of a weight-balanced B-treegets rebalanced, no further rebalancing isneeded until its subtree is updated Ä(ah)times. Weight-balanced B-trees support awide array of applications in which theI/O cost to rebalance a node of weight wis O(w); the rebalancings can be scheduledin an amortized (and often worst-case) waywith only O(1) I/Os. Such applications arevery common when the nodes have sec-ondary structures, as in multidimensionalsearch trees, or when rebuilding is expen-sive. Agarwal et al. [2001a] apply weight-balanced B-trees to convert partition treessuch as kd-trees, BBD trees, and BARtrees, which were designed for internalmemory, into efficient EM data structures.

Weight-balanced trees called BB[α]-trees [Blum and Mehlhorn 1980;Nievergelt and Reingold 1973] havebeen designed for internal memory; theymaintain balance via rotations, which isappropriate for binary trees, but not forthe multiway trees needed for externalmemory. In contrast, weight-balancedB-trees maintain balance via splits andmerges.

Weight-balanced B-trees were origi-nally conceived as part of an optimaldynamic EM interval tree structure forstabbing queries and a related EM seg-ment tree structure. We discuss their use

for stabbing queries and other types ofrange queries in Sections 11.3 to 11.5.They also have applications in the (in-ternal memory) RAM model [Arge andVitter 1996; Grossi and Italiano 1997]where they offer a simpler alternative toBB[α]-trees. For example, by setting a toa constant in the EM interval tree basedupon weight-balanced B-trees, we get asimple worst-case implementation of in-terval trees [Edelsbrunner 1983a; 1983b]in internal memory. Weight-balanced B-trees are also preferable to BB[α]-trees forpurposes of augmenting one-dimensionaldata structures with range restriction ca-pabilities [Willard and Lueker 1985].

10.3. Parent Pointers and Level-BalancedB-Trees

It is sometimes useful to augment B-treeswith parent pointers. For example, if werepresent a total order via the leaves in aB-tree, we can answer order queries suchas, “Is x < y in the total order?” by walk-ing upwards in the B-tree from the leavesfor x and y until we reach their commonancestor. Order queries arise in online al-gorithms for planar point location and fordetermining reachability in monotonesubdivisions [Agarwal et al. 1999]. If weaugment a conventional B-tree with par-ent pointers, then each split operationcosts 2(B) I/Os to update parent pointers,although the I/O cost is only O(1) whenamortized over the updates to the node.However, this amortized bound does notapply if the B-tree needs to support cut

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 35: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 243

and concatenate operations, in which casethe B-tree is cut into contiguous pieces andthe pieces are rearranged arbitrarily. Forexample, reachability queries in a mono-tone subdivision are processed by main-taining two total orders, called the left-ist and rightist orders, each of which isrepresented by a B-tree. When an edgeis inserted or deleted, the tree represent-ing each order is cut into four consecutivepieces, and the four pieces are rearrangedvia concatenate operations into a new to-tal order. Doing cuts and concatenationvia conventional B-trees augmented withparent pointers will require2(B) I/Os perlevel in the worst case. Node splits can oc-cur with each operation (unlike the casewhere there are only inserts and deletes),and thus there is no convenient amortiza-tion argument that can be applied.

Agarwal et al. [1999] describe an in-teresting variant of B-trees called level-balanced B-trees for handling parentpointers and operations like cut andconcatenate. The balancing condition is“global”: the data structure representsa forest of B-trees in which the num-ber of nodes on level h in the forest isallowed to be at most Nh= 2N/(b/3)h,where b is some fixed parameter in therange 4< b< B/2. It immediately followsthat the total height of the forest isroughly logb N .

Unlike previous variants of B-trees,the degrees of individual nodes of level-balanced B-trees can be arbitrarily small,and for storage purposes, nodes are packedtogether into disk blocks. Each node in theforest is stored as a node record (whichpoints to the parent’s node record) and adoubly linked list of child records (whichpoint to the node records of the children).There are also pointers between the noderecord and the list of child records. Everydisk block stores only node records or onlychild records, but all the child records fora given node must be stored in the sameblock (possibly with child records for othernodes). The advantage of this extra levelof indirection is that cuts and concatenatescan usually be done in only O(1) I/Os perlevel of the forest. For example, duringa cut, a node record gets split into two,

and its list of child nodes is chopped intotwo separate lists. The parent node musttherefore get a new child record to pointto the new node. These updates requireO(1) I/Os except when there is not enoughspace in the disk block of the parent’s childrecords, in which case the block must besplit into two, and extra I/Os are neededto update the pointers to the moved childrecords. The amortized I/O cost, however,is only O(1) per level, since each updatecreates at most one node record and childrecord at each level. The other dynamicupdate operations can be handled simi-larly.

All that remains is to reestablish theglobal level invariant when a level getstoo many nodes as a result of an update.If level h is the lowest such level outof balance, then level h and all the lev-els above it are reconstructed via a pos-torder traversal in O(Nh) I/Os so that thenew nodes get degree 2(b) and the in-variant is restored. The final trick is toconstruct the new parent pointers thatpoint from the 2(Nh−1)=2(bNh) noderecords on level h− 1 to the 2(Nh) level-h nodes. The parent pointers can be ac-cessed in a blocked manner with respectto the new ordering of the nodes onlevel h. By sorting, the pointers can berearranged to correspond to the orderingof the nodes on level h− 1, after whichthe parent pointer values can be writtenvia a linear scan. The resulting I/O costis O(Nh+Sort(bNh)+Scan(bNh)), whichcan be amortized against the 2(Nh)updates that occurred since the lasttime the level-h invariant was violated,yielding an amortized update cost ofO(1+ (b/B) logm n) I/Os per level.

Order queries such as “Does leaf xprecede leaf y in the total order repre-sented by the tree?” can be answered usingO(logB N ) I/Os by following parent point-ers starting at x and y . The update oper-ations insert, delete, cut, and concatenatecan be done in O((1+ (b/B) logm n) logb N )I/Os amortized, for any 2≤ b≤ B/2, whichis never worse than O((logB N )2) by appro-priate choice of b.

Using the multislab decomposition wediscuss in Section 11.3, Agarwal et al.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 36: Massive Data - University of Kansas

244 Jeffrey Scott Vitter

[1999] apply level-balanced B-trees in adata structure for point location in mono-tone subdivisions, which supports queriesand (amortized) updates in O((logB N )2)I/Os. They also use it to dynami-cally maintain planar st-graphs usingO((1+ (b/B)(logm n) logb N ) I/Os (amor-tized) per update, so that reachabilityqueries can be answered in O(logB N )I/Os (worst-case). (Planar st-graphs areplanar directed acyclic graphs with asingle source and a single sink.) An in-teresting open question is whether level-balanced B-trees can be implemented inO(logB N ) I/Os per update. Such an im-provement would immediately give an op-timal dynamic structure for reachabilityqueries in planar st-graphs.

10.4. Buffer Trees

An important paradigm for constructingalgorithms for batched problems in aninternal memory setting is to use a dy-namic data structure to process a se-quence of updates. For example, we cansort N items by inserting them one byone into a priority queue, followed by a se-quence of N delete min operations. Sim-ilarly, many batched problems in com-putational geometry can be solved bydynamic plane sweep techniques. For ex-ample, in Section 7 we showed how to com-pute orthogonal segment intersections bydynamically keeping track of the activevertical segments (i.e., those hit by thehorizontal sweep line); we mentioned asimilar algorithm for orthogonal rectangleintersections.

However, if we use this paradigmnaively in an EM setting, with a B-treeas the dynamic data structure, the result-ing I/O performance will be highly nonop-timal. For example, if we use a B-tree asthe priority queue in sorting or to storethe active vertical segments hit by thesweep line, each update and query oper-ation will take O(logB N ) I/Os, resultingin a total of O(N logB N ) I/Os, which islarger than the optimal bound Sort(N ) bya substantial factor of roughly B. One so-lution suggested in Vitter [1991] is to use abinary tree data structure in which items

are pushed lazily down the tree in blocksof B items at a time. The binary natureof the tree results in a data structure ofheight O(log n), yielding a total I/O boundof O(n log n), which is still nonoptimal bya significant log m factor.

Arge [1995a] developed the elegantbuffer tree data structure to supportbatched dynamic operations, as in thesweep line example, where the queries donot have to be answered right away orin any particular order. The buffer treeis a balanced multiway tree, but with de-gree 2(m) rather than degree 2(B), ex-cept possibly for the root. Its key distin-guishing feature is that each node hasa buffer that can store 2(M ) items (i.e.,2(m) blocks of items). Items in a nodeare pushed down to the children when thebuffer fills. Emptying a full buffer requires2(m) I/Os, which amortizes the cost of dis-tributing the M items to the 2(m) chil-dren. Each item thus incurs an amortizedcost of O(m/M )=O(1/B) I/Os per level,and the resulting cost for queries and up-dates is O((1/B) logm n) I/Os amortized.

Buffer trees have an ever-expandinglist of applications. They can be usedas a subroutine in the standard sweepline algorithm in order to get an opti-mal EM algorithm for orthogonal segmentintersection. Arge showed how to extendbuffer trees to implement segment trees[Bentley 1980] in external memory in abatched dynamic setting by reducing thenode degrees to 2(

√m ) and by intro-

ducing multislabs in each node, whichwere explained in Section 7 for the re-lated batched problem of intersectingrectangles. Buffer trees provide a nat-ural amortized implementation of pri-ority queues for time-forward process-ing applications such as discrete eventsimulation, sweeping, and list rank-ing [Chiang et al. 1995]. Govindrajanet al. [2000] use time-forward processingto construct a well-separated pair decom-position of N points in d dimensions inO(Sort(N )) I/Os, and they apply it to theproblems of finding the K nearest neigh-bors for each point and the K closest pairs.Brodal and Katajainen [1998] provide aworst-case optimal priority queue, in the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 37: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 245

sense that every sequence of B insertand delete min operations requires onlyO(logm n) I/Os. Practical implementationsof priority queues based upon these ideasare examined in Brengel et al. [1999]and Sanders [1999]. In Section 11.2 wereport on some timing experiments in-volving buffer trees for use in bulk load-ing of R-trees. Further experiments onbuffer trees appear in Hutchinson et al.[1997].

11. SPATIAL DATA STRUCTURESAND RANGE SEARCH

In this section we consider online EM datastructures for storing and querying spa-tial data. A fundamental database prim-itive in spatial databases and geographicinformation systems (GIS) is range search,which includes dictionary lookup as a spe-cial case. An orthogonal range query, for agiven d -dimensional rectangle, returns allthe points in the interior of the rectangle.In this section we use range searching (es-pecially for the orthogonal 2-D case whend = 2) as the canonical query operation onspatial data. Other types of spatial queriesinclude point location, ray shooting, near-est neighbor, and intersection queries, butfor brevity we restrict our attention pri-marily to range searching.

There are two types of spatial datastructures: data-driven and space-driven.R-trees and kd-trees are data-driven sincethey are based upon a partitioning of thedata items themselves, whereas space-driven methods such as quad trees andgrid files are organized by a partitioningof the embedding space, akin to order-preserving hash functions. In this sectionwe discuss primarily data-driven datastructures.

Multidimensional range search is afundamental primitive in several onlinegeometric applications, and it provides in-dexing support for constraint and object-oriented data models. (See Kanellakiset al. [1996] for background.) We have al-ready discussed multidimensional rangesearching in a batched setting in Section 7.In this section we concentrate on datastructures for the online case.

For many types of range searching prob-lems, it is very difficult to develop theoret-ically optimal algorithms and data struc-tures. Many open problems remain. Theprimary design criteria are to achieve thesame performance we get using B-trees forone-dimensional range search:

(1) to get a combined search and outputcost for queries of O(logB N + z) I/Os,

(2) to use only a linear amount (namely,O(n) blocks) of disk storage space, and

(3) to support dynamic updates inO(logB N ) I/Os (in the case of dynamicdata structures).

Criterion 1 combines the I/O costSearch(N )=O(logB N ) of the searchcomponent of queries with the I/O costOutput(Z )=O(dze) for reporting the Zoutput items. Combining the costs hasthe advantage that when one cost is muchlarger than the other, the query algorithmhas the extra freedom to follow a filteringparadigm [Chazelle 1986], in which boththe search component and the outputreporting are allowed to use the largernumber of I/Os. For example, to do queriesoptimally when Output(Z ) is large withrespect to Search(N ), the search compo-nent can afford to be somewhat sloppyas long as it doesn’t use more than O(z)I/Os, and when Output(Z ) is relativelysmall, the Z output items do not need toreside compactly in only O(dze) blocks.Filtering is an important design paradigmfor many of the algorithms we discuss inthis section.

We find in Section 11.7 under a fairlygeneral computational model for general2-D orthogonal queries, as pictured inFigure 9(d), it is impossible to satisfyCriteria 1 and 2 simultaneously. At leastÄ(n(log n)/ log(logB N + 1)) blocks of diskspace must be used to achieve a querybound of O((logB N )c + z) I/Os per query,for any constant c [Subramanian andRamaswamy 1995]. Three natural ques-tions arise.

—What sort of performance can beachieved when using only a linearamount of disk space? In Sections 11.1and 11.2, we discuss some of the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 38: Massive Data - University of Kansas

246 Jeffrey Scott Vitter

Fig. 9 . Different types of 2-D orthogonal range queries: (a) diagonal corner two-sided 2-D query (equivalent to a stabbing query; cf. Section 11.3); (b) two-sided2-D query; (c) three-sided 2-D query; (d) general four-sided 2-D query.

linear-space data structures used exten-sively in practice. None of them comeclose to satisfying Criteria 1 and 3 forrange search in the worst case, but intypical-case scenarios they often per-form well. We devote Section 11.2 toR-trees and their variants, which arethe most popular general-purpose spa-tial structures developed to date.

—Since the lower bound applies only togeneral 2-D rectangular queries, arethere any data structures that meet Cri-teria 1 to 3 for the important specialcases of 2-D range searching picturedin Figures 9(a) through (c)? Fortunatelythe answer is yes. We show in Sections11.3 and 11.4 how to use a bootstrappingparadigm to achieve optimal search andupdate performance.

—Can we meet Criteria 1 and 2 for gen-eral four-sided range searching if thedisk space allowance is increased toO(n(log n)/ log(logB N + 1)) blocks? Yesagain! In Section 11.5, we show how toadapt the optimal structure for three-sided searching in order to handle gen-eral four-sided searching in optimalsearch cost. The update cost, however,is not known to be optimal.

In Section 11.6, we discuss other scenar-ios of range search dealing with three di-mensions and nonorthogonal queries. Wediscuss the lower bounds for 2-D rangesearching in Section 11.7.

11.1. Linear-Space Spatial Structures

Grossi and Italiano [1999] construct anelegant multidimensional version of theB-tree called the cross tree. Using linear

space, it combines the data-driven par-titioning of weight-balanced B-trees (cf.Section 10.2) at the upper levels of thetree with the space-driven partitioning ofmethods like quad trees at the lower lev-els of the tree. For d > 1, d -dimensionalorthogonal range queries can be donein O(n1− 1/d + z) I/Os, and inserts anddeletes take O(logB N ) I/Os. The O-tree ofKanth and Singh [1999] provides similarbounds. Cross trees also support the dy-namic operations of cut and concatenate inO(n1− 1/d ) I/Os. In some restricted modelsfor linear-space data structures, the 2-Drange search query performance of crosstrees and O-trees can be considered to beoptimal, although it is much larger thanthe logarithmic bound of Criterion 1.

One way to get multidimensional EMdata structures is to augment known in-ternal memory structures, such as quadtrees and kd-trees, with block-access ca-pabilities. Examples include kd-B-trees[Robinson 1981], buddy trees [Seeger andKriegel 1990], and hB-trees [Evangelideset al. 1997; Lomet and Salzberg 1990].Grid files [Hinrichs 1985; Krishnamurthyand Wang 1985; Nievergelt et al. 1984]are a flattened data structure for storingthe cells of a two-dimensional grid in diskblocks. Another technique is to “linearize”the multidimensional space by imposinga total ordering on it (a so-called space-filling curve), and then the total order isused to organize the points into a B-tree[Gargantini 1982; Kamel and Faloutsas1994; Orenstein and Merrett 1984]. Lin-earization can also be used to representnonpoint data, in which the data itemsare partitioned into one or more mul-tidimensional rectangular regions [Abel

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 39: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 247

1984; Orenstein 1989]. All the methodsdescribed in this paragraph use linearspace, and they work well in certain sit-uations; however, their worst-case rangequery performance is no better than thatof cross trees, and for some methods, suchas grid files, queries can require 2(n)I/Os, even if there are no points satisfyingthe query. We refer the reader to Agarwaland Erickson [1999], Gaede and Gunther[1998], and Nievergelt and Widmayer[1997] for a broad survey of these andother interesting methods. Space-fillingcurves arise again in connection with R-trees, which we describe next.

11.2. R-Trees

The R-tree of Guttman [1984] and itsmany variants are a practical multidimen-sional generalization of the B-tree for stor-ing a variety of geometric objects, suchas points, segments, polygons, and poly-hedra, using linear disk space. Internalnodes have degree 2(B) (except possiblythe root), and leaves store 2(B) items.Each node in the tree has associated withit a bounding box (or bounding polygon) ofall the items in its subtree. A big differ-ence between R-trees and B-trees is thatin R-trees the bounding boxes of siblingnodes are allowed to overlap. If an R-treeis being used for point location, for exam-ple, a point may lie within the boundingbox of several children of the current nodein the search. In that case the search mustproceed to all such children.

In the dynamic setting, there are sev-eral popular heuristics for where to insertnew items into an R-tree and how to rebal-ance it; see Agarwal and Erickson [1999],Gaede and Gunther [1998], and Greene[1989] for a survey. The R*-tree variant ofBeckmann et al. [1990] seems to give bestoverall query performance. To insert anitem, we start at the root and recursivelyinsert the item into the subtree whosebounding box would expand the least inorder to accommodate the item. In case ofa tie (e.g., if the item already fits inside thebounding boxes of two or more subtrees),we choose the subtree with the small-est resulting bounding box. In the normal

R-tree algorithm, if a leaf node gets toomany items or if an internal node getstoo many children, we split it, as inB-trees. Instead, in the R*-tree algorithm,we remove a certain percentage of theitems from the overflowing node and rein-sert them into the tree. The items wechoose to reinsert are the ones whose cen-troids are farthest from the center of thenode’s bounding box. This forced reinser-tion tends to improve global organizationand reduce query time. If the node stilloverflows after the forced reinsertion, wesplit it. The splitting heuristics try to par-tition the items into nodes so as to mini-mize intuitive measures such as coverage,overlap, or perimeter. During deletion, inboth the normal R-tree and R*-tree algo-rithms, if a leaf node has too few itemsor if an internal node has too few chil-dren, we delete the node and reinsertall its items back into the tree by forcedreinsertion.

The rebalancing heuristics perform wellin many practical scenarios, especially inlow dimensions, but they result in poorworst-case query bounds. An interestingopen problem is whether nontrivial querybounds can be proven for the “typical-case”behavior of R-trees for problems such asrange searching and point location. Sim-ilar questions apply to the methods dis-cussed in Section 11.1. New R-tree parti-tioning methods by de Berg et al. [2000]and Agarwal et al. [2001b] provide someprovable bounds on overlap and query per-formance.

In the static setting, in which there areno updates, constructing the R*-tree by re-peated insertions, one by one, is extremelyslow. A faster alternative to the dynamicR-tree construction algorithms mentionedabove is to bulk-load the R-tree in abottom-up fashion [Abel 1984; Kamel andFaloutsos 1993; Orenstein 1989]. Suchmethods use some heuristic for groupingthe items into leaf nodes of the R-tree, andthen recursively build the nonleaf nodesfrom bottom to top. As an example, inthe so-called Hilbert R-tree of Kamel andFaloutsos [1993], each item is labeled withthe position of its centroid on the Peano–Hilbert space-filling curve, and a B+-tree

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 40: Massive Data - University of Kansas

248 Jeffrey Scott Vitter

Fig. 10 . Costs for R-tree processing (inunits of 1000 I/Os) using the naive repeatedinsertion method and the buffer R-tree forvarious buffer sizes: (a) cost for bulk-loadingthe R-tree; (b) query cost.

is built upon the totally ordered labels in abottom-up manner. Bulk loading a HilbertR-tree is therefore easy to do once thecentroid points are presorted. Thesestatic construction method algorithms arevery different in spirit from the dynamicinsertion methods. The dynamic methodsexplicitly try to reduce the coverage,overlap, or perimeter of the boundingboxes of the R-tree nodes, and as a re-sult, they usually achieve good queryperformance. The static constructionmethods do not consider the boundingbox information at all. Instead, the hopeis that the improved storage utilization(up to 100%) of these packing methodscompensates for a higher degree of nodeoverlap. A dynamic insertion method re-lated to Kamel and Faloutsos [1993] waspresented in Kamel and Faloutsos [1994].The quality of the Hilbert R-tree in termsof query performance is generally not asgood as that of an R*-tree, especially for

Table V. Summary of the Costs (in Number of I/Os)for R-Tree Updates and Queries. Packing Refers to

the Percentage Storage Utilization

Data Update Update with 50% of the DataSet Method Building Querying Packing

RInaiveHilbertbuffer

259, 26315, 86513, 484

6, 6707, 2625, 485

64%92%90%

CTnaiveHilbertbuffer

805, 74951, 08642, 774

40, 91040, 59337, 798

66%92%90%

NJnaiveHilbertbuffer

1, 777, 570120, 034101, 017

70, 83069, 79865, 898

66%92%91%

NYnaiveHilbertbuffer

3, 736, 601246, 466206, 921

224, 039230, 990227, 559

66%92%90%

higher-dimensional data [Berchtold et al.1998; Kamel et al. 1996].

In order to get the best of both worlds—the query performance of R*-trees andthe bulk construction efficiency of HilbertR-trees—Arge et al. [1999a] and vanden Bercken et al. [1997] independentlydevised fast bulk loading methods basedupon buffer trees that do top-downconstruction in O(n logm n) I/Os, whichmatches the performance of the bottom-upmethods within a constant factor. Theformer method is especially efficient andsupports dynamic batched updates andqueries. In Figure 10 and Table V, wereport on some experiments that test theconstruction, update, and query perfor-mance of various R-tree methods. Theexperimental data came from TIGER/linedata sets from four US states [TIGER1992]; the implementations were doneusing the TPIE system, described inSection 14.

Figure 10 compares the constructioncost for building R-trees and the result-ing query performance in terms of I/Os forthe naive sequential method for construc-tion into R*-trees (labeled “naive”) and thenewly developed buffer R*-tree method[Arge et al. 1995a] (labeled “buffer”). AnR-tree was constructed on the TIGER roaddata for each state and for each of fourpossible buffer sizes. The four buffer sizeswere capable of storing 0, 600, 1250, and5000 rectangles, respectively; buffer size 0corresponds to the naive method and the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 41: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 249

larger buffers correspond to the buffermethod. The query performance of eachresulting R-tree was measured by pos-ing rectangle intersection queries usingrectangles taken from TIGER hydro-graphic data. The results, depicted inFigure 10, show that buffer R*-trees, evenwith relatively small buffers, achieve atremendous speedup in number of I/Osfor construction without any worseningin query performance, compared with thenaive method. The CPU costs of the twomethods are comparable. The storage uti-lization of buffer R*-trees tends to be inthe 90% range, as opposed to roughly 70%for the naive method.

Bottom-up methods can build R-treeseven more quickly and more compactly,but they generally do not support bulk dy-namic operations, which is a big advan-tage of the buffer tree approach. Kamelet al. [1996] develop a way to do bulkupdates with Hilbert R-trees, but ata cost in terms of query performance.Table V compares dynamic update meth-ods for the naive method, for bufferR-trees, and for Hilbert R-trees [Kamelet al. 1996] (labeled “Hilbert”). A singleR-tree was built for each of the four USstates, containing 50% of the road dataobjects for that state. Using each of thethree algorithms, the remaining 50% ofthe objects were inserted into the R-tree,and the construction time was measured.Query performance was then tested as be-fore. The results in Table V indicate thatthe buffer R*-tree and the Hilbert R-treeachieve a similar degree of packing, butthe buffer R*-tree provides better updateand query performance.

11.3. Bootstrapping for 2-D DiagonalCorner and Stabbing Queries

An obvious paradigm for developing an ef-ficient dynamic EM data structure, givenan existing data structure that workswell when the problem fits into inter-nal memory, is to “externalize” the in-ternal memory data structure. If theinternal memory data structure uses a bi-nary tree, then a multiway tree such asa B-tree must be used instead. However,

when searching a B-tree, it can be diffi-cult to report the outputs in an output-sensitive manner. For example, in certainsearching applications, each of the 2(B)subtrees of a given node in a B-tree maycontribute one item to the query output,and as a result each subtree may need tobe explored (costing several I/Os) just toreport a single output item.

Fortunately, we can sometimes achieveoutput-sensitive reporting by augment-ing the data structure with a set of fil-tering substructures, each of which is adata structure for a smaller version ofthe same problem. We refer to this ap-proach, which we explain shortly in moredetail, as the bootstrapping paradigm.Each substructure typically needs tostore only O(B2) items and to answerqueries in O(logB B2+ Z ′/B)=O(dZ ′/Be)I/Os, where Z ′ is the number of items re-ported. A substructure can even be staticif it can be constructed in O(B) I/Os, sincewe can keep updates in a separate bufferand do a global rebuilding in O(B) I/Oswhenever there are 2(B) updates. Such arebuilding costs O(1) I/Os (amortized) perupdate. We can often remove the amortiza-tion make it worst-case using the weight-balanced B-trees of Section 10.2 as the un-derlying B-tree structure.

Arge and Vitter [1996] first uncoveredthe bootstrapping paradigm while design-ing an optimal dynamic EM data structurefor diagonal corner two-sided 2-D queries(see Figure 9(a)) that meets all three de-sign criteria listed in Section 11. Diago-nal corner two-sided queries are equiva-lent to stabbing queries, which have theform: “Given a set of one-dimensional in-tervals, report all the intervals ‘stabbed’by the query value x.” (That is, report allintervals that contain x.) A diagonal cor-ner query x on a set of 2-D points {(a1, b2),(a2, b2), . . .} is equivalent to a stabbingquery x on the set of closed intervals{[a1, b2], [a2, b2], . . .}.

The EM data structure for stabbingqueries is a multiway version of thewell-known interval tree data structure[Edelsbrunner 1983a; 1983b] for internalmemory, which supports stabbing queriesin O(log N + Z ) CPU time and updates

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 42: Massive Data - University of Kansas

250 Jeffrey Scott Vitter

Fig. 11 . Internal node v of the EM priority search tree, for B = 64 with√

B = 8slabs. Node v is the lowest node in the tree completely containing the indicatedinterval. The middle piece of the interval is stored in the multislab list corre-sponding to slabs 3 to 5. (The multislab lists are not pictured.) The left and rightend pieces of the interval are stored in the left-ordered list of slab 2 and theright-ordered list of slab 6, respectively.

in O(log N ) CPU time and uses O(N )space. We can externalize it by using aweight-balanced B-tree as the underly-ing base tree, where the nodes have de-gree 2(

√B ). Each node in the base tree

corresponds in a natural way to a one-dimensional range of x-values; its 2(

√B )

children correspond to subranges calledslabs, and the 2(

√B2)=2(B) contiguous

sets of slabs are called multislabs, as inSection 7 for a similar batched problem.Each input interval is stored in the low-est node v in the base tree whose rangecompletely contains the interval. The in-terval is decomposed by v’s 2(

√B ) slabs

into at most three pieces: the middle piecethat completely spans one or more slabsof v, the left end piece that partially pro-trudes into a slab of v, and the right endpiece that partially protrudes into anotherslab of v, as shown in Figure 11. The threepieces are stored in substructures of v.In the example in Figure 11, the middlepiece is stored in a list associated with themultislab it spans (corresponding to thecontiguous range of slabs 3 to 5), the leftend piece is stored in a one-dimensionallist for slab 2 ordered by left endpoint,and the right end piece is stored in a one-dimensional list for slab 6 ordered by rightendpoint.

Given a query value x, the intervalsstabbed by x reside in the substructuresof the nodes of the base tree along thesearch path from the root to the leaf for x.For each such node v, we consider each of

v’s multislabs that contains x and reportall the intervals in the multislab list. Wealso walk sequentially through the right-ordered and left-ordered lists for the slabof v that contains x, reporting intervals inan output-sensitive way.

The big problem with this approach isthat we have to spend at least one I/Oper multislab containing x, regardless ofhow many intervals are in the multislablists. For example, there may be 2(B)such multislab lists, with each list con-taining only a few stabbed intervals (orworse yet, none at all). The resulting queryperformance will be highly nonoptimal.The solution, according to the bootstrap-ping paradigm, is to use a substructure ineach node consisting of an optimal staticdata structure for a smaller version of thesame problem; a good choice is the cornerdata structure developed by Kanellakiset al. [1996]. The corner substructurein this case is used to store all theintervals from the “sparse” multislab lists,namely, those that contain fewer thanB intervals, and thus the substructurecontains only O(B2) intervals. When vis-iting node v, we access only v’s non-sparse multislab lists, each of whichcontributes Z ′ ≥ B intervals to the output,at an output-sensitive cost of O(Z ′/B) I/Os,for some Z ′. The remaining Z ′′ stabbedintervals stored in v can be found by asingle query to v’s corner substructure,at a cost of O(logB B2+ Z ′′/B)=O(dZ ′′/Be)I/Os. Since there are O(logB N ) nodes

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 43: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 251

along the search path in the base tree, thetotal collection of Z stabbed intervals isreported in O(logB N + z) I/Os, which isoptimal. Using a weight-balanced B-treeas the underlying base tree allows thestatic substructures to be rebuilt in worst-case optimal I/O bounds.

Stabbing queries are important be-cause, when combined with one-dimensional range queries, they provide asolution to dynamic interval management,in which one-dimensional intervals canbe inserted and deleted, and intersectionqueries can be performed. These opera-tions support indexing of one-dimensionalconstraints in constraint databases. Otherapplications of stabbing queries arise ingraphics and GIS. For example, Chiangand Silva [1999] apply the EM intervaltree structure to extract at query time theboundary components of the isosurface(or contour) of a surface. A data structurefor a related problem, which in additionhas optimal output complexity, appearsin Agarwal et al. [1998b]. The above boot-strapping approach also yields dynamicEM segment trees with optimal queryand update bound and O(n logB N )-blockspace usage.

11.4. Bootstrapping for Three-SidedOrthogonal 2-D Range Search

Arge et al. [1999b] provide another ex-ample of the bootstrapping paradigmby developing an optimal dynamic EMdata structure for three-sided orthogonal2-D range searching (see Figure 9(c))that meets all three design criteria. Ininternal memory, the optimal structure isthe priority search tree [McCreight 1985],which answers three-sided range queriesin O(log N + Z ) CPU time, does updatesin O(log N ) CPU time, and uses O(N )space. The EM structure of Arge et al.[1999b] is an externalization of the prior-ity search tree, using a weight-balancedB-tree as the underlying base tree. Eachnode in the base tree corresponds to aone-dimensional range of x-values, andits 2(B) children correspond to subrangesconsisting of vertical slabs. Each node vcontains a small substructure called a

child cache that supports three-sidedqueries. Its child cache stores the “Y-set”Y (w) for each of the 2(B) children w of v.The Y-set Y (w) for child w consists ofthe highest 2(B) points in w’s slab thatare not already stored in the child cacheof some ancestor of v. There are thus atotal of 2(B2) points stored in v’s childcache.

We can answer a three-sided query ofthe form [x1, x2]× [ y1, +∞) by visiting aset of nodes in the base tree, starting withthe root. For each visited node v, we posethe query [x1, x2]× [ y1, +∞) to v’s childcache and output the results. The follow-ing rules are used to determine which ofv’s children to visit. We visit v’s child w ifeither

(1) w is along the leftmost search pathfor x1 or the rightmost search pathfor x2 in the base tree, or

(2) the entire Y-set Y (w) is reported whenv’s child cache is queried.

(See Figure 12.) There are O(logB N )nodes w that are visited because of Rule 1.When Rule 1 is not satisfied, Rule 2 pro-vides an effective filtering mechanism toguarantee output-sensitive reporting. TheI/O cost for initially accessing a childnode w can be charged to the 2(B) pointsof Y (w) reported from v’s child cache; con-versely, if not all of Y (w) is reported, thenthe points stored in w’s subtree will be toolow to satisfy the query, and there is noneed to visit w. (See Figure 12(b).) Pro-vided that each child cache can be queriedin O(1) I/Os plus the output-sensitivecost to output the points satisfying thequery, the resulting overall query time isO(logB N + z), as desired.

All that remains is to show how to querya child cache in a constant number of I/Os,plus the output-sensitive cost. Arge et al.[1999b] provide an elegant and optimalstatic data structure for three-sided rangesearch, which can be used in the EMpriority search tree described above toimplement the child caches of size O(B2).The static structure is a persistent B-treeoptimized for batched construction. Whenused for O(B2) points, it occupies O(B)

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 44: Massive Data - University of Kansas

252 Jeffrey Scott Vitter

Fig. 12 . Internal node v of the EM priority search tree, with slabs (children) w1, w2, . . . , w5.The Y-sets of each child, which are stored collectively in v’s child cache, are indicated by thebold points. (a) The three-sided query is completely contained in the x-range of w2. The relevant(bold) points are reported from v’s child cache, and the query is recursively answered in w2.(b) The three-sided query spans several slabs. The relevant (bold) points are reported from v’schild cache, and the query is recursively answered in w2, w3, and w5. The query is not extendedto w4 in this case because not all of its Y-set Y (w4) (stored in v’s child cache) satisfies the query,and as a result none of the points stored in w4’s subtree can satisfy the query.

blocks, can be built in O(B) I/Os, andsupports three-sided queries in O(dZ ′/Be)I/Os per query, where Z ′ is the numberof points reported. The static structure isso simple that it may be useful in practiceon its own.

Both the three-sided structure devel-oped by Arge et al. [1999b] and the struc-ture for two-sided diagonal queries dis-cussed in Section 11.3 satisfy Criteria 1to 3 of Section 11. So in a sense, the three-sided query structure subsumes the diago-nal two-sided structure, since three-sidedqueries are more general. However, diag-onal two-sided structures may prove to befaster in practice, because in each of itscorner substructures, the data accessedduring a query are always in contiguousblocks, whereas the static substructuresused in three-sided search do not guaran-tee block contiguity. Empirical work is on-going to evaluate the performance of thesedata structures.

On a historical note, earlier workon two- and three-sided queries wasdone by Ramaswamy and Subramanian[1994] using the notion of path caching;their structure met Criterion 1 buthad higher storage overheads and amor-tized and/or nonoptimal update bounds.Subramanian and Ramaswamy [1995]subsequently developed the p-range treedata structure for three-sided queries,with optimal linear disk space and ne-

arly optimal query and amortized updatebounds.

11.5. General Orthogonal 2-D Range Search

The dynamic data structure for three-sided range searching can be gener-alized using the filtering technique ofChazelle [1986] to handle general four-sided queries with optimal I/O querybound O(logB N + z) and optimal diskspace usage O(n(log n)/ log(logB N + 1))[Arge et al. 1999b]. The update bound be-comes O((logB N )(log n)/log(logB N + 1)),which may not be optimal.

The outer level of the structure is abalanced (logB N + 1)-way 1-D search treewith 2(n) leaves, oriented, say, alongthe x-dimension. It therefore has about(log n)/ log(logB N + 1) levels. At each levelof the tree, each input point is stored infour substructures (described below) thatare associated with the particular treenode at that level that spans the x-valueof the point. The space and update boundsquoted above follow from the fact that thesubstructures use linear space and can beupdated in O(logB N ) I/Os.

To search for the points in a four-sided query rectangle [x1, x2]× [ y1, y2],we decompose the four-sided query inthe following natural way into two three-sided queries, a stabbing query, andlogB N − 1 list traversals. We find the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 45: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 253

lowest node v in the tree whose x-rangecontains [x1, x2]. If v is a leaf, we cananswer the query in a single I/O. Other-wise we query the substructures storedin those children of v whose x-ranges in-tersect [x1, x2]. Let 2≤ k≤ logB N + 1 bethe number of such children. The rangequery when restricted to the leftmost suchchild of v is a three-sided query of the form[x1, +∞] × [ y1, y2], and when restrictedto the rightmost such child of v, the rangequery is a three-sided query of the form[−∞, x2]× [ y1, y2]. Two of the substruc-tures at each node are devoted to three-sided queries of these types; using thelinear-sized data structures of Arge et al.[1999b] in Section 11.4, each such querycan be done in O(logB N + z) I/Os.

For the k− 2 intermediate children of v,their x-ranges are completely containedinside the x-range of the query rectangle,and thus we need only do k− 2 list traver-sals in y-order and retrieve the pointswhose y-values are in the range [ y1, y2].If we store the points in each node iny-order (in the third type of substructure),the Z ′ output points from a node can befound in O(dZ ′/Be) I/Os, once a startingpoint in the linear list is found. We canfind all k− 2 starting points via a sin-gle query to a stabbing query substruc-ture S associated with v. (This structureis the fourth type of substructure.) Foreach two y-consecutive points (ai, bi) and(ai+1, bi+1) associated with a child of v, westore the y-interval [bi, bi+1] in S. Notethat S contains intervals contributed byeach of the logB N + 1 children of v. By asingle stabbing query with query value y1,we can thus identify the k− 2 startingpoints in only O(logB N ) I/Os [Arge andVitter 1996], as described in Section 11.3.(We actually get starting points for all thechildren of v, not just the k− 2 ones ofinterest, but we can discard the startingpoints we don’t need.) The total numberof I/Os to answer the range query is thusO(logB N + z), which is optimal.

11.6. Other Types of Range Search

For other types of range searching, such asin higher dimensions and for nonorthogo-

nal queries, different filtering techniquesare needed. So far, relatively little workhas been done, and many open problemsremain.

Vengroff and Vitter [1996a] develop thefirst theoretically near-optimal EM datastructure for static three-dimensional or-thogonal range searching. They createa hierarchical partitioning in which allthe points that dominate a query pointare densely contained in a set of blocks.Compression techniques are needed tominimize disk storage. With some recentmodifications [Vitter and Vengroff 1999],queries can be done in O(logB N + z)I/Os, which is optimal, and the spaceusage is O(n(log n)k+ 1

/(log(logB N + 1))k)

disk blocks to support (3+ k)-sided 3-Drange queries, in which k of the dimen-sions (0≤ k ≤ 3) have finite ranges. Theresult also provides optimal O(log N + Z )-time query performance for three-sided3-D queries in the (internal memory) RAMmodel, but using O(N log N ) space.

By the reduction in Chazelle andEdelsbrunner [1987], a data structure forthree-sided 3-D queries also applies to2-D homothetic range search, in which thequeries correspond to scaled and trans-lated (but not rotated) transformationsof an arbitrary fixed polygon. A inter-esting special case is “fat” orthogonal2-D range search, where the query rectan-gles are required to have bounded aspectratio. For example, every rectangle withbounded aspect ratio can be covered by twooverlapping squares. An interesting openproblem is to develop linear-sized opti-mal data structures for fat orthogonal2-D range search. By the reduction, onepossible approach would be to developoptimal linear-sized data structures forthree-sided 3-D range search.

Agarwal et al. [1998a] consider halfs-pace range searching, in which a query isspecified by a hyperplane and a bit indi-cating one of its two sides, and the outputof the query consists of all the points onthat side of the hyperplane. They give var-ious data structures for halfspace rangesearching in two, three, and higher dimen-sions, including one that works for sim-plex (polygon) queries in two dimensions,

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 46: Massive Data - University of Kansas

254 Jeffrey Scott Vitter

but with a higher query I/O cost. Theyhave subsequently improved the storagebounds for halfspace range queries in twodimensions to obtain an optimal staticdata structure satisfying Criteria 1 and 2of Section 11.

The number of I/Os needed to buildthe data structures for 3-D orthogonalrange search and halfspace range searchis rather large (more thanÄ(N )). Still, thestructures shed useful light on the com-plexity of range searching and may openthe way to improved solutions. An openproblem is to design efficient construc-tion and update algorithms and to improveupon the constant factors.

Callahan et al. [1995] develop dy-namic EM data structures for several on-line problems in d dimensions. For anyfixed ε >0, they can find an approximatelynearest neighbor of a query point (withina 1+ ε factor of optimal) in O(logB N )I/Os; insertions and deletions can also bedone in O(logB N ) I/Os. They use a re-lated approach to maintain the closestpair of points; each update costs O(logB N )I/Os. Govindarajan et al. [2000] achievethe same bounds for closest pair by main-taining a well-separated pairs decompo-sition. For finding nearest neighbors andapproximate nearest neighbors, two otherapproaches are partition trees [Agarwalet al. 1998a; 2000] and locality-sensitivehashing [Gionis et al. 1999]. Numerousother data structures have been developedfor range queries and related problemson spatial data. We refer to Agarwal andErickson [1999], Gaede and Gunther[1998], and Nievergelt and Widmayer[1997] for a broad survey.

11.7. Lower Bounds for OrthogonalRange Search

We mentioned in Section 11 that Sub-ramanian and Ramaswamy [1995]prove that no EM data structurefor 2-D range searching can achievedesign Criterion 1 using less thanO(n(log n)/ log(logB N + 1)) disk blocks,even if we relax the criterion to allowO((logB N )c+ z) I/Os per query, for anyconstant c. The result holds for an EM

version of the pointer machine model,based upon the approach of Chazelle[1990] for the internal memory model.

Hellerstein et al. [1997] considera generalization of the layout-basedlower bound argument of Kanellakiset al. [1996] for studying the tradeoffbetween disk space usage and queryperformance. They develop a model forindexability, in which an “efficient” datastructure is expected to contain theZ output points to a query compactlywithin O(dZ/Be)=O(dze) blocks. Oneshortcoming of the model is that itconsiders only data layout and ignoresthe search component of queries, andthus it rules out the important filteringparadigm discussed earlier in Section 11.For example, it is reasonable for anyquery algorithm to perform at leastlogB N I/Os, so if the output size Z is atmost B, an algorithm may still be ableto satisfy Criterion 1 even if the outputis contained within O(logB N ) blocksrather than O(z)=O(1) blocks. Argeet al. [1999b] modify the model to red-erive the same nonlinear space lowerbound O(n(log n)/ log(logB N + 1)) ofSubramanian and Ramaswamy [1995]for 2-D range searching by consider-ing only output sizes Z larger than(logB N )c B, for which the number ofblocks allowed to hold the outputs isZ/B=O((logB N )c+ z). This approachignores the complexity of how to findthe relevant blocks, but as mentionedin Section 11.5 the authors separatelyprovide an optimal 2-D range search datastructure that uses the same amountof disk space and does queries in theoptimal O(logB N + z) I/Os. Thus, despiteits shortcomings, the indexability modelis elegant and can provide much insightinto the complexity of blocking data inexternal memory. Further results in thismodel appear in Koutsoupias and Taylor[1998] and Samoladas and Miranker[1998].

One intuition from the indexabilitymodel is that less disk space is neededto efficiently answer 2-D queries whenthe queries have bounded aspect ratio(i.e., when the ratio of the longest side

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 47: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 255

length to the shortest side length of thequery rectangle is bounded). An inter-esting question is whether R-trees andthe linear-space structures of Sections11.1 and 11.2 can be shown to performprovably well for such queries. Anotherinteresting scenario is where the queriescorrespond to snapshots of the continuousmovement of a sliding rectangle.

When the data structure is restricted tocontain only a single copy of each point,Kanth and Singh [1999] show for a re-stricted class of index-based trees thatd -dimensional range queries in the worstcase require Ä(n1−1/d + z) I/Os, and theyprovide a data structure with a matchingbound. Another approach to achieve thesame bound is the cross tree data struc-ture [Grossi and Italiano 1999] mentionedin Section 11.1, which in addition supportsthe operations of cut and concatenate.

12. DYNAMIC AND KINETICDATA STRUCTURES

In this section we consider two scenar-ios where data items change: dynamic (inwhich items are inserted and deleted) andkinetic (in which the data items move con-tinuously along specified trajectories). Inboth cases, queries can be done at anytime. It is often useful for kinetic datastructures to allow insertions and dele-tions; for example, if the trajectory of anitem changes, we must delete the old tra-jectory and insert the new one.

12.1. Logarithmic Method for DecomposableSearch Problems

In Sections 9 to 11 we’ve already encoun-tered several dynamic data structures forthe problems of dictionary lookup andrange search. In Section 11, we saw howto develop optimal EM range search datastructures by externalizing some knowninternal memory data structures. Thekey idea was to use the bootstrappingparadigm, together with weight-balancedB-trees as the underlying data structure,in order to consolidate several static datastructures for small instances of rangesearching into one dynamic data struc-

ture for the full problem. The bootstrap-ping technique is specific to the partic-ular data structures involved. In thissection we look at another technique thatis based upon the properties of the prob-lem itself rather than upon that of the datastructure.

We call a problem decomposable if wecan answer a query by querying indi-vidual subsets of the problem data andthen computing the final result fromthe solutions to each subset. Dictionarysearch and range searching are obvi-ous examples of decomposable problems.Bentley developed the logarithmic method[Bentley and Saxe 1980; Overmars 1983]to convert efficient static data structuresfor decomposable problems into generaldynamic ones. In the internal memory set-ting, the logarithmic method consists ofmaintaining a series of static substruc-tures, at most one each of size 1, 2, 4,8, . . . . When a new item is inserted, it isinitialized in a substructure of size 1. Ifa substructure of size 1 already exists, thetwo substructures are combined into a sin-gle substructure of size 2. If there is al-ready a substructure of size 2, they in turnare combined into a single substructure ofsize 4, and so on. For the current valueof N , it is easy to see that the kth sub-structure (i.e., of size 2k) is present exactlywhen the kth bit in the binary represen-tation of N is 1. Since there are at mostlog N substructures, the search timebound is log N times the search time persubstructure. As the number of items in-creases from 1 to N , the kth structure isbuilt a total of N/2k times (assuming Nis a power of 2). If it can be built in O(2k)time, the total time for all insertions andall substructures is thus O(N log N ), mak-ing the amortized insertion time O(log N ).If we use up to three substructures ofsize 2k at a time, we can do the recon-structions in advance and convert theamortized update bounds to worst-case[Overmars 1983].

In the EM setting, in order to eliminatethe dependence upon the binary logarithmin the I/O bounds, the number of sub-structures must be reduced from log N tologB N , and thus the maximum size of the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 48: Massive Data - University of Kansas

256 Jeffrey Scott Vitter

kth substructure must be increased from2k to Bk . As the number of items increasesfrom 1 to N , the kth substructure has tobe built NB/Bk times (when N is a powerof B), each time taking O(Bk(logB N )/B)I/Os. The key point is that the extra fac-tor of B in the numerator of the first termis cancelled by the factor of B in the de-nominator of the second term, and thusthe resulting total insertion time over allN insertions and all logB N structures isO(N (logB N )2) I/Os, which is O((logB N )2)I/Os amortized per insertion. By global re-building we can do deletions in O(logB N )I/Os amortized. As in the internal memorycase, the amortized updates can typicallybe made worst-case.

Arge and Vahrenhold [2000] obtain I/Obounds for dynamic point location in gen-eral planar subdivisions similar to thoseof Agarwal et al. [1999], but withoutuse of level-balanced trees. Their methoduses a weight-balanced base structure atthe outer level and a multislab struc-ture for storing segments similar to thatof Arge and Vitter [1996] described inSection 11.3. They use the logarithmicmethod to construct a data structure toanswer vertical rayshooting queries in themultislab structures. The method reliesupon a total ordering of the segments, butsuch an ordering can be changed drasti-cally by a single insertion. However, eachsubstructure in the logarithmic method isstatic (until it is combined with anothersubstructure), and thus a static total or-dering can be used for each substructure.The authors show by a type of fractionalcascading that the queries in the logBN substructures do not have to be done in-dependently, which saves a factor of logBN in the I/O cost, but at the cost of mak-ing the data structures amortized insteadof worst case.

Agarwal et al. [2001a] apply the loga-rithmic method (in both the binary formand B-way variant) to get EM versions ofkd-trees, BBD trees, and BAR trees.

12.2. Continuously Moving Items

Early work on temporal data generallyconcentrated on time-series or multiver-

sion data [Salzberg and Tsotras 1999]. Aquestion of growing interest in this mo-bile age is how to store and index con-tinuously moving items, such as mobiletelephones, cars, and airplanes (e.g., seeJensen and Theodoridis [2000], Salteniset al. [2000], and Wolfson et al. [1999]).There are two main approaches to storingmoving items. The first technique is to usethe same sort of data structure as for non-moving data, but to update it wheneveritems move sufficiently far so as to trig-ger important combinatorial events thatare relevant to the application at hand[Basch et al. 1999]. For example, an eventrelevant for range search might be trig-gered when two items move to the samehorizontal displacement (which happenswhen the x-ordering of the two items isabout to switch). A different approach isto store each item’s location and speed tra-jectory, so that no updating is needed aslong as the item’s trajectory plan does notchange. Such an approach requires fewerupdates, but the representation for eachitem generally has higher dimension, andthe search strategies are therefore lessefficient.

Kollios et al. [1999] developed a linear-space indexing scheme for moving pointsalong a (one-dimensional) line, based uponthe notion of partition trees. Their struc-ture supports a variety of range search andapproximate nearest neighbor queries.For example, given a range and time, thepoints in that range at the indicated timecan be retrieved in O(n1/2+ ε + k) I/Os, forarbitrarily small ε >0. Updates requireO((log n)2) I/Os. Agarwal et al. [2000]extend the approach to handle rangesearches in two dimensions, and they im-prove the update bound to O((logB n)2)I/Os. They also propose an event-drivendata structure with the same querytimes as the range search data struc-ture of Arge et al. [1999b] discussed inSection 11.5, but with the potential need todo many updates. A hybrid data structurecombining the two approaches permits atradeoff between query performance andupdate frequency.

R-trees offer a practical generic mecha-nism for storing multidimensional points

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 49: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 257

and are thus a natural alternative forstoring mobile items. One approach isto represent time as a separate dimen-sion and to cluster trajectories using theR-tree heuristics. However, the orthogo-nal nature of the R-tree does not lend it-self well to diagonal trajectories. For thecase of points moving along linear tra-jectories, Saltenis et al. [2000] build theR-tree upon only the spatial dimensions,but parameterize the bounding box coor-dinates to account for the movement ofthe items stored within. They maintain anouter approximation of the true boundingbox, which they periodically update to re-fine the approximation. Agarwal and Har-Peled [2001] show how to maintain a prov-ably good approximation of the minimumbounding box with need for only a constantnumber of refinement events.

13. STRING PROCESSING

In this section we survey methods used toprocess strings in external memory, suchas inverted files, search trees, suffix treesand suffix arrays, and sorting, with par-ticular attention to more recent develop-ments.

13.1. Inverted Files

The simplest and most commonly usedmethod to index text in large documentsor collections of documents is the invertedfile, which is analogous to the index at theback of a book. The words of interest in thetext are sorted alphabetically, and eachitem in the sorted list has a list of point-ers to the occurrences of that word in thetext. In an EM setting, a hybrid approachmakes sense, in which the text is dividedinto large chunks (consisting of one ormore blocks) and an inverted file is used tospecify the chunks containing each word;the search within a chunk can be carriedout by using a fast sequential method,such as the Knuth–Morris–Pratt [1977]or Boyer–Moore [1977] methods. This par-ticular hybrid method was introduced asthe basis of the widely used GLIMPSEsearch tool [Manber and Wu 1994]. An-other way to index text is to use hash-

Fig. 13 . Patricia trie representation of a singlenode of an SB-tree, with branching factor B= 8.The seven strings used for partitioning are pic-tured at the leaves; in the actual data structure,pointers to the strings, not the strings them-selves, are stored at the leaves. The pointersto the B children of the SB-tree node are alsostored at the leaves.

ing to get small signatures for portionsof text. The reader is referred to Frakesand Baeza-Yates [1992] and Baeza-Yatesand Ribeiro-Neto [1999] for more back-ground on the above methods.

13.2. String B-Trees

In a conventional B-tree, 2(B) unit-sizedkeys are stored in each internal node toguide the searching, and thus the entirenode fits into one or two blocks. However,if the keys are variable-sized text strings,the keys can be arbitrarily long, and theremay not be enough space to store 2(B)strings per node. Pointers to 2(B) stringscould be stored instead in each node,but access to the strings during searchwould require more than a constantnumber of I/Os per node. In order to savespace in each node, Bayer and Unterauer[1977] investigated the use of prefixrepresentations of keys. Ferragina andGrossi [1996; 1999] recently developed anelegant generalization of the B-tree calledthe String B-tree or simply SB-tree (notto be confused with the SB-tree [O’Neil1992] mentioned in Section 10.1). AnSB-tree differs from a conventional B-treein the way that each 2(B)-way branchingnode is represented.

An individual node of Ferragina andGrossi’s SB-tree is pictured in Figure 13.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 50: Massive Data - University of Kansas

258 Jeffrey Scott Vitter

It is based upon a variant of thePatricia trie character-based data struc-ture [Knuth 1998; Morrison 1968] alongthe lines of Ajtai et al. [1984]. It achievesB-way branching with a total storage ofO(B) characters, which fit in O(1) blocks.Each of its internal nodes stores an in-dex (a number from 0 to N ) and a one-character label for each of its outgoingedges. For example, in Figure 13 the rightchild of the root has index 4 and its out-going edges have character labels “a” and“b”, which means that the node’s left sub-trie consists of strings whose position 4(fifth character) is “a”, and its right subtrieconsists of strings whose position 4 (fifthcharacter) is “b”. The first four charac-ters in all the strings in the node’s subtrieare identically “bcbc”. To find which of theB branches to take for a search string, atrie search is done in the Patricia trie; eachbinary branching decision is based uponthe character indexed at that node. Forsearch string “bcbabcba”, a binary searchon the trie in Figure 13 traverses the far-right path of the Patricia trie, examiningcharacter positions 0, 4, and 6.

Unfortunately, the leaf node that iseventually reached (in our example, theleaf at the far right, corresponding to“bcbcbbba”) is not in general at the correctbranching point, since only certain char-acter positions in the string were exam-ined during the search. The key idea tofix this situation is to sequentially com-pare the search string with the string as-sociated with the leaf, and if they dif-fer, the index where they differ can befound. In the example the search string“bcbabcba” differs from “bcbcbbba” in thefourth character (position 3), and there-fore the search string is lexicographicallysmaller than the entire right subtrie ofthe root. It thus fits in between the leaves“abac” and “bcbcaba”.

Searching each Patricia trie requiresone I/O to load it into memory, plus ad-ditional I/Os to do the sequential scan ofthe string after the leaf of the Patricia trieis reached. Each block of the search stringthat is examined during a sequential scandoes not have to be read in again for lowerlevels of the SB-tree, so the I/Os for the se-

quential scan can be charged to the blocksof the search string. The resulting querytime to search in an SB-tree for a string of` characters is therefore O(logB N + `/B),which is optimal. Insertions and dele-tions can be done in the same I/O bound.Ferragina and Grossi [1996; 1999] applySB-trees to the problems of stringmatching, prefix search, and substringsearch. Ferragina and Luccio [1998] applySB-trees to get new results for dynamicdictionary matching; their structure evenprovides a simpler approach for the (inter-nal memory) RAM model.

13.3. Suffix Trees and Suffix Arrays

Tries and Patricia tries are commonlyused as internal memory data structuresfor storing sets of strings. One particu-larly interesting application of Patriciatries is to store the set of suffixes of atext string. The resulting data structure,called a suffix tree [McCreight 1976;Weiner 1973], can be built in linear timeand supports search for an arbitrarysubstring of the text in time linear inthe size of the substring. A more compact(but static) representation of a suffix tree,called a suffix array [Manber and Myers1993], consisting of the leaves of thesuffix tree in symmetric traversal order,can also be used for fast searching. (SeeGusfield [1997] for general background.)Farach et al. [1998] show how to constructSB-trees, suffix trees, and suffix arrays onstrings of total length N using O(n logm n)I/Os, which is optimal. Clark and Munro[1996] give a practical implementationof dynamic suffix trees that use aboutfive bytes per indexed suffix. Crauser andFerragina [1999] present an extensive setof experiments on various text collectionsin which they compare the practicalperformance of some novel and knownsuffix array construction algorithms.

13.4. Sorting Strings

Arge et al. [1997] consider several mod-els for the problem of sorting K stringsof total length N in external mem-ory. They develop efficient sorting algo-rithms in these models, making use of the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 51: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 259

SB-tree, buffer tree techniques, and a sim-plified version of the SB-tree for merg-ing called the lazy trie. The problem canbe solved in the (internal memory) RAMmodel in O(K log K +N ) time. By analogyto the problem of sorting integers, it wouldbe natural to expect that the I/O complex-ity would be O(k logm k+n), where k =max{1, K/B}. Arge et al. show somewhatcounterintuitively that for sorting shortstrings (i.e., strings of length at most B)the I/O complexity depends upon the to-tal number of characters, whereas for longstrings the complexity depends upon thetotal number of strings.

THEOREM 13.1 [ARGE ET AL. 1997]. Thenumber of I/Os needed to sort K stringsof total length N, where there are K1short strings of total length N1 and K2long strings of total length N2 (i.e., N =N1+N2 and K = K1+ K2 ), is

O(

min{

N1

Blogm

(N1

B+ 1)

,

K1 logM (K1+ 1)}+ K2 logM (K2+ 1)

+ NB

). (13)

Lower bounds for various models of howstrings can be manipulated are given inArge et al. [1997]. There are gaps in somecases between the upper and lower boundsfor sorting.

14. THE TPIE EXTERNAL MEMORYPROGRAMMING ENVIRONMENT

In this section we describe the TPIE(transparent parallel I/O environment)3

[Arge et al. 1999a; TPIE 1999; Vengroffand Vitter 1996b], which serves as theimplementation platform for the experi-ments described in Sections 7 and 11.2 aswell as in several of the referenced papers.TPIE is a comprehensive set of C++ tem-plates for EM paradigms and a run-time

3 The TPIE software distribution is available free ofcharge at http://www.cs.duke.edu/TPIE/ on the WorldWide Web.

library. Its goal is to help programmers de-velop high-level, portable, and efficient im-plementations of EM algorithms.

There are three basic approaches tosupporting development of I/O-efficientcode, which we call access-, array- andframework-oriented. TPIE falls primarilyinto the third category with some elementsof the first category. Access-oriented sys-tems preserve the programmer abstrac-tion of explicitly requesting data trans-fer. They often extend the read–writeinterface to include data type specifica-tions and collective specification of mul-tiple transfers, sometimes involving thememories of multiple processing nodes.Examples of access-oriented systems in-clude the UNIX file system at the low-est level, higher-level parallel file systemssuch as Whiptail [Shriver and Wisniewski1995], Vesta [Corbett and Feitelson 1996],PIOUS [Moyer and Sunderam 1996], andthe High Performance Storage System[Watson and Coyne 1995], and I/O li-braries MPI-IO [Corbett et al. 1996], andLEDA-SM [Crauser and Mehlhorn 1999].

Array-oriented systems access datastored in external memory primarily bymeans of compiler-recognized data types(typically arrays) and operations on thosedata types. The external computation isdirectly specified via iterative loops or ex-plicitly data-parallel operations, and thesystem manages the explicit I/O transfers.Array-oriented systems are effective forscientific computations that make regularstrides through arrays of data and candeliver high-performance parallel I/Oin applications such as computationalfluid dynamics, molecular dynamics, andweapon system design and simulation.Array-oriented systems are generallyill-suited to irregular or combinatorialcomputations. Examples of array-orientedsystems include PASSION [Thakur etal. 1996], Panda [Seamons and Winslett1996] (which also has aspects of accessorientation), PI/OT [Parsons et al. 1997],and ViC* [Colvin and Cormen 1998].

TPIE [Arge et al. 1999a; TPIE 1999;Vengroff and Vitter 1996b] provides aframework-oriented interface for batchedcomputation as well as an access-oriented

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 52: Massive Data - University of Kansas

260 Jeffrey Scott Vitter

interface for online computation. Insteadof viewing batched computation as anenterprise in which code reads data, op-erates on them, and writes results, aframework-oriented system views com-putation as a continuous process duringwhich a program is fed streams of datafrom an outside source and leaves trailsof results behind. TPIE programmers donot need to worry about making explicitcalls to I/O routines. Instead, they merelyspecify the functional details of the de-sired computation, and TPIE automat-ically choreographs a sequence of datamovements to feed the computation.

TPIE is written in C++ as a set of tem-plated classes and functions. It consists ofthree main components: a block transferengine (BTE), a memory manager (MM),and an access method interface (AMI). TheBTE is responsible for moving blocks ofdata to and from the disk. It is also respon-sible for scheduling asynchronous read-ahead and write-behind when necessaryto allow computation and I/O to overlap.The MM is responsible for managing mainmemory in coordination with the AMI andBTE. The AMI provides the high-level uni-form interface for application programs.The AMI is the only component that pro-grammers normally need to interact withdirectly. Applications that use the AMI areportable across hardware platforms, sincethey do not have to deal with the under-lying details of how I/O is performed on aparticular machine.

We have seen in the previous sec-tions that many batched problems in spa-tial databases, GIS, scientific comput-ing, graphs, and string processing can besolved optimally using a relatively smallnumber of basic paradigms like scanning(or streaming), multiway distribution, andmerging, which TPIE supports as accessmechanisms. Batched programs in TPIEthus consist primarily of a call to oneor more of these standard access mecha-nisms. For example, a distribution sort canbe programmed by using the access mech-anism for multiway distribution. The pro-grammer has to specify the details as tohow the partitioning elements are formedand how the buckets are defined. Then the

multiway distribution is invoked, duringwhich TPIE automatically forms the buck-ets and writes them to disk using doublebuffering.

For online data structures such as hash-ing, B-trees, and R-trees, TPIE supportsmore traditional block access like theaccess-oriented systems.

15. DYNAMIC MEMORY ALLOCATION

The amount of internal memory allocatedto a program may fluctuate during thecourse of execution because of demandsplaced on the system by other users andprocesses. EM algorithms must be able toadapt dynamically to whatever resourcesare available so as to preserve good perfor-mance [Pang et al. 1993a]. The algorithmsin the previous sections assume a fixedmemory allocation; they must resort tovirtual memory if the memory allocationis reduced, often causing a severe degra-dation in performance.

Barve and Vitter [1999b] discuss thedesign and analysis of EM algorithmsthat adapt gracefully to changing memoryallocations. In their model, without lossof generality, an algorithm (or program)P is allocated internal memory in phases.During the ith phase, P is allocated miblocks of internal memory, and this mem-ory remains allocated to P until P com-pletes 2mi I/O operations, at which pointthe next phase begins. The process contin-ues until P finishes execution. The modelmakes the reasonable assumption that theduration for each memory allocation phaseis long enough to allow all the memory inthat phase to be used by the algorithm.

For sorting, the lower bound approachof (10) implies that∑

i

2mi log mi = Ä(n log n).

We say that P is dynamically optimal forsorting if∑

i

2mi log mi =O(n log n)

for all possible sequences m1, m2, . . .of memory allocation. Intuitively, if P is

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 53: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 261

dynamically optimal, no other algorithmcan perform more than a constant num-ber of sorts in the worst case for the samesequence of memory allocations.

Barve and Vitter [1999b] define themodel in generality and give dynam-ically optimal strategies for sorting,matrix multiplication, and buffer tree op-erations. Their work represents the firsttheoretical model of dynamic allocationand the first algorithms that can be consid-ered dynamically optimal. Previous workwas done on memory-adaptive algorithmsfor merge sort [Pang et al. 1993a; Zhangand Larson 1997] and hash join [Panget al. 1993b], but the algorithms handleonly special cases and can be made to per-form nonoptimally for certain patterns ofmemory allocation.

16. CONCLUSIONS

In this survey we have described severaluseful paradigms for the design and im-plementation of efficient external memoryalgorithms and data structures. The prob-lem domains we have considered includesorting, permuting, FFT, scientific com-puting, computational geometry, graphs,databases, geographic information sys-tems, and text and string processing.Interesting challenges remain in virtuallyall these problem domains. One difficultproblem is to prove lower bounds forpermuting and sorting without the indi-visibility assumption. Another promisingarea is the design and analysis of EM algo-rithms for efficient use of multiple disks.Optimal bounds have not yet been deter-mined for several basic EM graph prob-lems such as topological sorting, shortestpaths, breadth and depth-first search,and connected components. There is anintriguing connection between problemsthat have good I/O speedups and problemsthat have fast and work-efficient parallelalgorithms. Several problems remain openin the dynamic and kinetic settings, suchas range searching, ray shooting, pointlocation, and finding nearest neighbors.

A continuing goal is to develop optimalEM algorithms and to translate theoreti-cal gains into observable improvements in

practice. For some of the problems that canbe solved optimally up to a constant factor,the constant overhead is too large for thealgorithm to be of practical use, and sim-pler approaches are needed. In practice,algorithms cannot assume a static inter-nal memory allocation; they must adapt ina robust way when the memory allocationchanges.

Many interesting challenges and oppor-tunities in algorithm design and analysisarise from new architectures being devel-oped, such as networks of workstations,hierarchical storage devices, disk driveswith processing capabilities, and storagedevices based upon microelectromechan-ical systems (MEMS). Active (or intelli-gent) disks, in which disk drives havesome processing capability and can filterinformation sent to the host, have recentlybeen proposed to further reduce the I/Obottleneck, especially in large databaseapplications [Acharya et al. 1998; Riedelet al. 1998]. MEMS-based nonvolatile stor-age has the potential to serve as an in-termediate level in the memory hierarchybetween DRAM and disks. It could ulti-mately provide better latency and band-width than disks, at less cost per bit thanDRAM [Schlosser et al. 2000; Vettigeret al. 2000].

ACKNOWLEDGMENTS

The author wishes to thank Pankaj Agarwal, LarsArge, Ricardo Baeza-Yates, Adam Buchsbaum, JeffChase, David Hutchinson, Vasilis Samoladas, AminVahdat, the members of the Center for GeometricComputing at Duke University, and the referees forseveral helpful comments and suggestions. Figure 1is a modified version of a figure by Darren Ven-groff, and Figure 2 comes from Cormen et al. [1990].Figures 6, 7, 9, 10, 12, and 13 are modified versions offigures in Arge et al. [1998a] Enbody and Du [1998],Kanellakis et al. [1996], Arge et al. [1999a,b], andFerragina and Grossi [1999], respectively.

REFERENCES

ABEL, D. J. 1984. A B+-tree structure for largequadtrees. Computer Vision, Graphics, and Im-age Processing 27, 1 (July), 19–31.

ABELLO, J., BUCHSBAUM, A., AND WESTBROOK, J.1998. A functional approach to external graph

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 54: Massive Data - University of Kansas

262 Jeffrey Scott Vitter

algorithms. In Proceedings of the European Sym-posium on Algorithms, Vol. 1461 of Lecture Notesin Computer Science (Venice, August). Springer-Verlag, 332–343.

ACHARYA, A., UYSAL, M., AND SALTZ, J. 1998. Activedisks: Programming model, algorithms andevaluation. ACM SIGPLAN Notices 33, 11 (Nov),81–91.

ADLER, M. 1996. New coding techniques for im-proved bandwidth utilization. In Proceedings ofthe IEEE Symposium on Foundations of Com-puter Science, Vol. 37 (Burlington, VT, Oct.),173–182.

AGARWAL, P. K. AND ERICKSON, J. 1999. Geometricrange searching and its relatives. In B. Chazelle,J. E. Goodman, and R. Pollack, Eds, Advances inDiscrete and Computational Geometry, Vol. 23of Contemporary Mathematics, American Math-ematical Society, Providence, RI, 1–56.

AGARWAL, P. K. AND HAR-PELED, S. 2000. Maintain-ing the approximate extent measures of mov-ing points. In Proceedings of the ACM-SIAMSymposium on Discrete Algorithms, Vol. 12(Washington, Jan.), 148–157.

AGARWAL, P. K., ARGE, L., AND ERICKSON, J. 2000. In-dexing moving points. In Proceedings of the ACMSymposium on Principles of Database Systems,Vol. 19, 175–186.

AGARWAL, P. K., ARGE, L., BRODAL, G. S., AND VITTER,J. S. 1999. I/O-efficient dynamic point loca-tion in monotone planar subdivisions. In Pro-ceedings of the ACM-SIAM Symposium onDiscrete Algorithms, Vol. 10, 11–20.

AGARWAL, P. K., ARGE, L., ERICKSON, J., FRANCIOSA,P. G., AND VITTER, J. S. 1998a. Efficient search-ing with linear constraints. In Proceedings of theACM Symposium on Principles of Database Sys-tems, Vol. 17, 169–178.

AGARWAL, P. K., ARGE, L., MURALI, T. M., VARADARA-JAN, K., AND VITTER, J. S. 1998b. I/O-efficientalgorithms for contour line extraction and pla-nar graph blocking. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, Vol. 9,117–126.

AGARWAL, P. K., ARGE, L., PROCOPIUC, O., AND VITTER,J. S. 2001a. A framework for index dynamiza-tion. In Proceedings of the International Collo-quium on Automata, Languages, and Program-ming (Crete, Greece, July), Vol. 2076 of LectureNotes in Computer Science, Springer-Verlag.

AGARWAL, P. K., DE BERG, M., GUDMUNDSSON, J.,HAMMAR, M., AND HAVERKORT, H. J. 2001b.Constructing box trees and R-trees withlow stabbing number. In Proceedings of theACM Symposium on Computational Geometry(Medford, MA, June), Vol. 17.

AGGARWAL, A. AND PLAXTON, C. G. 1994. Optimalparallel sorting in multi-level storage. In Pro-ceedings of the ACM-SIAM Symposium on Dis-crete Algorithms, Vol. 5, 659–668.

AGGARWAL, A. AND VITTER, J. S. 1988. The in-put/output complexity of sorting and related

problems. Communications of the ACM 31, 9,1116–1127.

AGGARWAL, A., ALPERN, B., CHANDRA, A. K., AND SNIR,M. 1987a. A model for hierarchical memory.In Proceedings of the ACM Symposium on The-ory of Computing, Vol. 19 (New York), 305–314.

AGGARWAL, A., CHANDRA, A., AND SNIR, M. 1987b.Hierarchical memory with block transfer. In Pro-ceedings of the IEEE Symposium on Foundationsof Computer Science, Vol. 28 (Los Angeles), 204–216.

AJTAI, M., FREDMAN, M., AND KOMLOS, J. 1984. Hashfunctions for priority queues. Information andControl, 63, 3, 217–225.

ALPERN, B., CARTER, L., FEIG, E., AND SELKER, T. 1994.The uniform memory hierarchy model of compu-tation. Algorithmica 12, 2–3, 72–109.

ARGE, L. 1995a. The buffer tree: A new techniquefor optimal I/O-algorithms. In Proceedings of theWorkshop on Algorithms and Data Structures,Vol. 955 of Lecture Notes in Computer Science,Springer-Verlag, 334–345. A complete versionappears as BRICS Technical Report RS–96–28,University of Aarhus.

ARGE, L. 1995b. The I/O-complexity of orderedbinary-decision diagram manipulation. In Pro-ceedings of the International Symposium onAlgorithms and Computation, Vol. 1004 of Lec-ture Notes in Computer Science, Springer-Verlag,82–91.

ARGE, L. 1997. External-memory algorithms withapplications in geographic information systems.In M. van Kreveld, J. Nievergelt, T. Roos, andP. Widmayer, eds, Algorithmic Foundations ofGIS, Vol. 1340 of Lecture Notes in ComputerScience, Springer-Verlag, 213–254.

ARGE, L. AND MILTERSEN, P. 1999. On showinglower bounds for external-memory computa-tional geometry problems. In J. Abello and J. S.,Vitter, Eds., External Memory Algorithms andVisualization, DIMACS Series in Discrete Math-ematics and Theoretical Computer Science,American Mathematical Society, Providence, RI,139–159.

ARGE, L. AND VAHRENHOLD, J. 2000. I/O-efficientdynamic planar point location. In Proceedings ofthe ACM Symposium on Computational Geome-try (June), Vol. 9, 191–200.

ARGE, L. AND VITTER, J. S. 1996. Optimal dynamicinterval management in external memory. InProceedings of the IEEE Symposium on Foun-dations of Computer Science (Burlington, VT,Oct.), Vol. 37, 560–569.

ARGE, L., BRODAL, G. S., AND TOMA, L. 2000a. Onexternal memory MST, SSSP and multi-wayplanar graph separation. In Proceedings of theScandinavian Workshop on Algorithmic Theory(July) Vol. 1851 of Lecture Notes in ComputerScience, Springer-Verlag.

ARGE, L., FERRAGINA, P., GROSSI, R., AND VITTER, J.1997. On sorting strings in external memory.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 55: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 263

In Proceedings of the ACM Symposium on The-ory of Computing, Vol. 29, 540–548.

ARGE, L., HINRICHS, K. H., VAHRENHOLD, J., AND

VITTER, J. S. 1999a. Efficient bulk operationson dynamic R-trees. In Workshop on AlgorithmEngineering and Experimentation, Vol. 1619 ofLecture Notes in Computer Science (Baltimore,Jan.) Springer-Verlag, 328–348.

ARGE, L., KNUDSEN, M., AND LARSEN, K. 1993. Ageneral lower bound on the I/O-complexity ofcomparison-based algorithms. In Proceedings ofthe Workshop on Algorithms and Data Struc-tures, Vol. 709 of Lecture Notes in Computer Sci-ence, Springer-Verlag, 83–94.

ARGE, L., PROCOPIUC, O., RAMASWAMY, S., SUEL, T.,VAHRENHOLD, J., AND VITTER, J. S. 2000b. A uni-fied approach for indexed and non-indexed spa-tial joins. In Proceedings of the InternationalConference on Extending Database Technology(Konstanz, Germany, March), Vol. 7.

ARGE, L., PROCOPIUC, O., RAMASWAMY, S., SUEL, T., AND

VITTER, J. S. 1998a. Scalable sweeping-basedspatial join. In Proceedings of the InternationalConference on Very Large Databases (New York,August) Vol. 24, 570–581.

ARGE, L., PROCOPIUC, O., RAMASWAMY, S., SUEL, T.,AND VITTER, J. S. 1998b. Theory and practiceof I/O-efficient algorithms for multidimensionalbatched searching problems. In Proceedings ofthe ACM-SIAM Symposium on Discrete Algo-rithms, Vol. 9, 685–694.

ARGE, L., SAMOLADAS, V., AND VITTER, J. S. 1999b.Two-dimensional indexability and optimalrange search indexing. In Proceedings of theACM Conference Principles of Database Systems(Philadelphia, May–June), Vol. 18, 346–357.

ARGE, L., VENGROFF, D. E., AND VITTER, J. S. 1995.External-memory algorithms for processing linesegments in geographic information systems.Algorithmica (to appear). Special issue on car-tography and geographic information systems.An earlier version appeared in Proceedings ofthe Third European Symposium on Algorithms,(Sept.), Vol. 979 of Lecture Notes in ComputerScience, Springer-Verlag 295–310.

ARMEN, C. 1996. Bounds on the separation of twoparallel disk models. In Proceedings of the Work-shop on Input/Output in Parallel and Dis-tributed Systems (Philadelphia, May), Vol. 4,122–127.

BAEZA-YATES, R. 1996. Bounded disorder: The ef-fect of the index. Theoretical Computer Science168, 21–38.

BAEZA-YATES, R. AND RIBEIRO-NETO, B. Eds. 1999.Modern Information Retrieval. Addison-WesleyLongman: Chapter 8.

BAEZA-YATES, R. AND SOZA-POLLMAN, H. 1998. Anal-ysis of linear hashing revisited. Nordic Journalof Computing 5, 70–85.

BAEZA-YATES, R. A. 1989. Expected behaviour ofB+-trees under random insertions. Acta Infor-matica 26, 5, 439–472.

BAEZA-YATES, R. A. AND LARSON, P.-A. 1989. Per-formance of B+-trees with partial expansions.IEEE Transactions on Knowledge and Data En-gineering 1, 2 (June), 248–257.

BARVE, R. D. AND VITTER, J. S. 1999a. A simpleand efficient parallel disk mergesort. In Pro-ceedings of the ACM Symposium on ParallelAlgorithms and Architectures (St. Malo, France,June), Vol. 11, 232–241.

BARVE, R. D. AND VITTER, J. S. 1999b. A theoreticalframework for memory-adaptive algorithms. InProceedings of the IEEE Symposium on Foun-dations of Computer Science (New York, Oct.),Vol. 40, 273–284.

BARVE, R. D., GROVE, E. F., AND VITTER, J. S. 1997.Simple randomized mergesort on parallel disks.Parallel Computing 23, 4, 601–631.

BARVE, R. D., KALLAHALLA, M., VARMAN, P. J., AND

VITTER, J. S. 2000. Competitive analysis ofbuffer management algorithms. Journal ofAlgorithms 36 (August).

BARVE, R. D., SHRIVER, E. A. M., GIBBONS, P. B., HILLYER,B. K., MATIAS, Y., AND VITTER, J. S. 1999. Mod-eling and optimizing I/O throughput of multipledisks on a bus. In Procedings of ACM SIGMET-RICS Joint International Conference on Mea-surement and Modeling of Computer Systems(Atlanta, GA, May), 83–92.

BASCH, J., GUIBAS, L. J., AND HERSHBERGER, J. 1999.Data structures for mobile data. Journal ofAlgorithms 31, 1–28.

BAYER, R. AND MCCREIGHT, E. 1972. Organization oflarge ordered indexes. Acta Informatica 1, 173–189.

BAYER, R. AND UNTERAUER, K. 1977. Prefix B-trees.ACM Transactions on Database Systems 2, 1(March), 11–26.

BECKER, B., GSCHWIND, S., OHLER, T., SEEGER, B.,AND WIDMAYER, P. 1996. An asymptotically op-timal multiversion B-tree. VLDB Journal 5, 4(Dec.), 264–275.

BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND

SEEGER, B. 1990. The R*-tree: An efficient androbust access method for points and rectangles.In Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data, 322–331.

BENDER, M. A., DEMAINE, E. D., AND FARACH-COLTON,M. 2000. Cache-oblivious B-trees. In Proceed-ings of the IEEE Symposium on Foundations ofComputer Science (Redondo Beach, Cal., Nov.),Vol. 41, 12–14.

BENTLEY, J. L. 1980. Multidimensional divide andconquer. Communications of the ACM 23, 6, 214–229.

BENTLEY, J. L. AND SAXE, J. B. 1980. Decomposablesearching problems I: Static-to-dynamic trans-formations. Journal of Algorithms 1, 4 (Dec.),301–358.

BERCHTOLD, S., BOHM, C., AND KRIEGEL, H.-P.1998. Improving the query performance of

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 56: Massive Data - University of Kansas

264 Jeffrey Scott Vitter

high-dimensional index structures by bulk loadoperations. In Proceedings of the InternationalConference on Extending Database Technology,Vol. 1377 of Lecture Notes in Computer Science,Springer-Verlag, 216–230.

BLUM, N. AND MEHLHORN, K. 1980. On the aver-age number of rebalancing operations in weight-balanced trees. Theoretical Computer Science 11,3 (July), 303–320.

BOYER, R. S. AND MOORE, J. S. 1977. A fast stringsearching algorithm. Communications of theACM 20, 10 (Oct.), 762–772.

BRENGEL, K., CRAUSER, A., FERRAGINA, P., AND MEYER,U. 1999. An experimental study of priorityqueues. In J. S. Vitter and C. Zaroliagis, Ed., Pro-ceedings of the Workshop on Algorithm Engineer-ing, Vol. 1668 of Lecture Notes in Computer Sci-ence (London, July), Springer-Verlag, 345–359.

BRODAL, G. S. AND KATAJAINEN, J. 1998. Worst-caseefficient external-memory priority queues. InProceedings of the Scandinavian Workshop onAlgorithmic Theory, Vol. 1432 of Lecture Notes inComputer Science (Stockholm, July), Springer-Verlag, 107–118.

BUCHSBAUM, A. L., GOLDWASSER, M., VENKATASUBRAMA-NIAN, S., AND WESTBROOK, J. R. 2000. On ex-ternal memory graph traversal. In Proceedingsof the ACM-SIAM Symposium on Discrete Algo-rithms (Jan.), Vol. 11.

CALLAHAN, P., GOODRICH, M. T., AND RAMAIYER, K.1995. Topology B-trees and their applications.In Proceedings of the Workshop on Algorithmsand Data Structures, Vol. 955 of Lecture Notesin Computer Science, Springer-Verlag, 381–392.

CARTER, L. AND GATLIN, K. S. 1998. Towards an op-timal bit-reversal permutation program. In Pro-ceedings of the IEEE Symposium on Foundationsof Computer Science (Palo Alto, Nov.), Vol. 39,544–553.

CHAZELLE, B. 1986. Filtering search: A new ap-proach to query-answering. SIAM Journal onComputing 15, 703–724.

CHAZELLE, B. 1990. Lower bounds for orthogonalrange searching: I. The reporting case. Journalof the ACM 37, 2 (April), 200–212.

CHAZELLE, B. AND EDELSBRUNNER, H. 1987. Linearspace data structures for two types of rangesearch. Discrete and Computational Geometry 2,113–126.

CHEN, P. M., LEE, E. K., GIBSON, G. A., KATZ,R. H., AND PATTERSON, D. A. 1994. RAID: High-performance, reliable secondary storage. ACMComputing Surveys 26, 2 (June), 145–185.

CHIANG, Y.-J. 1998. Experiments on the practicalI/O efficiency of geometric algorithms: Distri-bution sweep vs. plane sweep. ComputationalGeometry: Theory and Applications 8, 4, 211–236.

CHIANG, Y.-J., GOODRICH, M. T., GROVE, E. F.,TAMASSIA, R., VENGROFF, D. E., AND VITTER, J. S.1995. External-memory graph algorithms. In

Proceedings of the ACM-SIAM Symposium onDiscrete Algorithms (Jan.), Vol. 6, 139–149.

CHIANG, Y.-J. AND SILVA, C. T. 1999. External mem-ory techniques for isosurface extraction in scien-tific visualization. In J. Abello and J. S. Vitter,Eds., External Memory Algorithms and Visual-ization, DIMACS Series in Discrete Mathemat-ics and Theoretical Computer Science, AmericanMathematical Society, Providence, RI. 247–277.

CLARK, D. R. AND MUNRO, J. I. 1996. Efficient suffixtrees on secondary storage. In Proceedings of theACM-SIAM Symposium on Discrete Algorithms(Atlanta, June), Vol. 7, 383–391.

CLARKSON, K. L. AND SHOR, P. W. 1989. Applicationsof random sampling in computational geometry,II. Discrete and Computational Geometry 4, 387–421.

COLVIN, A. AND CORMEN, T. H. 1998. ViC*: A com-piler for virtual-memory c*. In Proceedings of theInternational Workshop on High-Level Program-ming Models and Supportive Environments,Vol. 3.

COMER, D. 1979. The ubiquitous B-tree. ACMComputing Surveys 11, 2, 121–137.

CORBETT, P., FEITELSON, D., FINEBERG, S., HSU, Y.,NITZBERG, B., PROST, J.-P., SNIR, M., TRAVERSAT,B., AND WONG, P. 1996. Overview of the MPI-IO parallel I/O interface. In R., Jain, J., Werth,and J. C. Browne, Eds., Input/Output in Paral-lel and Distributed Computer Systems, Vol. 362of The Kluwer International Series in Engineer-ing and Computer Science, Kluwer Academic,Chapter 5, 127–146.

CORBETT, P. F. AND FEITELSON, D. G. 1996. The Vestaparallel file system. ACM Transactions on Com-puter Systems 14, 3 (August), 225–264.

CORMEN, T. H. AND NICOL, D. M. 1998. Performingout-of-core FFTs on parallel disk systems. Par-allel Computing 24, 1 (Jan.), 5–20.

CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R. L.1990. Introduction to Algorithms. MIT Press,Cambridge, MA.

CORMEN, T. H., SUNDQUIST, T., AND WISNIEWSKI, L. F.1999. Asymptotically tight bounds for perform-ing BMMC permutations on parallel disk sys-tems. SIAM Journal on Computing 28, 1, 105–136.

CRAUSER, A. AND FERRAGINA, P. 1999. On construct-ing suffix arrays in external memory. In Proceed-ings of the European Symposium on Algorithms,Vol. 1643 of Lecture Notes in Computer Science,Springer-Verlag.

CRAUSER, A. AND MEHLHORN, K. 1999. LEDA-SM:Extending LEDA to secondary memory. In J. S.,Vitter, and C., Zaroliagis, Eds., Proceedings of theEuropean Symposium on Algorithms, Vol. 1643of Lecture Notes in Computer Science (London,July), Springer-Verlag, 228–242.

CRAUSER, A., FERRAGINA, P., MEHLHORN, K., MEYER, U.,AND RAMOS, E. A. 1998. Randomized external-memory algorithms for geometric problems. In

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 57: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 265

Proceedings of the ACM Symposium on Compu-tational Geometry (June), Vol. 14, 259–268.

CRAUSER, A., FERRAGINA, P., MEHLHORN, K., MEYER, U.,AND RAMOS, E. A. 1999. I/O-optimal computa-tion of segment intersections. In J., Abello, andJ. S. Vitter, Eds., External Memory Algorithmsand Visualization, DIMACS Series in DiscreteMathematics and Theoretical Computer Sci-ence, American Mathematical Society, Provi-dence, RI, 131–138.

CYPHER, R. AND PLAXTON, G. 1993. Deterministicsorting in nearly logarithmic time on the hy-percube and related computers. Journal of Com-puter and System Sciences 47, 3, 501–548.

DE BERG, M., GUDMUNDSSON, J., HAMMAR, M., AND

OVERMARS, M. 2000. On R-trees with low stab-bing number. In Proceedings of the EuropeanSymposium on Algorithms, Vol. 1879 of Lec-ture Notes in Computer Science (Saarbrucken,Germany, Sept.), Springer-Verlag, 167–178.

DE BERG, M., VAN KREVELD, M., OVERMARS, M., AND

SCHWARZKOPF, O. 1997. Computational Geom-etry Algorithms and Applications. Springer-Verlag, Berlin.

DEHNE, F., DITTRICH, W., AND HUTCHINSON, D. 1997.Efficient external memory algorithms by sim-ulating coarse-grained parallel algorithms. InProceedings of the ACM Symposium on Paral-lel Algorithms and Architectures (June), Vol. 9,106–115.

DEHNE, F., HUTCHINSON, D., AND MAHESHWARI, A. 1999.Reducing I/O complexity by simulating coarsegrained parallel algorithms. In Proceedings ofthe International Parallel Processing Symmpo-sium (April), Vol. 13, 14–20.

DEMUTH, H. B. 1956. Electronic data sorting.Ph.D., Stanford University. A shortened versionappears in IEEE Transactions on ComputingC-34, 4, (April), 296–310, 1985, Special Issue onSorting, E. E. Lindstrom, C. K. Wong, and J. S.Vitter, Eds.

DENNING, P. J. 1980. Working sets past andpresent. IEEE Transactions on Software Engi-neering SE-6, 64–84.

DEWITT, D. J., NAUGHTON, J. F., AND SCHNEIDER, D. A.1991. Parallel sorting on a shared-nothing ar-chitecture using probabilistic splitting. In Pro-ceedings of the International Conference onParallel and Distributed Information Systems(Dec.), Vol. 1, 280–291.

DITTRICH, W., HUTCHINSON, D., AND MAHESHWARI, A.1998. Blocking in parallel multisearch prob-lems. In Proceedings of the ACM Symposium onParallel Algorithms and Architectures, Vol. 10,98–107.

DRISCOLL, J. R., SARNAK, N., SLEATOR, D. D., AND

TARJAN, R. E. 1989. Making data structurespersistent. Journal of Computer and SystemSciences 38, 86–124.

EASTON, M. C. 1986. Key-sequence data sets onindelible storage. IBM Journal of Research andDevelopment 30, 230–241.

EDELSBRUNNER, H. 1983a. A new approach to rect-angle intersections, part I. International Journalof Computer Mathematics 13, 209–219.

EDELSBRUNNER, H. 1983b. A new approach to rect-angle intersections, part II. International Jour-nal of Computer Mathematics 13, 221–229.

ENBODY, R. J. AND DU, H. C. 1988. Dynamic hashingschemes. ACM Computing Surveys 20, 2 (June),85–113.

EPPSTEIN, D., GALIL, Z., ITALIANO, G. F., AND

NISSENZWEIG, A. 1997. Sparsification—A tech-nique for speeding up dynamic graph algo-rithms. Journal of the ACM 44, 5, 669–96.

EVANGELIDIS, G., LOMET, D. B., AND SALZBERG, B. 1997.The hB5-tree: A multi-attribute index support-ing concurrency, recovery and node consolida-tion. VLDB Journal 6, 1–25.

FAGIN, R., NIEVERGELT, J., PIPPINGER, N., AND STRONG,H. R. 1979. Extendible hashing—A fast ac-cess method for dynamic files. ACM Transactionson Database Systems 4, 3, 315–344.

FARACH, M., FERRAGINA, P., AND MUTHUKRISHNAN, S.1998. Overcoming the memory bottleneck insuffix tree construction. In Proceedings of theIEEE Symposium on Foundations of ComputerScience (Palo Alto, Nov.), Vol. 39, 174–183.

FEIGENBAUM, J., KANNAN, S., STRAUSS, M., AND

VISWANATHAN, M. 1999. An approximate l1-difference algorithm for massive data streams.In Proceedings of the IEEE Symposium on Foun-dations of Computer Science (New York, Oct.),Vol. 40, 501–511.

FELLER, W. 1968. An Introduction to ProbabilityTheory and its Applications, Vol. 1, 3rd edition.John Wiley, New York.

FERRAGINA, P. AND GROSSI, R. 1996. Fast stringsearching in secondary storage: Theoreticaldevelopments and experimental results. InProceedings of the ACM-SIAM Symposium onDiscrete Algorithms (Atlanta, June), Vol. 7, 373–382.

FERRAGINA, P. AND GROSSI, R. 1999. The string B-tree: A new data structure for string search inexternal memory and its applications. Journalof the ACM 46, 2 (March), 236–280.

FERRAGINA, P. AND LUCCIO, F. 1998. Dynamic dictio-nary matching in external memory. Informationand Computation 146, 2 (Nov.), 85–99.

FLAJOLET, P. 1983. On the performance evaluationof extendible hashing and trie searching. ActaInformatica 20, 4, 345–369.

FLOYD, R. W. 1972. Permuting information in ide-alized two-level storage. In R. Miller and J.Thatcher, Eds., Complexity of Computer Compu-tations. Plenum, New York, 105–109.

FRAKES, W. AND BAEZA-YATES, R. Eds. 1992. Infor-mation Retrieval: Data Structures and Algo-rithms. Prentice-Hall.

FRIGO, M., LEISERSON, C. E., PROKOP, H., AND

RAMACHANDRAN, S. 1999. Cache-obliviousalgorithms. In Proceedings of the IEEE

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 58: Massive Data - University of Kansas

266 Jeffrey Scott Vitter

Symposium on Foundations of ComputerScience, Vol. 40.

FUNKHOUSER, T. A., SEQUIN, C. H., AND TELLER, S. J.1992. Management of large amounts of data ininteractive building walkthroughs. In Proceed-ings of the ACM SIGGRAPH Conference on Com-puter Graphics (Boston, March), 11–20.

GAEDE, V. AND GUNTHER, O. 1998. Multidimen-sional access methods. ACM Computing Surveys30, 2 (June), 170–231.

GANGER, G. R. 1995. Generating representativesynthetic workloads: An unsolved problem.In Proceedings of the Computer MeasurementGroup Conference (Dec.), Vol. 21, 1263–1269.

GARDNER, M. 1977. Magic Show. Knopf, New York,Chapter 7.

GARGANTINI, I. 1982. An effective way to representquadtrees. Communications of the ACM 25, 12(Dec.), 905–910.

GIBSON, G. A., VITTER, J. S., AND WILKES, J. 1996. Re-port of the working group on storage I/O issuesin large-scale computing. ACM Computing Sur-veys 28, 4 (Dec.), 779–793.

GIONIS, A., INDYK, P., AND MOTWANI, R. 1999. Simi-larity search in high dimensions via hashing. InProceedings of the International Conference onVery Large Databases (Edinburgh), Vol. 25, Mor-gan Kaufmann, San Mateo, CA, 78–89.

GOLDMAN, R., SHIVAKUMAR, N., VENKATASUBRAMANIAN,S., AND GARCIA-MOLINA, H. 1998. Proximitysearch in databases. In Proceedings of the In-ternational Conference on Very Large Databases(August), Vol. 24, 26–37.

GOODRICH, M. T., TSAY, J.-J., VENGROFF, D. E., AND

VITTER, J. S. 1993. External-memory compu-tational geometry. In Proceedings of the IEEESymposium on Foundations of Computer SciencePalo Alto (Nov.), Vol. 34, 714–723.

GOVINDARAJAN, S., LUKOVSZKI, T., MAHESHARI, A.,AND ZEH, N. 2000. I/O-efficient well-separatedpair decomposition and its applications. In Pro-ceedings of the European Symposium on Algo-rithms (Saarbrucken, Germany, Sept.), Vol. 1879of Lecture Notes in Computer Science, Springer-Verlag.

GREENE, D. 1989. An implementation and perfor-mance analysis of spatial data access methods.In Proceedings of IEEE International Conferenceon Data Engineering, Vol. 5, 606–615.

GRIFFIN, J. L., SCHLOSSER, S. W., GANGER, G. R., AND

NAGLE, D. F. 2000. Modeling and performanceof MEMS-based storage devices. In Procedings ofACM SIGMETRICS Joint International Confer-ence on Measurement and Modeling of ComputerSystems (Santa Clara, CA, June).

GROSSI, R. AND ITALIANO, G. F. 1999. Efficient cross-trees for external memory. In J. Abello andJ. S. Vitter, Eds., External Memory Algorithmsand Visualization, DIMACS Series in DiscreteMathematics and Theoretical Computer Sci-ence. American Mathematical Society, Provi-dence, RI, 87–106.

GROSSI, R. AND ITALIANO, G. F. 1997. Efficient split-ting and merging algorithms for order decompos-able problems. Information and Computation (inpress). An earlier version appears in Proceedingsof the International Colloquium on Automata,Languages and Programming, Vol. 1256 of Lec-ture Notes in Computer Science, Springer-Verlag,605–615.

GUPTA, S. K. S., LI, Z., AND REIF, J. H. 1995. Gen-erating efficient programs for two-level mem-ories from tensor-products. In Proceedings ofthe IASTED/ISMM International Conferenceon Parallel and Distributed Computing andSystems (Washington, DC, Oct.), Vol. 7, 510–513.

GUSFIELD, D. 1997. Algorithms on Strings, Trees,and Sequences. Cambridge University Press,Cambridge, UK.

GUTTMAN, A. 1984. R-trees: A dynamic index struc-ture for spatial searching. In Proceedings ofthe ACM SIGMOD International Conference onManagement of Data, 47–57.

HELLERSTEIN, J. M., KOUTSOUPIAS, E., AND

PAPADIMITRIOU, C. H. 1997. On the analy-sis of indexing schemes. In Proceedings of theACM Symposium on Principles of DatabaseSystems (Tucson, AZ, May), Vol. 16, 249–256.

HELLERSTEIN, L., GIBSON, G., KARP, R. M., KATZ, R. H.,AND PATTERSON, D. A. 1994. Coding techniquesfor handling failures in large disk arrays. Algo-rithmica 12, 2–3, 182–208.

HENZINGER, M. R., RAGHAVAN, P., AND RAJAGOPALAN, S.1999. Computing on data streams. In J. Abelloand J. S. Vitter, Eds., External Memory Algo-rithms and Visualization, DIMACS Series inDiscrete Mathematics and Theoretical Com-puter Science, American Mathematical Society,Providence, RI, 107–118.

HINRICHS, K. H. 1985. The grid file system: Im-plementation and case studies of applications.PhD Thesis, Dept. Information Science, ETH,Zurich.

HONG, J. W. AND KUNG, H. T. 1981. I/O complexity:The red-blue pebble game. In Proceedings of theACM Symposium on Theory of Computing (May),Vol. 13, 326–333.

HUTCHINSON, D., MAHESHWARI, A., SACK, J.-R., AND

VELICESCU, R. 1997. Early experiences in im-plementing the buffer tree. In Proceedings of theWorkshop on Algorithm Engineering, Vol. 1.

HUTCHINSON, D., MAHESHWARI, A., AND ZEH, N. 1999.An external memory data structure for short-est path queries. In Proceedings of the Inter-national Conference on Computing and Com-binatorics (July), Vol. 1627 of Lecture Notes inComputer Science, Springer-Verlag, 51–60.

HUTCHINSON, D. A., SANDERS, P., AND VITTER, J. S.2001a. Duality between prefetching andqueued writing with applications to integratedcaching and prefetching and to external sorting.In Proceedings of the European Symposiumon Algorithms (Arhus, Denmark, August),

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 59: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 267

Vol. 2161 of Lecture Notes in Computer Science,Springer-Verlag.

HUTCHINSON, D. A., SANDERS, P., AND VITTER, J. S.2001b. The power of duality for prefetchingand sorting with parallel disks. In Proceedingsof the ACM Symposium on Parallel Algorithmsand Architectures (Crete, Greece, July), Vol. 2076of Lecture Notes in Computer Science, Springer-Verlag.

HUTCHINSON, D. A., SANDERS, P., AND VITTER, J. S.2001c. Notes.

INDYK, P., MOTWANI, P. R., RAGHAVAN, P., AND VEMPALA,S. 1997. Locality-preserving hashing in mul-tidimensional spaces. In Proceedings of the ACMSymposium on Theory of Computing (El Paso,TX, May), Vol. 29, 618–625.

JENSEN, D. P. C. S. AND THEODORIDIS, Y. 2000. Novelapproaches to the indexing of moving objecttrajectories. In Proceedings of the InternationalConference on Very Large Databases (Cairo),Vol. 26, 395–406.

KALLAHALLA, M. AND VARMAN, P. J. 1999. Optimalread-once parallel disk scheduling. In Proceed-ings of the Workshop on Input/Output in Paral-lel and Distributed Systems (Atlanta, GA, May),Vol. 6, 68–77, ACM Press, New York.

KALLAHALLA, M. AND VARMAN, P. J. 2001. Optimalprefetching and caching for parallel i/o systems.In Proceedings of the ACM Symposium on Par-allel Algorithms and Architectures (Crete, July),Vol. 13.

KAMEL, I. AND FALOUTSOS, C. 1993. On packing R-trees. In Proceedings of the International ACMConference on Information and Knowledge Man-agement, Vol. 2, 490–499.

KAMEL, I., AND FALOUTSOS, C. 1994. Hilbert R-tree:An improved R-tree using fractals. In Proceed-ings of the International Conference on VeryLarge Databases, Vol. 20, 500–509.

KAMEL, I., KHALIL, M., AND KOURAMAJIAN, V. 1996.Bulk insertion in dynamic R-trees. In Proceed-ings of the International Symposium on SpatialData Handling, Vol. 4, 3B, 31–42.

KANELLAKIS, P. C., KUPER, G. M., AND REVESZ, P. Z.1990. Constraint query languages. In Pro-ceedings of the ACM Conference Principles ofDatabase Systems, Vol. 9, 299–313.

KANELLAKIS, P. C., RAMASWAMY, S., VENGROFF, D. E., AND

VITTER, J. S. 1996. Indexing for data modelswith constraints and classes. Journal of Com-puter and System Sciences 52, 3, 589–612.

KANTH, K. V. R. AND SINGH, A. K. 1999. Optimal dy-namic range searching in non-replicating indexstructures. In Proceedings of the InternationalConference on Database Theory (Jan.), Vol. 1540of Lecture Notes in Computer Science, Springer-Verlag, 257–276.

KIM, M. Y. 1986. Synchronized disk interleaving.IEEE Transactions on Computers 35, 11 (Nov.),978–988.

KIRKPATRICK, D. G. AND SEIDEL, R. 1986. The ulti-

mate planar convex hull algorithm? SIAM Jour-nal on Computing 15, 287–299.

KNUTH, D. E. 1998. Sorting and Searching, Vol. 3of The Art of Computer Programming. 2nd ed.,Addison-Wesley, Reading, MA.

KNUTH, D. E. 1999. MMIXware. Springer, Berlin.KNUTH, D. E., MORRIS, J. H., AND PRATT, V. R. 1977.

Fast pattern matching in strings. SIAM Journalon Computing 6, 323–350.

KOLLIOS, G., GUNOPULOS, D., AND TSOTRAS, V. J. 1999.On indexing mobile objects. In Proceedings of theACM Symposium on Principles of Database Sys-tems, Vol. 18, 261–272.

KOUTSOUPIAS, E. AND TAYLOR, D. S. 1998. Tightbounds for 2-dimensional indexing schemes. InProceedings of the ACM Symposium on Princi-ples of Database Systems (Seattle, June), Vol. 17,52–58.

KRISHNAMURTHY, R. AND WANG, K.-Y. 1985. Multi-level grid files. Technical Report, IBM T. J. Wat-son Center, Yorktown Heights, NY, November.

KUMAR, V. AND SCHWABE, E. 1996. Improved algo-rithms and data structures for solving graphproblems in external memory. In Proceed-ings of the IEEE Symposium on Parallel andDistributed Processing (Oct.), Vol. 8, 169–176.

KUSPERT, K. 1983. Storage utilization in B*-treeswith a generalized overflow technique. Acta In-formatica, 19, 35–55.

LARSON, P.-A. 1982. Performance analysis of linearhashing with partial expansions. ACM Trans-actions on Database Systems 7, 4 (Dec.), 566–587.

LAURINI, R. AND THOMPSON, D. 1992. Fundamentalsof Spatial Information Systems. Academic Press.

LEHMAN, P. L. AND YAO, S. B. 1981. Efficient lock-ing for concurrent operations on B-Trees. ACMTransactions on Database Systems 6, 4 (Dec.),650–570.

LEIGHTON, F. T. 1985. Tight bounds on the com-plexity of parallel sorting. IEEE Transactionson Computers C-34, 4 (April), 344–354. Specialissue on sorting, E. E. Lindstrom, C. K. Wong,and J. S. Vitter, Eds.

LEISERSON, C. E., RAO, S., AND TOLEDO, S. 1993. Ef-ficient out-of-core algorithms for linear relax-ation using blocking covers. In Proceedings of theIEEE Symposium on Foundations of ComputerScience, Vol. 34, 704–713.

LI, Z., MILLS, P. H., AND REIF, J. H. 1996. Modelsand resource metrics for parallel and distributedcomputation. Parallel Algorithms and Applica-tions 8, 35–59.

LITWIN, W. 1980. Linear hashing: A new tool forfiles and tables addressing. In Proceedings ofthe International Conference on Very LargeDatabases (Montreal, Oct.), Vol. 6, 212–223.

LITWIN, W. AND LOMET, D. 1987. A new method forfast data searches with keys. IEEE Software 4,2 (March), 16–24.

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 60: Massive Data - University of Kansas

268 Jeffrey Scott Vitter

LOMET, D. 1988. A simple bounded disorder file or-ganization with good performance. ACM Trans-actions on Database Systems 13, 4, 525–551.

LOMET, D. B. AND SALZBERG, B. 1990. The hB-tree:A multiattribute indexing method with goodguaranteed performance. ACM Transactions onDatabase Systems 15, 4, 625–658.

LOMET, D. B. AND SALZBERG, B. 1997. Concurrencyand recovery for index trees. VLDB Journal 6, 3,224–240.

MAHESHWARI, A. AND ZEH, N. External memoryalgorithms for outerplanar graphs. In Proceed-ings of the International Conference on Comput-ing and Combinatorics (July), Vol. 1627 of Lec-ture Notes in Computer Science, Springer-Verlag,51–60.

MAHESHWARI, A. AND ZEH, N. 2001. I/O-efficient al-gorithms for bounded treewidth graphs. In Pro-ceedings of the ACM-SIAM Symposium on Dis-crete Algorithms (Washington, DC, Jan.), Vol. 12.

MANBER, U. AND MYERS, G. 1993. Suffix arrays: Anew method for on-line string searches. SIAMJournal on Computing 22, 5 (Oct.), 935–948.

MANBER, U. AND WU, S. 1994. GLIMPSE: A tool tosearch through entire file systems. In USENIXAssociation, Ed., Proceedings of the WinterUSENIX Conference (San Francisco, Jan.), 23–32.

MARTIN, G. N. N. 1979. Spiral storage: Incremen-tally augmentable hash addressed storage. Tech-nical Report CS-RR-027, University of Warwick,March.

MATIAS, Y., SEGAL, E., AND VITTER, J. S. 2000. Effi-cient bundle sorting. In Proceedings of the ACM–SIAM Symposium on Discrete Algorithms (SanFrancisco, Jan.), Vol. 11, 839–848.

MCCREIGHT, E. M. 1976. A space-economical suffixtree construction algorithm. Journal of the ACM23, 2, 262–272.

MCCREIGHT, E. M. 1985. Priority search trees.SIAM Journal on Computing 14, 2 (May), 257–276.

MENDELSON, H. 1982. Analysis of extendible hash-ing. IEEE Transactions on Software EngineeringSE–8 (Nov.), 611–619.

MEYER, U. 2001. External memory BFS on undi-rected graphs with bounded degree. In Proceed-ings of the ACM-SIAM Symposium on DiscreteAlgorithms (Washington, DC, Jan), Vol. 12.

Microsoft 1998. TerraServer online database ofsatellite images, available on the World-WideWeb at http://terraserver.microsoft.com/.

MOHAN, C. 1990 ARIES/KVL: A key-value lock-ing method for concurrency control of multi-action transactions on B-tree indices. In Pro-ceedings of the International Conference on VeryLarge Databases (Brisbane, August), Vol. 16,392.

MORRISON, D. R. 1968. Patricia: Practical algo-rithm to retrieve information coded in alphanu-meric. Journal of the ACM 15, 514–534.

MOYER, S. A. AND SUNDERAM, V. 1996. Character-izing concurrency control performance for thePIOUS parallel file system. Journal of Paralleland Distributed Computing 38, 1 (Oct.), 81–91.

MULLIN, J. K. 1985. Spiral storage: Efficient dy-namic hashing with constant performance. TheComputer Journal 28, 3 (July), 330–334.

MUNAGALA, K. AND RANADE, A. 1999. I/O-complexity of graph algorithms. In Proceedingsof the ACM-SIAM Symposium on DiscreteAlgorithms (Baltimore, Jan.), Vol. 10, 687–694.

NASA 1999. Earth Observing System (EOS) webpage, NASA Goddard Space Flight Center,http://eospso.gsfc.nasa.gov/.

NIEVERGELT, J. AND REINGOLD, E. M. 1973. Binarysearch tree of bounded balance. SIAM Journalon Computing 2, 1.

NIEVERGELT, J. AND WIDMAYER, P. 1997. Spatial datastructures: Concepts and design choices. InM. van Kreveld, J. Nievergelt, T. Roos, andP. Widmayer, Eds. Algorithmic Foundations ofGIS, Vol. 1340 of Lecture Notes in Computer Sci-ence, Springer-Verlag, 153 ff.

NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, K. C.1984. The grid file: An adaptable, symmetricmulti-key file structure. ACM Transactions onDatabase Systems 9, 38–71.

NODINE, M. H. AND VITTER, J. S. 1993. Determin-istic distribution sort in shared and distributedmemory multiprocessors. In Proceedings of theACM Symposium on Parallel Algorithms and Ar-chitectures (Velen, Germany, June–July), Vol. 5,120–129.

NODINE, M. H. AND VITTER, J. S. 1995. Greed Sort:An optimal sorting algorithm for multiple disks.Journal of the ACM 42, 4 (July), 919–933.

NODINE, M. H., GOODRICH, M. T., AND VITTER, J. S.1996. Blocking for external graph searching.Algorithmica 16, 2 (August), 181–214.

NODINE, M. H., LOPRESTI, D. P., AND VITTER, J. S.1991. I/O overhead and parallel VLSI archi-tectures for lattice computations. IEEE Trans-actions on Communications 40, 7 (July), 843–852.

O’NEIL, P. E. 1992. The SB-tree. An index-sequential structure for high-performance se-quential access. Acta Informatica 29, 3 (June),241–265.

ORENSTEIN, J. A. 1989. Redundancy in spatialdatabases. In Proceedings of the ACM SIG-MOD International Conference on Managementof Data (Portland, OR, June), 294–305.

ORENSTEIN, J. A. AND MERRETT, T. H. 1984. A classof data structures for associative searching. InProceedings of the ACM Conference Principles ofDatabase Systems, Vol. 3, 181–190.

OVERMARS, M. H. 1983. The Design of DynamicData Structures. Lecture Notes in Computer Sci-ence. Springer-Verlag.

PANG, H., CAREY, M., AND LIVNY, M. 1993a. Memory-adaptive external sorts. In Proceedings of the

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 61: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 269

International Conference on Very LargeDatabases (Dublin), Vol. 19, 618–629.

PANG, H., CAREY, M. J., AND LIVNY, M. 1993b. Par-tially preemptive hash joins. In P. Bunemanand S. Jajodia, Eds. Proceedings of the ACMSIGMOD International Conference on Manage-ment of Data (Washington, DC, May), 59–68.

PARSONS, I., UNRAU, R., SCHAEFFER, J., AND SZAFRON, D.1997. PI/OT: Parallel I/O templates. ParallelComputing 23, 4 (June), 543–570.

PATEL, J. M. AND DEWITT, D. J. 1996. Partitionbased spatial-merge join. In Proceedings ofthe ACM SIGMOD International Conference onManagement of Data (June), 259–270.

PREPARATA, F. P. AND SHAMOS, M. I. 1985. Computa-tional Geometry. Springer-Verlag.

RAHMAN, N., AND RAMAN, R. 2000. Adapting radixsort to the memory hierarchy. In Workshopon Algorithm Engineering and Experimentation(Jan.), Vol. 1982 of Lecture Notes in ComputerScience. Springer-Verlag.

RAMASWAMY, S. AND SUBRAMANIAN, S. 1994. Pathcaching: A technique for optimal externalsearching. In Proceedings of the ACM Confer-ence Principles of Database Systems (Minneapo-lis, MN), Vol. 13, 25–35.

RAO, J. AND ROSS, K. 1999. Cache conscious in-dexing for decision-support in main memory. InM. Atkinson et al., Eds. Proceedings of the In-ternational Conference on Very Large Databases(Los Altos, CA), Vol. 25, 78–89. Morgan Kauf-mann san Mateor, CA.

RAO, J. AND ROSS, K. A. 2000. Making B+-treescache conscious in main memory. In W. Chen,J. Naughton, and P. A. Bernstein, Eds, Proceed-ings of the ACM SIGMOD International Con-ference on Management of Data (Dallas). 475–486.

RIEDEL, E., GIBSON, G. A., AND FALOUTSOS, C. 1998.Active storage for large-scale data mining andmultimedia. In Proceedings of the InternationalConference on Very Large Databases (August),Vol. 22, 62–73.

ROBINSON, J. T. 1981. The k-d-b-tree: A searchstructure for large multidimensional dynamicindexes. In Proceedings of the ACM ConferencePrinciples of Database Systems, Vol. 1, 10–18.

ROSENBLUM, M., BUGNION, E., DEVINE, S., AND HERROD,S. A. 1997. Using the SimOS machine simu-lator to study complex computer systems. ACMTransactions on Modeling and Computer Simu-lation 7, 1 (Jan.), 78–103.

RUEMMLER, C. AND WILKES, J. 1994. An introductionto disk drive modeling. IEEE Computer (March),17–28.

SALEM, K. AND GARCIA-MOLINA, H. 1986. Disk strip-ing. In Proceedings of IEEE International Con-ference on Data Engineering (Los Angeles),Vol. 2, 336–242.

SALTENIS, S., JENSEN, C. S., LEUTENEGGER, S. T., AND

LOPEZ, M. A. 2000. Indexing the positions

of continuously moving objects. In W. Chen,J. Naughton, and P. A. Bernstein, Eds. Proceed-ings of the ACM SIGMOD International Confer-ence on Management of Data (Dallas), 331–342.

SALZBERG, B. AND TSOTRAS, V. J. 1999. Comparisonof access methods for time-evolving data. ACMComputing Surveys 31 (June), 158–221.

SAMET, H. 1989a. Applications of Spatial DataStructures: Computer Graphics, Image Process-ing, and GIS. Addison-Wesley.

SAMET, H. 1989b. The Design and Analysis of Spa-tial Data Structures. Addison-Wesley.

SAMOLADAS, V. AND MIRANKER, D. 1998. A lowerbound theorem for indexing schemes and its ap-plication to multidimensional range queries. InProceedings of the ACM Symposium on Princi-ples of Database Systems (Seattle, June), Vol. 17,44–51.

SANDERS, P. 1999. Fast priority queues for cachedmemory. In Workshop on Algorithm Engineeringand Experimentation, Vol. 1619 of Lecture Notesin Computer Science, Springer-Verlag (Jan.),312–327.

SANDERS, P. 2000. Personal communication.SANDERS, P. 2001. Reconciling simplicity and real-

ism in parallel disk models. In Proceedings of theACM-SIAM Symposium on Discrete Algorithms(Washington, Jan), Vol. 12.

SANDERS, P., EGNER, S., AND KORST, J. 2000. Fastconcurrent access to parallel disks. In Proceed-ings of the ACM-SIAM Symposium on DiscreteAlgorithms (San Francisco, Jan.), Vol. 11, 849–858.

SAVAGE, J. E. 1995. Extending the Hong-Kungmodel to memory hierarchies. In Proceedingsof the International Conference on Computingand Combinatorics (August), Vol. 959 of Lec-ture Notes in Computer Science, Springer-Verlag,270–281.

SAVAGE, J. E. AND Vitter, J. S. 1987. Parallelism inspace-time tradeoffs. In F. P. Preparata, Ed., Ad-vances in Computing Research, Vol. 4, JAI Press,Grreenwich, CT, 117–146.

SCHLOSSER, S. W., GRIFFIN, J. L., NAGLE, D. F., AND

GANGER, G. R. 2000. Designing computer sys-tems with MEMS-based storage. In Proceedingsof the International Conference on ArchitecturalSupport for Programming Languages and Oper-ating Systems (Nov.), Vol. 9.

SEAMONS, K. E. AND WINSLETT, M. 1996. Multidi-mensional array I/O in Panda 1.0. Journal ofSupercomputing 10, 2, 191–211.

SEEGER, B. AND KRIEGEL, H.-P. 1990 The buddy-tree: An efficient and robust access methodfor spatial data base systems. In Proceedingsof the International Conference on Very LargeDatabases, Vol. 16, 590–601.

SELTZER, M., SMITH, K. A., BALAKRISHNAN, H., CHANG,J., MCMAINS, S., AND PADMANABHAN, V. 1995.File system logging versus clustering: A perfor-mance comparison. In Proceedings of the Annual

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 62: Massive Data - University of Kansas

270 Jeffrey Scott Vitter

USENIX Technical Conference (New Orleans),249–264.

SEN, S. AND CHATTERJEE, S. 2000. Towards a the-ory of cache-efficient algorithms. In Proceedingsof the ACM-SIAM Symposium on Discrete Algo-rithms (San Francisco, Jan.), Vol. 11.

SHRIVER, E., MERCHANT, A., AND WILKES, J. 1998. Ananalytic behavior model for disk drives withreadahead caches and request reordering. InProceedings of ACM SIGMETRICS Joint Inter-national Conference on Measurement and Mod-eling of Computer Systems (June), 182–191.

SHRIVER, E. A. M. AND NODINE, M. H. 1996. Anintroduction to parallel I/O models and algo-rithms. In R. Jain, J. Werth, and J. C. Browne,Eds. Input/Output in Parallel and DistributedComputer Systems, Kluwer Academic, Chap-ter 2, 31–68.

SHRIVER, E. A. M. AND WISNIEWSKI, L. F. 1995. AnAPI for choreographing data accesses. TechnicalReport PCS-TR95-267, Dept. of Computer Sci-ence, Dartmouth College, November.

SIBEYN, J. F. 1997. From parallel to externallist ranking. Technical Report MPI–I–97–1–021,Max-Planck-Institut, September.

SIBEYN, J. F. 1999. External selection. In Proceed-ings of the Symposium on Theoretical Aspects ofComputer Science, Vol. 1563 of Lecture Notes inComputer Science, Springer-Verlag, 291–301.

SIBEYN, J. F. AND KAUFMANN, M. 1997. BSP-likeexternal-memory computation. In Proceedings ofthe Italian Conference on Algorithms and Com-plexity, Vol. 3, 229–240.

SRINIVASAN, B. 1991. An adaptive overflow tech-nique to defer splitting in b-trees. The ComputerJournal 34, 5, 397–405.

SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A sys-tem for building customized program analysistools. ACM SIGPLAN Notices 29, 6 (June), 196–205.

SUBRAMANIAN, S. AND RAMASWAMY, S. 1995. TheP-range tree: A new data structure for rangesearching in secondary memory. In Proceedingsof the ACM-SIAM Symposium on Discrete Algo-rithms, Vol. 6, 378–387.

TAMASSIA, R. AND VITTER, J. S. 1996. Optimal coop-erative search in fractional cascaded data struc-tures. Algorithmica 15, 2 (Feb.), 154–171.

THAKUR, R., CHOUDHARY, A., BORDAWEKAR, R., MORE, S.,AND KUDITIPUDI, S. 1996. Passion: OptimizedI/O for parallel applications. IEEE Computer 29,6 (June), 70–78.

TIGER/Line (tm). 1992. Technical documenta-tion. Technical Report, US Bureau of the Census.

TOLEDO, S. 1999. A survey of out-of-core algo-rithms in numerical linear algebra. In J. Abelloand J. S. Vitter, Eds. External Memory Algo-rithms and Visualization, DIMACS Series inDiscrete Mathematics and Theoretical Com-puter Science, American Mathematical Society,Providence, RI, 161–179.

TPIE. 1999. User manual and reference, Themanual and software distribution are availableon the web at http://www.cs.duke.edu/TPIE/.

ULLMAN, J. D. AND YANNAKAKIS, M. 1991. The in-put/output complexity of transitive closure.Annals of Mathematics and Artificial Intelli-gence 3, 331–360.

VAN DEN BERCKEN, J., SEEGER, B., AND WIDMAYER, P.1997. A generic approach to bulk loading mul-tidimensional index structures. In Proceedingsof the International Conference on Very LargeDatabases, Vol. 23, 406–415.

VAN KREVELD, M., NIEVERGELT, J., ROOS, T., AND

WINDMAYER, P. Eds. 1997. Algorithmic Foun-dations of GIS, Vol. 1340 of Lecture Notes inComputer Science, Springer-Verlag.

VARMAN, P. J. AND VERMA, R. M. 1997. An efficientmultiversion access structure. IEEE Transac-tions on Knowledge and Data Engineering 9, 3(May–June), 391–409.

VENGROFF, D. E. AND VITTER, J. S. 1996a. Efficient3-d range searching in external memory. In Pro-ceedings of the ACM Symposium on Theory ofComputing (Philadelphia, May), Vol. 28, 192–201.

VENGROFF, D. E. AND VITTER, J. S. 1996b. I/O-efficient scientific computation using TPIE. InProceedings of NASA Goddard Conference onMass Storage Systems (Sept.), Vol. 5, II, 553–570.

VETTIGER, P., DESPONT, M., DRECHSLER, U., DURIG, U.,HABERLE, W., LUTWYCHE, M. I., ROTHUIZEN, E.,STUTZ, R., WIDMER, R., AND BINNIG, G. K. 2000.The “Millipede”—More than one thousand tipsfor future AFM data storage. IBM Journal ofResearch and Development 44, 3, 323–340.

VITTER, J. S. 1991. Efficient memory access inlarge-scale computation. In Proceedings of theSymposium on Theoretical Aspects of ComputerScience, invited paper, Vol. 480 of Lecture Notesin Computer Science, Invited Paper, Springer-Verlag, 26–41.

VITTER, J. S. 1998. External memory algorithms.In Proceedings of the European Symposium onAlgorithms (Venice, Italy, August), Vol. 1461 ofLecture Notes in Computer Science, invited pa-per, Springer-Verlag.

VITTER, J. S. 1999a. External memory algorithmsand data structures. In J. Abello and J. S. Vitter,Eds. External Memory Algorithms and Visual-ization, DIMACS Series in Discrete Mathemat-ics and Theoretical Computer Science. AmericanMathematical Society, Providence, RI.

VITTER, J. S. 1999b. Online data structures in ex-ternal memory. In Proceedings of the Interna-tional Colloquium on Automata, Languages, andProgramming (Prague, August). Vol. 1644 ofLecture Notes in Computer Science, invited pa-per, Springer-Verlag, 119–133.

VITTER, J. S. 1999c. Online data structures in ex-ternal memory. In Proceedings of the Workshopon Algorithms and Data Structures (Vancouver,

ACM Computing Surveys, Vol. 33, No. 2, June 2001.

Page 63: Massive Data - University of Kansas

External Memory Algorithms and Data Structures 271

August), Vol. 1668 of Lecture Notes in ComputerScience, Invited paper. Springer-Verlag.

VITTER, J. S. AND FLAJOLET, P. 1990. Average-caseanalysis of algorithms and data structures. InJ. van Leeuwen, Ed., Handbook of TheoreticalComputer Science, Volume A: Algorithms andComplexity, North-Holland, Chapter 9, 431–524.

VITTER, J. S. AND HUTCHINSON, D. A. 2001. Distribu-tion sort with randomized cycling. In Proceed-ings of the ACM-SIAM Symposium on DiscreteAlgorithms (Washington, January), Vol. 12.

VITTER, J. S., AND NODINE, M. H. 1993. Large-scalesorting in uniform memory hierarchies. Journalof Parallel and Distributed Computing 17, 107–114.

VITTER, J. S. AND SHRIVER, E. A. M. 1994a. Algo-rithms for parallel memory I: Two-level mem-ories. Algorithmica 12, 2–3, 110–147.

VITTER, J. S. AND SHRIVER, E. A. M. 1994b. Algo-rithms for parallel memory II: Hierarchical mul-tilevel memories. Algorithmica 12, 2–3, 148–169.

VITTER, J. S. AND VENGROFF, D. E. 1999. Notes.VITTER, J. S. AND WANG, M. 1999. Approximate

computation of multidimensional aggregates ofsparse data using wavelets. In Proceedings ofthe ACM SIGMOD International Conference onManagement of Data (Philadelphia, June), 193–204.

VITTER, J. S., WANG, M., AND IYER, B. 1998.Data cube approximation and histograms viawavelets. In Proceedings of the InternationalACM Conference on Information and KnowledgeManagement (Washington, Nov.), Vol. 7, 96–104.

WANG, M., IYER, B., AND VITTER, J. S. 1998. Scal-able mining for classification rules in relationaldatabases. In Proceedings of the InternationalDatabase Engineering & Application Sympo-sium (Cardiff, July), 58–67.

WANG, M., VITTER, J. S., LIM, L., AND PADMANABHAN, S.2001. Wavelet-based cost estimation for spa-tial queries, July.

WATSON, R. W. AND COYNE, R. A. 1995. The parallelI/O architecture of the high-performance storagesystem (HPSS). In Proceedings of the IEEE Sym-posium on Mass Storage Systems (Sept.), Vol. 14,27–44.

WEINER, P. 1973. Linear pattern matching algo-rithm. In Proceedings of the IEEE Symposium onSwitching and Automata Theory (Washington,DC), Vol. 14, 1–11.

WILLARD, D. AND LUEKER, G. 1985. Adding range re-striction capability to dynamic data structures.Journal of the ACM, 32, 3, 597–617.

WOLFSON, O., SISTLA, P., XU, B., ZHOU, J., AND

CHAMBERLAIN, S. 1999. DOMINO: DatabasesfOr MovINg Objects tracking. In A. Delis,C. Faloutsos, and S. Ghandeharizadeh, Eds.,Proceedings of the ACM SIGMOD InternationalConference on Management of Data May, 547–549.

WOMBLE, D., GREENBERG, D., WHEAT, S., AND RIESEN, R.1993. Beyond core: Making parallel computerI/O practical. In Proceedings of the DAGS Sym-posium on Parallel Computation (Hanover, NH,June), 56–63. Vol. 2, Dartmouth Institute forAdvanced Graduate Studies.

WU, C., AND FENG, T. 1981. The universality of theshuffle-exchange network. IEEE Transactionson Computers C-30 (May), 324–332.

YAO, A. C. 1978. On random 2-3 trees. Acta Infor-matica 9, 159–170.

ZDONIK, S. B. AND MAIER, D. Eds. 1990. Readingsin Object-Oriented Database Systems. MorganKauffman, San Mateo, CA.

ZHANG, W. AND LARSON, P.-A. Dynamic memory ad-justment for external mergesort. In Proceedingsof the International Conference on Very LargeDatabases (Athens), Vol. 23, 376–385.

ZHU, B. 1994. Further computational geometry insecondary memory. In Proceedings of the Inter-national Symposium on Algorithms and Com-putation, Vol. 834 of Lecture Notes in ComputerScience, Springer-Verlag, 514 ff.

Received May 1999; revised March 2000; accepted September 2000

ACM Computing Surveys, Vol. 33, No. 2, June 2001.