Sublinear Time Algorithm

Embed Size (px)

Citation preview

  • 7/29/2019 Sublinear Time Algorithm

    1/28

    A

    Seminar Report

    On

    SUB-LINEAR TIME ALGORITHMS

    Submitted in partial fulfillment of the

    requirements for the award of the degree

    of

    BACHELOR OF TECHNOLOGY

    In

    COMPUTER SCIENCE AND ENGINEERING

    Submitted by

    AJAY YADAV

    Roll No. - 1012210009

    B.TECH (CS-61)

    Under the guidance of

    Ms. ANITA PAL

    Mr. KAMAL KUMAR SRIVASTAVA

    DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

    SHRI RAMSWAROOP MEMORIAL COLLEGE OF ENGINEERING &

    MANAGEMENT, LUCKNOW

    FEBRUARY

    2013

  • 7/29/2019 Sublinear Time Algorithm

    2/281

    ABSTRACT

    The area of sublinear-time algorithms is a new rapidly emerging area of computer science. It has

    its roots in the study of massive data sets that occur more and more frequently in various

    applications. Financial transactions with billions of input data and Internet traffic analyses

    (Internet traffic logs, clickstreams, web data) are examples of modern data sets that show

    unprecedented scale. Managing and analyzing such data sets forces us to reconsider the

    traditional notions of efficient algorithms: processing such massive data sets in more than linear

    time is by far too expensive and often even linear time algorithms may be too slow. Hence, there

    is the desire to develop algorithms whose running times are not only polynomial, but in fact are

    sublinear in n. Constructing a sublinear time algorithm may seem to be an impossible task sinceit allows one to read only a small fraction of the input. However, in recent years, we have seen

    development of sublinear time algorithms for optimization problems arising in such diverse areas

    as graph theory, geometry, algebraic computations, and computer graphics. Initially, the main

    research focus has been on designing efficient algorithms in the framework of property testing.

    which is an alternative notion of approximation for decision problems. But more recently, we see

    some major progress in sublinear-time algorithms in the classical model of randomized and

    approximation algorithms. In this paper, we survey some of the recent advances in this area. Our

    main focus is on sublinear-time algorithms for combinatorial problems, especially for graph

    problems and optimization problems in metric spaces.

  • 7/29/2019 Sublinear Time Algorithm

    3/282

    ACKNOWLEDGEMENT

    Acknowledgements provide us the opportunity to express our gratitude to the people which makesuch tasks to come out as a success. I take this opportunity to thank Dr. (Mrs.) Vinodini Katiyar

    (Head of Department, Computer Science) for her immense support and help. I would also like to

    thank my seminar guides Ms. Anita Pal and Mr. Kamal Kumar Srivastava for their incessant and

    perpetual cooperation and encouragement which helped me overcome hurdles at various stages.

    I would also like to thank my family for providing me resources, support and

    encouragement this report would not have been possible without you.

    AJAY YADAV

  • 7/29/2019 Sublinear Time Algorithm

    4/283

    List of Symbols and Abbreviations

    DFA Deterministic finite automata.

    R.E. Regular Expression

    MST Minimum Spanning Tree

    NP Non Polynomial

    PTAS polynomial time approximation

    scheme

    O Asymptotic upper bound

    Asymptotic lower bound

    Asymptotic average bound

    less than

    Summation

  • 7/29/2019 Sublinear Time Algorithm

    5/284

    TABLE OF CONTENTS

    Abstract.i

    Acknowledgement.ii

    List of Symbols, Abbreviations and Nomenclature..iii

    1.Intoduction..5

    1.1.Sublinear time algorithms5

    1.2.Property Testing...6

    1.2.1.Definition and Variants.7

    1.2.2.Features and limitations....8

    1.3.Data Streaming algorithms..8

    2.Related work...9

    2.1.Randomized algorithms...9

    2.2.Approximation algorithms..9

    2.3.Time complexity comparisons10

    3.Design.12

    3.1.Algorithm to determine sortedness of a list.12

    3.1.1.algorithm...13

    3.2.Successor search in a list..14

    3.2.1.Search algorithm14

    3.3.Testing Membership of a string in a R.E..15

    3.3.1.Algorithm...15

    3.4.Searching in sorted list..16

    3.5.Geometry:Intersection of two polygons17

    3.6.Sublinear Time Algorithms for Graphs Problems....193.6.1.Average degree of graph.19

    3.7.Sublinear Time Approximation Algorithms for Problems in Metric Spaces...233.7.1.Clustering via Random Sampling254.Conclusion..26

    References..27

  • 7/29/2019 Sublinear Time Algorithm

    6/285

    1. INTRODUCTIONThe concept of sublinear-time algorithms is known for a very long time, but initially it has been

    used to denote pseudo-sublinear-time algorithms, where after an appropriate preprocessing, an

    algorithm solves the problem in sublinear-time. For example, if we have a set of n numbers, then

    after an O(n log n) preprocessing (sorting), we can trivially solve a number of problems

    involving the input elements. And so, if the after the preprocessing the elements are put in a

    sorted array, then in O(1) time we can find the kth smallest element, in O(log n) time we can test

    if the input contains a given element x, and also in O(log n) time we can return the number of

    elements equal to a given element x. Even though all these results are folklore, this is not what

    we call nowadays a sublinear-time algorithm.

    In this Report, my goal is to study algorithms for which the input is taken to be in any standard

    representation and with no extra assumptions. Then, an algorithm does not have to read the entire

    input but it may determine the output by checking only a subset of the input elements. It is easy

    to see that for many natural problems it is impossible to give any reasonable answer if not all or

    almost all input elements are checked. But still, for some number of problems we can obtain

    good algorithms that do not have to look at the entire input. Typically, these algorithms are

    randomized (because most of the problems have a trivial linear-time deterministic lower bound)

    and they return only an approximate solution rather than the exact one (because usually, without

    looking at the whole input we cannot determine the exact solution). In this survey, we present

    recently developed sublinear-time algorithm for some combinatorial optimization problems.

    1.1 Subli near time algorithms:

    Sublinear time algorithms are that type of algorithm which runs in sublinear time. That is they

    runs always in upper bound of linear time algorithm .i.e. if we are defining f(n) as function for

    time complexity then f(n) < o(n).here o (small oh) can be define as the if g 1(n) and g2(n) are two

    function then g1=o(g2) if and only if g1(n) < c*g2(n) for c > 0 and n > n0.An algorithm is said to run in sub-linear time (often spelled sublinear time) if T(n) = o(n). In

    particular this includes algorithms with the time complexities defined above, as well as others

    such as the O (n) Grover's search algorithm. Typical algorithms that are exact and yet run in

    sub-linear time use parallel processing (as the NC1 matrix determinant calculation does), non-

    classical processing (as Grover's search does), or alternatively have guaranteed assumptions on

    the input structure (as the logarithmic time binary search and many tree maintenance algorithms

  • 7/29/2019 Sublinear Time Algorithm

    7/286

    do).

    Figure 1.1 Comparison between sublinear and linear algorithm

    However, languages such as the set of all strings that have a 1-bit indexed by the first log(n) bits

    may depend on every bit of the input and yet be computable in sub-linear time.

    The specific term sublinear time algorithm is usually reserved to algorithms that are unlike the

    above in that they are run over classical serial machine models and are not allowed prior

    assumptions on the input. They are however allowed to be randomized, and indeed must be

    randomized for all but the most trivial of tasks. As such an algorithm must provide an answer

    without reading the entire input, its particulars heavily depend on the access allowed to the input.

    Usually for an input that is represented as a binary string b1,...,bk it is assumed that the algorithm

    can in time O(1) request and obtain the value of bi for any i.

    Sub-linear time algorithms are typically randomized, and provide only approximate solutions. In

    fact, the property of a binary string having only zeros (and no ones) can be easily proved not to

    be decidable by a (non-approximate) sub-linear time algorithm. Sub-linear time algorithms arise

    naturally in the investigation of property testing.

    1.2. Property testing:

    In broad terms, property testing is the study of the following class of the problems:

    Given the ability to perform (local) queries concerning a particular object (e.g., a function, or a

    graph), the task is to determine whether the object has predetermined (global property) (e.g.

  • 7/29/2019 Sublinear Time Algorithm

    8/287

    linearity or bipartiteness) ,or is far from having the property. The task should be performed by

    inspecting only a small (possibly randomly selected) part of the whole object, where a small

    probability of failure is allowed.

    In order to define a property testing problem, we need to specify the types of queries that can

    perform by the algorithm and a distance measure between objects. Later is required in order todefine what it means that the object is far from having the property. we assume that the algorithm

    has given distance parameter is .

    In computer science, a property testing algorithm for a decision problem is an algorithm whose

    query complexity to its input is much smaller than the instance size of the problem. Typically

    property testing algorithms are used to decide if some mathematical object (such as a graph or a

    boolean function) has a "global" property, or is "far" from having this property, using only a

    small number of "local" queries to the object. For example, the following promise problem

    admits an algorithm whose query complexity is independent of the instance size (for an arbitrary

    constant > 0):

    "Given a graph G on n vertices, decide if G is bipartite, or G cannot be made bipartite even after

    removing an arbitrary subset of at most edges of G."

    Property testing algorithms are important in the theory of probabilistically checkable proofs.

    1.2.1.Defi ni tion and var iants

    Formally, a property testing algorithm with query complexity q(n) and proximity parameter for

    a decision problem L is a randomized algorithm that, on input x (an instance of L) makes at most

    q(|x|) queries to x and behaves as follows:

    If x is in L, the algorithm accepts x with probability at least .

    If x is -far from L, the algorithm rejects x with probability at least .

    Here, "x is -far from L" means that the Hamming distance between x and any string in L is at

    least |x|.

    A property testing algorithm is said to have one-sided error if it satisfies the stronger condition

    that the accepting probability for instances x L is 1 instead of .

    A property testing algorithm is said be non-adaptive if it performs all its queries before it

    "observes" any answers to previous queries. Such an algorithm can be viewed as operating in the

    following manner. First the algorithm receives its input. Before looking at the input, using its

    internal randomness, the algorithm decides which symbols of the input are to be queried. Next,

    the algorithm observes these symbols. Finally, without making any additional queries (but

    possibly using its randomness), the algorithm decides whether to accept or reject the input.

  • 7/29/2019 Sublinear Time Algorithm

    9/288

    1.2.2.Features and limitations

    The main efficiency parameter of a property testing algorithm is its query complexity, which is

    the maximum number of input symbols inspected over all inputs of a given length (and all

    random choices made by the algorithm). One is interested in designing algorithms whose query

    complexity is as small as possible. In many cases the running time of property testing algorithmsis sublinear in the instance length. Typically, the goal is first to make the query complexity as

    small as possible as a function of the instance size n, and then study the dependency on the

    proximity parameter .

    Unlike other complexity-theoretic settings, the asymptotic query complexity of property testing

    algorithms is affected dramatically by the representation of instances. For example, when =

    0.01, the problem of testing bipartiteness of dense graphs (which are represented by their

    adjacency matrix) admits an algorithm of constant query complexity. In contrast, sparse graphs

    on n vertices (which are represented by their adjacency list) require property testing algorithms of

    query complexity ( .

    The query complexity of property testing algorithms grows as the proximity parameter becomes

    smaller for all non-trivial properties. This dependence on is necessary as a change of fewer than

    symbols in the input cannot be detected with constant probability using fewer than O(1/)

    queries. Many interesting properties of dense graphs can be tested using query complexity that

    depends only on and not on the graph size n. However, the query complexity can grow

    enormously fast as a function of . For example, for a long time the best known algorithm for

    testing if a graph does not contain any triangle had a query complexity which is a tower function

    of poly(1/), and only in 2010 this has been improved to a tower function of log(1/). O ne of the

    reasons for this enormous growth in bounds is that many of the positive results for property

    testing of graphs are established using the Szemerdi regularity lemma, which also has tower-

    type bounds in its conclusions.

    1.3 Data Streaming algori thms:

    In computer science, streaming algorithms are algorithms for processing data streams in which

    the input is presented as a sequence of items and can be examined in only a few passes (typically

    just one). These algorithms have limited memory available to them (much less than the input

    size) and also limited processing time per item.

    These constraints may mean that an algorithm produces an approximate answer based on a

    summary or "sketch" of the data stream in memory.

  • 7/29/2019 Sublinear Time Algorithm

    10/289

    2. Related work:-

    2.1. Randomized Algori thms:

    A randomized algorithm is an algorithm which employs a degree of randomness as part of its

    logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its

    behaviour, in the hope of achieving good performance in the "average case" over all possible

    choices of random bits. Formally, the algorithm's performance will be a random variable

    determined by the random bits; thus either the running time, or the output (or both) are random

    variables.

    One has to distinguish between algorithms that use the random input to reduce the expected

    running time or memory usage, but always terminate with a correct result in a bounded amount

    of time, and probabilistic algorithms, which, depending on the random input, have a chance of

    producing an incorrect result (Monte Carlo algorithms) or fail to produce a result (Las Vegas

    algorithms) either by signalling a failure or failing to terminate.

    In the second case, random performance and random output, the term "algorithm" for a procedure

    is somewhat questionable. In the case of random output, it is no longer formally effective.

    However, in some cases, probabilistic algorithms are the only practical means of solving a

    problem.

    In common practice, randomized algorithms are approximated using a pseudorandom numbergenerator in place of a true source of random bits; such an implementation may deviate from the

    expected theoretical behaviour.

    2.2 Approximation Algor ithms:

    In computer science and operations research, approximation algorithms are algorithms used to

    find approximate solutions to optimization problems. Approximation algorithms are often

    associated with NP-hard problems; since it is unlikely that there can ever be efficient polynomial-

    time exact algorithms solving NP-hard problems, one settles for polynomial-time sub-optimal

    solutions. Unlike heuristics, which usually only find reasonably good solutions reasonably fast,

    one wants provable solution quality and provable run-time bounds. Ideally, the approximation is

    optimal up to a small constant factor (for instance within 5% of the optimal solution).

    Approximation algorithms are increasingly being used for problems where exact polynomial-

    time algorithms are known but are too expensive due to the input size. A typical example for an

    approximation algorithm is the one for vertex cover in graphs: find an uncovered edge and add

    both endpoints to the vertex cover, until none remain. It is clear that the resulting cover is at most

    twice as large as the optimal one. This is a constant factor approximation algorithm with a factor

  • 7/29/2019 Sublinear Time Algorithm

    11/2810

    of 2. NP-hard problems vary greatly in their approximability; some, such as the bin packing

    problem, can be approximated within any factor greater than 1 (such a family of approximation

    algorithms is often called a polynomial time approximation scheme or PTAS). Others are

    impossible to approximate within any constant, or even polynomial factor unless P = NP, such as

    the maximum clique problem. NP-hard problems can often be expressed as integer programs (IP)and solved exactly in exponential time. Many approximation algorithms emerge from the linear

    programming relaxation of the integer program. Not all approximation algorithms are suitable for

    all practical applications. They often use IP/LP/Semidefinite solvers, complex data structures or

    sophisticated algorithmic techniques which lead to difficult implementation problems. Also,

    some approximation algorithms have impractical running times even though they are polynomial

    time, for example O(n2000). Yet the study of even very expensive algorithms is not a completely

    theoretical pursuit as they can yield valuable insights. A classic example is the initial PTAS for

    Euclidean TSP due to Sanjeev Arora which had prohibitive running time, yet within a year,

    Arora refined the ideas into a linear time algorithm. Such algorithms are also worthwhile in some

    applications where the running times and cost can be justified e.g. computational biology,

    financial engineering, transportation planning, and inventory management. In such scenarios,

    they must compete with the corresponding direct IP formulations.

    Another limitation of the approach is that it applies only to optimization problems and not to

    "pure" decision problems like satisfiability, although it is often possible to conceive optimization

    versions of such problems, such as the maximum satisfiability problem (Max SAT).

    Inapproximability has been a fruitful area of research in computational complexity theory since

    the 1990 result of Feige, Goldwasser, Lovasz, Safra and Szegedy on the inapproximability of

    Independent Set. After Arora et al. proved the PCP theorem a year later, it has now been shown

    that Johnson's 1974 approximation algorithms for Max SAT, Set Cover, Independent Set and

    Coloring all achieve the optimal approximation ratio, assuming P != NP.

    2.3. Time Complexity Comparisons:

    In computer science, the time complexity of an algorithm quantifies the amount of time taken by

    an algorithm to run as a function of the length of the string representing the input. The time

    complexity of an algorithm is commonly expressed using big O notation, which excludes

    coefficients and lower order terms. When expressed this way, the time complexity is said to be

    described asymptotically, i.e., as the input size goes to infinity. For example, if the time required

    by an algorithm on all inputs of size n is at most 5n3 + 3n, the asymptotic time complexity is

    O(n3).

  • 7/29/2019 Sublinear Time Algorithm

    12/2811

    Time complexity is commonly estimated by counting the number of elementary operations

    performed by the algorithm, where an elementary operation takes a fixed amount of time to

    perform. Thus the amount of time taken and the number of elementary operations performed by

    the algorithm differ by at most a constant factor.

    Since an algorithm's performance time may vary with different inputs of the same size, onecommonly uses the worst-case time complexity of an algorithm, denoted as T(n), which is

    defined as the maximum amount of time taken on any input of size n. Time complexities are

    classified by the nature of the function T(n). For instance, an algorithm with T(n) = O(n) is called

    a linear time algorithm, and an algorithm with T(n) = O(2n) is said to be an exponential time

    algorithm.

    Table 1 Time complexity comparison

  • 7/29/2019 Sublinear Time Algorithm

    13/2812

    3.Design :

    3.1. Algori thm for checking sortedness of a li st in Subl inear Time:

    Input: a list of n numbers x1 , x2 ,..., xn.

    Question: Is the list sorted orfar from sorted?Already saw: two different O((log n)/) time testers.

    (log n) queries are required for all constant 1/2

    Today: log n) queries are required for all constant c for every 1-sided error nonadaptive

    test.

    A test has 1-sided error if it always accepts all YES instances.

    A test is nonadaptive if its queries that do not depend on answers to previous queries.

    A pair (xi, xj) is violated if xi < xj

    Claim. A 1-sided error test can reject only if it finds a violated pair.

    Proof: Every sorted partial list can be extended to a sorted list.

    Lolas distribution is uniform over the following log lists:

  • 7/29/2019 Sublinear Time Algorithm

    14/2813

    Claim 1. All lists above are 1/2-far from sorted.

    Claim 2. Every pair (xi , xj) is violated in exactly one list above.

    Let picks a set of positions to query.

    Ourtest must be correct, i.e., must find a violated pair with probability 2/3

    When input is picked according to Lolas distribution.

    Q contains a violated pair (ai, ai+1) is violated for some i

    Probability Pr =

    Lolas distribution

    If|Q| 2/3logn then this probability is

  • 7/29/2019 Sublinear Time Algorithm

    15/2814

    1.Pick an entry uniformly at random. Let x be the value in that entry.

    2.Perform a binary search for x

    3.If x is found, accept, otherwise, reject.

    Runtime: O(logn)

    3.2 Successor search in a l ist in subl inear time:

    Input: a sorted doubly linked list of numbers and a search value k

    Output: first element greater than k

    Data structure:

    Figure 3.1. Datastructure for doubly link list

    3.2.1 Search Algor i thm:

    Pick random list elements

    Determine closest predecessor and successor of k (within the random sample)

    Perform plain, linearsearch towards k (starting from these two elements)

    Figure 3.2. Algorithm running status

  • 7/29/2019 Sublinear Time Algorithm

    16/2815

    Good case:

    Figure 3.3. Good case for algorithm

    Worst case:

    Figure 3.4. worst case for algorithm

    Runtime:O()

    3.3 Testing Membership of a Str ing in R.E.

    For fixed regular language L belongs to {0,1}*, testing algorithm should accept w.h.p. every

    word w belongs to L, and should reject w.h.p. every word w that differs on more than e n bits

    (n=|w|) from every w belongs to L (|w|=n). Algorithm can query any bit wi of w.

    Figure 3.5. DFA for testing

    3.3.1.Algori thm (simpl if ied version):

    Uniformly and independently select (r/e) indices 1 i n .

    For each i selected, check that the substring wi wi+r/e is feasible.

    If any substring is infeasible then reject, otherwise accept

  • 7/29/2019 Sublinear Time Algorithm

    17/2816

    RunTime: O()

    3.4.Searching in a sorted list:

    It is well-known that if we can store the input in a sorted array, then we can solve various

    problems on the input very efficiently. However, the assumption that the input array is sorted isnot natural in typical applications. Let us now consider a variant of this problem,

    where our goal is to search for an element x in a linked sorted list containing n distinct elements.

    Here, we assume that the n elements are stored in a doubly-linked, each list element has access to

    the next and preceding element in the list, and the list is sorted (that is, if x follows y in the list,

    then y < x). We also assume that we have access to all elements in the list, which for example,

    can correspond to the situation that all n list elements are stored in an array (but the array is not

    sorted and we do not impose any order for the array elements). How can we find whether a given

    number x is in our input or is not? On the first glace, it seems that since we do not have direct

    access to the rank of any element in the list, this problem requires (n) time. And indeed, if our

    goal is to design a deterministic algorithm,then it is impossible to do the search in o(n)

    time.However, if we allow randomization, then we can complete the search in O(n) expected

    time (and this bound is asymptotically tight). Let us first sample uniformly at random a set S of

    (n) elements from the input. Since we have access to all elements in the list, we can select the

    set S in O(n) time. Next, we scan all the elements in S and in O(n) time we can find twoelements in S, p and q, such that p x < q, and there is no element in S that is between p and q.

    Observe that since the input consist of n distinct numbers, p and q are uniquely defined. Next, we

    traverse the input list containing all the input elements starting at p until we find either the sought

    key x or we find element q.

    Lemma 1:

    The algorithm above completes the search in expected O(n) time. Moreover, no algorithm can

    solve this problem in o(n) expected time.

    Proof. The running time of the algorithm if equal to O(n) plus the number of the input elements

    between p and q. Since S contains (n) elements, the expected number of input elements

    between p and q is O(n/|S|) = O(n). This implies that the expected running time of the algorithm

    is O(n).

    For a proof of a lower bound of (n) expected time.

  • 7/29/2019 Sublinear Time Algorithm

    18/2817

    3.5.Geometry: I ntersection of Two Polygons:

    Let us consider a related problem but this time in a geometric setting. Given two convex

    polygons A and B in R, each with n vertices, determine if they intersect, and if so, then find a

    point in their intersection.

    It is well known that this problem can be solved in O(n) time, for example, by observing that it

    can be described as a linear programming instance in 2-dimensions, a problem which is known to

    have a linear-time algorithmIn fact, within the same time one can either find a point that is in the

    intersection of A and B, or find a line L that separates A from B (actually, one can even find a

    bitangent separating line L, i.e., a line separating A and B which intersects with each of A and B

    in exactly one point). The question is whether we can obtain a better running time. The

    complexity of this problem depends on the input representation. In the most powerful model, if

    the vertices of both polygons are stored in an array in cyclic order, Chazelle and Dobkin showed

    that the intersection of the polygons can be determined in logarithmic time. However,a standard

    geometric representation assumes that the input is not stored in an array but rather A and B are

    given by their doubly-linked lists of vertices such that each vertex has as its successor the next

    vertex of the polygon in the clockwise order. Can we then test if A and B intersect?

    Figure 1: (a) Bitangent line LseparatingCAandCB, and (b) the polygon PA.

    Chazelle et al. [2] gave an O(n)-time algorithm that reuses the approach discussed above for

    searching in a sorted list. Let us first sample uniformly at random _(n) vertices from each A and

    B, and let CA and CBbe the convex hulls of the sample point sets for the polygons A and

    B,espectively. Using the linear-time algorithm mentioned above, in O(n) time we can check if

    CAandCBintersects. If they do, then the algorithm will get us a point that lies in the intersection

    ofCAand CB, and hence, this point lies also in the intersection of A and B. Otherwise, let L be the

    bitangent separating line returned by the algorithm.Let a and b be the points in L that belong to

    A and B, respectively. Let a1 and a2 be the two vertices adjacent to a in A. We will define now a

    new polygon PA. If none of a1 and a2 is on the side CA of L the we define PA to be empty.

  • 7/29/2019 Sublinear Time Algorithm

    19/2818

    Otherwise, exactly one of a1 and a2 is on the side CA of L; let it be a1. We define polygon PA by

    walking from a to a1 and then continue walking along the boundary of A until we cross L again

    (see Figure 1 (b)). In a similar way we define polygon PB. Observe that the expected size of each

    of PA and PB is at most O(n).it is easy to see that A and B intersects if and only if either A

    intersects PB or B intersects PA. We only consider the case of checking if A intersects PB. Wefirst determine if CA intersects PB. If yes, then we are done. Otherwise, let LA be a bitangent

    separating line that separates CA from PB. We use the same construction as above to determine a

    subpolygon QA of A that lies on the PB side of LA. Then, A intersects PB if and only if QA

    intersects PB. Since QA has expected size O(n) and so does PB, testing the intersection of these

    two polygons can be done in O(n) expected time. Therefore, by our construction above, we

    have solved the problem of determining

    if two polygons of size n intersect by reducing it to a constant number of problem instances of

    determining iftwo polygons of expected size O(n) intersect. This leads to the following lemma .

    Lemma 2 [2] The problem of determining whether two convex n-gons intersect can be solved in

    O(n) expected time, which is asymptotically optimal. Chazelle et al. [2] gave not only this

    result, but they also showed how to apply a similar approach to design a number of sublinear-

    time algorithms for some basic geometric problems. For example, one can extend the result

    discussed above to test the intersection of two convex polyhedra in R3with n vertices inO(n)

    expected time. One can also approximate the volume of an n-vertex convex polytope to within a

    relative error > 0 in expected time O(n/). Or even, for a pair of two points on the boundary of

    a convex polytope P with n vertices, one can estimate the length of an optimal shortest path

    outside P between the given points in O(n) expected time. In all the results mentioned above,

    the input objects have been represented by a linked structure: either every point has access to its

    adjacent vertices in the polygon in R2, or the polytope is defined by a doubly-connected edge list,

    or so. These input representations are standard in computational geometry, but a natural question

    is whether this is necessary to achieve sublinear-time algorithms what can we do if the input

    polygon/polytop is represented by a set of points and no additional structure is provided to the

    algorithm? In such a scenario, it is easy to see that no o(n)-time algorithm can solve exactly any

    of the problems discussed above. That is, for example, to determine if two polygons with n

    vertices intersect one needs (n) time. However, still, we can obtain some approximation to this

    problem, one which is described in the framework of property testing. Suppose that we relax our

    task and instead of determining if two (convex) polytopes A and B in Rd intersects, we just want

    to distinguish between two cases: either A and B are intersection-free, or one has to significantly

    modify A and B to make them intersection-free. The definition of the notion of significantlymodify may depend on the application at hand, but the most natural characterization would be to

  • 7/29/2019 Sublinear Time Algorithm

    20/2819

    remove at least n points in A and B, for an appropriate parameter (see [3] for a discussion

    about other geometric characterization). Czumaj et al. [4] gave a simple algorithm that for any

    > 0, can distinguish between the case when A and B do not intersect, and the case when at least

    n points has to be removed from A and B to make them intersection-free: the algorithm returns

    the outcome of a test if a random sample of O((d/) log(d/)) points from A intersects with arandom sample of O((d/) log(d/)) points from B.

    3.6.Subli near Time Algori thms for Graphs Problems

    In the previous section, we introduced the concept of sublinear-time algorithms and we presented

    two basic sublinear-time algorithms for geometric problems. In this section, we will discuss

    sublinear-time algorithms for graph problems. Our main focus is on sublinear-time algorithms for

    graphs, with special emphasizes on sparse graphs represented by adjacency lists where

    combinatorial algorithms are sought.

    3.6.1Approximating the Average Degree

    Assume we have access to the degree distribution of the vertices of an undirected connected

    graph G = (V,E), i.e., for any vertex v V we can query for its degree. Can we achieve a good

    approximation of the average degree in G by looking at a sublinear number of vertices? At first

    sight, this seems to be an impossible task. It seems that approximating the average degree isequivalent to approximating the average of a set of n numbers with values between 1 and n 1,

    which is not possible in sublinear time. However, Feige [5] proved that one can approximate the

    average degree in O(n/) time within a factor of 2 + .The difficulty with approximating the

    average of a set of n numbers can be illustrated with the following example. Assume that almost

    all numbers in the input set are 1 and a few of them are n 1. To approximate the average we

    need to approximate how many occurrences of n 1 exist.If there is only a constant number of

    them, we can do this only by looking at (n) numbers in the set. So, the problem is that these largenumbers can hide in the set and we cannot give a good approximation, unless we can find at

    least some of them. Why is the problem less difficult, if, instead of an arbitrary set of numbers,

    we have a set of numbers that are the vertex degrees of a graph? For example, we could still have

    a few vertices ofdegree n1. The point is that in this case any edge incident to such a vertex can

    be seen at another vertex. Thus, even if we do not sample a vertex with high degree we will see

    all incident edges at other vertices in the graph. Hence, vertices with a large degree cannot

    hide.

    We will sketch a proof of a slightly weaker result than that originally proven by Feige [5]. Let d

    denote the average degree in G = (V,E) and let dS denote the random variable for the average

  • 7/29/2019 Sublinear Time Algorithm

    21/2820

    degree of a set S of s vertices chosen uniformly at random from V . We will show that if we set s

    n/O(1) for an appropriate constant , then dS ( 1/2 ) d with probability at least1/64.

    Additionally, we observe thatMarkov inequality immediately implies that dS (1+)d with

    probability at least 1 1/(1 + ) /2. Therefore, our algorithm will pick 8/ sets Si, each of size

    s, and output the set with the smallest average degree. Hence, the probability that all of the sets Sihave too high average degree is at most (1 /2)/8 1/8. The probability that one of them has

    too small average degree is at most 8 / /64 = 1/8. Hence, the output value will satisfy both

    inequalities with probability at least 3/4. By replacing with /2, this will yield a (2+ )-

    approximation algorithm.

    Now, our goal is to show that with high probability one does not underestimate the average

    degree too much. Let H be the set of the n vertices with highest degree in G and let L = V \H

    be the set of the remaining vertices. We first argue that the sum of the degrees of the vertices in L

    is at least ( 1/2 ) times the sum of the degrees of all vertices. This can be easily seen by

    distinguishing between edges incident to a vertex from L and edges within H. Edges incident to a

    vertex from L contribute with at least 1 to the sum of degrees of vertices in L, which is fine as

    this is at least 1/2 of their full contribution. So the only edges that may cause problems are edges

    within H. However, since |H| = n, there can be at most n such edges, which is small

    compared to the overall number of edges (which is at least n 1, since the graph is connected).

    Now, let dH be the degree of a vertex with the smallest degree in H. Since we aim at giving a

    lower bound on the average degree of the sampled vertices, we can safely assume that all

    sampled vertices come from the set L. We know that each vertex in L has a degree between 1 and

    dH. Let Xi, 1 i s, be the random variable for the degree of the ith vertex from S. Then, it

    follows from Hoeffding bounds that

    We know that the average degree is at least dH |H|/n, because any vertex in H has at least

    degree dH. Hence, the average degree of a vertex in L is at least ( 1 /2 ) dH |H|/n. This just

    means E[Xi] ( 1/2)dH|H|/n. By linearity of expectation we get E[Ps i=1 Xi] s(1

    /2)dH|H|/n.

    This implies that, for our choice of s, with high probability we have dS ( 1/2 ) d. Feige

    showed the following result, which is stronger with respect to the dependence on .

  • 7/29/2019 Sublinear Time Algorithm

    22/2821

    3.6.2.Minimum Spanning Trees:

    One of the most fundamental graph problems is to compute a minimum spanning tree. Since the

    minimum spanning tree is of size linear in the number of vertices, no sublinear algorithm forsparse graphs can exists. It is also know that no constant factor approximation algorithm with

    o(n2) query complexity in dense graphs (even in metric spaces) exists [6]. Given these facts, it is

    somewhat surprising that it is possible to approximate the cost of a minimum spanning tree in

    sparse graphs [7] as well as in metric spaces [8] to within a factor of (1 + ). In the following we

    will explain the algorithm for sparse graphs by Chazelle et al. [7]. We will prove a slightly

    weaker result than in [7]. Let G = (V,E) be an undirected connected weighted graph with

    maximum degree D and integer edge weights from {1, . . . ,W}. We assume that the graph is

    given in adjacency list representation, i.e., for every vertex v there is a list of its at most D

    neighbors, which can be accessed from v. Furthermore, we assume that the vertices are stored in

    an array such that it is possible to select a vertex uniformly at random. We assume also that the

    values of D and W are known to the algorithm.

    The main idea behind the algorithm is to express the cost of a minimum spanning tree as the

    number of connected components in certain auxiliary subgraphs of G. Then, one runs a

    randomized algorithm to estimate the number of connected components in each of these

    subgraphs.

    To start with basic intuitions, let us assume that W = 2, i.e., the graph has only edges of weight 1

    or 2. Let G(1) = (V,E(1)) denote the subgraph that contains all edges of weight (at most) 1 and let

    c(1) be the number of connected components in G(1). It is easy to see that the minimum spanning

    tree has to link these connected components by edges of weight 2. Since any connected

    component in G(1) can be spanned by edges of weight 1, any minimum spanning tree of G has

    c(1)1 edges of weight 2 and n1(c(1)1) edges of weight 1. Thus, the weight of a minimum

    Next, let us consider an arbitrary integer value for W. Defining G(i) = (V,E(i)), where E(i) is the

    set of edges in G with weight at most i, one can generalize the formula above to obtain that the

    cost MST of a minimum spanning tree can be expressed as

  • 7/29/2019 Sublinear Time Algorithm

    23/2822

    This gives the following simple algorithm

    Thus, the key question that remains is how to estimate the number of connected components.

    This is done by the following algorithm.

    To analyze this algorithm let us fix an arbitrary connected component C and let |C| denote the

    number of vertices in the connected component. Let c denote the number of connected

    components in G. We can write

    And by linearity of expectation we obtain

    To show that is concentrated around its expectation, we apply Chebyshev inequality. Since bi

    is an indicator random variable, we have

  • 7/29/2019 Sublinear Time Algorithm

    24/2823

    With this bound for Var[], we can use Chebyshev inequality to obtain

    From this it follows that one can approximate the number of connected components within

    additive error of n in a graph with maximum degree D in time and with probability

    1e. The following somewhat stronger result has been obtained in [7]. Notice that the obtained

    running time is independent of the input size n.

    3.7.Subli near Time Approximation Algori thms for Problems in Metri c Spaces

    One of the most widely considered models in the area of sublinear time approximation

    algorithms is the distance oracle model for metric spaces. In this model, the input of an algorithm

    is a set P of n points in a metric space (P, d). We assume that it is possible to compute the

    distance d(p, q) between any pair of points p, q in constant time. Equivalently, one could assume

    that the algorithm is given access to the n n distance matrix of the metric space, i.e., we have

    oracle access to thematrix of a weighted undirected complete graph. Since the full description

    size of this matrix is O(n2), we will call any algorithm with o(n2) running time a sublinear

    algorithm. Which problems can and cannot be approximated in sublinear time in the distance

    oracle model? One of the most basic problems is to find (an approximation) of the shortest or the

    longest pairwise distance in the metric space. It turns out that the shortest distance cannot be

    approximated. The counterexample is a uniform metric (all distances are 1) with one distance

    being set to some very small value . Obviously, it requires O(n2) time to find this single short

    distance. Hence, no sublinear time approximation algorithm for the shortest distance problem

    exists. What about the longest distance? In this case, there is a very simple 1 /2 -approximation

    algorithm, which was first observed by Indyk [6]. The algorithm chooses an arbitrary point p and

    returns its furthest neighbor q. Let r, s be the furthest pair in the metric space. We claim that d(p,

    q) 1/ 2 d(r, s). By the triangle inequality, we have d(r, p) + d(p, s) d(r, s). This immediately

    implies that either d(p, r) 1/2 d(r, s) or d(p, s) 1/2 d(r, s). This shows the approximation

    guarantee. In the following, we present some recent sublinear-time algorithms for a few

    optimization problems in metric spaces.

    3.7.1Cluster ing via Random Sampling

    The problems of clustering large data sets into subsets (clusters) of similar characteristics are one

  • 7/29/2019 Sublinear Time Algorithm

    25/2824

    of the most fundamental problems in computer science, operations research, and related fields.

    Clustering problems arise naturally in various massive datasets applications, including data

    mining, bioinformatics, pattern classification, etc. In this section, we will discuss the uniformly

    random samplingfor clustering problems in metric spaces, as analyzed in two recent papers

    (a) A set of points in a metric space

    (b) its 3-clustering (white points correspond to the centre points)

    (c) the distances used in the cost for the 3-median.

  • 7/29/2019 Sublinear Time Algorithm

    26/2825

    Let us consider a classical clustering problem known as the k-median problem. Given a finite

    metric space (P, d), the goal is to find a set C P of k centers (points in P) that minimizes

    where d(p,C) denotes the distance from p to the nearest point in C. The k-median

    problem has been studied in numerous research papers. It is known to be NP-hard and there exist

    constant-factor approximation algorithms running in e O(n k) time. In two recent papers [20, 46],

    the authors asked the question about the quality of the uniformly random sampling approach to k-

    median, that is, is the quality of the following generic scheme:

    The goal is to show that already a sublinear-size sample set S will suffice to obtain a good

    approximation guarantee. Furthermore, as observed in [10] (see also [45]), in order to have any

    guarantee of the approximation, one has to consider the quality of the approximation as a

    function of the diameter of the metric space. Therefore, we consider a model with the diameter of

    the metric space given, that is, with d : P P [0,]. Limitations: What Cannot be done in

    Sublinear-Time The algorithms discussed in the previous sections may suggest that many

    optimization problems in metric spaces have sublinear-time algorithms. However, it turns out

    that the problems listed in the previous sections are more like exceptions than a norm. Indeed,most of the problems have a trivial lower bound that exclude sublinear-time algorithms. We have

    already mentioned in Section 4 that the problem of approximating the cost of the lightest edge in

    a finite metric space (P, d) requires (n2), even if randomization is allowed. The other problems

    for which no sublineartime algorithms are possible include estimation of the cost of minimum-

    cost matching, the cost of minimum-cost bi-chromatic matching, the cost of minimum non-

    uniform facility location, the cost of k-median for k = n/2; all these problems require (n2)

    (randomized) time to estimate the cost of their optimal solution to within any constant factor .

    To illustrate the lower bounds, we give two instances of the metric spaces which are indistin-

    guishable by any o(n2)-time algorithm for which the cost of the minimum-cost matching in one

    instance is greater than times the one in the other instance (see Figure 3). Consider a metric

    space (P, d) with 2n points, n points in L and n points in R. Take a random perfect matching M

    between the points in L and R, and then choose an edge e M at random. Next, define the

    distance in (P, d) as follows:

    d(e) is either 1 or B, where we set B = n ( 1) + 2,

  • 7/29/2019 Sublinear Time Algorithm

    27/2826

    for any e*M\ {e} set d(e ) = 1, and

    for any other pair of points p, q P not connected by an edge from M, d(p, q) = n3.

    It is easy to see that both instances define properly a metric space (P, d). For such problem

    instances, the cost of the minimum-cost matching problem will depend on the choice of d(e): Id(e) = B then the cost will be n 1 + B > n , and if d(e) = 1, then the cost will be n. Hence any

    -factor approximation algorithm for the matching problem must distinguish between these two

    problem instances. However, this requires to find if there is an edge of length B, and this is

    known to require time (n2), even if a randomized algorithm is used.

    5.Conclusions:

    It would be impossible to present a complete picture of the large body of research known in the

    area of sublinear-time algorithms in such a reeport. In this report, my main goal was to give some

    flavor of the area and of the types of the results achieved and the techniques used. For more

    details, we refer to the original works listed in the references. i did not discuss two important

    areas that are closely related to sublinear-time algorithms: property testing and data streaming

    algorithms.

  • 7/29/2019 Sublinear Time Algorithm

    28/28

    References:

    [1] A. Czumaj and C. Sohler. Sublinear-time Algorithms. Bulletin of the EATCS, 89: 23 - 47,

    June 2006.

    [2] B. Chazelle, D. Liu, and A. Magen. Sublinear geometric algorithms. SIAM Journal on

    Computing, 35(3): 627646, 2006.

    [3] A. Czumaj and C. Sohler. Property testing with geometric queries. Proceedings of the 9th

    Annual European Symposium on Algorithms (ESA), pp. 266277, 2001.

    [4] A. Czumaj, C. Sohler, and M. Ziegler. Property testing in computational geometry.

    roceedings of the 8th Annual European Symposium on Algorithms (ESA), pp. 155166, 2000.

    [5] U. Feige. On sums of independent random variables with unbounded variance and estimating

    the average degree in a graph. SIAM Journal on Computing, 35(4): 964984, 2006.

    [6] P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the 31st

    Annual ACM Symposium on Theory of Computing (STOC), pp. 428434, 1999

    [7] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approximating the minimum spanning treeweight in sublinear time. SIAM Journal on Computing, 34(6): 13701379, 2005.

    [8] A. Czumaj and C. Sohler. Estimating the weight of metric minimum spanning trees in

    sublinear-time. Proceedings of the 36th Annual ACM Symposium on Theory of Computing

    (STOC), pp. 175183, 2004.

    [9] A. Czumaj and C. Sohler. Sublinear-time approximation for clustering via random sampling.

    Proceedings of the 31st Annual International Colloquium on Automata, Languages and

    Programming (ICALP), pp. 396407, 2004.

    [10] N. Mishra, D. Oblinger, and L. Pitt. Sublinear time approximate clustering. Proceedings of

    the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 439447, 2001.