Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum

Folie 1

Fateme ShiraziSpring 20101

Statistical structures for Internet-scale data managementAuthors: Nikos Ntarmos, Peter Triantafillou, G. WeikumOutline2IntroductionBackground : Hash sketchesCompute aggregates and building histogramsImplementationResultsConclusion

Peer-to-Peer (P2P)3File sharing in overlay networks

Millions of users (peers) provide storage and bandwidth for searching and fetching files

Motivation4In P2P file-sharing often the total number of (unique) documents shared by their users is needed

Distributed P2P search engines need to evaluate the significance of keywords

the ratio of indexed documents containing each keyword to the total number of indexed documents

Motivation5Internet-scale information retrieval systems need a method to deduce the rank/score of data items.

Sensor networks need methods to compute aggregates

Traditionally query optimizers rely on histograms over stored data, to estimate the size of intermediate results

Overview Sketch6A large number of nodes, form the systems infrastructure

Contribute and/or store data items ,involved in operations such as computing synopses and building histograms

In general, queries do not affect all nodes

Compute aggregation functions over data sets dynamically by a filter predicate of the query

Problem Formulation7Relevant data items stored in unpredictable ways in a subset of all nodes

A large number of different data sets expected to exist, stored at (perhaps overlapping) subsets of the network

And, relevant queries and synopses may be built and used over any of these data setsComputational Model8Data stored in P2P network is structured in relations

Each R consists of (k+l) attr. or columns R(a1,,ak,b1,,bl)

Either one of the attributes of the tuple, or calculated otherwise (e.g. a combination of its attributes)

attr1attr2attr3Outline9IntroductionBackground : Hash sketchesCompute aggregates and building histogramsExperimental setupResultsConclusion

Distributed Hash Tables10A family of structured P2P network overlays exposing a hash-table-like interface(lookup service)

Examples of DHTs include Chord, Kademlia, Pastry, CAN

Any node can efficiently retrieve a value with given key

Chord11Nodes are assigned identifiers from a circular ID space, computed as the hash of IP address

Node-ID space among nodes partitioned, so that each node is responsible for a well-defined set (arc) of identifiers

Each item is also assigned a unique identifier from the same ID space

Stored at the node whose ID is closest to the items IDHash Sketches12Estimating the number of distinct items in D of data in a database

For application domains which need counting distinct elements:

Approximate query answering in very large databases, Data mining on the Internet graph Stream processing

Hash Sketches13A hash sketch consists of a bit vector B[] of length L

In order to estimate the number n of distinct elements in D ,(h(d)) is applied to all d D and record the results in the bitmap vector B[0 . . . L1]

000011d1d2d3d4000011LSBMSBPartially copied from slides of the author000000LSBMSBHash sketches: Insertionsh()()PRNnPRNn-1...PRN4PRN3PRN2PRN1L-bit Pseudo-RandomNumbersdndn-1...d4d3d2d1Data ItemsnHash sketch(Bit vector B)bL-1bL...b1b0L+1h()10111()my item 1 keymy item 2 keymy item 3 keymy item 4 key1001001101100111114Copied from slides of the author14Hash Sketches15Since h() distributes values uniformly over [0, 2L )

P((h(d)) = k) = 2k1

R =position of the least-significant 0-bit in B, then2R ~ n

d1d2d3d4000011|D| ~ 22 = 4Partially copied from slides of the authorDistributing Data Synopses16(1) the conservative but popular rendezvous based approach

(2) the decentralized way of DHS, in which no node has some sort of special functionality

Partially copied from slides of the authorMapping DHS bits to DHT Nodes N1N8N14N21N32N56N51N48N42N38Bit 0Bit 1Bit 2Bit 3Bit Copied from slides of the author1717DHS : CountingN1N8N14N21N32N56N51N48N42N38Counting nodeBits >3 not setBit 2 not set.RetryingBit 2 not setBit 1 not set.RetryingBit 1 set!Copied from slides of the author1818Outline19IntroductionBackground : Hash sketchesCompute aggregates and building histogramsExperimental setupResultsConclusion

Computing Aggregates20COUNT-DISTINCT: Estimation of the number of (distinct) items in a multi-set

COUNT: adding the tuple IDs to the corresponding synopsis, instead of the values of the column in question

SUM : each node locally computes the sum of values of the column tuples it stores, populates a local hash sketch

AVG: Consists of estimating the SUM and COUNT of the column and then taking their ratioCOUNT-DISTINCT21Both rendezvous-based hash sketches and DHS applicable to estimation of the number of (distinct) items in a multiset

Assume the estimation of the number of distinct values in a column C of a relation R stored in our Internet-scale data management system is wanted

Counting with the Rendezvous Approach22Nodes first compute a rendezvous ID. (attr1 h() 47 )

Then compute locally the synopsis and send it to the node whose ID is closest to the above ID (rendezvous node)

The rendezvous node responsible for combining the individual synopses (by bitwise OR) into the global synopsis

Interested nodes can then acquire the global synopsis by querying the rendezvous nodeStep 123

Step 224

Step 325

Counting with DHS26In the DHS-based case, nodes storing tuples of R insert them into the DHS, by:

(1)Nodes hash their tuples and compute (hash) for each tuple

(2) For each tuple,nodes send a set-to-1 to a random ID in the corresponding arc

(3) Counting consists of probing random nodes in arcs corresponding to increasing bit positions until 0-bit is foundStep 127

Step 228

Step 329

Histograms30The most common technique used by commercial databases as a statistical summary

An approximation of the distribution of values in base relations.

For a given attribute/column, a histogram is a grouping of attribute values into bucketsSalarySalaryAgeConstructing histogram types31Equi-Width histograms

The most basic histogram variant

Partitions the attribute value domain into cells (buckets) of equal spread

Assigns to each the number of tuples with an attribute value.

31Other histogram types32Average shifted Equi-Width histograms ,ASH

Consist of several EWH with different starting positions in value space

Frequency of each value in a bucket computed as the average of estimations given by histogram

Equi-Depth histograms

In an Equi-Depth histogram all buckets have equal frequencies but not (necessarily) equal spreads

Outline33IntroductionBackground : Hash sketchesCompute aggregates and building histogramsImplementation ResultsConclusion

Implementation341.Generating the workload

2. Populating the network with peers

3. Randomly assigning data tuples from the base data to nodes in the overlay

4. Then inserting all nodes into the P2P

5. Selecting random nodes ,reconstructing histograms and computing aggregatesMeasures of Interest35To consider

(1) The fairness of the load distribution across nodes in the network

(2)The accuracy of the estimation itself

(3)The number of hops are considered to do the estimation

To show the trade-off of scalability vs. performance/load distribution between the DHS and rendezvous-based approaches

Fairness36To compute the fairness, the load on any given node as the insertion /query/probe hits on the node is measured

Number of times this node is target of insertion/query/probe opera

A multitude of metrics are used. More specifically :

The Gini Coefficient The Fairness IndexMaximum and total loads for DHS- and rendezvous based approaches

The Gini Coefficient37Mean of the absolute difference of every possible pair .

Takes values in the interval [0, 1), where a GC value of 0.0 is the best possible state, with 1.0 being the worst

The Gini Coefficient roughly represents the amount of imbalance in the system

Gini = A/(A+B) A BEstimation error38Mean error of the estimation is reported

Computed as percentage

By the distributed estimation differed to the estimated aggregate computed in a centralized manner

(i.e. as if all data was stored on a single host)Hop-count Costs39The per-node average hop count for inserting all tuples to the distributed synopsis is measured and shown The per-node hop count costs are higher for the DHS-based approach

Outline40IntroductionBackgroundCompute aggregates and building histogramsImplementationResultsConclusion

Results41The hop-count efficiency and the accuracy of rendezvous-based hash sketches and of the DHS is measured

Initially single-attribute relations is created, with integer values in the intervals [0, 1000)

following either a uniform distribution (depicted as a Zipf with equal to 0.0) or a shuffled Zipf distribution with equal to 0.7, 1.0, and 1.2

Total query load (node hits) over time42Load distribution43The extra hop-count cost of the DHS-based approach pays back when it comes to load distribution fairness

The load on a node, the number of times it is visited (a.k.a. node hits) during data insertion and/or query processing.

Gini Coefficient44 Rendezvous approach DHS approach

Evolution of the Gini coefficient45In the rendezvous based approach a single node has all the query load

The DHS-based approaches ,0.5, which equal the GC values of the distribution of the distances between consecutive nodes in the ID space

Thus the best respective values by any algorithm using randomized assignment of items to nodesEvolution of the Gini coefficient46

Error for Computing COUNT Aggregate47 Rendezvous approach DHS approach

In both cases, error due to use of hash sketches Both approaches exhibit the same average error As expected, the higher the number of bitmaps in the synopsis, the better the accuracy

Insertion hop count 48 Rendezvous approach DHS approach

The insertion hop-count cost for all aggregates

Hop count costs are higher for the DHS-based approach by appr.8 for both the insertion and query cases

Outline49IntroductionBackground : Hash sketchesCompute aggregates and building histogramsExperimental setupResultsConclusion

Conclusion50A framework for distributed statistical synopses for Internet-scale networks such as P2P systems

Extending centralized settings techniques towards distributed settings

Developing DHT based higher-level synopses like Equi-Width, ASH, and Equi-Depth histograms

Conclusion51Fully distributed cardinality estimator, providing scalability, efficiency, accuracy

Constructed efficiently and scaling well with growing network size, while having high accuracy

Providing trade-off between accuracy and construction /maintenance costs

Totally balanced (access and maintenance) load on nodes

Future research52Examining auto-tuning capabilities for the histogram inference engine

Integrating it with Internet- scale query processing systems

To look into implementing for other types of synopses, aggregates, and histogram variants

Finally, using this tools for approximate query answering

Thank you 53

Documents

Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum