View
214
Download
0
Category
Tags:
Preview:
Citation preview
22
Presentation overview PIERPresentation overview PIER
Core functionality and design Core functionality and design principlesprinciples
Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail. Distributed join / operations in detail.
Simulation results.Simulation results. ConclusionConclusion
33
What is PIER?What is PIER? Peer-to-Peer Information Exchange and RetrievalPeer-to-Peer Information Exchange and Retrieval
A distributed query engine based on widely A distributed query engine based on widely distributed environmentsdistributed environments
"general data retrieval system" using any source"general data retrieval system" using any source
Pier is internally using a relational database Pier is internally using a relational database formatformat
Read only systemRead only system
44
NetworkMonitoring
Other UserApps
Applications
Pier overview>>based on CAN
IPNetwork
Network
Tier 1 Tier 2 Tier 3
CoreRelationalExecution
EngineCatalogManager
QueryOptimizer
PIER
DHTWrapper
StorageManager
OverlayRouting
DHT
55
Relaxed ConsistencyRelaxed Consistency Brewer states that a distributed data system can Brewer states that a distributed data system can
only have two out of three of the following only have two out of three of the following properties :properties :
1.1. (C)onsistency(C)onsistency2.2. (A)vailability(A)vailability3.3. Tolerance of network (P)artitionsTolerance of network (P)artitions
Pier : Pier : Priority : A ,P and sacrifice CPriority : A ,P and sacrifice Cie. Best effort resultsie. Best effort results
Detailed in distributed join partDetailed in distributed join part
66
ScalabilityScalability
Scalability – amount of work scales Scalability – amount of work scales with the amount of nodeswith the amount of nodes• Network can grow easilyNetwork can grow easily• RobustRobust
PIER doesnt require a-priori PIER doesnt require a-priori allocation of resourcesallocation of resources
77
Data sourcesData sources
Data remains in it Data remains in it original sourceoriginal source
Could be anything : Could be anything : -file system-file system-live feed from a -live feed from a procesproces
Wrappers or gateways Wrappers or gateways have to be providedhave to be provided Source
Wrapper
Tuple
88
Standard schemasStandard schemas
Design goal for Design goal for application layerapplication layer
Pro : Bypasses Pro : Bypasses standardization standardization processprocess
Con : Limited by Con : Limited by current applicationscurrent applications IPIP PayloadPayload TimeStampTimeStamp
131.54.78.128131.54.78.128 1010100110101001 03-02-2004 03-02-2004 18:24:5018:24:50
WrapperWrapper
IPIP PayloadPayload
131.54.78.128131.54.78.128 1010100110101001
Relation
Popular software (example tcpdump)
99
PIER is independent of the DHT. PIER is independent of the DHT. Currently it is CAN. Currently it is CAN.
Currently using multicast. Other Currently using multicast. Other strategies are possible.strategies are possible.
(Further explained in DHT)(Further explained in DHT)
PIER storage overviewPIER storage overviewFig. 5.10: PIER: DHT-based pipelining symmetric hash join:consistency defined by dilated-reachable snapshot.
Source: -somedatastorage(Windows fileetc.).
Wrapper
Tuple schema and namespace. DHT key = namespace || ressourceID where resourceID = tuple key. Unique key
Attribute1 Attribute 2
DHT layer
PIER layer
Application layer: query application X. Application Y.
Create, update and deletesource data
Wrapper readssource data
Wrapper informs DTHlayer, that it has data
belonging to namespace XWrapper is preconfiguredwith knowledge about
relation X: it’s namepaceand multicast group.
DHT layer informs PIERlayer, that relation X (it’snamepace and multicast
group) exists. DHT createsmulticast group X.
PIER keeps tracks of allrelations (for each relation:
namespace, multicastgroup).
Query
DHT query
Request for data
1111
Presentation overview PIERPresentation overview PIER
Core functionality and design Core functionality and design principlesprinciples
Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail. Distributed join / operations in detail.
Simulation results.Simulation results. ConclusionConclusion
1212
DHT based distributed join: DHT based distributed join: exampleexample
• Distributed execution of relational database Distributed execution of relational database query operationsquery operations, e.g. join, is the , e.g. join, is the core core functionalityfunctionality of the PIER system. of the PIER system.
• Distributed join is a Distributed join is a relational database joinrelational database join performed to some degree in performed to some degree in parallelparallel by a by a number of processorsnumber of processors (machines) containing (machines) containing different parts of the relationsdifferent parts of the relations on which the on which the join is performed.join is performed.
• PerspectivePerspective: : generic intelligent "keyword" generic intelligent "keyword" based searchbased search based on distributed database based on distributed database query operations (e.g. like Google).query operations (e.g. like Google).
• The following example illustrates The following example illustrates whatwhat PIER can do PIER can do and thus the main and thus the main purpose of PIERpurpose of PIER, by means of , by means of distributed join based on DHTdistributed join based on DHT. Details of how . Details of how and which layer is doing what are provided later.and which layer is doing what are provided later.
DHT based distributed join example: DHT based distributed join example: relational database join proper (1/2)relational database join proper (1/2)
Relational database join of R and S on CPR: SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr Join result: CPR Name Ad-
dress 311056-0485
Peter Hansen
Århus
200770-2312
Kirsten Andersen
København
DHT based distributed join example: DHT based distributed join example: relational database join proper (2/2)relational database join proper (2/2)
Relation R: - primary key = DHT resource id: A1 || A2 || CPR || A4 - attributes selected in join: CPR (join attribute), Name.
Attribute A1
Attribute A2
CPR Attribute A4
Attribute A5
Name Attribute A7
aaa bbb 311056-0485
ccc ddd Peter Hansen
eee
fff ggg 200770-2312
hhh iii Kirsten Andersen
jjj
Relation S: - primary key = DHT resource id: A2 || A3 || A4 - attributes selected in join: Address, CPR (join attribute).
Address Attribute A2
Attribute A3
Attribute A4
CPR
Køben-havn
bbb aaa ccc 200770-2312
Århus ggg fff hhh 311056-0485
DHT based distributed join example: (1/9)DHT based distributed join example: (1/9)Fig. 5.01: PIER: DHT-based pipelining symmetric hash join:R-table in DHT.
Entire DHT network
Namespace NR =multicast group R-group
Relation R: A1 A2 CPR A4 A5 Name A7 aaa bbb 7 ccc ddd xxx eee fff ggg 9 hhh iii yyy jjj
For each relation (table) (R):The tuples of the relation are distributedand stored among some subset of nodes,which:- is assigned a specific namespace (NR).- is assigned a specific multicast-group (R-group).DHT layer knows all relations (tables),namespaces and multicast groups.
DHT based distributed join example: (2/9)DHT based distributed join example: (2/9)
Fig. 5.02: PIER: DHT-based pipelining symmetric hash join:S-table in DHT.
Entire DHT network
Namespace NS =multicast group S-group
Relation S: Ad- dress
A2 A3 A4 CPR
xxx bbb aaa ccc 9 zzz ggg fff hhh 7
DHT based distributed join example: (3/9)DHT based distributed join example: (3/9)Fig. 5.03: PIER: DHT-based pipelining symmetric hash join:relational database join operation to be performed.
Entire DHT network
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
Relation R: A1 A2 CPR A4 A5 Name A7 aaa bbb 7 ccc ddd xxx eee fff ggg 9 hhh iii yyy jjj
Relation S: Ad- dress
A2 A3 A4 CPR
xxx bbb aaa ccc 9 zzz ggg fff hhh 7
Join result: [4b] CPR Name Ad-
dress 7 xxx xxx 9 yyy zzz
Relational database join of R and S on CPR: SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr
DHT based distributed join example: (4/9)DHT based distributed join example: (4/9)
Fig. 5.04: PIER: DHT-based pipelining symmetric hash join:step 1: activate operation on nodes in NR.
Entire DHT network
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
Node-7 originator of join: [1]. multicast(R-group, "oper-5", "Select xxx");
[1]
DHT based distributed join example: (5/9)DHT based distributed join example: (5/9)
Fig. 5.05: PIER: DHT-based pipelining symmetric hash join:step 3: generate and push intermediate data to DHT on nodes in NR.
Entire DHT network
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
Namespace NQ
Relation R: A1 A2 CPR A4 A5 Name A7 aaa bbb 7 ccc ddd xxx eee fff ggg 9 hhh iii yyy jjj
R derived intermediate tuple: [3Ra/b] CPR Name Source
table name
9 yyy R
Node in NR: [3R]: Build table: lscan R: for each local tuple:
[3Ra]: extract R.cpr, R.name Create intermediate tuple: (R.cpr,
R.name, "R"). Create instanceID as random
number: random_no = random(); [3Rb]: PUSH: put("NQ", R.cpr,
random_no, (R.cpr, R.name, "R"), "60 minutes").
DHT based distributed join example: (6/9)DHT based distributed join example: (6/9)
Fig. 5.06: PIER: DHT-based pipelining symmetric hash join:steps 2 and 3: activate operation and generate and push intermediate data to DHT on nodes in NS.Note: step 3 is executed in parallel on nodes in NR and NS.
Entire DHT network
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
Namespace NQ
Node-7 originator of join: [2]. multicast(S-group, "oper-5", "Select xxx");
[2]
Relation S: Ad- dress
A2 A3 A4 CPR
xxx bbb aaa ccc 9 zzz ggg fff hhh 7
S derived intermediate tuple: [3Sa/b] CPR Ad-
dress Source table name
9 zzz S
Node in NS: [3S]: Build table: lscan S: for each local tuple:
[3Sa]: extract S.cpr, S.address. Create intermediate tuple: (S.cpr, S.addres,
"S"). Create instanceID as random number:
random_no = random(); [3Sb]: PUSH: put("NQ", S.cpr,
random_no, (S.cpr, S.address, "S"), "60 minutes").
DHT based distributed join example: (7/9)DHT based distributed join example: (7/9)Fig. 5.07: PIER: DHT-based pipelining symmetric hash join:step 4: pull intermediate data and perform join on nodes in namespace NQ.
Entire DHT network
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
Namespace NQ
Join result: [4b] CPR Name Ad-
dress 7 xxx xxx 9 yyy zzz
S derived intermediate tuple: [3Sa/b] CPR Ad-
dress Source table name
9 zzz S
R derived intermediate tuple: [3Ra/b] CPR Name Source
table name
9 yyy R
Node in NQ: [4]: Probe table: newData("NQ", item) is called by DHT layer. For each call of newData:
[4a]: PULL: get("NQ", item.resourceID);
[4b]: if both R derived tuple and S derived tuple returned:
o extract R.cpr, R.name and S.address.
o Create join result tuple (R.cpr, R.name, S.address).
DHT based distributed join example: (8/9)DHT based distributed join example: (8/9)Fig. 5.08: PIER: DHT-based pipelining symmetric hash join:step 5: join result tuples are pushed (sent) to originator of join.
Entire DHT network
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
Namespace NQ
Node-7 originator of join: [5]. receive("oper-5", DHT_item) (one call for each tuple in result);
Join result: [4b] CPR Name Ad-
dress 7 xxx xxx 9 yyy zzz
Node in NQ: [5]: Probe table: PUSH: send("oper-5", hashkey(Node-7), (R.cpr, R.name, S.address)).
DHT based distributed join example: (9/9)DHT based distributed join example: (9/9)
Fig. 5.09: PIER: DHT-based pipelining symmetric hash join:overview of entire distributed join operation.
Entire DHT network
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
Namespace NQ
Node-7 originator of join: [1]. multicast(R-group, "oper-5", "Select xxx"); [2]. multicast(S-group, "oper-5", "Select xxx"); [5]. receive("oper-5", DHT_item) (one call for each tuple in result);
[1][2]
Relation R: A1 A2 CPR A4 A5 Name A7 aaa bbb 7 ccc ddd xxx eee fff ggg 9 hhh iii yyy jjj
Relation S: Ad- dress
A2 A3 A4 CPR
xxx bbb aaa ccc 9 zzz ggg fff hhh 7
Join result: [4b] CPR Name Ad-
dress 7 xxx xxx 9 yyy zzz
S derived intermediate tuple: [3Sa/b] CPR Ad-
dress Source table name
9 zzz S
R derived intermediate tuple: [3Ra/b] CPR Name Source
table name
9 yyy R
Node in NQ: [4,5]: Probe table: newData("NQ", item) is called by DHT layer. For each call of newData:
[4a]: PULL: get("NQ", item.resourceID);
[4b]: if both R derived tuple and S derived tuple returned:
o extract R.cpr, R.name and S.address.
o [5] PUSH: send("oper-5", hashkey(Node-7), (R.cpr, R.name, S.address)).
Node in NR: [3R]: Build table: lscan R: for each local tuple:
[3Ra]: extract R.cpr, R.name Create intermediate tuple: (R.cpr,
R.name, "R"). Create instanceID as random
number: random_no = random(); [3Rb]: PUSH: put("NQ", R.cpr,
random_no, (R.cpr, R.name, "R"), "60 minutes").
Node in NS: [3S]: Build table: lscan S: for each local tuple:
[3Sa]: extract S.cpr, S.address. Create intermediate tuple: (S.cpr, S.addres,
"S"). Create instanceID as random number:
random_no = random(); [3Sb]: PUSH: put("NQ", S.cpr,
random_no, (S.cpr, S.address, "S"), "60 minutes").
PIER use of DHT layer in distributed join: addressing of table by multicast group. Push / pull technique (intermediate results). Intermediate data used as queue and high degree of distribution and parallelism makes
network delay much less important. Hashing on primary key (rehashing) ensures related tuples on same node (content addressable
network / store based on hashed key). DHT used as exchange medium between nodes (database terms).
2424
Presentation overview PIERPresentation overview PIER
Core functionality and design Core functionality and design principlesprinciples
Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail. Distributed join / operations in detail.
Simulation results.Simulation results. ConclusionConclusion
2525
CANCAN
CAN is a DHTCAN is a DHT
Basic operations on CAN are Basic operations on CAN are insertion, lookup and deletion of insertion, lookup and deletion of (key,value) pairs(key,value) pairs
Each CAN node stores a chunk Each CAN node stores a chunk (called zone) of the entire hash table(called zone) of the entire hash table
2626
Every node is Every node is responsible of a small responsible of a small number of “adjacent” number of “adjacent” nodes(called zones) in nodes(called zones) in the tablethe table
Requests (insert, Requests (insert, lookup, delete) for a lookup, delete) for a particular key are particular key are routed by intermediate routed by intermediate CAN nodes towards the CAN nodes towards the CAN node that contains CAN node that contains the keythe key
2727
Design centers around a Design centers around a virtual d-dimensional virtual d-dimensional Cartesian coordinate Cartesian coordinate space on a d-torusspace on a d-torus
Coordinate space is Coordinate space is completely virtual and completely virtual and has no relation to any has no relation to any physical coordinate physical coordinate systemsystem
Keyspace is probably Keyspace is probably SHA1SHA1
Map Key k1 onto a point p1 using a Map Key k1 onto a point p1 using a Uniform Hash functionUniform Hash function
(k1,v1) is stored at the node Nx(k1,v1) is stored at the node Nx that owns the zone with p1that owns the zone with p1
2828
Retrieving a value for a given keyRetrieving a value for a given key
Apply the deterministic hash function Apply the deterministic hash function to map key onto point P and then to map key onto point P and then retrieve the corresponding value retrieve the corresponding value from the point P. from the point P.
One hash function per dimensionOne hash function per dimension
get(key) -> value:- lookup(key) -> ip address.- ip.retrieve(key) -> value
2929
Storing (key,value)Storing (key,value)
Key is deterministically mapped onto Key is deterministically mapped onto a point P in the coordinate space a point P in the coordinate space using a uniform hash functionusing a uniform hash function
The corresponding (key,value) pair is The corresponding (key,value) pair is then stored at the node that owns then stored at the node that owns the zone within which the point P liesthe zone within which the point P lies
put(key,value):- lookup(key) -> ip address- ip.store(key,value)
3030
RoutingRouting
CAN node maintains routing table that CAN node maintains routing table that holds the IP address and virtual coordinate holds the IP address and virtual coordinate zone of each of its immediate neighbors in zone of each of its immediate neighbors in the coordinate spacethe coordinate space
Using its neighbor coordinate set, a node Using its neighbor coordinate set, a node routes a message towards its destination routes a message towards its destination by simple greedy forwarding to the by simple greedy forwarding to the neighbor with coordinates closest to the neighbor with coordinates closest to the destination coordinatesdestination coordinates
3131
• Node Maintains routing Node Maintains routing
table with neighborstable with neighbors• Follow the straight line Follow the straight line
path through path through
the Cartesian spacethe Cartesian space
(16,16)(16,0)
(0,16)(0,0)
Data
Key = (15,14)
3232
Node JoiningNode Joining
New node must find a node already in the New node must find a node already in the CANCAN
Next, using the CAN routing mechanisms, Next, using the CAN routing mechanisms, it must find a node whose zone will be splitit must find a node whose zone will be split
Neighbors of the split zone must be Neighbors of the split zone must be notified so that routing can include the notified so that routing can include the new nodenew node
3737
Node departureNode departure Node explicitly hands over its zone and the
associated (key,value) database to one of its neighbors
Incase of network failure this is handled by a take-over algorithm
Problem : take over mechanism does not provide regeneration of data
solution:every node has a backup of its neighbours
3838
Presentation overview PIERPresentation overview PIER
Core functionality and design Core functionality and design principlesprinciples
Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail. Distributed join / operations in detail.
Simulation results.Simulation results. ConclusionConclusion
3939
Querying the Internet with PIERQuerying the Internet with PIER(PIER = Peer-to-peer Information Exchange and (PIER = Peer-to-peer Information Exchange and
Retrieval)Retrieval)
4040
What is a DHT?What is a DHT?
• Take an abstract ID space, and partition among Take an abstract ID space, and partition among a changing set of computers (nodes)a changing set of computers (nodes)
• Given a message with an ID, route the Given a message with an ID, route the message to the computer currently responsible message to the computer currently responsible for that IDfor that ID
• Can store messages at the nodesCan store messages at the nodes This is like a “distributed hash table”This is like a “distributed hash table” Provides a put()/get() APIProvides a put()/get() API
4141
(16,16)(16,0)
(0,16)(0,0)
Data
Key = (15,14)
Given a message with an ID, route the Given a message with an ID, route the message to the computer currently message to the computer currently responsible for that IDresponsible for that ID
4242
Lots of effort is put into making DHTs Lots of effort is put into making DHTs better:better:• Scalable (thousands Scalable (thousands millions of millions of
nodes)nodes)• Resilient to failureResilient to failure• Secure (anonymity, encryption, etc.)Secure (anonymity, encryption, etc.)• Efficient (fast access with minimal state)Efficient (fast access with minimal state)• Load balancedLoad balanced• etc.etc.
4343
IPNetwork
Network
DHTWrapper
StorageManager
OverlayRouting
DHT
CoreRelationalExecution
EngineCatalogManager
QueryOptimizer
PIER
NetworkMonitoring
Other UserApps
Applications
Physical Network
Overlay Network
Query Plan
>>based on Can
DeclarativeQueries
Select R.Cpr,R.name,s.Address From R,SWhere R.cpr=S.cpr
4444
ApplicationsApplications
Any Distributed Relational Database Any Distributed Relational Database ApplicationsApplications
Network Monitoring Network Monitoring Feasible ApplicationsFeasible Applications Intrusion detectionIntrusion detection Fingerprint queries Fingerprint queries Load Monitor of CPULoad Monitor of CPU Split Zone Split Zone
4545
DHTsDHTs Implemented with CAN (Content Implemented with CAN (Content
Addressable Network).Addressable Network). Node identified by rectangle in d-Node identified by rectangle in d-
dimensional spacedimensional space Key hashed to a point, stored in Key hashed to a point, stored in
corresponding node.corresponding node. Routing Table of neighbours is Routing Table of neighbours is
maintained. O(d)maintained. O(d)
4646
DHT DesignDHT Design Routing LayerRouting Layer
Mapping for keysMapping for keys
(-- dynamic as nodes leave and join)(-- dynamic as nodes leave and join) Storage ManagerStorage Manager
DHT based dataDHT based data ProviderProvider
Storage access interface for higher Storage access interface for higher levels levels
ProviderStorageManager
OverlayRouting
4747
DHT – RoutingDHT – Routing
Routing layerRouting layer
mapsmaps a a keykey into the into the IP addressIP address of the node currently of the node currently responsible for that key. Provides exact lookups, responsible for that key. Provides exact lookups, callbacks higher levels when the set of keys has callbacks higher levels when the set of keys has changedchanged
Routing layer API Routing layer API Asynchronous FncAsynchronous Fnc
lookup(key) lookup(key) ipaddr synch fnc Local Node ipaddr synch fnc Local Nodejoin(landmarkNode)join(landmarkNode)leave()leave()
locationMapChange()locationMapChange()
ProviderStorageManager
OverlayRouting
4848
DHT – StorageDHT – Storage
Storage ManagerStorage Manager
storesstores and and retrievesretrieves records, which consist of records, which consist of key/value pairs. Keys are used to locate key/value pairs. Keys are used to locate items and can be any data type or structure items and can be any data type or structure supportedsupported
Storage Manager APIStorage Manager APIstore(key, items)store(key, items)retrieve(key)retrieve(key) items --Structure items --Structureremove(key)remove(key)
ProviderStorageManager
OverlayRouting
4949
ProviderStorageManager
OverlayRouting
DHT – Provider (1)DHT – Provider (1)
ProviderProvider
tiesties routing and storage manager layers and routing and storage manager layers and providesprovides an interface an interface
Each object in the DHT has a Each object in the DHT has a namespacenamespace, , resourceIDresourceID and and instanceIDinstanceID
DHT key = DHT key = (hash1((hash1(namespacenamespace,,resourceIDresourceID) +..+ ) +..+ hashN(hashN(namespacenamespace,,resourceIDresourceID))))
namespacenamespace - application or group of object, table or relation - application or group of object, table or relation resourceIDresourceID – primary key or any attribute(Object) – primary key or any attribute(Object) instanceIDinstanceID –– integer, to separate items with the same integer, to separate items with the same namespacenamespace and and
resourceIDresourceID
Lifetime - Lifetime - item storage duration item storage duration ----adherence principle of adherence principle of relaxed Consistencyrelaxed Consistency
CAN’s mapping of CAN’s mapping of resourceIDresourceID /Object is equivalent to an index /Object is equivalent to an index
Depends on dimension On CAN 0.2160
5050
DHT – Provider (2)DHT – Provider (2)Provider APIProvider API
getget(namespace, resourceID) (namespace, resourceID) item item
putput(namespace, resourceID, item, lifetime)(namespace, resourceID, item, lifetime)
renewrenew(namespace, resourceID, instanceID, lifetime) (namespace, resourceID, instanceID, lifetime) boolbool
multicastmulticast(namespace, resourceID, item)(namespace, resourceID, item)
lscanlscan(namespace) (namespace) items (Structure/Iterator) items (Structure/Iterator)
newDatanewData(namespace, item)(namespace, item)
Node R1Node R1
(1..n)(1..n)Table R (namespace)Table R (namespace)
(1..n) tuples (1..n) tuples
(n+1..m) tuples(n+1..m) tuples
Node R2Node R2
(n+1..m)(n+1..m)rID1rID1
itemitem
rID3rID3
itemitem
rID2rID2
itemitem
ProviderStorageManager
OverlayRouting
5151
Query ProcessorQuery Processor How it works?How it works?
• performs selection, projection, joins, grouping, performs selection, projection, joins, grouping, aggregation ->aggregation ->OperatorsOperators
• Push & Pull Ways of OperatingPush & Pull Ways of Operating
• simultaneous execution of multiple operators pipelined simultaneous execution of multiple operators pipelined togethertogether
• results are produced and queued as quick as possibleresults are produced and queued as quick as possible How it modifies data?How it modifies data?
• insert, update and delete different items via DHT insert, update and delete different items via DHT interfaceinterface
How it selects data to process?How it selects data to process?
• dilated-reachable snapshotdilated-reachable snapshot – data, published by – data, published by reachable nodes at the query arrival timereachable nodes at the query arrival time
5252
pull
R s
Join result: [4b1] CPR Name Ad-
dress 7 xxx xxx 9 yyy zzz
TemporaryData in DHT
NameSpace
push
Relation R: A1 A2 CPR A4 A5 Name A7 aaa bbb 7 ccc ddd xxx eee fff ggg 9 hhh iii yyy jjj
Relation S:
Ad- dress
A2 A3 A4 CPR
xxx bbb aaa ccc 9 zzz ggg fff hhh 7
pull
push
Relation T: A1 A2 CPR Dptno A5 Name A7 aaa bbb 7 1 ddd xxx eee fff ggg 9 2 iii yyy jjj
Relation Q:
Deptname A2 Total Salary
Woked Deptno
xxx bbb 30,000 30 1 zzz ggg 40,000 40 2
TemporaryData in DHT Salary
Join result: [4b2] CPR Dept
Name Salary
7 xxx 30,000 9 yyy 40,000
T Q
Grouping & Aggregation
Join result: [4c] CPR Name Dept
name Salary Ad-
dress 7 xxx xxx 30,000 xxx 9 yyy yyy 40,000 zzz
(R S) (T Q)
(4b1.CPR=4b2.CPR)
5353
DHT based distributed join detailed: PIER layer DHT based distributed join detailed: PIER layer procedures in node originating join (1/2).procedures in node originating join (1/2).
Operations on node Node-7, originating the join (1/2): PIER Core Relational Execution Engine:
- [0]: Assign to the join operation a globally unique operation id and namespace for result: o operation id = "oper-5". o namespace is chosen to be equal to operation id = "oper-5".
- [1]: Activate join in all nodes belonging to NR. o Multicast group R-group was created, when R was created
(assumption). o DHT_resourceID = "oper-5". o DHT_item = "SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr".
o Multicast syntax: multicast(namespace, resourceID, item). o multicast(R-group, DHT_resourceID, DHT_item);
5454
DHT based distributed join detailed: PIER layerDHT based distributed join detailed: PIER layerprocedures in node originating join (2/2).procedures in node originating join (2/2).
Operations on node Node-7, originating the join (2/2): - [2]: Activate join in all nodes belonging to NS.
o Multicast group S-group was created, when S was created (assumption).
o DHT_resourceID = "oper-5". o DHT_item = "SELECT R.cpr, R.name, S.address FROM R, S WHERE R.cpr = S.cpr" (as above).
o multicast(S-group, DHT_resourceID, DHT_item); - [5]: Receive result in namespace "oper-5":
o DHT Provider calls function receive in PIER Core Relational Execution Engine.
o Assumed syntax of receive: receive(namespace, item). o For each tuple in join result, receive function is called:
receive("oper-5", DHT_item); o For each call of receive:
store received DHT_item in result table.
5555
DHT based distributed join detailed: PIER layerDHT based distributed join detailed: PIER layerprocedures in node containing data of relation R.procedures in node containing data of relation R.
Operations on node containing data in relation R (multicast group NR): Building of hash table: lscan R:
for each tuple in R: o extract R.cpr, R.name. o DHT_item = (R.cpr, R.name, "R"). o DHT_instanceID = random(); o Put syntax: put(namespace, resourceID, instanceID, item,
lifetime): o [3R]: PUSH: put("NQ", R.cpr, DHT_instanceID,
DHT_item, "60 minutes").
5656
DHT based distributed join detailed:DHT based distributed join detailed:PIER layer procedures in node containing data of PIER layer procedures in node containing data of relation S.relation S.
5757
DHT based distributed join detailed:DHT based distributed join detailed:PIER layer procedures in node containing PIER layer procedures in node containing intermediate data (namespace NQ).intermediate data (namespace NQ).
Operations on node containing intermediate data (namespace NQ): newData("NQ", item) in PIER layer is called by DHT layer. For each call of newData:
o Get syntax: get(namespace, resourceID) list of items: o [4] PULL: get("NQ", item.resourceID); o if get returns both an R derived tuple and an S derived tuple
do the following: extract R.cpr, R.name and S.address. create join result tuple (R.cpr, R.name, S.address). DHT_item = (R.cpr, R.name, S.address). send DHT_item to originator of Join:
assumed syntax of send: send(namespace, receiver-address, item). Note: receiver-address is a hash key.
[5] PUSH: send("oper-5", hashkey(Node-7), DHT_item).
5858
Presentation overview PIERPresentation overview PIER
Core functionality and design Core functionality and design principlesprinciples
Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail.Distributed join / operations in detail. Simulation results.Simulation results. ConclusionConclusion
5959
DHT based distributed joins DHT based distributed joins detailed: important properities.detailed: important properities.
Distributed join performed by PIER layer.Distributed join performed by PIER layer.
6060
DHT based distributed joins DHT based distributed joins detailed: important properities.detailed: important properities.
The PIER distributed joins are adaptions of The PIER distributed joins are adaptions of well-known distributed database well-known distributed database distributed join algorithms.distributed join algorithms.
Distributed databases and database Distributed databases and database operations have been the object of operations have been the object of research for many years.research for many years.
PIER distributed join leverages DHT in PIER distributed join leverages DHT in many aspects, because DHT provides the many aspects, because DHT provides the desired scalability on network size.desired scalability on network size.
DHT based distributed join detailed:DHT based distributed join detailed:CAN multicast design.CAN multicast design.
Fig. 6-det: PIER: CAN multicast design.
Entire CAN network =orginal CAN network(CAN network identifier = 0)multicast group R-group =
separate CAN network(CAN network identifier = 1).Separate namespace = NR. multicast group S-group =
separate CAN network(CAN network identifier = 2).Separate namespace = NS.
CAN multicast isdone by enhancedfloodingmechanism, whichis acceptablyefficient.
DHT based distributed join consistency: DHT based distributed join consistency: in comparison: complete consistency in traditional in comparison: complete consistency in traditional distributed database systems.distributed database systems.
Fig. 5.11: PIER: DHT-based pipelining symmetric hash join:complete consistency in traditional distributed databasesystems.
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
[1][2]
Entire DHT network
Consistency in traditional distributed database systems: data consistency ensured by traditional database mechanisms
(including committed update transactions and "keep alive protocol"). This includes clock synchronization. o data are made unavailable by database system, if they are
inconsistent. correct set of data is the union of local data on participating nodes,
where each set of local data is from the same point of time: the time of the query, which is synchronized within very short timeinterval on all participating nodes.
nodes are either available or query is discarded by database system (network partition robustness sacrificed to consistency).
DHT based distributed join consistency: DHT based distributed join consistency: consistency defined by dilated-reachable snapshot.consistency defined by dilated-reachable snapshot.
Fig. 5.10: PIER: DHT-based pipelining symmetric hash join:consistency defined by dilated-reachable snapshot.
Namespace NS =multicast group S-group
Namespace NR =multicast group R-group
[1][2]
Entire DHT network
Node X
Node Y originated a tuple inS, which is stored on node X.
Node Y
Consistency defined by dilated-reachable snapshot: data consistency ensured by traditional database mechanisms (including
committed update transactions and "keep alive protocol"). (Note: update is not part of PIER). This includes clock synchronization. o this may not be the case, but then query consistency is not
interesting. correct set of data is the (slightly time-dilated) union of local snapshots of
data published by reachable nodes, where each local snapshot is from the time of the arrival of the multicast query message at that node.
data inconsistency (consistency sacrificed to availability and network partition robustness): o missing data: some nodes storing data of a relation may be
unavailable. o invalid data: some stored data originated on nodes, which are no
longer available.
6464
DHT based distributed joins DHT based distributed joins detailed: important properities.detailed: important properities.
PIER use of DHT layer in distributed join:PIER use of DHT layer in distributed join: Relation -> multicast group = namespace. Join operation Relation -> multicast group = namespace. Join operation
multicasted to all nodes containing data for a given relation multicasted to all nodes containing data for a given relation ((addressing of table by multicast groupaddressing of table by multicast group)). This multicast . This multicast functionalilty is essential to achieve distribution and parallelism.functionalilty is essential to achieve distribution and parallelism.
Push / pull techniquePush / pull technique: : • Intermediate result stored (Intermediate result stored (pushedpushed) in ) in hashed tablehashed table. . • Nodes performing next step in workflow Nodes performing next step in workflow pullspulls intermediate result. intermediate result.
DHT hashed tables with intermediate results used as DHT hashed tables with intermediate results used as queuequeue in in workflow. workflow.
DHT leveraged queue and high degree of distribution and DHT leveraged queue and high degree of distribution and parallelism makes parallelism makes network delay much less importantnetwork delay much less important..
Hashing on primary key (Hashing on primary key (rehashingrehashing) ensures related tuples on ) ensures related tuples on same node (same node (content addressable network / store based on content addressable network / store based on hashed keyhashed key).).
DHT used as DHT used as exchange mediumexchange medium between nodes (database between nodes (database terms).terms).
6565
DHT based distributed join detailed: DHT based distributed join detailed: important properities.important properities.
Essential properties of DHT protocol used:Essential properties of DHT protocol used:• reliability in inserting data (put)reliability in inserting data (put)• reliability in retrieving data (get)reliability in retrieving data (get)• reliability in storing data (preserving data at nodefailure)reliability in storing data (preserving data at nodefailure)• scalability in terms of number of nodes and amount of work scalability in terms of number of nodes and amount of work
handledhandled trad. distributed databases systems has this, but DHT (at least trad. distributed databases systems has this, but DHT (at least
CAN) scalability is largerCAN) scalability is larger• robustness to node and network failures.robustness to node and network failures.
trad. database systems do not have this (cf. Brewer's "CAP" trad. database systems do not have this (cf. Brewer's "CAP" conjecture: you can only have two of conjecture: you can only have two of CConsistency, onsistency, AAvailability and vailability and tolerance of network tolerance of network PPartitions. Trad. distributed systems choose C artitions. Trad. distributed systems choose C and sacrifiCes A in the face of P).and sacrifiCes A in the face of P).
• support of quickly and frequently varying number of support of quickly and frequently varying number of participating nodesparticipating nodes
trad. distributed databases normally does not have this.trad. distributed databases normally does not have this.
6666
DHT based distributed join detailed: DHT based distributed join detailed: important properities.important properities.
Comparison of traditional distributed Comparison of traditional distributed databases and PIER distributed query:databases and PIER distributed query:• PIER query (as name shows) is read only, no PIER query (as name shows) is read only, no
update.update.• Committed update transactions and operation Committed update transactions and operation
support facilitites like backup (e.g. checkpoint support facilitites like backup (e.g. checkpoint based)based)
not a design goal of PIER and other P2P systems.not a design goal of PIER and other P2P systems. essential property of trad. distributed database essential property of trad. distributed database
systems for many years.systems for many years.
6767
DHT based distributed join detailed: DHT based distributed join detailed: important properties.important properties.
PIER: supported distributed join algoritms PIER: supported distributed join algoritms and query rewrite mechanisms (1/2):and query rewrite mechanisms (1/2):• Note initially: Note initially:
Basic principles of the use of DHT by PIER distributed Basic principles of the use of DHT by PIER distributed query is already shown by above example containing query is already shown by above example containing pipelining symmetric hash joinpipelining symmetric hash join. .
Pipelining symmetric hash join is the most general Pipelining symmetric hash join is the most general purpose equi join operation of PIER.purpose equi join operation of PIER.
Other distributed query mechanisms primarily differ Other distributed query mechanisms primarily differ with respect to distributed database operation with respect to distributed database operation strategy and bandwidth saving techniques. strategy and bandwidth saving techniques.
Other distributed query mechanisms are just shortly Other distributed query mechanisms are just shortly mentioned here.mentioned here.
6868
DHT based distributed join detailed: DHT based distributed join detailed: important properties.important properties.
PIER: supported distributed join algoritms and query rewrite PIER: supported distributed join algoritms and query rewrite mechanisms (2/2):mechanisms (2/2):• PIER supports DHT based versions of two distributed binary PIER supports DHT based versions of two distributed binary
equi join algoritms: equi join algoritms: pipelinedpipelined symmetric hash joinsymmetric hash join: shown in above example: shown in above example.. fetch matches: fetch matches: one of the tables is already hashed on join one of the tables is already hashed on join
key and attributes.key and attributes.• PIER supports DHT based versions of two distributed PIER supports DHT based versions of two distributed
bandwidth-reducing query rewrite strategies: bandwidth-reducing query rewrite strategies: symmetric semi-join: symmetric semi-join: first local projection of "source" first local projection of "source"
table's join keys and attributes, then pipelined symmetric table's join keys and attributes, then pipelined symmetric hash join.hash join.
Bloom joinBloom join: : something like the following:something like the following: an "index" table an "index" table keyed on join key of each participating table is put into keyed on join key of each participating table is put into intermediate table and makes it possible in an efficient way intermediate table and makes it possible in an efficient way to identify matching set of tuples in the participating tables.to identify matching set of tuples in the participating tables.
Note: pipelining symmetric has join (above example) does Note: pipelining symmetric has join (above example) does in it self safe much network bandwidth.in it self safe much network bandwidth.
6969
Presentation overview PIERPresentation overview PIER
Core functionality and design Core functionality and design principlesprinciples
Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail.Distributed join / operations in detail. Simulation results.Simulation results. ConclusionConclusion
7070
Evaluation of peer scalability and Evaluation of peer scalability and robustness by means of simulation and robustness by means of simulation and
limited scale experimental operation:limited scale experimental operation: Subject of evalution: above two distributed join algoritms, above Subject of evalution: above two distributed join algoritms, above
two query rewrite methods.two query rewrite methods. Simulation: 10000 nodes network.Simulation: 10000 nodes network. Experimental operation: cluster of 64 pcs in a LAN.Experimental operation: cluster of 64 pcs in a LAN. Simulation results: Simulation results:
• network scales well with both scaling number of nodes and scaling network scales well with both scaling number of nodes and scaling amount of work.amount of work.
• network robust to node failure. Robustness is provided by the DHT network robust to node failure. Robustness is provided by the DHT layer, not by the PIER layer.layer, not by the PIER layer.
Evaluation of presenters: the Pier simulation test cases seem Evaluation of presenters: the Pier simulation test cases seem relevant and the results seem acceptable. relevant and the results seem acceptable.
Experimental results: negligeable (experimental operation must Experimental results: negligeable (experimental operation must be extended).be extended).
Reference: Reference: "Querying the Internet with PIER", Ryan Huebsch a.o., "Querying the Internet with PIER", Ryan Huebsch a.o., see course homepage. Hereafter "article".see course homepage. Hereafter "article".
The most important simulation tests and resuls shown below.The most important simulation tests and resuls shown below.
7171
Simulation testbedSimulation testbed The same test distributed join is used The same test distributed join is used
in all tests. in all tests. The amount of work, in terms of The amount of work, in terms of
datatraffic, of the test distributed join datatraffic, of the test distributed join can be and is proportionally can be and is proportionally increased with increasing number of increased with increasing number of nodes in the network. nodes in the network.
7272
Simulation: average join time when Simulation: average join time when scaling network and work ("full scaling network and work ("full mesh" topology) (1/2).mesh" topology) (1/2).
Test case: average join time when both Test case: average join time when both the amount of work and the amount of the amount of work and the amount of nodes in the network scale.nodes in the network scale.
"Full mesh" network topology used."Full mesh" network topology used. Result: network scales well (performance Result: network scales well (performance
degradation of factor 4 when number of degradation of factor 4 when number of nodes (and propertionally the amount of nodes (and propertionally the amount of work) scales from 2 to 10.000. work) scales from 2 to 10.000.
See article fig. 3 below.See article fig. 3 below.
7373
Simulation: average join time when scaling network and Simulation: average join time when scaling network and work ("full mesh" topology) (2/2).work ("full mesh" topology) (2/2).
7474
Simulation: average join time for Simulation: average join time for different mechanisms (1/2).different mechanisms (1/2).
Test case: Average join time for:Test case: Average join time for:• supported distributed join algoritms:supported distributed join algoritms:
pipelining symmetric hash joinpipelining symmetric hash join fetch matches.fetch matches.
• supported query rewrite strategies:supported query rewrite strategies: symmetric semi-join.symmetric semi-join. Bloom join. Bloom join.
Result: see article table 4 below.Result: see article table 4 below.
7575
Simulation: average join time for different mechanisms Simulation: average join time for different mechanisms (2/2).(2/2).
7676
Simulation test cases and results: Simulation test cases and results: aggregate join traffic for different aggregate join traffic for different mechanisms (1/2).mechanisms (1/2).
Test case: aggregate join traffic Test case: aggregate join traffic generated supported join algorithms generated supported join algorithms and query rewrite strategies.and query rewrite strategies.
Result: see article fig. 4 below.Result: see article fig. 4 below.
7777
Simulation test cases and results: aggregate join traffic Simulation test cases and results: aggregate join traffic for different mechanisms (2/2).for different mechanisms (2/2).
7878
Simulation: average join time when Simulation: average join time when scaling network and work ("transit scaling network and work ("transit stub" topology) (1/4).stub" topology) (1/4). Initially note: the topologies examined are the Initially note: the topologies examined are the
underlying IP network topology, not the PIER / underlying IP network topology, not the PIER / DHT overlay network topology.DHT overlay network topology.
Test case: "Transit stub" network topology Test case: "Transit stub" network topology compared to "full mesh" network topology.compared to "full mesh" network topology.
"Transit stub" is the only realistic network "Transit stub" is the only realistic network topology. "Full mesh" is easier to simulate.topology. "Full mesh" is easier to simulate.
"Transit stub" topology: traditional hierarchical "Transit stub" topology: traditional hierarchical network topology:network topology:• 4 "transit domains" each having4 "transit domains" each having• 10 "transit nodes" each having10 "transit nodes" each having• 3 "stub nodes".3 "stub nodes".• PIER nodes distributed equally upon "stub nodes".PIER nodes distributed equally upon "stub nodes".
7979
Simulation: average join time when Simulation: average join time when scaling network and work ("transit scaling network and work ("transit stub" topology) (2/4).stub" topology) (2/4). Comparison of "transit stub" topology with "full mesh" Comparison of "transit stub" topology with "full mesh"
topology is done by comparing the following in the two topology is done by comparing the following in the two topologies: average join time when both the amount of topologies: average join time when both the amount of work and the amount of nodes scale.work and the amount of nodes scale.
Performance scaling ability of "full mesh" topology was Performance scaling ability of "full mesh" topology was shown on fig. 3 above.shown on fig. 3 above.
Performance scaling ability of "transit stub" topology shown Performance scaling ability of "transit stub" topology shown on fig. 7 below.on fig. 7 below.
Result: Result: • "transit stub" topology performance scaling ability close to that "transit stub" topology performance scaling ability close to that
of "full mesh" topology.of "full mesh" topology.• "transit stub" absolute delay bigger (due to realistic topology "transit stub" absolute delay bigger (due to realistic topology
and network delays).and network delays).• conclusion: "full mesh", which is easier to simulate, used in conclusion: "full mesh", which is easier to simulate, used in
simulation test, since it is sufficiently close to the realistic simulation test, since it is sufficiently close to the realistic "transit stub" topology."transit stub" topology.
8080
Simulation: "transit stub" topology's ability to keep Simulation: "transit stub" topology's ability to keep performance high when scaling (fig. 7): performance high when scaling (fig. 7): average join time when scaling network and work (3/4).average join time when scaling network and work (3/4).
8181
Simulation: in comparison "full mesh" topology's ability to Simulation: in comparison "full mesh" topology's ability to keep performance high when scaling (fig. 3): keep performance high when scaling (fig. 3): average join time when scaling network and work (4/4).average join time when scaling network and work (4/4).
8282
Presentation overview PIERPresentation overview PIER
Core functionality and design Core functionality and design principlesprinciples
Distributed join example.Distributed join example. CAN high level functions.CAN high level functions. Application, PIER, DHT/CAN in detail.Application, PIER, DHT/CAN in detail. Distributed join / operations in detail.Distributed join / operations in detail. Simulation results.Simulation results. Conclusion.Conclusion.
Recommended