Upload
meagan-washington
View
213
Download
0
Embed Size (px)
Citation preview
Mauro Mezzini
ANSWERING SUM-QUERIES :A SECURE AND EFFICIENT
APPROACH
University of Rome“La Sapienza”Computer Science Department
Introduction
Statistical database: users are allowed to ask statistical information such as sum, count, average, max and min queries on a numerical attribute.
PRODUCT SALES(€)storage 90000router 30000server 30000mainframe 25000
select sum( SALES ) from Retailwhere PRODUCT = “storage” or PRODUCT = “router”;
Retail
r = 120.000
Introduction
Definition: The target K of a query q.
select sum( SALES ) from Retailwhere PRODUCT = “storage” or PRODUCT = “router”;
PRODUCTK storage
router
The efficiency issue
To speed up the answer of a sum-query, the query system is endowed with a set of pre-computed sum-queries called the set of materialised views.
select sum( SALES ) q2 from Retail
where PRODUCT = “storage” or PRODUCT = “router”;
q1 select sum( SALES ) from Retail
r1= 175.000
r2= 120.000
select sum( SALES ) q from Retail
where PRODUCT = “server” or PRODUCT = “mainframe”;
r = r1 r2= 55.000
Protection issue
To protect the confidentiality of the numerical attribute, the query system is endowed with the list of all sensitive categories.
q1 select sum( SALES) from Retail where PRODUCT = “storage”;
q2 select sum( SALES) from Retail where PRODUCT = “router”;
PRODUCT SALES(€)storage 90000routers 30000server 30000mainframe 25000
select sum( SALES) from Retail q1 where PRODUCT = “router” or PRODUCT = “server”;
select sum( SALES) from Retail q2 where PRODUCT = “storage” or PRODUCT = “server”;
select sum( SALES) from Retail q3 where PRODUCT = “storage” or PRODUCT = “router”;
r1= 120.000
r2= 60.000
r3 =120.000
Protection issue
x1 + x2 = r1
x2 + x3 = r2
x1 + x3 = r3
The value of all confidential information can be inferred from the answer of non–confidential queries {q1, q2, q3 }.
The inference model
Efficiency : Given a set of sum-queries V = {q1,…,qn} determine if the result of q can be inferred from V.
Protection :Given a set of sum-queries V = {q1,…,qn} determine for every inferable sum-query q if the result of q is a sensitive information.
The inference model
Let V = {q1, q2, …,qn}
Let Ki and ri be the target and the result of qi respectively
Let ={C1, C2,…, Cm} be the coarsest partition of i=1,…,n Ki such that each
Ki can be obtained by the union of one or more elements of
The inference model is based on the following linear constraints system
j=1,…,m ai,j xj = ri i=1,…,n
xFm
where ai,j = 1 if CjKi and ai,j = 0 otherwise
and F is the domain of the numerical attribute
(1)
The inference model. An example
K1={router, server}
C1={router}C2={server}C3={storage}
F is the set of non-negative reals
select sum( SALES) from Retail q1 where PRODUCT = “router” or PRODUCT = “server”;
select sum( SALES) from Retail q2 where PRODUCT = “storage” or PRODUCT = “server”;
select sum( SALES) from Retail q3 where PRODUCT = “storage” or PRODUCT = “router”;
r1= 120.000
r2= 60.000
r3 =120.000
x1 + x2 = r1
x2 + x3 = r2
x1 + x3 = r3
K2={storage, server}
K3={storage, router}
The inference model
Definition: Given a subset S of {1,2,…,m} the sum-expression
jS xj
is an F-invariant if it takes on the same value for every solution x of (1).
An F-invariant sum is the result of the sum-query with target
jS Cj
The inference model
Definitions: Given the partition = {C1,…,Cm} and a query q with target K the two sets:
S = { j : Cj K} the support of q
S = { j : Cj K and Cj - K } the cosupport of q
The sum
jSS xj
is called the sum-expression associated to q.
The inference model. An example
q select sum( SALES) from Retail where PRODUCT = “storage”;
The support of q is { 3 } , the cosupport is empty and the sum-expression associated to q is trivially:
x3
K1={router, server}
C1={router}C2={server}C3={storage}
select sum( SALES) from Retail q1 where PRODUCT = “router” or PRODUCT = “server”;
select sum( SALES) from Retail q2 where PRODUCT = “storage” or PRODUCT = “server”;
select sum( SALES) from Retail q3 where PRODUCT = “storage” or PRODUCT = “router”;
r1= 120.000
r2= 60.000
r3 =120.000
K2={storage, server}
K3={storage, router}
x1 + x2 = r1
x2 + x3 = r2
x1 + x3 = r3
K={storage}
Problems definitions
1) Given a sum-expression jS xj decide whether it is an F-invariant.
2) Given a sum-expression jS xj that is not an F-invariant, find a nonempty subset S of S such that jS xj is an F-invariant.
Let S be a subset of {1,…,m} and let s be the characteristic vector of S. Then
1 if iS
0 if iS
Problem (2)
s(i)= i = 1,…,m
Problem (2)
An m-dimensional f vector is a linear combination of rows of A if
We can rewrite system (1) as : A x = r, xFm
f = i=1,…,m i ai
iRai is a row of A i=1,…,m
Problem (2)
Definition: A subset S of {1,2,…,m} is said to be algebraic if its characteristic vector can be expressed as a linear combination of the rows of the matrix A.
If F is R , or Z then jS xj is F-invariant if and only if S is algebraic.
Problem definition :Given a sum expression
jS xj
that is not R-invariant, find a non-empty algebraic subset of S (NAS Problem).
NAS Problem : find a non-empty subset F of S such that the characteristic vector of F is expressible as a linear combination of rows of A
The NAS Problem
The subset sum problem (SSP):
Given a set S = {1,…,p} and a mapping
a:S Z
such that
a(i) > 0 for i=1,…,p-1 and
a(i) < 0 for i=p
find a subset F of S such that
iF a(i) = 0
The NAS Problem
Let c be a q-dimensional vector, with q≥p, such that
c(1) = a(1) c(2) = a(2) ….c(p) = a(p)
and
c(j) R for p<jq
Let M = (I, c) be the q (q+1) matrix obtained from c.
The NAS Problem
Example: let S={1, 2, 3, 4} and
a(1) = 1
a(2) = 2
a(3) = 5
a(4)= -7
The subset F = { 2, 3, 4} of S is a solution of the SSP since
a(2) + a(3) + a(4) = 2 + 5 – 7 = 0.
The NAS Problem
The NAS Problem
If we chose q = 5 the vector c is (1, 2, 5, -7, ) and the matrix M is
1 0 0 0 0 1 0 1 0 0 0 2 0 0 1 0 0 5 0 0 0 1 0 -7 0 0 0 0 1
The NAS Problem
The vector c= (c , 1) is a solution of the equation
M y = 0
y1 +1 y6 = 0 y2 +2 y6 = 0
y3 +5 y6 = 0 y4 7 y6 = 0
y5 + y6 = 0
The NAS Problem
Theorem: Given the matrix M and the set S = {1,…,p} then the SSP as a solution if and only if there exist a nonempty algebraic subset of S.
Proof
The (q+1)-dimensional vector c= (c , 1) spans the null space of M
M y = 0
and the null space of M has dimension equal to one.
The NAS Problem
If FS is an algebraic set then its characteristic vector f is expressible as a linear combination of rows of M. Since f and c are orthogonal then
i=1,…,q+1 f(i) c(i) = 0
that is
0 = iF c(i) = iF a(i)
qed.
The NAS Problem
Example: let S={1, 2, 3} and
a(1) = 2 a(2) = 2 a(3) = 4
then
c0 = (2 , 2, 4)
c1 = (1, 1, 1)
c2 = (1, 1, 1)
c3 = ( 2, 2, 2, 1, 1)
let
c = (c0, c1, c2, c3)
Then M would be
1 0 0 0 0 0 0 0 0 0 0 0 0 0 20 1 0 0 0 0 0 0 0 0 0 0 0 0 20 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 0 0 0 0 0 0 0 0 0 -10 0 0 0 1 0 0 0 0 0 0 0 0 0 -10 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
The NAS Problem
c0
c1
c2
c3
Step (1)
1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 20 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 0 0 0 0 0 0 0 0 0 -10 0 0 0 1 0 0 0 0 0 0 0 0 0 -10 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
The NAS Problem
c0
c1
c2
c3
Step (3)
1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 20 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
The NAS Problem
c0
c1
c2
c3
Step (4)
1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
The NAS Problem
c0
c1
c2
c3
Step (5)
1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 1 1 0 0 0 00 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
The NAS Problem
c0
c1
c2
c3
Step (6)
1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 1 1 0 0 0 00 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 1 0 0 00 0 0 0 0 0 0 0 0 0 1 1 0 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
The NAS Problem
c0
c1
c2
c3
Final step
1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 1 1 0 0 0 00 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 1 0 00 0 0 0 0 0 0 0 0 0 1 0 1 0 00 0 0 0 0 0 0 0 0 0 0 1 1 1 00 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
The NAS Problem
c0
c1
c2
c3
The NAS Problem
a(i) > 1 i=1,..,p-1
ci = ( )
ki = log2 a(i)
iii kkk
iaiaiaiaiaiaiaiaia
2
)(,
2
)(,
2
)(,...,
2
)(,
2
)(,
2
)(,
2
)(,
2
)(,
2
)(222
The NAS Problem
a(i) = 7
ci = ( -3, -3, 3, -1, -1, 1 )
ki = log2 7 = 2
a(i) = 8
ci = ( -4, -4, 4, -2, -2, 2, -1, -1, 1 )
ki = log2 8 = 3
The NAS Problem
B = max{ |a(i)| : i = 1,…,p}
The SSP has input dimension equal to O( p × log2(B)).
ki log2(B)
The dimension of the matrix M is q × (q +1) where
q ( p + 1 ) × 3 log2(B) O( p × log2(B) )
Solving problem (1)
A x = r, xFm
jS xj is an F-invariant?
A is the vertex-edge incidence matrix of a graph, F is the set of reals and S is singleton.
x1
x2
x7
x8
x6
x4
x3 x5
r1
r2
r3 r4
r5
r6
Solving problem (1)
Consider the homogeneous system associated to system (1)
A y = 0, yRm (2)
We call circulation a solution y of system (2).
+
+
- -0
0
0
00 0 0 0
0 0
Solving problem (1)
Definition : given a circulation y then its support is the set
C = { e : ye 0 }
0
0
0
0
+
+
- -
Solving problem (1)
Theorem 1: The unknown xe is an R-invariant if and only if circulation y with support C then eC.
Proof: Let x* be a particular solution of (1). Then
x = x* + y
So if ye=0, circulation y then xe = xe*, solution x of (1).
If xe is invariant then
xe – xe* = 0 = ye
For every solution x of (1). Therefore ye = 0 for every circulation y.
Solving problem (1)
Definition : A circulation y with support C is minimal if there is no circulation with support C such that CC.
+
+3
-2 -2+
-
+
-
Solving problem (1)
The support of minimal circulations are called circuits and are the even cycles and the L-oddsets of the graph.
+
+
- -+2
- ++
-
-
+-2
-+
-
-+
+
Solving problem (1)
Given a circulation y then
y = i=1,…,pi yi
where i R
B={y1,…, yp} is a base of N
each yi is a circuit of G
Solving problem (1)
+2
- -+
-
+
-
+β
+ β
- β - β
Solving problem (1)
Theorem 2: The unknown xe is an R-invariant if and only if circuit yi with support C then eC.Proof:
ye= i=1,…,pi yi,e = 0
Solving problem (1)
An odd edge is an edge of G belonging to every odd cycles of G.
A bridge is an edge of G whose removal disconnect G.
Solving problem (1)
Theorem 3: The unknown xe is an R-invariant if and only if e is an odd edge or is a bridge that disconnect a bipartite component of G.Proof:
1) If e belongs to all odd cycles of G then G cannot contains an l-oddset.
2) If e is a bridge then it cannot belong to an even cycle.
Solving problem (1)
The case when e is an odd edge.
Let for contraddiction D be an even cycle containing e.
D C is a set of edge-disjoint cycles not containing e.
|D C| = |D| +|C| 2 |D C|
|D C| is odd and D C must contains at least one odd cycle (contraddiction).
Solving problem (1)
The case when e is a bridge disconnecting a bipartite component.
e
non bipartitecomponent
bipartitecomponent
Solving problem (1)
E(H) = { e : e is a bridge of G}
V(H) = { v : v is a connected component of GE(H)}
G
H
Solving problem (1)
Step 1
Solving problem (1)
Step 2
Solving problem (1)
Step 3
Solving problem (1)
Step 4
Solving problem (1)
Step 5
Solving problem (1)
Step 6
Solving problem (1)
Step 7
Solving problem (1)
Step 8
Solving problem (1)
A DFS traversal of a graph gives a partition of the edges of G
tree edges
back edges
Each back edge e generates a cycle C(e)
The cycle C(e) is called a fundamental cycle with respect to the tree T
Solving problem (1)
Proposition: every cycle of G can be obtained as the symmetric difference of one or more fundamental cycles.
If e is an odd edge then
1) it must belong to every fundamental odd cycle of G
1) no fundamental even cycle of G contains e
Solving problem (1)
A back edge e belong to every fundamental odd cycle of G if and only if C(e) is the only fundamental odd cycle.
For every tree edge e we count the number of odd and even fundamental cycles containing e.