Mauro Mezzini ANSWERING SUM-QUERIES : A SECURE AND EFFICIENT APPROACH University of Rome “La Sapienza” Computer Science Department

Mauro Mezzini

ANSWERING SUM-QUERIES :A SECURE AND EFFICIENT

APPROACH

University of Rome“La Sapienza”Computer Science Department

Introduction

Statistical database: users are allowed to ask statistical information such as sum, count, average, max and min queries on a numerical attribute.

PRODUCT SALES(€)storage 90000router 30000server 30000mainframe 25000

select sum( SALES ) from Retailwhere PRODUCT = “storage” or PRODUCT = “router”;

Retail

r = 120.000

Introduction

Definition: The target K of a query q.

select sum( SALES ) from Retailwhere PRODUCT = “storage” or PRODUCT = “router”;

PRODUCTK storage

router

The efficiency issue

To speed up the answer of a sum-query, the query system is endowed with a set of pre-computed sum-queries called the set of materialised views.

select sum( SALES ) q2 from Retail

where PRODUCT = “storage” or PRODUCT = “router”;

q1 select sum( SALES ) from Retail

r1= 175.000

r2= 120.000

select sum( SALES ) q from Retail

where PRODUCT = “server” or PRODUCT = “mainframe”;

r = r1 r2= 55.000

Protection issue

To protect the confidentiality of the numerical attribute, the query system is endowed with the list of all sensitive categories.

q1 select sum( SALES) from Retail where PRODUCT = “storage”;

q2 select sum( SALES) from Retail where PRODUCT = “router”;

PRODUCT SALES(€)storage 90000routers 30000server 30000mainframe 25000

select sum( SALES) from Retail q1 where PRODUCT = “router” or PRODUCT = “server”;

select sum( SALES) from Retail q2 where PRODUCT = “storage” or PRODUCT = “server”;

select sum( SALES) from Retail q3 where PRODUCT = “storage” or PRODUCT = “router”;

r1= 120.000

r2= 60.000

r3 =120.000

Protection issue

x1 + x2 = r1

x2 + x3 = r2

x1 + x3 = r3

The value of all confidential information can be inferred from the answer of non–confidential queries {q1, q2, q3 }.

The inference model

Efficiency : Given a set of sum-queries V = {q1,…,qn} determine if the result of q can be inferred from V.

Protection :Given a set of sum-queries V = {q1,…,qn} determine for every inferable sum-query q if the result of q is a sensitive information.

The inference model

Let V = {q1, q2, …,qn}

Let Ki and ri be the target and the result of qi respectively

Let ={C1, C2,…, Cm} be the coarsest partition of i=1,…,n Ki such that each

Ki can be obtained by the union of one or more elements of

The inference model is based on the following linear constraints system

j=1,…,m ai,j xj = ri i=1,…,n

xFm

where ai,j = 1 if CjKi and ai,j = 0 otherwise

and F is the domain of the numerical attribute

(1)

The inference model. An example

K1={router, server}

C1={router}C2={server}C3={storage}

F is the set of non-negative reals




r1= 120.000

r2= 60.000

r3 =120.000

x1 + x2 = r1

x2 + x3 = r2

x1 + x3 = r3

K2={storage, server}

K3={storage, router}

The inference model

Definition: Given a subset S of {1,2,…,m} the sum-expression

jS xj

is an F-invariant if it takes on the same value for every solution x of (1).

An F-invariant sum is the result of the sum-query with target

jS Cj

The inference model

Definitions: Given the partition = {C1,…,Cm} and a query q with target K the two sets:

S = { j : Cj K} the support of q

S = { j : Cj K and Cj - K } the cosupport of q

The sum

jSS xj

is called the sum-expression associated to q.

The inference model. An example

q select sum( SALES) from Retail where PRODUCT = “storage”;

The support of q is { 3 } , the cosupport is empty and the sum-expression associated to q is trivially:

x3

K1={router, server}

C1={router}C2={server}C3={storage}




r1= 120.000

r2= 60.000

r3 =120.000

K2={storage, server}

K3={storage, router}

x1 + x2 = r1

x2 + x3 = r2

x1 + x3 = r3

K={storage}

Problems definitions

1) Given a sum-expression jS xj decide whether it is an F-invariant.

2) Given a sum-expression jS xj that is not an F-invariant, find a nonempty subset S of S such that jS xj is an F-invariant.

Let S be a subset of {1,…,m} and let s be the characteristic vector of S. Then

1 if iS

0 if iS

Problem (2)

s(i)= i = 1,…,m

Problem (2)

An m-dimensional f vector is a linear combination of rows of A if

We can rewrite system (1) as : A x = r, xFm

f = i=1,…,m i ai

iRai is a row of A i=1,…,m

Problem (2)

Definition: A subset S of {1,2,…,m} is said to be algebraic if its characteristic vector can be expressed as a linear combination of the rows of the matrix A.

If F is R , or Z then jS xj is F-invariant if and only if S is algebraic.

Problem definition :Given a sum expression

jS xj

that is not R-invariant, find a non-empty algebraic subset of S (NAS Problem).

NAS Problem : find a non-empty subset F of S such that the characteristic vector of F is expressible as a linear combination of rows of A

The NAS Problem

The subset sum problem (SSP):

Given a set S = {1,…,p} and a mapping

a:S Z

such that

a(i) > 0 for i=1,…,p-1 and

a(i) < 0 for i=p

find a subset F of S such that

iF a(i) = 0

The NAS Problem

Let c be a q-dimensional vector, with q≥p, such that

c(1) = a(1) c(2) = a(2) ….c(p) = a(p)

and

c(j) R for p<jq

Let M = (I, c) be the q (q+1) matrix obtained from c.

The NAS Problem

Example: let S={1, 2, 3, 4} and

a(1) = 1

a(2) = 2

a(3) = 5

a(4)= -7

The subset F = { 2, 3, 4} of S is a solution of the SSP since

a(2) + a(3) + a(4) = 2 + 5 – 7 = 0.

The NAS Problem

The NAS Problem

If we chose q = 5 the vector c is (1, 2, 5, -7, ) and the matrix M is

1 0 0 0 0 1 0 1 0 0 0 2 0 0 1 0 0 5 0 0 0 1 0 -7 0 0 0 0 1

The NAS Problem

The vector c= (c , 1) is a solution of the equation

M y = 0

y1 +1 y6 = 0 y2 +2 y6 = 0

y3 +5 y6 = 0 y4 7 y6 = 0

y5 + y6 = 0

The NAS Problem

Theorem: Given the matrix M and the set S = {1,…,p} then the SSP as a solution if and only if there exist a nonempty algebraic subset of S.

Proof

The (q+1)-dimensional vector c= (c , 1) spans the null space of M

M y = 0

and the null space of M has dimension equal to one.

The NAS Problem

If FS is an algebraic set then its characteristic vector f is expressible as a linear combination of rows of M. Since f and c are orthogonal then

i=1,…,q+1 f(i) c(i) = 0

that is

0 = iF c(i) = iF a(i)

qed.

The NAS Problem

Example: let S={1, 2, 3} and

a(1) = 2 a(2) = 2 a(3) = 4

then

c0 = (2 , 2, 4)

c1 = (1, 1, 1)

c2 = (1, 1, 1)

c3 = ( 2, 2, 2, 1, 1)

let

c = (c0, c1, c2, c3)

Then M would be

1 0 0 0 0 0 0 0 0 0 0 0 0 0 20 1 0 0 0 0 0 0 0 0 0 0 0 0 20 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 0 0 0 0 0 0 0 0 0 -10 0 0 0 1 0 0 0 0 0 0 0 0 0 -10 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

The NAS Problem

c0

c1

c2

c3

Step (1)

1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 20 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 0 0 0 0 0 0 0 0 0 -10 0 0 0 1 0 0 0 0 0 0 0 0 0 -10 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

The NAS Problem

c0

c1

c2

c3

Step (3)

1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 20 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

The NAS Problem

c0

c1

c2

c3

Step (4)

1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

The NAS Problem

c0

c1

c2

c3

Step (5)

1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 1 1 0 0 0 00 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 20 0 0 0 0 0 0 0 0 0 1 0 0 0 20 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

The NAS Problem

c0

c1

c2

c3

Step (6)

1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 1 1 0 0 0 00 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 1 0 0 00 0 0 0 0 0 0 0 0 0 1 1 0 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 -20 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

The NAS Problem

c0

c1

c2

c3

Final step

1 0 0 1 1 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 1 1 0 0 0 00 0 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 0 0 1 0 1 0 0 0 0 0 00 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 1 0 0 1 0 00 0 0 0 0 0 0 0 0 0 1 0 1 0 00 0 0 0 0 0 0 0 0 0 0 1 1 1 00 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

The NAS Problem

c0

c1

c2

c3

The NAS Problem

a(i) > 1 i=1,..,p-1

ci = ( )

ki = log2 a(i)

iii kkk

iaiaiaiaiaiaiaiaia

2

)(,

2

)(,

2

)(,...,

2

)(,

2

)(,

2

)(,

2

)(,

2

)(,

2

)(222

The NAS Problem

a(i) = 7

ci = ( -3, -3, 3, -1, -1, 1 )

ki = log2 7 = 2

a(i) = 8

ci = ( -4, -4, 4, -2, -2, 2, -1, -1, 1 )

ki = log2 8 = 3

The NAS Problem

B = max{ |a(i)| : i = 1,…,p}

The SSP has input dimension equal to O( p × log2(B)).

ki log2(B)

The dimension of the matrix M is q × (q +1) where

q ( p + 1 ) × 3 log2(B) O( p × log2(B) )

Solving problem (1)

A x = r, xFm

jS xj is an F-invariant?

A is the vertex-edge incidence matrix of a graph, F is the set of reals and S is singleton.

x1

x2

x7

x8

x6

x4

x3 x5

r1

r2

r3 r4

r5

r6

Solving problem (1)

Consider the homogeneous system associated to system (1)

A y = 0, yRm (2)

We call circulation a solution y of system (2).

+

+

- -0

0

0

00 0 0 0

0 0

Solving problem (1)

Definition : given a circulation y then its support is the set

C = { e : ye 0 }

0

0

0

0

+

+

- -

Solving problem (1)

Theorem 1: The unknown xe is an R-invariant if and only if circulation y with support C then eC.

Proof: Let x* be a particular solution of (1). Then

x = x* + y

So if ye=0, circulation y then xe = xe*, solution x of (1).

If xe is invariant then

xe – xe* = 0 = ye

For every solution x of (1). Therefore ye = 0 for every circulation y.

Solving problem (1)

Definition : A circulation y with support C is minimal if there is no circulation with support C such that CC.

+

+3

-2 -2+

-

+

-

Solving problem (1)

The support of minimal circulations are called circuits and are the even cycles and the L-oddsets of the graph.

+

+

- -+2

- ++

-

-

+-2

-+

-

-+

+

Solving problem (1)

Given a circulation y then

y = i=1,…,pi yi

where i R

B={y1,…, yp} is a base of N

each yi is a circuit of G

Solving problem (1)

+2

- -+

-

+

-

+β

+ β

- β - β

Solving problem (1)

Theorem 2: The unknown xe is an R-invariant if and only if circuit yi with support C then eC.Proof:

ye= i=1,…,pi yi,e = 0

Solving problem (1)

An odd edge is an edge of G belonging to every odd cycles of G.

A bridge is an edge of G whose removal disconnect G.

Solving problem (1)

Theorem 3: The unknown xe is an R-invariant if and only if e is an odd edge or is a bridge that disconnect a bipartite component of G.Proof:

1) If e belongs to all odd cycles of G then G cannot contains an l-oddset.

2) If e is a bridge then it cannot belong to an even cycle.

Solving problem (1)

The case when e is an odd edge.

Let for contraddiction D be an even cycle containing e.

D C is a set of edge-disjoint cycles not containing e.

|D C| = |D| +|C| 2 |D C|

|D C| is odd and D C must contains at least one odd cycle (contraddiction).

Solving problem (1)

The case when e is a bridge disconnecting a bipartite component.

e

non bipartitecomponent

bipartitecomponent

Solving problem (1)

E(H) = { e : e is a bridge of G}

V(H) = { v : v is a connected component of GE(H)}

G

H

Solving problem (1)

Step 1

Solving problem (1)

Step 2

Solving problem (1)

Step 3

Solving problem (1)

Step 4

Solving problem (1)

Step 5

Solving problem (1)

Step 6

Solving problem (1)

Step 7

Solving problem (1)

Step 8

Solving problem (1)

A DFS traversal of a graph gives a partition of the edges of G

tree edges

back edges

Each back edge e generates a cycle C(e)

The cycle C(e) is called a fundamental cycle with respect to the tree T

Solving problem (1)

Proposition: every cycle of G can be obtained as the symmetric difference of one or more fundamental cycles.

If e is an odd edge then

1) it must belong to every fundamental odd cycle of G

1) no fundamental even cycle of G contains e

Solving problem (1)

A back edge e belong to every fundamental odd cycle of G if and only if C(e) is the only fundamental odd cycle.

For every tree edge e we count the number of odd and even fundamental cycles containing e.

Documents

Mauro Mezzini ANSWERING SUM-QUERIES : A SECURE AND EFFICIENT APPROACH University of Rome “La Sapienza” Computer Science Department