Proof-Infused Streams: Authenticating Sliding Window Queries on Data Streams
Feifei Li, Florida State University Ke Yi, Hong Kong University of Science &
Technology Marios Hadjieleftheriou, AT&T Labs Research George Kollios, Boston University
Outsourced stream model: stock trading monitoring
2
Provider: A stock broker
Servers(bloomberg)
Q
Register Queries: Sliding window query and/or
One shot query
Clients
Data Publishing Model [HIM02]
SD
3
Owner: publish dataServers: host (or monitor) the data and provide query servicesClients: query the owner’s data through servers
ownerserversclients
H. Hacigumus, B. R. Iyer, and S. Mehrotra, ICDE02
Information Security Issues
4
The third-party (server) cannot be trusted
Lazy server
Malicious intent
Compromised equipment
Unintentional errors (e.g. bugs)
Problem 1: Injection
5
SD
Select * from T where 5<A<11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
Returns 7, 8, 9
owner
server
client
Problem 2: Drop
6
SD
Select * from T where 5<A<11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
Returns 7
owner
server
client
9ri+1
Query Authentication: Goals
7
Query Correctnessresults do exist in the owner's database
Query Completenessno records have been omitted from the result
General Approach
SD
8
ownerserversclients
A B
r1 …
… …
ri-1 4
ri 7
Authenticated Structures
Query results
Verification Object (VO)
Sliding Window Query
9
SELECT SUM(stock_price) FROM Stock_traceWHERE stock_name = A in last 5 MinutesSLIDES every 1 minute
Time-based Window
SELECT SUM(stock_price) FROM Stock_traceWHERE stock_name = A in last 100 TradesSLIDES every 1 trade
Tuple-based WindowThis talk concentrates on tuple-baesd window, generalizing to time-based window is in the paper.For tuple-based window, the timestamp is simply the arrivalid of the tuple.
2, A2, B 5, A9, C4, A 8, A7, C7, B… 2, D
xt+
1
Recent n tuples
xtxt-n xt-n+1
One Shot Query
10
2, A2, B 5, A9, C4, A 8, A7, C7, B…
Recent n tuples
xtxt-n
SELECT SUM(stock_price) FROM Stock_traceWHERE stock_name = A in last 100 Trades
Tuple-based Window
Merkle Hash Tree[M89]-Amortizing Signature Cost
11
m1 m2 m3 m4 m5 m6 m7 m8
h1 h2 h3 h4 h5 h6 h7 h8
h12 h34 h56 h78
h1..4 h5..8
h1..8
Sign(h1..8,SK)
h12=H(h1|h2)
R. C. Merkle. CRYPTO, 1989
m6
h78
h5 h6
m5
h56
h5..8h1..4
h1..8
Ver(h1..8, ,pK)=valid?
Collision resistant hash function any change in the tree will lead to a different hash value for the rootDigital signature of the root no one except the ownercould produce the signatureHash function is publicly knownSingle signature to sign many messages
Extends to Range Query: f=2 (f is the fanout)
12
1 2 3 4 5 6 9 12
h1 h2 h3 h4 h5 h6 h7 h8
h12 h34 h56 h78
h1..4 h5..8
h1..8
Sign(h1..8,SK)
qLB(q) RB(q)
Select * from T where 5<A<11
h1..4
VO: 5, 12, h1..4,
5 12
h5..8
Client Side Verification
13
5 6 9 12
h5 h6 h7 h8
h56 h78
h1..4 h5..8
h1..8
Valid?Ver(h1..8,PK, )
q
Select * from T where 5<A<11
VO: 5, 12, h1..4, Query results: 6, 9
Unknown to the client
Reconstruct query subtree
Solution Overview
14
Sign Every Tuple (with query attribute(s) and timestamp) Expensive update cost for the data provider Expensive communication cost between server
and clients as VO size is large But it provides timely answer on a per-tuple basis
Amortize the signing cost by “proof-infusing” on a group of tuples: A delayed response, can often be tolerated.
Query with d query attributes is a query in d+1 dimension.
N: maximum window size; n: window size for a particular query; b: the delay
Tumbling Merkle Tree (TM-tree)
15
… …
Merkle binary search tree for every b tuples
… …
Merkle binary search tree for every b tuples
Time
Sign(hroot|t1|tb)
ti: timestamp of the ith tuple
TM-tree Continues
16
Time
… …
Query A
ttribute A
Sort by A
Build Merkle tree
Sliding window query on the TM-tree
17
• • •
1. Initialization: Query n/b trees2. Window slides3. Incremental update: query four boundary trees
Tuples to be removed from results Tuples to be added to results
Query the TM-tree
18
Valu
e
Time
Q
QQuery shifts by b
False positives
Sent to clientsRemove from resultsAdded to resultsFalse positives
Correctness and Completeness
19
Correctness: Guaranteed by each individual Merkle tree
Completeness: Completeness in each small Merkle tree is
guaranteed by what we have studied in the first part of this talk
Overall completeness: Check that the results returned are obtained by
querying consecutive trees that fall within the query range on time dimension and they completely cover the query range on time dimension.
This is possible as two boundary tuples’ timestamps have been signed in each tree (hence these timestamps have to be included in the VO by the server).
Limitation of TM-tree
20
Only supports one dimensional query False positives lead to large VO size,
especially when each tuple has non-trivial size.
Merkle kd tree (Mkd-tree)
21
To get rid of false positives: Obviously we need a multi-dimensional indexing structure KD-tree: an excellent candidate with bounded query performance of
and to bulk-load.
A space-partition structure: partition along each dimension in turn.
)(O b )logO( bb
Mkd-tree and TMkd-tree
22
Incorporating Merkle tree into KD-tree: Leaf node: H(p), p is the point contained in this
node Index node u with children v, w and dividing line
lu: H(hv|hw|lu) Tumbling Merkle kd-tree (TMkd-tree)
Similar idea as it is in TM-tree, but we are using Mkd-tree as each small tree.
Boundary trees no longer introduce false positives!
Is this good enough?
23
Tumbling trees are good for maintaining the update to sliding window queries
They both have linear space to N and log b update cost, and
But they are expensive for answering one-shot queries (or the initialization of sliding window queries) query with window size n: have to query
n/b trees: linear in n and could be expensive for large values of n.
costquery or log kbbkbbb
Dyadic Merkle kd-tree (DMkd-tree): 1D queries
24
b b b b b b
2b 2b 2b
b b b
2b
b
2b
4b
N+b
4b• • •
• • •
N+b
Merkle tree
Mkd-tree
Q
2b
b
4b
b
2b
4b
N+b
b
2bDiscarded
b
Exponential Merkle kd-tree (EMkd-tree):Multi-dimensional queries
25
bb
4b
T0 T’0
T1
Tl
2b2bT’1
4bT’l
Materialized Mkd-tree
Non-materialized Mkd-tree
b b bnew T0
2bT’1
bT’0
bb b
2bT1
T0
2b 2b
4bT’l
Q
Some Experiments
26
We use real streams: World Cup Data (WC) IP traces from the AT&T network (IP)
We perform the following query: WC: Query attribute is the response size IP: Query attribute is the packet size
Hardware: 2.8GHz Intel Pentium 4 CPU Linux Machine
Tumbling trees: update cost
27
1. b=1000 is a sweet point2. This delay is small: in real streams it spans less than one or two seconds
Tumbling trees: size
28
They both have linear size (to number of tuples covered in maximal window size of N)
Query cost per sliding period, b=1,000: fixed sliding period as b
29
Linear scan of TM-tree at leaf level results in localitywhich greatly improves its performance
VO size per sliding period, b=1,000: fixed sliding period as b
30
TM-tree incurs roughly 4γb false positives
DM-kd Tree, EM-kd Tree Update Cost
31
DMkd, EMkd trees: size
32
One Shot Query Cost
33
One Shot Query: VO size
34
Summary
35
All trees support aagregations TM-tree and DMkd-tree support only one-
dimensional queries TMkd-tree and EMkd-tree support multi-
dimensional queries Tumbling trees are good for maintaining
updates to sliding window queries, while DMkd-tree and Emkd-tree are good for one shot queries.
Thanks!
36
Questions
Intuition on Authenticating Aggregation Query
37
Query q
Naïve solution: answer it as a range selection query linear authentication cost k (k tuples in the range)!Find the canonical cover: authentication cost log k !
Public key digital signature schemes
38
Sender
RecipientKeyGen (SK, PK)
m
Ver(m, PK, ) valid?m SK
Sign(m, SK)
Insecure Channel
Merkle Tree: Verifying A Single Value
39
SELECT Airline FROM Flights WHERE price = $600
t1 t2 t3 t4
h3 h4
h12 h34
hroot
$250 $320 $410 $600
320 600
410
Query result:
t4
Verification Object:h3
h12
Ver(hroot , , PK)=valid?
Sibling hash values along the query path
apply merkle tree to database authentication [DGMS03] P. Devanbu, M. Gertz, C. Martel, and S. G. Stubblebine Journal of Computer Security 2003
Reduce S/C communication Cost [MNT04]
40
Aggregation Signature: Condensed RSA
m1
1
mk
k
m1
mk
=combine(1,…, k)
Overhead: computation cost of modular multiplication with big modular base number, close to 100 s
E. Mykletun, M. Narasimha, and G. Tsudik. NDSS'04
Condensed RSA [MNT04]
41
KeyGen:
• Choose two large primes, p and q, pq• Set n=pq• Compute (n)=(p-1)(q-1)• Choose e s.t. 1<e<(n) and e is coprime to (n)• Compute d s.t. de1 (mod (n))
(d, n) is the secret key and (e, n) is the public key
Condensed RSA [MNT04]
42
Sign:• Given mi, compute hi=H(mi)• Compute
• Compute
nhdii mod
nk
ii mod
1
Verify:• Given mi, compute hi=H(mi)• Check that:
nhk
ii
e mod1
DMkd-tree vs. EMkd-tree
43
cost metrics
DMkd-tree EMkd-tree
Space
Update cost
Singing operations
One-shot query cost
One-shot VO size
b
NN log
Nlog
b
1
kbb
nn loglog
bb
nn loglog
N
Nb
Nloglog
b
1
kbn
bn
Tool 1: Collision-Resistant Hash Functions
44
Example SHA1: variable input size 20 bytes
(can also plug in any newer replacement) Observations:
Computation cost: 2-3 s (for up to 500 bytes input) Storage cost: 20 bytes
x1 x2
H H
hard to find collision
Tool 2: Public Key Digital Signature Schemes
45
Formally defined by [GMR88]
The message has not been changed in any way The message is indeed from the sender (corresponding to the public
key) No one except the secret key owner could produce a signature
One such scheme: RSA [RSA78]
Observations Computation cost: about 3-4 ms for signing and more
than 100 s for verifying Storage cost: 128 bytes
S. Goldwasser S. Micali R. Rivest SIAM Journal on Computing 1988. R. Rivest A. Shamir L. Adleman, Commun. ACM 1978
Problem 3: Omission
46
SD
Select * from T where 5<A<11
A B
r1 …
… …
ri-1 4
ri 7
ri+1 9
ri+2 11
A B
r1 …
… …
ri-1 4
ri 6
Ri 7
ri+1 9
ri+2 11
Returns 7,9
owner
server
client
Update
Roadmap
47
Solution overview Efficient authentication of sliding window
queries when window slides Efficient authentication of one shot queries
(also the sliding window query initialization): Experiment Conclusion
Is this good enough?
48
Tumbling trees are good for maintaining the update to sliding window queries
They both have linear space to N and log b update cost, and
But they are expensive for answering one-shot queries (or the initialization of sliding window queries) query with window size n: have to query
n/b trees: linear in n and could be expensive for large values of n.
costquery or log kbbkbbb
Query cost per sliding period, b=1,000: fixed query selectivity as 0.1
49
Query upto 2/b+2 boundary trees
VO size per sliding period, b=1,000: fixed query selectivity as 0.1
50
TM-tree incurs roughly (2/b+2)b false positives