Mário S. Alvim Ph.D. Thesis Defense École Polytechnique – LIX Supervised by catuscia palamidessi

MÁRIO S. ALVIMPH.D. THESIS DEFENSE

É C O L E P O LY T E C H N I Q U E – L I XS U P E RV I S E D BY C AT U S C I A PA L A M I D E SSI

12-Oct-2011

Formal approaches to information hiding: an analysis of interactive systems, statistical disclosure control,

and refinement of specifications

Ph.D. Defense - Mário S. Alvim

2

Part I

12-Oct-2011

Introduction


3

Information hiding

12-Oct-2011

In many cases the broad and efficient dissemination of information is desirable.

But in several situations it is undesirable, or even unacceptable, that part of the information be leaked.

Information hiding deals with the problem of keeping secret part of the information processed by a computational system.


4

Subfields of information hiding vary depending on: What one wants to keep secret; From which adversary or attacker; How powerful the adversary is.

The subfields are not mutually exclusive. We observe an increasing covergence in the research.

An individual’s identity?

A message’s contents?

The link between an individual and an action?

Subfields of information hiding

12-Oct-2011

Can he only observe the system?

Can he interact with the system?

An external entity?

A user of the system?


5

Information flow: protecting the secret information w.r.t. what can be deduced from the observable behavior of the system. Ex: Election system

Statistical disclosure control: protecting individual information within a statistical sample.

Our focus

12-Oct-2011

Alice -> X

Cindy > Y

Bob -> X

X=2, Y=1

secrets observables

Time

Heating


6

By observing the system’s behavior, the adversary cannot be sure of what the secret is. The principle of confusion: “For every

observable output generated by a secret input value, there is another secret value that could also have generated the same output.”

Does not take into consideration the adversary’s level of (un)certainty about the secret.

Noninterference: the secrets do not alter the observable behavior of the system. Unachievable in practice.

The qualitative approach

12-Oct-2011

𝒐𝟏

𝒂𝟏

𝒂𝟐

𝒐𝟐

𝒂𝟑

𝒂𝟒...

𝒂𝟏𝒂𝟐

𝒂𝟑𝒂𝟒

𝑷 𝒐𝟏

𝑷 𝒐𝟐

...

Partitioning

𝑷 𝒐𝟏

𝑷 𝒐𝟐

𝑷

𝑷 𝒐𝟑 𝑷 ′𝒐𝟏𝑷 ′𝒐𝟐

𝑷 ′

𝑷 ′𝒐𝟑

?


7

The quantitative approach

12-Oct-2011

Takes into consideration the level of (un)certainty of the adversary. Allows us to compare two systems w.r.t. the level of

security they provide. Makes use of probabilities.

Main approaches: Bayes risk Information theory Our focus

on this thesis


8

Plan of the presentation

12-Oct-2011

Part II Information theory as a framework for information leakage

Part III Information flow in interactive systems

Part IV Differential-privacy: the trade-off between privacy and utility

Part V Safe equivalences for security properties

Part VI Conclusion


9

Part II

12-Oct-2011

Information theory as a framework for information

leakage


10

Information theory and communication

12-Oct-2011

Information theory originally focused on how to transmit information through unreliable (or noisy) channels.

It allows us to reason about: the degree of uncertaintiy of a random variable; the amount of information one random variable

carries about another random variable.


11

input

𝑎1𝑎2

𝑎𝑛

…

output

𝑏1𝑏2

𝑏𝑚

…

Noisy channel

Channel matrix

Noisy channels

12-Oct-2011

is a finite input alphabet

is a finite output alphabet

is the probability of output given input

is the channel matrix where

secrets

observables

System’sbehavior


12

General principle:

The uncertainty can be measured in different ways, corresponding to different models of attack.

Models of guessing attacks (Köpf and Basin): The adversary wants to determine the value of a random

variable . He can ask (adaptatively) several yes/no questions to an

oracle. The attacker knows the a priori distribution . Different measures of uncertainty correspond to different

models of attack.

Information leakage

12-Oct-2011

𝑳𝒆𝒂𝒌𝒂𝒈𝒆=𝑰𝒏𝒊𝒕𝒊𝒂𝒍𝒖𝒏𝒄𝒆𝒓𝒕𝒂𝒊𝒏𝒕𝒚 −𝑹𝒆𝒎𝒂𝒊𝒏𝒊𝒏𝒈𝒖𝒏𝒄𝒆𝒓𝒕𝒂𝒊𝒏𝒕𝒚

A subsequent question may depend on the

answer to a previous question..


13

Shannon entropy

12-Oct-2011

Leakage as mutual information:

𝐼 ( 𝐴 ;𝐵 )=𝐻 ( 𝐴 )−𝐻 (𝐴∨𝐵)

Initial uncertainty

Remaining uncertainty

Meaning in security: The adversary can ask questions of the type “Does

belong to ?” is the lower bound to the expected number of

questions necessary to determine the value of .

Leakage


14

Réniy min-entropy

12-Oct-2011

Leakage as min-entropy leakage::

𝐼∞ ( 𝐴;𝐵 )=𝐻∞ ( 𝐴 )−𝐻∞ (𝐴∨𝐵)

Initial uncertainty

Remaining uncertainty (Smith)

Meaning in security: One try attack: “Is ?” Closely related to the Bayes risk.

Leakage


15

Part III

12-Oct-2011

Information flow in interactive systems


16

The problem of interactivity

12-Oct-2011

So far the information-theoretic approach has been applied only to systems where secrets do not depend on observables.

In interactive systems secrets and observables can interleave and influence each other: Auction protocols, web applications, command line

programs, etc.

In such systems the classic information-theoretic approach fails.


17

The problem of interactivity: an example

12-Oct-2011

Web based application A seller can offer a cheap or an expensive

product (observables) Two possible buyers: rich or poor (secrets)

Channel matrix: ?

cheap expensive

poor rich poor rich

0.5 0.5

s s’ t t’

cheap expensive

poor

rich

chp. exp.

poor

0.4 0.6

rich 0.6 0.4

chp. exp.

poor

0.25 0.75

rich 0.56 0.44

S=0.4, t=0.6

S=0.1, t=0.3

Channel matrix is not invariant w.r.t. input distribution.

Capacity can no longer be calculated.


18

Our contribution

12-Oct-2011

Extend the classic information-theoretic approach to interactive systems: Modelling systems as Interactive Information-Hiding

Systems (IIHSs);

Using channels with memory and feedback;

Re-interpreting the leakage in this more genereal scenario, finding a more adequate definition of leakage.

Show that the capacity of the channels associated to IIHSs is a continuous function of the Kantorovich metric


19

Some necessary technicalities

12-Oct-2011

is a set of symbols

In a sequence of symbols, represents the symbol at time

Example: In we have and

contains all the information about the joint behavior of the sequences of inputs and outputs up to time By probability laws:

feedback memory


20

Channels with memory and feedback

12-Oct-2011

𝝋𝒕 𝜶𝒕𝜷 𝒕

𝜷 𝒕−𝟏

Code-functions

Delay

“Interactor”Stochastic

Kernels

Mutual information can be slpit into its components: directed information from input to output directed information from output to intput

It can be shown that


Modelling IIHS’s as channels with memory and feedback

12-Oct-2011

21

Theorem: Given a fully probabilistic IIHS, it is always possible to construct a joint prob. dist. s.t. it always hold ():

And a corollary shows how to construct .

Code-functions

𝝋𝒕 “Interactor”Stochastic

Kernels

Delay

𝜶𝒕𝜷 𝒕

𝜷 𝒕−𝟏Comes

directly from the IIHS

Combine altogether in a new joint

probability distribution





Behavior of the IIHS

Behavior of the channel

Deterministic: how to embed into it?


Leakage

12-Oct-2011

22

In the classical information theoretic approach:

In channels with memory and feedback:

The worst case leakage is the capacity of the channel:

where is the set of all possible input distributions

LeakageA priori uncertainty

of the input distribution

A posteriori uncertainty

LeakageA priori uncertainty

of the “reactor”A posteriori uncertainty

Ex. A

Ex. B

Ex. C

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

..

.

3 examples of Info. Leakage

𝑰 (𝑩𝑻→ 𝑨𝑻 )

𝑰 (𝑨𝑻→𝑩𝑻 )


23

Part IV

12-Oct-2011

Differential privacy: the trade-off between privacy and utility


Statistical databases

12-Oct-2011

24

A statistical database is a collection of data of several participants.

Users of the database can ask statistical queries, such as: Average height, maximum salary, most common disease.

Usually we consider the global information relative to the database as public, while the individual information about a participant is private.


An example

12-Oct-2011

25

A statistical database contains the salary of several employees.

A user has the some side information: There are 100 people in the database (counting query) The average salary is 3.000 € (average query)

Then Robert is included in the database. The user repeat the queries and finds out that the average salary is now 3.050 €. And she can conclude that Robert earns 8.050 €: privacy

breach!

Previous knowledge

Newspapers,common sense,

previous queries, etc


General problem

12-Oct-2011

26

How to ensure that the queries provide statistical information about the whole sample without harming the privacy of the participants?

Usually it is done by adding randomization: instead of reporting the real answer for the query, a noisy answer is reported to the user. The noise is carefully added to obfuscate the link between

the values of participants in the database and the reported answer to the query.

Yet the noise should avoid reporting answers that are “too far away” from the real answers.


A model of utility and privacy

12-Oct-2011

27

Participants: Values: Universe of databases: Randomized function:

where

Absence is included as a special symbol, e.g.

null

reportedanswer

𝐾dataset

𝑿 𝒁

-d.p. randomized

function

Channel

𝒙𝟏∼ 𝒙𝒏

ratio


Differential Privacy

12-Oct-2011

28

Differential privacy [Dwork]: the effect of the presence of any individual in a database will be negligible, even when an adversary has auxiliary knowledge. We can also consider presence/absence of any individual, or his

value. It is a strong statistical guarantee.

Formally (discrete case): Two databases and differing on the presence/value of at most

one row are called neighbors or adjacent. We write .

A function provides -differential privacy if, for every , and for all possible answer to the query:


A model of utility and privacy

12-Oct-2011

29

Oblivious mechanisms: the reported answer depends only on the real answer, and not on the database.

𝑓query

𝐻randomization mechanism

𝒀real answer reported answerdataset

(-diff. priv. randomized function)

𝑿 𝒁

Leakage

Utility


30

Our contribution

12-Oct-2011

(1) Does-d.p. induce a bound on the information leakage of the randomized function ?

(2) Does -d.p. induce a bound on the information leakage relative to an individual?

(3) Does -d.p. induce a bound on the utility?

(4) Given a query and a value , can we construct a randomized function satisfying -d.p. and also presenting maximum utility?

In the worst case scenario where the attacker knows the values of all other

participants.


The adopted measures of utility and leakage

12-Oct-2011

31

Leakage is modeled as min-entropy leakage:

Utility is modeled with gain functions:

Binary gain function: if and otherwise. In the binary case is the Bayes risk.


32

Methodology

12-Oct-2011

The adjacency relation on the database domain induces a graph .

The relation can be extended to the real answers domain : if and then is also a graph.

We consider two special types of graphs:

Distance-regular


33

Some theorems

12-Oct-2011

Given a channel from to , we perform transformations which: Are valid for the uniform

input distribution;

Preserve the a posteriori min-entropy

Provide -d.p.

This allows us to find very regular matrices. And therefore a bound

on

any graph

𝑉 𝑇+¿ ¿dist-regular

Corresponds to the maximum

value of .


The proof technique

12-Oct-2011

34

The previous theorems can be applied to any channel from to . Leakage: we apply the theorems to the channel from to Utility: we apply the theorems to the channel from to

𝑓query

𝐻randomization mechanism

𝒀real answer reported answerdataset

(-diff. priv. randomized function)

𝑿 𝒁

Leakage

Utility


35

The bounds

12-Oct-2011

Leakage: we apply the theorems to the channel from databases to reported answers Proposition: is both distance-

regular and

Utility: we apply the theorems to the channel from real answers to reported answers when the graph is distance-

regular or


36

Our contribution

12-Oct-2011

(1) Does-d.p. induce a bound on the information leakage of the randomized function ?

Yes:

(2) Does -d.p. induce a bound on the information leakage relative to an individual?

Yes:

It works in every case, as is always dist-reg.

and


37

Our contribution

12-Oct-2011

(3) Does -d.p. induce a bound on the utility?

Yes:

(4) Given a query and a value , can we construct a randomized function satisfying -d.p. and also presenting maximum utility?

Yes: , where

Only when is also dist.-reg. or

Only when is also dist.-reg. or


38

An example

12-Oct-2011

A database with tuples: voter id, voter city, candidate

There are 6 cities: A, B, C, D, E, FQuery: Which city had

more votes for a given candidate?

Clearly the gain is binary

is a clique

Y/Z A B C D E F

A

B

C

D

E

F

Optimal mechanism:


39

Part V

12-Oct-2011

Safe equivalences for security properties


40

Equivalences in security

12-Oct-2011

Equivalence relations are often used to formalize information hiding properties.

Examples: A system guarantees anonymity for users and if:

(trace equivalence)

Votes of users and for candidates and are confidential in a system if:

(bisimulation)


41

The role of nondeterminism

12-Oct-2011

In the presence of nondeterminism, there is a (dangerous) implicit assumption: all the nondeterministic possibilities of the

specification will be possible under every implementation of (or at least that the adversary will believe so).

Nondeterminism can have different natures: Nondeterminism by design: preserved under

refinement; Underspecification: not necessarily preserved under

refinement.


42

Nondeterminism by design:

12-Oct-2011

is secure.

Mix

𝑈 1𝑠1

𝑈 2𝑠2

𝑠1 , s2 𝑠2 , s1

Mix

𝑈 1

𝑎𝑈 2

𝑏

𝑎 ,𝑏 𝑏 ,𝑎Mix

𝑈 1

𝑏𝑈 2

𝑎

𝑏 ,𝑎 𝑎 ,𝑏

𝑴𝒊𝒙 [ 𝒂 ,𝒃𝒔𝟏 ,𝒔𝟐 ] 𝑴𝒊𝒙 [ 𝒃 ,𝒂

𝒔𝟏 ,𝒔𝟐 ] Should be presereved in the implementation


43

𝑩𝒊𝒕𝑻𝒓𝒂𝒏𝒔𝒇𝒆𝒓 [ 𝒕𝒔𝒆𝒄 ] 𝑩𝒊𝒕𝑻𝒓𝒂𝒏𝒔𝒇𝒆𝒓 [ 𝒉

𝒔𝒆𝒄 ]

Underspecification:

12-Oct-2011

But is not secure.

User

𝑠𝑒𝑐

𝑡𝑅𝑡 𝑅h𝐶

𝑠𝑒𝑐h

𝑠𝑒𝑐𝑠𝑒𝑐

User

𝜏


𝑡 h

𝜏 𝜏User

𝜏

𝑡𝑅𝑡 𝑅h𝐶h h

𝜏 𝜏

May be eliminated in

the implementation


44

Motivation

12-Oct-2011

Two types of nondeterminism: Angelic: inherent to the system, like in . The scheduler

has freedom to help the system.

Demonic: underspecification, like in . The design should guarantee that even in the worst case choice (by the scheduler), the security is still preserved.

Problem: in the equivalence approach the nondeterminism is considered only as angelic.


45

Contribution

12-Oct-2011

A formalism to handle both angelic and demonic nondeterminism.

Notions of safe equivalences: safe trace-equivalence and safe-bisimulation.

We show that these notions of safe equivalences imply “no leakage”.


46

Admissible schedulers

12-Oct-2011

Global schedulers Communication, interleaving Cannot see the internal

choices of the components

• Local schedulers

Global nondeterminism (implementation freedom)

Local nondeterminism (inherent to the system)

Local schedulers Randomness, noise One for each component Cannot see internal choices of the

other components.


47

Safe bisimulation

12-Oct-2011

Safe bisimulation such that, whenever , then for all admissible global

schedulers :

𝑞

𝑞1 𝑞2 𝑞3

𝒂𝟏 𝒂𝟐 𝒂𝟑

𝑞 ′

𝑞1 ′ 𝑞2 ′ 𝑞3 ′𝒂𝟏 𝒂𝟐 𝒂𝟑

ζ ζ


48

Safe trace-equivalence

12-Oct-2011

Safe trace-equivalence such that, whenever :

is but not Theorem: safe-bisimulation implies safe

trace-equivalence

𝑞𝒕𝟏 𝒕𝟑

𝒕𝟐

𝑞 ′𝒕𝟏 𝒕𝟑

𝒕𝟐


49

Safe nondeterministic information hiding

12-Oct-2011

Definition: A system is leakage-free if for all observable and secrets we have

𝑈𝑠𝑒𝑟

𝒔𝒆𝒄

𝒔𝒆𝒄

𝒔𝒆𝒄

¬𝒔𝒆𝒄𝑀𝑖𝑥𝒔𝒆𝒄 ,¬𝒔𝒆𝒄

¬𝒔𝒆𝒄 ,𝒔𝒆𝒄

• Example:(Binary secret)

• is but not • Now is also

𝑷

𝑷 ′


50

Safe nondeterministic information hiding

12-Oct-2011

Definition: A system is leakage-free if for all observable and secrets we have

• Theorem: If then is leakage free.

• Corollary: If then is leakage free.


51

Part VI

12-Oct-2011

Conclusion


52

List of publications

12-Oct-2011

Interactive systems: Quantitative Information Flow in Interactive Systems – Journal of Computer Security (to appear)

Mário S. Alvim, Miguel E. Andrés, Catuscia Palamidessi

Information Flow in Interactive Systems – CONCUR 2010 Mário S. Alvim, Miguel E. Andrés, Catuscia Palamidessi

Differential Privacy: On the relation between Differential Privacy and Quantitative Information Flow – ICALP 2011

Mário S. Alvim, Miguel E. Andrés, Konstantinos Chatzikokolakis, Catuscia Palamidessi

Differential Privacy: on the trade-off between Utility and Information Leakage – FAST 2011 Mário S. Alvim, Miguel E. Andrés, Konstantinos Chatzikokolakis, Pierpaolo Degano, Catuscia Palamidessi

Safe Equivalences: Safe Equivalences for Security Properties – IFIP-TCS 2010

Mário S. Alvim, Miguel E. Andrés, Peter van Rossum, Catuscia Palamidessi

Others: Probabilistic Information Flow – LICS 2010

Mário S. Alvim, Miguel E. Andrés, Catuscia Palamidessi

Quantitative Information Flow and Applications to Differential Privacy – FOSAD 2011 Mário S. Alvim, Miguel E. Andrés, Konstantinos Chatzikokolakis, Catuscia Palamidessi


53

Acknowledments

12-Oct-2011

The only people with whom you should try to get even are those who have helped you.

John E. Southard


54

Thank you

12-Oct-2011

Questions?


55

Appendix I

12-Oct-2011

Introduction


56

Philosophical problems

12-Oct-2011

Compromise between freedom and control. Anonymity: political activist vs. criminal

But it is always helpful to measure the leakage.

The quantification of information leakage considers: The definition of protection; To which extent the information is protected; From whom it is protected.


57

Appendix II

12-Oct-2011

Information theory as a framework for information

leakage


58

Appendix III

12-Oct-2011

Information flow in interactive systems


59

An example: the cocaine auction protocol [Stajano’99]

12-Oct-2011

Several mob members and one drug dealer around a table

Rounds of biddings. At round : the seller announces the bid price for that round; buyers have seconds to make an offer; when one buyer anonymously says yes, he becomes the

winner of that round and a new round begins; if nobody says anything for seconds, round is

concluded by timeout and the auction is won by the winner of the previous round.

The biddings are observable. The identity of the bidders should be secret.


60

Interactive information hiding systems

12-Oct-2011

IIHS’s are a variant of probabilistic automata in which we indicate explicitly that each action is secret or observable

An example of the Cocaine Auction Protocol: Two mobsters: Candlemaker and Scarface Biddings increase by 1k euros or 2k euros.

1k 2k 1k 2k

Cdmk Scrfc

1k 2k 1k 2k

Cdmk Scrfc

1k 2k

1k 2k 1k 2k

Cdmk Scrfc

1k 2k 1k 2k

Cdmk Scrfc

1k 2k

Cdmk Scrfc


Modelling IIHS’s as channels with memory and feedback

12-Oct-2011

61

Prop: every history determines a unique state ()

𝑝𝑖

𝑝 𝑗

1k

1k 2k

Cdmk Scrfc

1k 2k

Cdmk

2k

1k 2k

Cdmk

1k 2k

Scrfc

𝒑 𝒊

𝒑 𝒋

Scrfc

1k 2k

1k 2k

Cdmk Scrfc

1k 2k 1k 2k

Cdmk

1k 2k

Scrfc


Interactive systems:summary table

12-Oct-2011

62

IHHS as automaton IIHS as channel Notion of leakage

Normalized IIHS with non deterministic inputs and probabilistic outputs

Sequence of stochastic kernels

Leakage as capacity

Normalized IIHS with a deterministic scheduler solving the non-determinism

Sequence of stochastic kernels together with a sequence

Fully probabilistic normalized IIHS

Sequence of stochastic kernels together with a distribution

Leakage as directed information


63

Appendix IV

12-Oct-2011

Differential privacy: the trade-off between privacy and utility


Dalenius’ desideratum

12-Oct-2011

64

Dalenius’ desideratum: nothing about an individual should be learnable from the database that could not be learned without access to the database. This is, however, unachieavable in practice.

[Dwork’06]: There is always a piece of side information that alone does not leak information, but in combination with the database it does.


65

Graph symmetries

12-Oct-2011

Distance-regular graph: There exist integers such that for all vertices at

distance there are exactly: neighbors of in neighbors of in

graph (Vertex transitive +): There exist automorphisms where such that, for

every vertex , we have


66

Appendix V

12-Oct-2011

Safe equivalences for security properties


67

The framework

12-Oct-2011

Components are similar to probabilistic CCS:

Systems are components in parallel:

Semantics:


68

Safe bisimulation:

12-Oct-2011

is but not Theorems:

Safe bisimilarity is an equivalence; Safe bisimilarity is a congruence

User

𝜏


𝑡 h

𝜏 𝜏

𝐵𝑖𝑡𝑇𝑟𝑎𝑛𝑠𝑓𝑒𝑟 [𝑡

𝑠𝑒𝑐]

User

𝜏

𝑡𝑅𝑡 𝑅h𝐶h h

𝜏 𝜏

𝐵𝑖𝑡𝑇𝑟𝑎𝑛𝑠𝑓𝑒𝑟 [h

𝑠𝑒𝑐]

No longer admissibl

e

ζ


69

Safe equivalences:future work

12-Oct-2011

Extend our framework to the non-zero leakage case

Model checking techniques to verify information hiding properties in our framework: Challenges:

Restricting to partial information schedulers may cause the loss of decidability

Unusual quantifications introduced to cope with global (demonic) and local (angelic) schedulers.

Documents

Mário S. Alvim Ph.D. Thesis Defense École Polytechnique – LIX Supervised by catuscia palamidessi