Privacy preserving similarity detection for data analysisleontiad/publications/CSAR2013.pdf ·...

Preview:

Citation preview

Privacy preserving similarity detection for data analysis

Iraklis Leontiadis1 Melek Önen1 Refik Molva1 M.J. Chorley2 G.B. Colombo2

CSAR 2013

1Eurecom - France 2Cardiff - UK

Privacy vs Utility

Data A1,A2,A3,…An

Data B1,B2,B3,… Bn

Clustering

Similarity

Privacy preserving similarity detection for data analysis 2

.

.

.

? ? ? ? ? ?

Personality test

Naïve solutions

• Encrypt data with standard crypto – Renders operations infeasible.

• Data separation – Vertical separation is not always applicable.

• Anonymizing techniques – Don’t protect individuals data.

Privacy preserving similarity detection for data analysis 3

Our Approach

• Combine crypto with data processing

User Data Data analysis

Alice 𝐴1′, …𝐴𝐴′ 𝐹(𝐴1′, …𝐴𝐴′)

Bob 𝐵1′, …𝐵𝐴′

𝐹(𝐵1′, …𝐵𝐴′)

𝐹(𝐴1, …𝐴𝐴) = 𝐹(𝐴1′, …𝐴𝐴′)

Data A’1,A’2,A’3,…A’n

Data B’1,B’2,B’3,… B’n

Privacy preserving similarity detection for data analysis 4

.

.

.

Outline

• Our solution – Cosine similarity – Privacy with Geometrical Transformations

• Security Analysis • Performance Evaluation

– Hierarchical clustering – Results

• Looking Ahead

Privacy preserving similarity detection for data analysis 5

Cosine similarity

A

B θo

1 1 w1 w2

w4

w3

wn

F1

Dictionary

F2

“Next CSAR workshop will be held in Karlsruhe”

“Next CSAR workshop will be held in London”

A= 1 1

1 1 0 1 1

1 1

1 1 0 1

1

1

1

1

1

B=

Privacy preserving similarity detection for data analysis 6

Random Scaling

• Data encoded as unique vectors in ℝ𝐴

• φr:ℝ𝐴 → ℝ𝐴 s.t:

cos a, b = cos φr1(a),φr2(b)

• Random scaling

– r ⟵ℝ𝑛

– S(r, A) = r ∙ A =r ⋯⋮ 𝑟 ⋮

⋯ 𝑟∙ A

Privacy preserving similarity detection for data analysis 7

θo θo

Vector Rotation

• Rotation by a common angle λ°

– R λ° a = a ∙cos (λ°) ⋯ sin (λ°)

⋮ ⋮−sin (λ°) ⋯ cos (λ°)

• φr = a ∙R λ° a ∙ 𝑆𝑟(a)

F1’

F2’

θo

F1

F2

Privacy preserving similarity detection for data analysis 8

Our solution

Privacy preserving similarity detection for data analysis 9

Dimension reduction

Random Scaling

A S(r1, A1) = A1

A2

A3

S(r2, A2) =

S(r3, A3) =

r1 ∙

r2 ∙

r3 ∙

Rotation

R λ° r1 ∙ A1 =

R λ° r2 ∙ A2 =

R λ° r3 ∙ A3 =

R λ° ∙ r1 ∙

R λ° ∙ r2 ∙

R λ° ∙ r3 ∙

Security analysis

𝑉′1 = R λ° (S r1,𝑑1,𝑑2 , S r2, 𝑑3,𝑑4 , S r3,𝑑1𝑑5 )

Privacy preserving similarity detection for data analysis 10

• Internal:

– Rotation angle is known.

• External:

– Rotation angle remains unknown.

Security analysis cont’d

Privacy preserving similarity detection for data analysis 11

Per user equivalent coefficient are exposured as auxiliary information

∙𝐜𝐨𝐨 (𝝀𝝀) ⋯ 𝐨𝐬𝐬 (𝝀𝝀)

⋮ ⋮−𝐨𝐬𝐬 (𝝀𝝀) ⋯ 𝐜𝐨𝐨 (𝝀𝝀)

?

∙ r1

∙ r2

∙ r3

∙ r1

∙ r2

∙ r3

Evaluation

• 173 users willing to run 4sqPersonality test • 5 factor personality test

– Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism.

Privacy preserving similarity detection for data analysis 12

Clustering approach

• Hierarchical Agglomerative clustering (HAC) – Input: n points and N*N similarity matrix – Output: Single cluster containing all n points C=MakeSingletonClusters(); for i=0 to i=n: Find “closest” clusters c1,c2; Merge(c1,c2); RecomputeDistances(C); if #C=1 exit();

Agglomerative: O(n3) Divisible: O(2n)

Privacy preserving similarity detection for data analysis 13

Cosine Similarity

Presenter
Presentation Notes
Single-linkage vs complete linkage: In single-link clustering or single-linkage clustering , the similarity of two clusters is the similarity of their most similar members In complete-link clustering or complete-linkage clustering , the similarity of two clusters is the similarity of their most dissimilar members. Agglomerative vs divisible: O(n^3) O(n^2) Why not K-means? -K-means is extremely sensitive to cluster center initialization -difficult to predict the k-value -Hierarchical Clustering can give different partitionings depending on the level-of-resolution we are looking at -Flat clustering needs the number of clusters to be specified -Hierarchical clustering doesn’t need the number of clusters to be specified -No clear consensus on which of the two produces better clustering

Results

Presenter
Presentation Notes
Equivalent clusters between encrypted and unencrypted data

Recap

1. Pairwise cosine similarity for multidimensional vectors.

2. Geometrical transformations compatible with cosine similarity.

Privacy preserving similarity detection for data analysis 15

Looking Ahead

• Other privacy preserving similarity detection algorithms.

• Privacy preserving data analysis algorithms: – MAX,MIN

Thank you! Iraklis Leontiadis

leontiad@eurecom.fr

Privacy preserving similarity detection for data analysis 16

Recommended