ICASSP 2018 1 / 28
Differentially-private Distributed PrincipalComponent Analysis
Hafiz Imtiaz & Anand D. Sarwate
Department of Electrical and Computer EngineeringRutgers, the State University of New Jersey
April 17, 2018
Rutgers Imtiaz & Sarwate
ICASSP 2018 2 / 28
Outline
1 Motivation
2 Problem Formulation
3 Differential Privacy
4 Proposed Algorithm
5 Experimental Results
6 Conclusion
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Motivation 3 / 28
Why learn from private data?
• Much of private/sensitive data is being digitized
• Want to learn about population – using/reusing data
• Free and open sharing – ethical, legal, and technological obstacles
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Motivation 4 / 28
Why learn in distributed setting?
Good feature learning requires large sample sizes.
• Data at a single site may not be sufficient for statistical learning
• Pooling data in one location may not be possible
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Motivation 5 / 28
An example in neuroimaging
• Multiple fMRI collection centers
• Each has a moderate number of samples, at best
• Goal: find a way to reduce the sample dimension
We can perform principal component analysis (PCA)
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Problem Formulation 6 / 28
Principal Component Analysis
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Problem Formulation 7 / 28
The PCA problem: pooled case
• Data matrix: X = [x1 x2 . . . xN ] ∈ RD×N
• Second-moment matrix: A = 1N XX>
• We can decompose A as: A = VΛV>
• Here, Λ = diag(λ1, λ2, . . . , λD) and λ1 ≥ λ2 ≥ · · · ≥ λD ≥ 0
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Problem Formulation 7 / 28
The PCA problem: pooled case
• The best rank-K approximation of A: AK = VKΛKV>K• The top-K PCA subspace is the span of the corresponding
columns of V: VK(A)
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Problem Formulation 8 / 28
The PCA problem: distributed case
• One aggregator, S different sites with disjoint datasets• Local data matrix: Xs = [xs,1 . . .xs,Ns ] ∈ RD×Ns
• Local second-moment matrix: As = 1Ns
XsX>s
• All parties: “nice but curious”
How can we compute a global VK?Rutgers Imtiaz & Sarwate
ICASSP 2018 > Problem Formulation 9 / 28
What are our options?
• Send Xs to aggregator
→ huge communication cost→ privacy violation
• Compute VK using local data
→ poor quality of the subspace
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Differential Privacy 10 / 28
{x1, . . .xn�1}
Algorithm F
adversary
= xn ?or x0n
H1H0 or
Differential Privacy
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Differential Privacy 11 / 28
Differential privacy: a definition
AlgorithmD
{x1, . . . ,xn�1,xn} F
D0{x1, . . . ,xn�1,x
0n} Algorithm F0
D0{x1, . . . ,xn�1,x
0n} F0
[Dwork et al. 2006] An algorithm A is (ε, δ)-differentially private if forany set of outputs F , and all (D,D′) differing in a single point,
P (A(D) ∈ F) ≤ exp(ε) · P(A(D′) ∈ F
)+ δ
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Differential Privacy 12 / 28
Differential privacy: hypothesis testing
{x1, . . .xn�1}
Algorithm F
adversary
= xn ?or x0n
H1H0 or
logP (A(D) ∈ F)
P (A(D′) ∈ F)≤ ε
We want to design algorithms that satisfy differential privacy
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Differential Privacy 13 / 28
Why we need privacy in PCA?
-5 0 5x
-5
-4
-3
-2
-1
0
1
2
3
4
y
Class 1: 11 samples, Class 2: 10 samples
Class 1Class 2Principal direction
-5 0 5x
-6
-4
-2
0
2
4
6
y
Class 1: 11 samples, Class 2: 11 samples
Class 1Class 2Principal direction
Changing one sample can significantly change the principal direction
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Differential Privacy 14 / 28
What are we trying to address?
Goal: compute an accurate VK
• want to exploit all samples across all sites
• want a lower communication cost
• want to preserve a formal privacy definition
Idea: send the differentially private partial square root of As
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Proposed Algorithm 15 / 28
Proposed Algorithm
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Proposed Algorithm 16 / 28
Differentially-private Distributed PCA (DPdisPCA)
Input: Data matrix Xs for s ∈ [S]; privacy parameters ε, δ;intermediate dimension R; reduced dimension K
→ for s = 1, 2, . . . , S do :• Compute As ← 1
NsXsX
>s
• Generate D ×D symmetric matrix E where {Eij : i ∈ [D], j ≤ i}drawn i.i.d. ∼ N (0,∆2
ε,δ) and ∆ε,δ = 1Nsε
√2 log(1.25
δ )
• Compute As ← As + E• Perform SVD As = UΣU>
• Compute Ps ← URΣ12
R; send Ps to aggregator
→ Compute Ac ← 1S
∑Ss=1 PsP
>s
→ Perform SVD Ac = VΛV>
Output: Differentially-private rank-K subspace VK
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Proposed Algorithm 17 / 28
Privacy guarantee of DPdisPCA
Theorem (Privacy of DPdisPCA Algorithm)
DPdisPCA computes an (ε, δ) differentially private approximation tothe optimal subspace VK(A).
• L2 sensitivity of As is 1Ns
• By AG [Dwork et al. 2014] algorithm: computation of As is (ε, δ)differentially private
• Differential-privacy is invariant to post-processing: computation ofVK also satisfies (ε, δ) differential privacy
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Proposed Algorithm 18 / 28
Some comments on DPdisPCA
• Ps is D ×R: communication cost is proportional to S ×D ×R• If we send As, the cost would be proportional to S ×D2.
Typically, K < R < D
• Sending Ps instead of As does introduce some errors – cost ofcheaper communication
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Experimental Results 19 / 28
Experimental Results
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Experimental Results 20 / 28
Datasets
• Synthetic dataset (D = 200, K = 50) generated with zero meanand a pre-determined covariance matrix
• MNIST dataset (D = 784, K = 50) (MNIST)
• Covertype dataset (D = 54, K = 10) (COVTYPE)
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Experimental Results 21 / 28
The trade-offs
We are interested to find out:
• how performance varies with “privacy risk” ε
• how performance varies with sample size Ns
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Experimental Results 22 / 28
Performance measures
Table: Notation of performance measures
Algorithm / Setting Performance Index
Pooled Data qpooledDPdisPCA qDPdisPCA
Local Data qlocalSending As qfull
• Quality of a subspace V: captured energy of A
q(V) = tr(V>AV)
• We plot the ratio of these quantities with respect to the truecaptured energy qo
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Experimental Results 23 / 28
Performance variation
For synthetic data
0.001 0.01 0.1 1.0 10.0(a) Privacy parameter ( )
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rel
. cap
t. en
er.
Synthetic ( =0.01, S = 10, Ns = 1k)
qpooled / qoqDPdisPCA / qoqfull / qoqlocal / qo
0.1 1 4(b) Local sample size in 1000 (Ns)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rel
. cap
t. en
er.
Synthetic ( = 0.5, =0.01, S = 10)
qpooled / qoqDPdisPCA / qoqfull / qoqlocal / qo
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Experimental Results 24 / 28
Performance variation
with ε
0.001 0.01 0.1 1.0 10.0(a) Privacy parameter ( )
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rel
. cap
t. en
er.
MNIST ( =0.01, S = 10, Ns = 1k)
qpooled / qoqDPdisPCA / qoqfull / qoqlocal / qo
0.001 0.01 0.1 1.0 10.0(b) Privacy parameter ( )
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rel
. cap
t. en
er.
COVTYPE ( =0.01, S = 10,Ns = 0.5k)
qpooled / qoqDPdisPCA / qoqfull / qoqlocal / qo
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Experimental Results 25 / 28
Performance variation
with Ns
0.1 1 4(a) Local sample size in 1000 (Ns)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rel
. cap
t. en
er.
MNIST ( = 1, =0.01, S = 10)qpooled / qoqDPdisPCA / qoqfull / qoqlocal / qo
0.1 1 4(b) Local sample size in 1000 (Ns)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rel
. cap
t. en
er.
COVTYPE ( = 0.1, =0.01, S = 10)
qpooled / qoqDPdisPCA / qoqfull / qoqlocal / qo
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Conclusion 26 / 28
Concluding Remarks
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Conclusion 27 / 28
Concluding remarks
• The distributed algorithm clearly outperforms the local PCAalgorithm
• Increasing ε improves performance at the cost of lower privacy
• Datasets with lower D allows smaller ε for achieving the sameutility
• Increasing Ns improves performance for a fixed privacy level
• The cost of sending Ps instead of As is noticeable in all datasets
Rutgers Imtiaz & Sarwate
ICASSP 2018 > Conclusion 28 / 28
Questions
Thank you
Rutgers Imtiaz & Sarwate