Learning without Centralized Training Data Privacy...

Federated Learning Privacy-Preserving Collaborative Machine Learning without Centralized Training DataJakub Konečnýkonkey@google.com presenting the work of manyTrends in Optimization Seminar, University of Washington in SeattleJan 30, 2018

This Talk

Differential Privacy▶

Secure Aggregation▶

Minimizing Communication▶

FL Algorithms▶

FL vs. other distributed optimization

Why Federated Learning?▶

Imbue mobile devices with state of the art machine learning systems without centralizing data and with privacy by default.

Federated Learning

Our Goal

A very personal computer2015: 79% away from phone ≤2 hours/day1 63% away from phone ≤1 hour/day 25% can't remember being away at all

2013: 72% of users within 5 feet of phone most of the time2.

Plethora of sensors

Innumerable digital interactions12015 Always Connected Research Report, IDC and Facebook22013 Mobile Consumer Habits Study, Jumio and Harris Interactive.

Federated Learning

Our Goal

Deep Learning

non-convex

millions of parameters

complex structure (eg LSTMs)

Federated Learning

Our Goal

Cloud-centric ML for Mobile

Current Model Parameters

The model lives in the cloud.

trainingdata

We train models in the cloud.

Mobile Device

request

prediction

Make predictions in the cloud.

trainingdata

request

prediction

Gather training data in the cloud.

trainingdata

And make the models better.

On-device inference

On-device inference is using a cloud-distributed model to make predictions directly on an edge device without a cloud round-trip.

request

prediction

Instead of making predictions in the cloud

Distribute the model,make predictions on device.

Privacy

Offline

Latency

Power Sensors

Data Caps

Machine Intelligence for Mobile Devices

MI models in the

data center

MI models on the device

e.g. ranking a large inventory

e.g. keyboard suggest, ambient audio analysise.g. speech

transcription

trainingdata

But how do we continue toimprove the model?

trainingdata

But how do we continue toimprove the model?

Interactions generate training data on device... Local

Training Data

Local Training Data

Which we gather to thecloud.

And make the model better.

And make the model better.(for everyone)

Privacy

Offline

Latency

Power Sensors

Data Caps

On-Device Learning (Personalization)

Local Training Data

Instead of centralizingthe training data...

Train models right on the device.Better for everyone (individually.)

But what about…

1. New User Experience

2. Benefitting from peers' data

Federated computation and learning

Federated computation: where a server coordinates a fleet of participating devices to compute aggregations of devices’ private data.

Federated learning: where a shared global model is trained via federated computation.

Mobile Device

Local Training Data

Federated Learning

CloudServiceProvider

Mobile Device

Local Training Data

Many devices will be offline.

CloudServiceProvider

Federated Learning

Mobile Device

Local Training Data

Federated Learning

Many devices will be offline.

1. Server selects a sample of e.g. 100 online devices.

Federated Learning

2. Selected devices download the current model parameters.

Federated Learning

3. Devices compute an update using local training data

Federated Learning

4. Server aggregates users' updates into a new model.

Federated Learning

5. Repeat until convergence

To make the model better.(for everyone)

And personalize it,for every one.

Privacy

Offline

Latency

Power Sensors

Data Caps

Privacy

Offline

Latency

In Vivo

Sensors

Data Caps

Personalization

Training & Evaluation

Privacy

Offline

Latency

In Vivo

Sensors

Data Caps

Personalization

Applications of Federating Learning

What makes a good application?

● On-device data is more relevant than server-side proxy data

● On-device data is privacy sensitive or large

● Labels can be inferred naturally from user interaction

Example applications

● Language modeling for mobile keyboards and voice recognition

● Image classification for predicting which photos people will share

● ...

Federated Learning in Gboard

This Talk

FL Algorithms▶

Atypical Federated Learning assumptionsMassively Distributed

Training data is stored across a very large number of devices

Limited CommunicationOnly a handful of rounds of unreliable communication with each devices

Unbalanced DataSome devices have few examples, some have orders of magnitude more

Highly Non-IID DataData on each device reflects one individual's usage pattern

Unreliable Compute NodesDevices go offline unexpectedly; expect faults and adversaries

Dynamic Data AvailabilityThe subset of data available is non-constant, e.g. time-of-day vs. country

This Talk

FL Algorithms▶

Server

Until Converged:1. Select a random subset (e.g. 1000) of the (online) clients

2. In parallel, send current parameters θt to those clients

3. θt+1 = θt + data-weighted average of client updates

H. B. McMahan, et al. Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS 2017

Selected Client k

1. Receive θt from server.

2. Run some number of minibatch SGD steps, producing θ'

3. Return θ'-θt to server.

The Federated Averaging algorithm

Rounds to reach 10.5% Accuracy

FedSGD 820FedAvg 35

23x decrease in communication rounds

Large-scale LSTM for next-word predictionDataset: Large Social Network, 10m public posts, grouped by author.

CIFAR-10 convolutional model

Updates to reach 82%SGD 31,000FedSGD 6,600FedAvg 630

(IID and balanced data)

decrease in communication (updates) vs SGD

Open Questions

Privacy

Offline

Latency

In Vivo

Sensors

Data Caps

Personalization

Federated Learning + Personalized Learning

Take as a recipe for

(rather than a task-useful model on its own)

1. Multitask Learning

represents a regularizer for training , i.e. smoothing among models for similar users.

P. Vanhaesebrouck, A. Bellet and M. Tommasi. Decentralized Collaborative Learning of Personalized Models over Networks. AISTATS, 2017.

V. Smith, C.K. Chiang, M. Sanjabi, & A. Talwalkar Federated Multi-Task Learning. arXiv preprint, 2017.

2. Learning to Learn

represents a learning procedure that is biased towards learning efficiently.

This procedure either replaces standard SGD or controls any free parameters (e.g. learning rate) when training .

O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas, & J. Sohl-Dickstein. Learned Optimizers that Scale and Generalize. arXiv preprint, 2017.

N. Mishra, M. Rohaninejad, X. Chen, P. Abbeel. Meta-Learning with Temporal Convolutions. arXiv preprint, 2017.

3. Model-Agnostic Meta Learning

represents a good initialization for standard algorithms to very quickly learning .

"Personalize to new users quickly" becomes the training objective for by back-propagating through the training procedure for .

Chelsea Finn, P. Abbeel, S. Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv preprint, 2017

Better algorithms than Federated Averaging?Massively Distributed

Dynamic Data AvailabilityThe subset of data available is non-constant, e.g. time-of-day vs. country

This Talk

FL Algorithms▶

Server

Selected Client k

Server

Selected Client k

Potential bottleneck θt

Upload time can be bottleneckAsymmetric connection speed

Download generally faster than uploadhttp://www.speedtest.net/reports/

Some markets more data sensitiveE.g. India, Nigeria, Brazil, ...

Secure AggregationFurther increases the data needed to communicate

Energy consumptionData transmission generally power intensive

Federated Learning

We are not interested in the updates themselves

Distributed Mean Estimation with Limited Communication

zi z 1n

xi xStochasticquantization Unquantize

A. T. Suresh, F.. Yu, S. Kumar, & H. B. McMahan. Distributed Mean Estimation with Limited Communication. ICML 2017.

Stochastic Binary Quantization (1 bit per dimension)

Mean Squared Error:

If d is large, this is prohibitive.

xi xStochasticquantization UnquantizeRandom

RotationInverse

Rotation 1n

yizi z

RotationInverse

Rotation 1n

yizi z

Much better for large dCan be modified to O(1/n) A. T. Suresh, F.. Yu, S. Kumar, & H. B.

McMahan. Distributed Mean Estimation with Limited Communication. ICML 2017.

RotationInverse

Rotation 1n

yizi z

Efficiency:

Avoid representing, transmitting, or inverting R, which is d ✕ d.

Structured Matrix: R = HDD: diagonal matrix of i.i.d. Rademacher entries (±1)H: Walsh-Hadamard matrix

RotationInverse

Rotation 1n

yizi z

recurse: inverse:

RotationInverse

Rotation 1n

yizi z

Structured Matrix: R = HD

Rotation & Inverse RotationTime: O(d log d)Additional Space: O(1)Communication (seed): O(1) A. T. Suresh, F.. Yu, S. Kumar, & H. B.

McMahan. Distributed Mean Estimation with Limited Communication. ICML 2017.

RotationInverse

Rotation 1n

ExpandSubsample

Randomly select subset of elements

RotationInverse

Rotation 1n

ExpandSubsample

Efficiency:

Communicate only subsampled values

Corresponding indices can be represented as a random seed

RotationInverse

Rotation 1n

ExpandSubsample

Practice:

Some subsampling and moderate quantization

Better than

No subsampling and aggressive quantization

RotationInverse

Rotation 1n

ExpandSubsample

Existing related work● Mostly alternative to Quantization● Complementary to Subsampling and Rotation

○ Not necessarily efficient (MPI GPU-to-GPU communication)

For instance:D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. NIPS 2017

W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, H. Li. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. NIPS 2017

Federated Learning with Compressed Updates

J. Konečný, H. B. McMahan, F. Yu, P. Richtarik, A. T. Suresh, D. Bacon Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492

Federated Learning with Compressed Updates

J. Konečný, H. B. McMahan, F. Yu, P. Richtarik, A. T. Suresh, D. Bacon Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492

This Talk

FL Algorithms▶

Federated Learning

Server aggregates users' updates into a new model.

Recall the aggregation step of federated learning:

Federated Learning

Might these updates contain privacy-sensitive data?

Federated Learning

1. Ephemeral

Federated Learning

1. Ephemeral

2. Focused

Federated Learning

1. Ephemeral

2. Focused

→ Federated Learning is aprivacy-preserving technology.

Federated Learning

1. Ephemeral

2. Focused

3. Only in Aggregate∑

Wouldn't it be great if...

Server aggregates users' updates, but cannot inspect the individual updates.

Confidential + Proprietary

Secure Aggregation.

Secure Aggregation.Existing protocols either:

Transmit a lot of data

Fail when users drop out

(or both)

Secure Aggregation.Existing protocols either:

Transmit a lot of data

Fail when users drop out

(or both)

A novel protocol for K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth. Practical Secure Aggregation for Privacy-Preserving Machine Learning. To appear at CCS 2017.

Random positive/negative pairs, aka antiparticles

Devices cooperate to sample random pairs of 0-sum perturbations vectors.

Matched pair sums to 0

Random positive/negative pairs, aka antiparticles

Devices cooperate to sample random pairs of 0-sum perturbations vectors.

Add antiparticles before sending to the server

Each contribution looks random on its own...

The antiparticles cancel when summing contributions

but paired antiparticles cancel out when summed.

Revealing the sum.

but paired antiparticles cancel out when summed.

But there are two problems...

1. These vectors are big! How do users agree efficiently?

2. What if someone drops out?

That's bad.

2. What if someone drops out?

Pairwise Diffie-Hellman Key Agreement

Secret

Safe to reveal(cryptographically

hard to infer secret)

Secret

Public parameters: g, (mod p)c

Because gx are public, we can share them via the server.

Saves a copy

Combine with secret

Pairwise Diffie-Hellman Key AgreementAlice

Commutative op→ Shared secret!

Secrets are scalars, but….

Shared secret!

Use each secret to seed a pseudorandom number generator,generate paired antiparticle vectors.

PRNG(gba) → = - Shared secret!

Pairwise Diffie-Hellman Key Agreement + PRNG ExpansionAlice

Pairwise Diffie-Hellman Key Agreement + PRNG Expansion

PRNG(gba) → = -

1. Efficiency via pseudorandom generator

2. Mobile phones typically don't support peer-to-peer communication anyhow.

3. Fewer secrets = easier recovery.

Pairwise Diffie-Hellman Key Agreement + PRNG Expansion

PRNG(gba) → = -

k-out-of-n Threshold Secret SharingGoal: Break a secret into n pieces, called shares. ● <k shares: learn nothing● ≥k shares: recover s perfectly.

2-out-of-3 secret sharing: Each line is a share

x-coordinate of the intersection is the secret

k-out-of-n Threshold Secret Sharing

Goal: Break a secret into n pieces, called shares. ● <k shares: learn nothing● ≥k shares: recover s perfectly

k-out-of-n Threshold Secret Sharing

Goal: Break a secret into n pieces, called shares. ● <k shares: learn nothing● ≥k shares: recover s perfectly

Users make shares of their secrets Alice

And exchange with their peers

That's bad.

Enough honest users + a high enough threshold ⇒ dishonest users cannot reconstruct the secret.

However….

b + +late.

b late.

( , b, )?

b + ++late.

+ ++b late.

Permanent.

(honest users already gave b)

Server aggregates users' updates, but cannot inspect the individual updates.

Secure Aggregation

# Params Bits/Param # Users Expansion

220 = 1 m 16 210 = 1 k 1.73x

224 = 16 m 16 214 = 16 k 1.98x

Communication Efficient

Secure⅓ malicious clients + fully observed server

Robust

⅓ clients can drop out

Interactive Cryptographic ProtocolEach phase, 1000 clients + server interchange messages over 4 rounds of communication.

K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth. Practical Secure Aggregation for Privacy-Preserving Machine Learning. To appear at CCS 2017.

This Talk

FL Algorithms▶

Federated Learning

1. Ephemeral

2. Focused

3. Only in Aggregate

4. Differential Privacy

Might the final model memorize a user's data?

Differential Privacy

Differential privacy is the statistical science of trying to learn as much as possible about a group while learning as little as possible about any individual in it.

Andy GreenbergWired 2016.06.13

Differential PrivacyDifferential Privacy(trusted aggregator)

Server

Until Converged:1. Select a random subset (e.g. C=100) of the (online) clients

Selected Client k

Federated Averaging

Server

Until Converged:1. Select each user independently with probability q, for say E[C]=1000 clients

Selected Client k

3. Return θ'-θt to server. θt

Differentially-Private Federated AveragingMcMahan, Ramage, Talwar, Zhang. Learning Differentially Private Recurrent Language Models.

Server

Selected Client k

3. Return Clip(θ'-θt) to server. θt

Server

3. θt+1 = θt + bounded sensitivity data-weighted average of client updates

Selected Client k

Server

3. θt+1 = θt + bounded sensitivity data-weighted average of client updates

+ Gaussian noise N(0, Iσ2)

Selected Client k

Privacy Accounting for Noisy SGD: Moments Accountant

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, & L. Zhang. Deep Learning with Differential Privacy.CCS 2016.

Moments Accountant

Previous composition theorems

Baseline Trainingusers per round = 10017.5% accuracy in 4120 rounds

(1.152, 1e-9) DP Training[users per round] = 5k

17.5% accuracy in 5000 rounds

Private training achieves equal accuracy, but using 60x more computation.

User Differential Privacy for Federated Language Models

McMahan, Ramage, Talwar, Zhang. Learning Differentially Private Recurrent Language Models.

Differential PrivacyDifferential Privacy(trusted aggregator)

Differential PrivacyLocal Differential Privacy

Differential PrivacyDifferential Privacywith Secure Aggregation+

● Less local noise for DP because individual updates not observed

● Quantization (necessary for SA) also used by CU. (more efficient combination open question)

Secure Aggregation (SA)● Lower communication costs good for FL

settings, especially with SA● Structured noise of CU may complement or

replace local noise for DP (open question)

Compressed Updates (CU)

Inherently Complementary Techniques

● Touching each user’s data infrequently is good for user-level DP guarantees

● FedAvg updates are simple averages, making DP analysis easier and allowing SA

● Averaging over many users allows high CU without losing too much signal

Federated Learning (FL)● Local noise may have regularization effect

for FL (open question)● Requires bounding update norms, reducing

error of quantization in CU● Bounding the norm of individual users may

mitigate abuse potential under SA (open)

Differential Privacy (DP)

Federated Learning Privacy-Preserving Collaborative Machine Learning without Centralized Training Data

Jakub Konečný / konkey@google.com

FL Algorithms▶

On-device learning has many advantages

Federated Learning:training a shared global model,from a federation of participating deviceswhich maintain control of their own data,with the facilitation of a central server.

Federated Learning is PracticalFederatedAveraging often converges quickly(in terms of communication rounds)

Inherently Complementary TechniquesFederated Learning Differential PrivacySecure Aggregation Compressed Updates

Learning without Centralized Training Data Privacy...

Documents

Data Types, Primitive Types in C++, Variables – Declaration, Initialization, Scope Telerik Software Academy academy.telerik.com Learning and Development

Initialization & Cleanup

CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization

C++11 initialization: {}-lists - Parasol Laboratory Initialization – what do we want? • Initializer lists for containers – as for arrays (and structs) • Uniform initialization

On the importance of initialization and momentum in …fritz/absps/momentum.pdfOn the importance of initialization and momentum in deep learning certain situations. In particular,

Lecture 15 Weight Initialization,Momentum and Learning Rates

On the importance of initialization and momentum in deep ...fritz/absps/momentum.pdfOn the importance of initialization and momentum in deep learning certain situations. In particular,

Cloud Services Centre - cdn.ingenico.com · Cloud Services Centre A secure, centralized & scalable cloud services platform POS for all business needs Single Service Point Auto-initialization,

Device Initialization and Password Reset for Storage Devices · Device Initialization and Password Reset for ... 2.1.2 Initializing Remote Devices Recorders support ... Device Initialization

5.7 Initialization Revisited

Lecture 15 Weight Initialization,Momentum and Learning Rateseuler.stat.yale.edu/~tba3/stat665/lectures/lec15/lecture15.pdf · Weight Initialization,Momentum and Learning Rates 09

ZXCTN Series Equipment Initialization V1.0. Contents Initialization Preparation Initialization Procedure Initialization Command Configuration Commonly

Platform Initialization (PI) Specification Volume 1: Pre ... · PDF filePlatform Initialization (PI) Specification Volume 1: Pre-EFI Initialization Core Interface . Version 1.6 May

Initialization within OpenModelica · Lennart A. Ochel, M.Sc. Bielefeld University of Applied Sciences. Initialization within OpenModelica Motivation for Initialization 04/02/2013

03a - Initialization

User Guide for Currency Initialization (CRI) · This document describes the implementation of Currency Initialization ... User Guide for Currency Initialization ... Brazil localization

Initialization File LTire

On the importance of initialization and momentum in deep ...gdahl/papers/momentumNesterov... · On the importance of initialization and momentum in deep learning ... On the importance

Robust Initialization for Learning Latent Dirichlet Allocationprofs.sci.univr.it/bicego/papers/2015_SIMBAD.pdf · Robust Initialization for Learning Latent Dirichlet Allocation Pietro

MNL-608 - 440LX - Motherboard - Super Micro Computer, Inc.JP918 i960 Initialization Mode2-21 JP919 i960 Initialization Mode2-21 JP920 i960 Initialization Mode2-21 JP921 i960 Initialization