67
1 Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon

Sum-Product Networks: A New Deep Architecture

  • Upload
    kenna

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Sum-Product Networks: A New Deep Architecture. Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon. 1. Graphical Models: Challenges. Restricted Boltzmann Machine (RBM). Bayesian Network. Markov Network. Sprinkler. Rain. Grass Wet. - PowerPoint PPT Presentation

Citation preview

Page 1: Sum-Product Networks: A New Deep Architecture

11

Sum-Product Networks: A New Deep Architecture

Pedro DomingosDept. Computer Science & Eng.

University of Washington

Joint work with Hoifung Poon

Page 2: Sum-Product Networks: A New Deep Architecture

2

Graphical Models: Challenges

2

Bayesian Network Markov Network

Sprinkler Rain

Grass Wet

Advantage: Compactly represent probability

Problem: Inference is intractable

Problem: Learning is difficult

Restricted Boltzmann Machine (RBM)

Page 3: Sum-Product Networks: A New Deep Architecture

Stack many layersE.g.: DBN [Hinton & Salakhutdinov,2006]

CDBN [Lee et al.,2009]

DBM [Salakhutdinov & Hinton,2010]

Potentially much more powerful than shallow architectures [Bengio, 2009]

But … Inference is even harder Learning requires extensive effort

3

Deep Learning

3

Page 4: Sum-Product Networks: A New Deep Architecture

Learning: Requires approximate inference

Inference: Still approximate

Graphical Models

Page 5: Sum-Product Networks: A New Deep Architecture

E.g., hierarchical mixture model, thin junction tree, etc.

Problem: Too restricted

Graphical Models

Existing Tractable Models

Page 6: Sum-Product Networks: A New Deep Architecture

This Talk: Sum-Product Networks

Compactly represent partition function using a deep network

Graphical Models

Existing Tractable Models

Sum-Product Networks

Page 7: Sum-Product Networks: A New Deep Architecture

Graphical Models

Existing Tractable Models

Sum-Product Networks

Exact inference linear time in network size

Page 8: Sum-Product Networks: A New Deep Architecture

Graphical Models

Existing Tractable Models

Sum-Product Networks

Can compactly represent many more distributions

Page 9: Sum-Product Networks: A New Deep Architecture

Graphical Models

Existing Tractable Models

Sum-Product Networks

Learn optimal way to reuse computation, etc.

Page 10: Sum-Product Networks: A New Deep Architecture

1010

Outline

Sum-product networks (SPNs) Learning SPNs Experimental results Conclusion

Page 11: Sum-Product Networks: A New Deep Architecture

Bottleneck: Summing out variables

E.g.: Partition function

Sum of exponentially many products

Why Is Inference Hard?

1 11

( , , ) , ,N j N

j

P X X X XZ

j

X j

Z X

Page 12: Sum-Product Networks: A New Deep Architecture

Alternative Representation

X1 X2 P(X)

1 1 0.4

1 0 0.2

0 1 0.1

0 0 0.3

P(X) = 0.4 I[X1=1] I[X2=1]

+ 0.2 I[X1=1] I[X2=0]

+ 0.1 I[X1=0] I[X2=1]

+ 0.3 I[X1=0] I[X2=0]

Page 13: Sum-Product Networks: A New Deep Architecture

Alternative Representation

X1 X2 P(X)

1 1 0.4

1 0 0.2

0 1 0.1

0 0 0.3

P(X) = 0.4 I[X1=1] I[X2=1]

+ 0.2 I[X1=1] I[X2=0]

+ 0.1 I[X1=0] I[X2=1]

+ 0.3 I[X1=0] I[X2=0]

Page 14: Sum-Product Networks: A New Deep Architecture

Shorthand for Indicators

X1 X2 P(X)

1 1 0.4

1 0 0.2

0 1 0.1

0 0 0.3

P(X) = 0.4 X1 X2

+ 0.2 X1 X2

+ 0.1 X1 X2

+ 0.3 X1 X2

Page 15: Sum-Product Networks: A New Deep Architecture

Sum Out Variables

X1 X2 P(X)

1 1 0.4

1 0 0.2

0 1 0.1

0 0 0.3

P(e) = 0.4 X1 X2

+ 0.2 X1 X2

+ 0.1 X1 X2

+ 0.3 X1 X2

e: X1 = 1

Set X1 = 1, X1 = 0, X2 = 1, X2 = 1

Easy: Set both indicators to 1

Page 16: Sum-Product Networks: A New Deep Architecture

Graphical Representation

0.4

0.2 0.1

0.3

X1 X2 X1

X2

X1 X2 P(X)

1 1 0.4

1 0 0.2

0 1 0.1

0 0 0.3

Page 17: Sum-Product Networks: A New Deep Architecture

But … Exponentially Large

Example: Parity Uniform distribution over states with even number of 1’s

17

X2 X2

X3 X3

X1 X1 X4

X4

X5 X5

2N-1

N2N-

1

Page 18: Sum-Product Networks: A New Deep Architecture

But … Exponentially Large

Example: Parity Uniform distribution over states of even number of 1’s

18

X2 X2

X3 X3

X1 X1 X4

X4

X5 X5

Can we make this more compact?

Page 19: Sum-Product Networks: A New Deep Architecture

19

Use a Deep Network

19

O(N)

Example: Parity Uniform distribution over states with even number of 1’s

Page 20: Sum-Product Networks: A New Deep Architecture

20

Use a Deep Network

20

Example: Parity Uniform distribution over states of even number of 1’s

Induce many hidden layers

Page 21: Sum-Product Networks: A New Deep Architecture

21

Use a Deep Network

21

Example: Parity Uniform distribution over states of even number of 1’s

Reuse partial computation

Page 22: Sum-Product Networks: A New Deep Architecture

22

Sum-Product Networks (SPNs) Rooted DAG Nodes: Sum, product, input indicator Weights on edges from sum to children

22

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1

X2

Page 23: Sum-Product Networks: A New Deep Architecture

23

Distribution Defined by SPN

P(X) S(X)

23

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1

X2

Page 24: Sum-Product Networks: A New Deep Architecture

24

Distribution Defined by SPN

P(X) S(X)

24

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1

X2

Page 25: Sum-Product Networks: A New Deep Architecture

2525

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1

X2

1 0 1 1e: X1 = 1

P(e) Xe S(X) S(e)

Can We Sum Out Variables?

Page 26: Sum-Product Networks: A New Deep Architecture

2626

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1

X2

1 0 1 1e: X1 = 1

P(e) = Xe P(X) S(e)

Can We Sum Out Variables?

Page 27: Sum-Product Networks: A New Deep Architecture

2727

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1

X2

1 0 1 1e: X1 = 1

P(e) Xe S(X) S(e)

Can We Sum Out Variables?

Page 28: Sum-Product Networks: A New Deep Architecture

2828

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1

X2

1 0 1 1e: X1 = 1

P(e) Xe S(X) S(e)

Can We Sum Out Variables?

=?

Page 29: Sum-Product Networks: A New Deep Architecture

29

Valid SPN

SPN is valid if S(e) = Xe S(X) for all e

Valid Can compute marginals efficiently

Partition function Z can be computed by setting all indicators to 1

29

Page 30: Sum-Product Networks: A New Deep Architecture

30

Valid SPN: General Conditions

Theorem: SPN is valid if it is complete & consistent

30

Incomplete Inconsistent

Complete: Under sum, children cover the same set of variables

Consistent: Under product, no variable in one child and negation in another

S(e) Xe S(X) S(e) Xe S(X)

Page 31: Sum-Product Networks: A New Deep Architecture

Semantics of Sums and Products

Product Feature Form feature hierarchy

Sum Mixture (with hidden var. summed out)

31

I[Yi = j]

i

j

…… …… j

wij

i

…… ……

wijSum out Yi

Page 32: Sum-Product Networks: A New Deep Architecture

32

Inference

Probability: P(X) = S(X) / Z

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1 X2

1 0 0 1

0.6 0.9 0.7 0.8

0.42 0.72

X: X1 = 1, X2 = 0

X1 1

X1 0

X2 0

X2 1

0.51

Page 33: Sum-Product Networks: A New Deep Architecture

33

Inference

If weights sum to 1 at each sum node Then Z = 1, P(X) = S(X)

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1 X2

1 0 0 1

0.6 0.9 0.7 0.8

0.42 0.72

X: X1 = 1, X2 = 0

X1 1

X1 0

X2 0

X2 1

0.51

Page 34: Sum-Product Networks: A New Deep Architecture

34

Inference

Marginal: P(e) = S(e) / Z

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1 X2

1 0 1 1

0.6 0.9 1 1

0.6 0.9

0.69 = 0.51 0.18e: X1 = 1

X1 1

X1 0

X2 1

X2 1

Page 35: Sum-Product Networks: A New Deep Architecture

35

Inference

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1 X2

1 0 1 1

0.6 0.9 0.7 0.8

0.42 0.72

0.3 0.72 = 0.216e: X1 = 1

X1 1

X1 0

X2 1

X2 1

MAP: Replace sums with maxs

MAX MAX MAX MAX

MAX

0.7 0.42 = 0.294

Page 36: Sum-Product Networks: A New Deep Architecture

36

Inference

0.7 0.3

X1 X2

0.80.30.10.20.70.90.4

0.6

X1 X2

1 0 1 1

0.6 0.9 0.7 0.8

0.42 0.72

0.3 0.72 = 0.216e: X1 = 1

X1 1

X1 0

X2 1

X2 1

MAX: Pick child with highest value

MAP State: X1 = 1, X2 = 0

MAX MAX MAX MAX

MAX

0.7 0.42 = 0.294

Page 37: Sum-Product Networks: A New Deep Architecture

37

Handling Continuous Variables

Sum Integral over input

Simplest case: Indicator Gaussian

SPN compactly defines a very large mixture of Gaussians

Page 38: Sum-Product Networks: A New Deep Architecture

38

SPNs Everywhere

Graphical models

38

Existing tractable models, e.g.:hierarchical mixture model, thin junction tree, etc.

SPNs can compactly represent many more distributions

Page 39: Sum-Product Networks: A New Deep Architecture

39

SPNs Everywhere

Graphical models

39

Inference methods, e.g.:Junction-tree algorithm, message passing, …

SPNs can represent, combine, and learn the optimal way

Page 40: Sum-Product Networks: A New Deep Architecture

40

SPNs Everywhere

Graphical models

40

SPNs can be more compact byleveraging determinism,

context-specific independence, etc.

Page 41: Sum-Product Networks: A New Deep Architecture

41

SPNs Everywhere

Graphical models

Models for efficient inference

41

E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams

SPN: First approach for learning directly from data

Page 42: Sum-Product Networks: A New Deep Architecture

42

SPNs Everywhere

Graphical models

Models for efficient inference

General, probabilistic convolutional network

42

Sum: Average-pooling

Max: Max-pooling

Page 43: Sum-Product Networks: A New Deep Architecture

43

SPNs Everywhere

Graphical models

Models for efficient inference

General, probabilistic convolutional network

Grammars in vision and language

43

E.g., object detection grammar,probabilistic context-free grammar

Sum: Non-terminalProduct: Production rule

Page 44: Sum-Product Networks: A New Deep Architecture

4444

Outline

Sum-product networks (SPNs) Learning SPNs Experimental results Conclusion

Page 45: Sum-Product Networks: A New Deep Architecture

45

General Approach

Start with a dense SPN

Find the structure by learning weightsZero weights signify absence of connections

Can also learn with EMEach sum node is a mixture over children

Page 46: Sum-Product Networks: A New Deep Architecture

4646

The Challenge

In principle, can use gradient descent

But … gradient quickly dilutes

Similar problem with EM

Hard EM overcomes this problem

Page 47: Sum-Product Networks: A New Deep Architecture

4747

Our Learning Algorithm

Online learning Hard EM

Sum node maintains counts for each child

For each example Find MAP instantiation with current weights Increment count for each chosen child Renormalize to set new weights

Repeat until convergence

Page 48: Sum-Product Networks: A New Deep Architecture

4848

Outline

Sum-product networks (SPNs) Learning SPNs Experimental results Conclusion

Page 49: Sum-Product Networks: A New Deep Architecture

49

Task: Image Completion

Very challenging

Good for evaluating deep models

Methodology: Learn a model from training images Complete unseen test images Measure mean square errors

Page 50: Sum-Product Networks: A New Deep Architecture

50

Datasets

Main evaluation: Caltech-101 [Fei-Fei et al., 2004] 101 categories, e.g., faces, cars, elephants Each category: 30 – 800 images

Also, Olivetti [Samaria & Harter, 1994] (400 faces)

Each category: Last third for test

Test images: Unseen objects

Page 51: Sum-Product Networks: A New Deep Architecture

51

SPN Architecture

Pixel......

x

Page 52: Sum-Product Networks: A New Deep Architecture

52

SPN Architecture

Pixel......

x

Page 53: Sum-Product Networks: A New Deep Architecture

53

SPN Architecture

Region

Pixel

......

......

......

x

......

Page 54: Sum-Product Networks: A New Deep Architecture

54

Decomposition

Page 55: Sum-Product Networks: A New Deep Architecture

55

Decomposition

……

……

Page 56: Sum-Product Networks: A New Deep Architecture

56

SPN Architecture

Whole Image

Region

Pixel

......

......

......

x

......

Page 57: Sum-Product Networks: A New Deep Architecture

5757

Systems

SPN

DBM [Salakhutdinov & Hinton, 2010]

DBN [Hinton & Salakhutdinov, 2006]

PCA [Turk & Pentland, 1991]

Nearest neighbor [Hays & Efros, 2007]

Page 58: Sum-Product Networks: A New Deep Architecture

Preprocessing for DBN

Did not work well despite best effort

Followed Hinton & Salakhutdinov [2006] Reduced scale: 64 64 25 25 Used much larger dataset: 120,000 images

Reduced-scale Artificially lower errors

58

Page 59: Sum-Product Networks: A New Deep Architecture

Caltech: Mean-Square Errors

59

LEFT BOTTOM

SPN 3551 3270

DBM 9043 9792

DBN 4778 4492

PCA 4234 4465

Nearest Neighbor

4887 5505

Page 60: Sum-Product Networks: A New Deep Architecture

60

SPN vs. DBM / DBN

SPN is order of magnitude faster

No elaborate preprocessing, tuning

Reduced errors by 30-60%

Learned up to 46 layers

SPN DBM / DBN

Learning 2-3 hours Days

Inference < 1 second Minutes or hours

Page 61: Sum-Product Networks: A New Deep Architecture

Scatter Plot: Complete Left

61

Page 62: Sum-Product Networks: A New Deep Architecture

Scatter Plot: Complete Bottom

62

Page 63: Sum-Product Networks: A New Deep Architecture

Olivetti: Mean-Square Errors

63

LEFT BOTTOM

SPN 942 918

DBM 1866 2401

DBN 2386 1931

PCA 1076 1265

Nearest Neighbor

1527 1793

Page 64: Sum-Product Networks: A New Deep Architecture

64

Example Completions

SPN

DBN

Nearest Neighbor

DBM

PCA

Original

Page 65: Sum-Product Networks: A New Deep Architecture

6565

Open Questions

Other learning algorithms

Discriminative learning

Architecture

Continuous SPNs

Sequential domains

Other applications

Page 66: Sum-Product Networks: A New Deep Architecture

End-to-End Comparison

66

DataGeneral

Graphical ModelsPerformance

DataSum-Product

NetworksPerformance

Approximate Approximate

Approximate Exact

Given same computation budget, which approach has better performance?

Page 67: Sum-Product Networks: A New Deep Architecture

6767

Conclusion

Sum-product networks (SPNs) DAG of sums and products Compactly represent partition function Learn many layers of hidden variables

Exact inference: Linear time in network size

Deep learning: Online hard EM

Substantially outperform state of the art on image completion