32
Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University) K. Selçuk Candan (Arizona State University) This work is supported by an NSF Grant #1043583 -‘MiNC: NSDL Middleware for Network- and Context-aware Recommendations’ and the NSF Grant #1116394 `RanKloud: Data Partitioning and Resource Allocation Strategies for Scalable Multimedia and Social Media Analysis 1

Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Embed Size (px)

Citation preview

Page 1: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

1

Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition

Mijung Kim (Arizona State University) K. Selçuk Candan (Arizona State University)

This work is supported by an NSF Grant #1043583 -‘MiNC: NSDL Middleware for Network- and Context-awareRecommendations’ and the NSF Grant #1116394 `RanKloud: Data Partitioning and Resource Allocation Strategies for Scalable Multimedia and Social Media Analysis

Page 2: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

2

Tensor Decomposition

Tensor is a high-dimensional array

Tensor decomposition is widely used for multi-aspect data analysis for multi-dimensional data

Page 3: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

3

High cost of tensor decomposition

Data is commonly high-dimensional and large-scale Dense tensor decomposition

The cost increases exponentially with the number of modes of the tensor.

Sparse tensor decomposition The cost increases more slowly (linearly with the number

of nonzero entries in the tensor) But still be very expensive for large data sets. Parallelization for ALS method faces difficulties such as

communication cost. How do we tackle this high computational cost of

tensor decomposition?

Page 4: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

4

Normalization Reduce the dimensionality and the size of the

input tensor based on functional dependencies (FD) of the

relation (tensor)

Page 5: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

5

Join-by-Decomposition [Kim and Candan 2011]

Step 1a: Decomposition of

(user, movie, rating) relation

Step 1b: Decomposition of (movie, genre) relation

Step 2: Combination of the two decompositions into a final

decomposition

M. Kim and K. S. Candan. Approximate tensor decomposition within a tensor-relational algebraic framework. In CIKM, 2011.

Find all rank-R1 and rank-R2 decompositions of the two input tensors, where R1 × R2 = R and choose one pair where two decompositions are as independent from each other as possible.

Page 6: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Decomposition-by-Normalization (DBN)High-dimensional

data set (5-mode tensor)

Page 7: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Decomposition-by-Normalization (DBN)High-dimensional

data set (5-mode tensor)

Normalization based on functional dependencies(vertical partitioning)

Page 8: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Decomposition-by-Normalization (DBN)

8

High-dimensional data set (5-mode tensor)

Lower-dimensional data sets (two 3-mode tensors)

Normalization based on functional dependencies(vertical partitioning)

Page 9: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Decomposition-by-Normalization (DBN)

9

High-dimensional data set (5-mode tensor)

Lower-dimensional data sets (two 3-mode tensors)

Tensor decomposition on each vertical partition (sub-tensor)

Normalization based on functional dependencies(vertical partitioning)

Page 10: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Decomposition-by-Normalization (DBN)

10

High-dimensional data set (5-mode tensor)

Lower-dimensional data sets (two 3-mode tensors)

Tensor decomposition on each vertical partition (sub-tensor)

Combined into the decomposition of the original data set (tensor)

Normalization based on functional dependencies(vertical partitioning)

Page 11: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

11

Task 1: Normalization Process Select an attribute X which functionally

determines the other attributes (X A) to prevent from introducing spurious data.

An efficient method is needed to determine functional dependencies in the data. Because total number of functional dependencies in

the data can be exponential. We employ TANE [Huhtala et al. 1999] that finds a set

of (approximate) pair-wise FDs, which is linear in the size of input, which is trivial compared to the decomposition cost.

Y. Huhtala et al. TANE: An ecient algorithm for discovering functional and approximate dependencies. Comput. J., 42 (2):100-111, 1999.

Page 12: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

12

Task 2: Find Approximate FD

Many data sets may not have perfect FDs to leverage for normalization

Thus we rely on approximate FDs in the data with support (the minimum fraction of tuples that must be removed for FDs to hold)

Page 13: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Task 3: Partitioning

Partition the data into two partitions that will lead to least amount of errors.

Find the partitions as independent from each other as possible. minimize inter-partition (between partitions) pair-

wise FDs maximize intra-partition (within partitions) pair-

wise FDs

13

Page 14: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

14

Parallelized DBN

We parallelize the entire DBN operation by associating each pair of rank decompositions to an individual processor core

Rank-1 × Rank-12 Rank-2 × Rank-6 Rank-3 × Rank-4

Rank-4 × Rank-3 Rank-6 × Rank-2 Rank-12 × Rank-1

Rank-12Each pair can run in a separate core in a

parallel manner.

Page 15: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

15

Desiderata The vertical partitioning should be s.t.:

Approx. FDs need to have high support to prevent over-thinning of the relation R.

Case 1: join attribute X determines only a subset of the attributes of the relation R) (|R|=|R2|, |R1|<=|R2|) For dense tensors, the number of attributes in each partition

should be balanced For sparse tensors, the total number of tuples of R1 and R2 are

minimized Case 2: join attribute X determines all attributes of the

relation R (|R|=|R1|=|R2|) The support for the inter-partition FDs are minimized. (For dense

tensors, the partitions should be balanced.)

Page 16: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

16

Vertical Partitioning Strategies Partition with all the attributes determined

with a support higher than the threshold (support) by the join attribute.

Page 17: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

17

Desiderata The vertical partitioning should be s.t.:

Approx. FDs need to have high support to prevent over-thinning of the relation R.

Case 1: join attribute X determines only a subset of the attributes of the relation R) (|R|=|R2|, |R1|<=|R2|) For dense tensors, the number of attributes in each partition

should be balanced For sparse tensors, the total number of tuples of R1 and R2 are

minimized Case 2: join attribute X determines all attributes of the

relation R (|R|=|R1|=|R2|) The support for the inter-partition FDs are minimized. (For dense

tensors, the partitions should be balanced.)

Page 18: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

18

Vertical Partitioning Strategies (Case 1: join attribute X determines only a subset of the attributes of the relation R (|R|=|R2|, |R1|<=|R2|)) Sparse tensors

The size of R1 (X and all determined attributes) can be minimized down to the number of unique values of X by eliminating all the duplicate tuples.

Dense tensors Promote balanced partitioning by relaxing or tightening

the support threshold. If # attr. of R2 > # attr. of R1, move the attributes with the

highest support of R2 to R1 (relaxing) or if # attr. of R1 > # attr. of R2, move the attributes with the lowest support of R1 to

R2 (tightening).

Page 19: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

19

Desiderata The vertical partitioning should be s.t.:

Approx. FDs need to have high support to prevent over-thinning of the relation R.

Case 1: join attribute X determines only a subset of the attributes of the relation R) (|R|=|R2|, |R1|<=|R2|) For dense tensors, the number of attributes in each partition

should be balanced For sparse tensors, the total number of tuples of R1 and R2 are

minimized Case 2: join attribute X determines all attributes of the

relation R (|R|=|R1|=|R2|) The support for the inter-partition FDs are minimized. (For dense

tensors, the partitions should be balanced.)

Page 20: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

20

Vertical Partitioning Strategies (Case 2: join attribute X determines all attributes of the relation R)

We formulate the interFD-based partitioning as a graph partitioning problem. pairwise FD graph, Gpfd(V, E), where each vertex

represents an attribute and the weight of the edge the average support of the approximate FDs between the attr.

The problem is then to locate a cut on Gpfd with the minimum average weight. (For dense tensors, balance criterion is imposed)

We use a modified version of a minimum cut algorithm [Stoer and Wagner 1997] to seek a minimum average cut.

M. Stoer and F. Wagner. A simple min-cut algorithm. J. ACM, 44 (4):585-59, 1997

Page 21: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

21

Rank Pruning based on Intra-Partition Dependencies The higher the overall dependency between

the attributes in a partition, the smaller should be the decomposition rank of that partition.

Thus, we only consider rank pairs (r1, r2) s.t. r1 < r2 if intra-partition FD support for R1 is larger than the support for R2, and vice versa.

Page 22: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

22

Experimental Setup (Data Sets) UCI Machine Learning Repository [Frank and

Asuncion 2010]

A. Frank and A. Asuncion. UCI Machine Learning Repository. Irvine, CA: U. of California, School of ICS, 2010.

Page 23: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

23

Experimental Setup (Algorithms) NNCP (Non-Negative CP) vs. DBN Dense tensor [N-way Toolbox 2000]

NNCP-NWAY vs. DBN-NWAY Sparse tensor [MATLAB Tensor Toolbox 2007]

NNCP-CP vs. DBN-CP With parallelization

NNCP-NWAY/CP-GRID2,6 [Phan and Cichocki 2011] vs. pp-DBN-NWAY/CP

DBN with intraFD-based rank pruning DBN2,3 (2 pairs or 3 pairs selection)

C. A. Andersson and R. Bro. The N-way Toolbox for MATLAB. Chemometr. Intell. Lab., 52(1):1-4, 2000.B. W. Bader and T. G. Kolda. MATLAB Tensor Toolbox Ver. 2.2, 2007.A. H. Phan and A. Cichocki. PARAFAC algorithms for large-scale problems. Neurocomputing, 74(11):1970-1984, 2011.

Page 24: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

24

Experimental Setup (rank)

rank-12 decomposition DBN uses 6 combinations (1×12, 2×6, 3×4, 4×3,

6×2, and 12×1)

Page 25: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

25

Experimental Setup (H/W and S/W) H/W

6 cores Intel(R) Xeon(R) CPU X5355 @ 2.66GHz with 24GB of RAM.

S/W MATLAB Version 7.11.0.584 (R2010b) 64-bit

(glnxa64) for the general implementation MATLAB Parallel Computing Toolbox for the

parallel implementation of DBN and NNCP

Page 26: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

26

Key Results: Running Time (Dense Tensor) (Case 1: join attribute X determines only a subset of the attributes of the relation R (|R|=|R2|, |R1|<=|R2|))

NNCP vs. DBN NNCP vs. DBN with parallelization

Page 27: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

27

NNCP vs. DBN and DBN2,3 (DBN2,3: DBN with intraFD-based rank pruning)

Key Results: Running Time (Sparse Tensor) (Case 1: join attribute X determines only a subset of the attributes of the relation R (|R|=|R2|, |R1|<=|R2|))

NNCP vs. DBN with parallelization

Page 28: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

28

NOTE: In both cases, most of data points are located under the diagonal, which indicates that DBN outperforms NNCP.

Key Results: Running Time(Case 2: join attribute X determines all attributes of the relation R (|R|=|R1|=|R2|))

Dense TensorNNCP vs. DBN3 with parallelization

Sparse TensorNNCP vs. DBN3 with parallelization

Page 29: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

29

Key Results: Accuracy

NOTE: THE HIGHER THE BETTER!!!!!!!

Page 30: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

30

InterFD-based vertical partitioning

Note: The higher the closer to the optimal partitioning strategy!

Page 31: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

31

intraFD-based rank pruning strategy

Note: The higher the better intraFD-based rank pruning works!

Page 32: Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

32

Lifecycle of data requires capture, integration, projection, decomposition, and data analysis.

Tensor decomposition is a costly operation. We proposed:

highly efficient, effective, and easily parallelizable decomposition-by-normalization strategy for approximately evaluating decompositions

interFD-based partitioning intraFD-based rank pruning strategies

Conclusions