46
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Embed Size (px)

Citation preview

Page 1: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

A compression-boostingtransform for 2D data

Qiaofeng Yang Stefano LonardiUniversity of California, Riverside

Page 2: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Problem

• Given a matrix A over a binary alphabet

• Find an invertible transform T such that T(A) is more “compressible” than A

• Idea: try a reordering of the rows and columns of A

Page 3: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Applications

• Lossless compression of binary matrices (i.e., binary digital images)

• Evaluation of the randomness of a graph (represented by its adjacency matrix)

Page 4: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Compression-boosting transform

• Design an invertible transform that improves compression

• The transform must expose the “patterns” in the matrix

• [Storer and Helfgott 97] find uniform blocks to compress images (lossless)

Page 5: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Redundancy in binary matrices

• Look for uniform submatrices

• A uniform submatrix is a submatrix induced by a subset of rows and columns (i.e., not necessarily contiguous) solely composed by the same symbol (either 0 or 1)

Page 6: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

a

b

c

a b c

originalmatrix

findlargestuniformsubmatrix

reorder

decompose

Outline of the direct transform

Page 7: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Direct transform

• Find the largest uniform submatrix in A• Reorder the rows and the columns such that the

uniform submatrix is moved to the upper-left corner

• Recursively apply the transform on the rest of the matrix

• Stop when the partition produces a matrix which is smaller than rxc (r,c predetermined thresholds)

Page 8: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Does it help? Maybe …

• Each uniform submatrix can be represented by one bit and a list of its rows/columns

• If a matrix can be decomposed in a small number of uniform submatrices, the compression should improve

Page 9: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Does it help? Maybe not…

• Typically, in order to get good compression of 2D data (e.g., digital images) one has to exploit dependencies between adjacent rows and columns

• The reordering can “break” local dependencies between adjacent rows or columns, therefore could affecting negatively the compression

Page 10: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Complexity of the direct transform

• The computation cost depends on the complexity of finding the largest uniform submatrix

• This problem is a special case of the biclustering problem

Page 11: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Some related works

• Hartigan, ‘72• Aggarwal et al., SIGMOD’99 • Cheng & Church, ISMB’00• Wang et al., SIGMOD’02• Ben-Dor et al., RECOMB’02• Tanay et al., ISMB’02• Procopiuc et al., SIGMOD’02• Murali & Kasif, PSB’03• Sheng et al., ECCB’03• Mishra et al., COLT’03• Lonardi et al., CPM’04

Page 12: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Problem definition

LARGESTUNIFORMSUBMATRIX

• Instance: A matrix• Question: Find a row selection R and a

column selection C such that A(R, C) is uniform and |R||C| is maximized

This problem is also called Maximum Edge Biclique, and is computationally hard

{0,1}n mA

Page 13: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Randomized search

* * * * * *

* *

* *

Assume contains a maximal uniform submatrix

( , ). Assume that and are

known. Observation:

If we knew , then could be obtained

If we knew , then could be obtained

A

R C R r C c

R C

C R

Page 14: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Randomized search (step 1)

Select a random subset S of size k uniformly from the set of columns{1,2,…, m}

1

1

1

1

1

1

1

1

1

1

1

1

0 0 1

0

0

0 0

0

1

1

0

1

0

1

0

0

1

0 1

0 0 1 00

S

C*

Page 15: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Randomized search (step 2)

For all the subset U of S, check whether string 1|U| (resp., 0|U|) appears at least r times and record the rows R1 (resp., R0) in which it occurs

1

Page 16: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Randomized search (step 3)

• Given R1 (resp., R0) select the set of columns C which read 1 (resp., 0)

• Check whether C c 1

1

1

1

1

1

1

1

1

1

1

1

00 00 11

0

00

00 00

00

11

11

00

11

00

11

00

00

11

00 1

00 00 11 0000

C

Page 17: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Randomized search (step 4)

Save the solutions (R1,C) and (R0,C) and repeat step 1 to 4 for t iterations

Page 18: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Parameters

• Projection size k

• Column threshold c

• Row threshold r

• Number of iterations t

• See [Lonardi ‘04] for details on how to choose k, r and c

Page 19: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Selecting the number of iterations t

• We can miss a solution in two cases– S completely misses C*– when S overlaps C*, but the string 11…1

(00…0) selected by the algorithm also appears in a row outside R*

Page 20: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Selecting the number of iterations t

• The probability of missing the solution in one iteration is

which is

Page 21: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Selecting the number of iterations t

• Given the probability of missing the solution in t iterations to be smaller than ε

• For the experiments, we fixed the number of iterations to t=1,000 and t=10,000

Page 22: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Recursive decomposition

• Input: binary matrix A, row/column thresholds r/c, iterations t (or error ε)

• Run LARGESTUNIFORMSUBMATRIX on A• Reorder A and decompose A into four

smaller matrices U,a,b,c where is U is the uniform submatrix

• Should we recursively apply the transform to (a,b,c), (a+c,b) or (a,b+c) ?

Page 23: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

a

b

c

a b c

originalmatrix

findlargestuniformsubmatrix

reorder

decompose

Outline of the direct transform

Page 24: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

a

b

c

originalmatrix

findlargestuniformsubmatrix

reorder

(a,b+c)decomposition

a

a b

(a+c,b)decomposition

b

cc

Refined transform

Page 25: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Is (a+c,b) better than (a,b+c)?

• Let us call A1, A2, B1, B2 the area of the uniform submatrix found in a+c, b, a, and b+c respectively

• Choose decomposition (a+c,b) if– SUM: A1+A2>B1+B2

– MAX: max{A1,A2}>max{B1,B2}

– INDIV: A1>B2

Page 26: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Implementation

• Each uniform submatrix can be represented by one bit

• Non-decomposable matrices are saved in row-major order

• The content of both types of matrices is saved in a file called strings

Page 27: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Implementation

• Row and column indices for uniform and non-decomposable matrices are saved in a file called index

• For each set of rows/column indices the first index is saved as is, while the other are saved as difference between adjacent indices

Page 28: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Implementation

• The number of rows and columns for uniform and non-decomposable matrices is saved in a file called length

• The files string, index and length allow one to invert the transform

• Note that the complexity of the inverse transform is linear in the size of the matrix

Page 29: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Experiments on synthetic data

• We generated datasets composed by four random 256x256 binary matrices

• In each matrix of a dataset we embedded 1, 2, 3 and 4 uniform submatrices of size 64x64

• For each matrix we compared the compression size before and after the transform

Page 30: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Experiments on synthetic data

Matrixi is a random 256x256 binary matrix and containsi uniform 64x64 submatrices. Parameters are r=c=10, t=10,000

Page 31: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Experiments on synthetic data

• Failure in recovering all the embedded submatrices is typically due to the recursive partitioning

• If the partitioning happens to split an embedded submatrix, there is no hope in recovering it

Page 32: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Experiments on compressing images

Page 33: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Compressing images

Page 34: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Experiments on images

Parameters are r=c=60, t=10,000

Page 35: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Experiments on images

• How does the performance depend on– the partitioning strategy?– the row and column thresholds?– the number of iterations?

• The next graphs are about the image bird

Page 36: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Size as a function of decomposition strategy

1500

1600

1700

1800

1900

2000

2100

2200

10 20 30 40 50 60 70 80 90 100 110 120

Threshold

Fin

al

size

(b

yte

s)

sum+gzip sum+bzip2

indiv+gzip indiv+bzip2

max+gzip max+bzip2

Page 37: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Average area of uniform submatrices

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

10 20 30 40 50 60 70 80 90 100 110 120

Threshold

Av

era

ge

are

a

t=1,000 t=10,000

Page 38: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Number of uniform submatrices

0

5

10

15

20

25

30

35

40

45

10 20 30 40 50 60 70 80 90 100 110 120

Threshold

Co

un

t

t=1,000 t=10,000

Page 39: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Proportion of uniform submatrices

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50 60 70 80 90 100 110 120

Threshold

Pro

po

rtio

n

t=1,000 t=10,000

Page 40: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Final compressed size

1500

1600

1700

1800

1900

2000

2100

2200

2300

10 20 30 40 50 60 70 80 90 100 110 120

Threshold

Fin

al s

ize

(b

yte

s)

gzip t=1,000

gzip t=10,000

bzip t=1,000

bzip t=10,000

Page 41: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Findings

• Proposed a transform to boost compression of 2D binary data

• The direct transform is hard, but a randomized algorithm can be used

• The inverse transform is very fast

• The transform boosts compression

Page 42: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Randomness of a graph

• Problem: determine whether a graph G is “random” or not (and how much)

• Use the Kolmogorov complexity

• Use the compressibility of the adjacency matrix of G to bound the Kolmogorov complexity of G

Page 43: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Page 44: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

The general problem

• Biclustering is the problem of finding a partition of the vectors and a subset of the dimensions such that the projections along those directions of the vectors in each cluster are close to one another

• The problem requires to cluster the vectors and the dimensions simultaneously, thus the name “biclustering”

Page 45: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Applications of biclustering

• Collaborative filtering and recommender systems

• Finding web communities

• Discovery association rules in databases

• Gene expression analysis

• …

Page 46: A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005Haifa Stringology Research Workshop 2005

Pseudo-code