Download pdf - Format Abstraction for Sparse Tensor Algebra Compilerspeople.csail.mit.edu/s3chou/files/oopsla18-slides.pdfThe dense matrix input has a density of 0.95, the hypersparse matrix has

Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

2

Sparse tensors are a natural way of representing real-world data

2


2

10

Quality

DurablePoor

…

2

1 1

1

1

1

2

3

1 1

Kindle

Dubliners

The Iliad

Monitor

Sweater

Laptop

Candide

Jack

et …

Peter

Paul

Mary

Bob

Sam

Billy

Lilly

Hilde

…

Users

Words

Products


2

10

Quality

DurablePoor

…

2

1 1

1

1

1

2

3

1 1

Kindle

Dubliners

The Iliad

Monitor

Sweater

Laptop

Candide

Jack

et …

Peter

Paul

Mary

Bob

Sam

Billy

Lilly

Hilde

…

Users

Words

Products


Dense storage: 107 exabytes Sparse storage: 13 gigabytes

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Many different formats for storing tensors exist

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL

Hash mapsSparse vector Dense array vector


3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL


CSFCoordinate tensor

Dense array tensor

Mode-generic tensorHiCOO

F-COO


Thermal simulation

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL



Dense array tensor


F-COO


Unstructured mesh simulation[Bell and Garland 2009]

Thermal simulation

3

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL



Dense array tensor


F-COO



Thermal simulation

3

CNN with block-sparse weights[Gray et al. 2017]

ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL



Dense array tensor


F-COO



Thermal simulation

Image processing

3


ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL



Dense array tensor


F-COO



Thermal simulation

Data analytics

Image processing

3


ELLPACK

DIA

CSR

Coordinate matrix

DCSC

CSB

DCSR

CSC

Dense array matrix

Block DIA

BCSR

BCOO

SELLSkyline BELLLIL



Dense array tensor


F-COO


4

There is no universally superior tensor format

77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe

DD DS SD SS

1×

2×

Normalized run timeNormalized storage

(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×

6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×

(c) Thermal

DD DS SD SS

1×

2×

17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×

(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked

Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.

and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.

Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.


DD DS SD SS

1×

2×


(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×


(c) Thermal

DD DS SD SS

1×

2×


(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked




4


4



DD DS SD SS

1×

2×


(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×


(c) Thermal

DD DS SD SS

1×

2×


(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked




4



DD DS SD SS

1×

2×


(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×


(c) Thermal

DD DS SD SS

1×

2×


(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked




4



DD DS SD SS

1×

2×


(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×


(c) Thermal

DD DS SD SS

1×

2×


(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked





DD DS SD SS

1×

2×


(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×


(c) Thermal

DD DS SD SS

1×

2×


(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked




Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

CSR DIA

y = Ax

CSR DIA

186x

CSR BCSR

4



DD DS SD SS

1×

2×


(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×


(c) Thermal

DD DS SD SS

1×

2×


(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked





DD DS SD SS

1×

2×


(a) Dense

DD DS SD SS

1×

2×

158× ruuur 99×

(b) Row-slicing

DD DS SD SS

1×

2×


(c) Thermal

DD DS SD SS

1×

2×


(d) Hypersparse

DS SD SS SDᵀTT

1×

2×

112× rrruur 95×

(e) Column-slicing

DS SS DSDD SSDD

1×

2×

(f) Blocked




Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

CSR DIA

y = Ax

CSR DIA

186x

CSR BCSR

Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

6

[Kjolstad et al. 2017] This work

Code Generator

4Npos 0 2 44

crd 0 1 10 0

7

3 4

Code Generator

6M

4N

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

4N

pos 0 2 44

crd 0 1 10 0

7

3 4

6


Code Generator

4Npos 0 2 44

crd 0 1 10 0

7

3 4

Code Generator

6M

4N

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

4N

pos 0 2 44

crd 0 1 10 0

7

3 4

Format abstraction

6


Code Generator

4Npos 0 2 44

crd 0 1 10 0

7

3 4

Code Generator

6M

4N

offset -3 0 1-1

crd 0 1 0 30 41

W 6crd 0 1 46 -1-1

offset -3 0 1-1

4N

pos 0 2 44

crd 0 1 10 0

7

3 4

FC D E

A B

7

A B

0 1 2 3

0

1

2 F

C D E

Storing sparse tensors efficiently requires additional metadata

FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2

FC D E


FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2

FC D E


FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

row(6) = 6 / 4 = 1

col(6) = 6 % 4 = 2

0 1 2 3

0

1

2

FC D E


FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2

FC D E


FC D E

A B

7

A B0 1 2 3 4 5 6 7 8 9 10 11

locate(1,2) = 1 * 4 + 2

0 1 2 3

0

1

2

FC D E

= 6


8

A B C D E F0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3

0

1

2 F

C D E

A B


8

A B C D E F0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


8

A B C D E F

row(3) = ???

col(3) = ???

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


9

A B C D E F0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B

Coordinates of tensor elements can be encoded in many ways

9

A B C D E F

0 2 1 2 3 3

0 0 1 1 1 2rows

cols

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


Coordinate

9

A B C D E F

0 2 1 2 3 3

0 0 1 1 1 2rows

cols

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


Coordinate

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


CSR0 1 2 3

9

A B C D E F

0 2 1 2 3 3

0 2 5 6

cols

pos

0 1 2 3 4 5

0 1 2 3

0

1

2 F

C D E

A B


CSR0 1 2 3

10

Computing with different formats can require very different codeA = B ∘ C

10

for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }

Coordinate ✕ Dense array


10


Coordinate ✕ Dense array CSR ✕ Dense array


for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}

10


int pC1 = C1_pos[0];while (pC1 < C1_pos[1]) { int i = C1_crd[pC1]; int C1_segend = pC1 + 1; while (C1_segend < C1_pos[1] && C1_crd[C1_segend] == i) C1_segend++; int pB2 = B2_pos[i]; int pC2 = pC1; while (pB2 < B2_pos[i + 1] && pC2 < C1_segend) { int jB2 = B2_crd[pB2]; int jC2 = C2_crd[pC2]; int j = min(jB2, jC2); int pA = i * N + j; if (jB2 == j && jC2 == j) A[pA] = B[pB2] * C[pC2]; if (jB2 == j) pB2++; if (jC2 == j) pC2++; } pC1 = C1_segend;}

Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate


for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}

Hand-coding support for a wide range of formats is infeasibleA = B ∘ C

Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate

11


Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ Coordinate

11


Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate 11


Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate

A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array

ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA

DIA ✕ DIA ✕ DIA

11


Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate

A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array

ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA

DIA ✕ DIA ✕ DIA

y = Ax + zDense array ✕ Dense array ✕ Dense arrayDense array ✕ Dense array ✕ Sparse vectorDense array ✕ Dense array ✕ Hash mapDense array ✕ Sparse vector ✕ Sparse vectorDense array ✕ Sparse vector ✕ Hash mapDense array ✕ Hash map ✕ Sparse vectorDense array ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Dense arrayCoordinate ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Hash mapCoordinate ✕ Sparse vector ✕ Hash mapCoordinate ✕ Hash map ✕ Sparse vectorCSR ✕ Dense array ✕ Dense arrayCSR ✕ Dense array ✕ Sparse vectorCSR ✕ Hash map ✕ Sparse vectorCSR ✕ Hash map ✕ Dense arrayCSR ✕ Sparse vector ✕ Dense arrayDIA ✕ Dense array ✕ Dense arrayDIA ✕ Hash map ✕ Dense arrayELLPACK ✕ Dense array ✕ Sparse vector 11

12

Evaluation

Mode-generic tensorCompressed

Singleton

Dense

Dense

DIADenseRangeOffset

123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe

x supports locate?

y supports locate?

y supports locate?

x unordered and y ordered?

co-iterate over x and y

iterate over x and locate into y

iterate over y and locate into x

no yes

no

yes yes

no

no

yes

x/y ordered or accessed with locate?

output unordered or

supports insert?

x/y unique?

reorder x/y

aggregate duplicates in x/y

no

yes

yes

yes

no

converted iteratorover x/y

iterator over x/y

co-iterating over x and y?

yes

x/y unique?

no

yes

no

Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.

If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.

We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.

4.5 Code Generation Algorithm

Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the

Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.

Format Abstraction & Code Generation

12

Evaluation


Singleton

Dense

Dense

DIADenseRangeOffset


x supports locate?

y supports locate?

y supports locate?





no yes

no

yes yes

no

no

yes


output unordered or

supports insert?

x/y unique?

reorder x/y


no

yes

yes

yes

no


iterator over x/y


yes

x/y unique?

no

yes

no








12

Evaluation


Singleton

Dense

Dense

DIADenseRangeOffset


x supports locate?

y supports locate?

y supports locate?





no yes

no

yes yes

no

no

yes


output unordered or

supports insert?

x/y unique?

reorder x/y


no

yes

yes

yes

no


iterator over x/y


yes

x/y unique?

no

yes

no








13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D

Tensor formats can be viewed as compositions of level formats

13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D

0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8


13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8


13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Slices


13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Rows


13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Columns


13

J

GH

3210

0

1

1

0

2

B C

A

F

E

D0 3 5 9

3

0 1 1 0 1 0 0 1 1

0 0 2 3 1 1 3 0 3

A B C D E F G H J0 1 2 3 4 5 6 7 8

Compressed

Singleton

Dense


14

Dense Compressed Singleton

The same level formats can be composed in many ways

14


A B

0 1 2 3

0

1

2 F

C D E


14


A B

0 1 2 3

0

1

2 F

C D E

3Dense


14

A B C D E F


A B

0 1 2 3

0

1

2 F

C D E

3

0 2 2 3 310 2 5 6

Dense

Compressed


14

A B C D E F


A B

0 1 2 3

0

1

2 F

C D E

3

0 2 2 3 310 2 5 6


CSR{

15


A B

0 1 2 3

0

1

2 F

C D E


15


0 0 1 1 210 6

A B

0 1 2 3

0

1

2 F

C D E

Compressed


15

A B C D E F

0 2 2 3 31


0 0 1 1 210 6

A B

0 1 2 3

0

1

2 F

C D E

Compressed

Singleton


15

A B C D E F

0 2 2 3 31


0 0 1 1 210 6

A B

0 1 2 3

0

1

2 F

C D E


Coordinate{

16



Tensorformats

Level formats

16

Coordinate matrixCompressed

Singleton

CSRDense

Compressed


[Tinney and Walker, 1967]


Tensorformats

Level formats

16


Singleton

CSRDense

Compressed

Coordinate tensorCompressed

Singleton

Singleton



Singleton

Dense

Dense

[Baskaran et al. 2012]


Dense array tensorDenseDenseDense


Tensorformats

Level formats

16


Singleton

CSRDense

Compressed


Singleton

Singleton

BCSRDense

Compressed

Dense

Dense


ELLPACKDense

Dense

Singleton


Singleton

Dense

Dense


CSBDense

Dense

Compressed

Singleton [Kincaid et al. 1989][Buluç et al. 2009]


[Im and Yelick 1998]



Tensorformats

Level formats

16

Hashed Range Offset


Singleton

CSRDense

Compressed


Singleton

Singleton

BCSRDense

Compressed

Dense

Dense


ELLPACKDense

Dense

Singleton


Singleton

Dense

Dense


CSBDense

Dense

Compressed






Tensorformats

Level formats

16

Hashed Range Offset


Singleton

CSRDense

Compressed


Singleton

Singleton

BCSRDense

Compressed

Dense

Dense

DIADenseRangeOffset

Block DIADenseRangeOffsetDenseDense

Hash map vectorHashed


ELLPACKDense

Dense

Singleton


Singleton

Dense

Dense


Hash map matrixHashedHashed

CSBDense

Dense

Compressed




[Saad 2003]

[Patwary et al. 2015]



17

for (int i = 0; i < m; i++) {

for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j;

int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_idx[pB3]; int kc = c1_idx[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { a[pA2] += b[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }

Tensor Algebra Compiler (taco)

c : (compressed)<latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit>

B : (dense, compressed, compressed)<latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit>

A : (dense, dense)

Aij =X

k

Bijkck

[Kjolstad et al. 2017]

18

for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }

Aij =X

k

Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>

taco generates code dimension by dimension

18


Aij =X

k



18


Aij =X

k



18


Aij =X

k



19

Dense Compressed⇥Dense Compressed+

Hand-coding support for a wide range of level formats is also infeasible

19...

Dense Compressed⇥Dense Compressed+

Compressed ⇥ Hashed

Compressed + Singleton

Dense + Range

Compressed + Offset

Dense Compressed⇥ ⇥ Hashed

Compressed⇥ ⇥ HashedSingleton

Compressed ⇥Singleton + Dense

CompressedSingleton + Dense+

Hand-coding support for a wide range of level formats is also infeasible

20

Code generation is performed in two stages

20



20


High-level algorithm


20



Runnable code


20



Runnable code


How to compute with different data structures

20



Runnable code


How to compute with different data structures

How to compute with multiple operands

21

Tensor algebra computations can be expressed in terms of high-level operations on tensor operands

OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

21


OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

21


OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

21


OM N∘C D E

B C

F P

K

Q R

A B

G H

I J

S

L

T U

22

Level formats declare whether they support various high-level operations

Dense

Compressed

Hashed

Singleton

Range

Offset

22


Random access Iteration

Dense

Compressed

Hashed

Singleton

Range

Offset

22



Dense

Compressed

Hashed

Singleton

Range

Offset

22



Dense

Compressed

Hashed

Singleton

Range

Offset

22



Dense

Compressed

Hashed

Singleton

Range

Offset

23

Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations

Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

23


Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘

23



Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘

23



Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘

Iterate over B and random access C

23


Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘Compressed ⇥ Singleton

23


Dense

Compressed

Hashed

Singleton

Range

Offset

Random access

B C∘Compressed ⇥ Singleton

Simultaneously iterate over B and C

24

Random access

Dense

Hashed

Level formats also specify how they support high-level operations

Compressed

Iteration

...

24

int pB2 = pB1 * N + j;

Random access

Dense

Hashed


Compressed

Iteration

...

24


int pB2 = j % W + pB1 * W; if (crd[pB2] != j && crd[pB2] != -1) { int end = pB2; do { pB2 = (pB2 + 1) % W; } while (crd[pB2] != j && crd[pB2] != -1 && pB2 != end); } if (crd[pB2] == j) {

Random access

Dense

Hashed


Compressed

Iteration

...

24


int pB2 = j % W + pB1 * W; if (crd[pB2] != j && crd[pB2] != -1) { int end = pB2; do { pB2 = (pB2 + 1) % W; } while (crd[pB2] != j && crd[pB2] != -1 && pB2 != end); } if (crd[pB2] == j) {

Random access

Dense

Hashed


Compressed

Iteration

for (int j = 0; j < N; j++) { int pB2 = pB1 * N + j;

for (int pB2 = pos[pB1]; pB2 < pos[pB1+1]; pB2++) { int j = crd[pB2];

for (int pB2 = pB1 * W; pB2 < (pB1 + 1) * W; pB2++) { int j = crd[pB2]; if (j != -1) {

...

25

Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations


B C∘

25


for every element b in B: find corresponding element c in C A[i][j] = b * c;


B C∘

25


for every element b in B: find corresponding element c in C A[i][j] = b * c;


B C∘

25


for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; find corresponding element c in C A[i][j] = B[pB2] * c; }


B C∘

25


for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; find corresponding element c in C A[i][j] = B[pB2] * c; }


B C∘

25



B C∘

for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; int pC2 = j % W + pC1 * W; if (C2_crd[pC2] != j && C2_crd[pC2] != -1) { int end = pC2; do { pC2 = (pC2 + 1) % W; } while (C2_crd[pB2] != j && C2_crd[pB2] != -1 && pC2 != end); } if (C2_crd[pC2] == j) { A[i][j] = B[pB2] * C[pC2]; } }

26

The same process can be repeated dimension by dimensionAijk = Bijk + Cijk

26

The same process can be repeated dimension by dimension

int iB = 0; int C0_pos = C0_pos_arr[0]; while (C0_pos < C0_pos_arr[1]) { int iC = C0_idx_arr[C0_pos]; int C0_end = C0_pos + 1; if (iC == iB) while ((C0_end < C0_pos_arr[1]) && (C0_idx_arr[C0_end] == iB)) { C0_end++; } if (iC == iB) { int B1_pos = B1_pos_arr[iB]; int C1_pos = C0_pos; while ((B1_pos < B1_pos_arr[iB + 1]) && (C1_pos < C0_end)) { int jB = B1_idx_arr[B1_pos]; int jC = C1_idx_arr[C1_pos]; int j = min(jB, jC); int A1_pos = (iB * A1_size) + j; int C1_end = C1_pos + 1; if (jC == j) while ((C1_end < C0_end) && (C1_idx_arr[C1_end] == j)) { C1_end++; } if ((jB == j) && (jC == j)) { int B2_pos = B2_pos_arr[B1_pos]; int C2_pos = C1_pos; while ((B2_pos < B2_pos_arr[B1_pos + 1]) && (C2_pos < C1_end)) { int kB = B2_idx_arr[B2_pos]; int kC = C2_idx_arr[C2_pos]; int k = min(kB, kC); int A2_pos = (A1_pos * A2_size) + k; if ((kB == k) && (kC == k)) { A_val_arr[A2_pos] = B_val_arr[B2_pos] + C_val_arr[C2_pos]; } else if (kB == k) { A_val_arr[A2_pos] = B_val_arr[B2_pos]; } else { A_val_arr[A2_pos] = C_val_arr[C2_pos]; } if (kB == k) B2_pos++; if (kC == k) C2_pos++; } while (B2_pos < B2_pos_arr[B1_pos + 1]) { int kB0 = B2_idx_arr[B2_pos]; int A2_pos0 = (A1_pos * A2_size) + kB0; A_val_arr[A2_pos0] = B_val_arr[B2_pos]; B2_pos++; } while (C2_pos < C1_end) { int kC0 = C2_idx_arr[C2_pos]; int A2_pos1 = (A1_pos * A2_size) + kC0; A_val_arr[A2_pos1] = C_val_arr[C2_pos]; C2_pos++; } } else if (jB == j) { for (int B2_pos0 = B2_pos_arr[B1_pos]; B2_pos0 < B2_pos_arr[B1_pos + 1]; B2_pos0++) { int kB1 = B2_idx_arr[B2_pos0]; int A2_pos2 = (A1_pos * A2_size) + kB1; A_val_arr[A2_pos2] = B_val_arr[B2_pos0]; } } else { for (int C2_pos0 = C1_pos; C2_pos0 < C1_end; C2_pos0++) { int kC1 = C2_idx_arr[C2_pos0]; int A2_pos3 = (A1_pos * A2_size) + kC1; A_val_arr[A2_pos3] = C_val_arr[C2_pos0]; } } if (jB == j) B1_pos++; if (jC == j) C1_pos = C1_end; }

while (B1_pos < B1_pos_arr[iB + 1]) { int jB0 = B1_idx_arr[B1_pos]; int A1_pos0 = (iB * A1_size) + jB0; for (int B2_pos1 = B2_pos_arr[B1_pos]; B2_pos1 < B2_pos_arr[B1_pos + 1]; B2_pos1++) { int kB2 = B2_idx_arr[B2_pos1]; int A2_pos4 = (A1_pos0 * A2_size) + kB2; A_val_arr[A2_pos4] = B_val_arr[B2_pos1]; } B1_pos++; } while (C1_pos < C0_end) { int jC0 = C1_idx_arr[C1_pos]; int A1_pos1 = (iB * A1_size) + jC0; int C1_end0 = C1_pos + 1; while ((C1_end0 < C0_end) && (C1_idx_arr[C1_end0] == jC0)) { C1_end0++; } for (int C2_pos1 = C1_pos; C2_pos1 < C1_end0; C2_pos1++) { int kC2 = C2_idx_arr[C2_pos1]; int A2_pos5 = (A1_pos1 * A2_size) + kC2; A_val_arr[A2_pos5] = C_val_arr[C2_pos1]; } C1_pos = C1_end0; } } else { for (int B1_pos0 = B1_pos_arr[iB]; B1_pos0 < B1_pos_arr[iB + 1]; B1_pos0++) { int jB1 = B1_idx_arr[B1_pos0]; int A1_pos2 = (iB * A1_size) + jB1; for (int B2_pos2 = B2_pos_arr[B1_pos0]; B2_pos2 < B2_pos_arr[B1_pos0 + 1]; B2_pos2++) { int kB3 = B2_idx_arr[B2_pos2]; int A2_pos6 = (A1_pos2 * A2_size) + kB3; A_val_arr[A2_pos6] = B_val_arr[B2_pos2]; } } } if (iC == iB) C0_pos = C0_end; iB++; } while (iB < B0_size) { for (int B1_pos1 = B1_pos_arr[iB]; B1_pos1 < B1_pos_arr[iB + 1]; B1_pos1++) { int jB2 = B1_idx_arr[B1_pos1]; int A1_pos3 = (iB * A1_size) + jB2; for (int B2_pos3 = B2_pos_arr[B1_pos1]; B2_pos3 < B2_pos_arr[B1_pos1 + 1]; B2_pos3++) { int kB4 = B2_idx_arr[B2_pos3]; int A2_pos7 = (A1_pos3 * A2_size) + kB4; A_val_arr[A2_pos7] = B_val_arr[B2_pos3]; } } iB++; }

Aijk = Bijk + Cijk

26

The same process can be repeated dimension by dimension

int iB = 0; int C0_pos = C0_pos_arr[0]; while (C0_pos < C0_pos_arr[1]) { int iC = C0_idx_arr[C0_pos]; int C0_end = C0_pos + 1; if (iC == iB) while ((C0_end < C0_pos_arr[1]) && (C0_idx_arr[C0_end] == iB)) { C0_end++; } if (iC == iB) { int B1_pos = B1_pos_arr[iB]; int C1_pos = C0_pos; while ((B1_pos < B1_pos_arr[iB + 1]) && (C1_pos < C0_end)) { int jB = B1_idx_arr[B1_pos]; int jC = C1_idx_arr[C1_pos]; int j = min(jB, jC); int A1_pos = (iB * A1_size) + j; int C1_end = C1_pos + 1; if (jC == j) while ((C1_end < C0_end) && (C1_idx_arr[C1_end] == j)) { C1_end++; } if ((jB == j) && (jC == j)) { int B2_pos = B2_pos_arr[B1_pos]; int C2_pos = C1_pos; while ((B2_pos < B2_pos_arr[B1_pos + 1]) && (C2_pos < C1_end)) { int kB = B2_idx_arr[B2_pos]; int kC = C2_idx_arr[C2_pos]; int k = min(kB, kC); int A2_pos = (A1_pos * A2_size) + k; if ((kB == k) && (kC == k)) { A_val_arr[A2_pos] = B_val_arr[B2_pos] + C_val_arr[C2_pos]; } else if (kB == k) { A_val_arr[A2_pos] = B_val_arr[B2_pos]; } else { A_val_arr[A2_pos] = C_val_arr[C2_pos]; } if (kB == k) B2_pos++; if (kC == k) C2_pos++; } while (B2_pos < B2_pos_arr[B1_pos + 1]) { int kB0 = B2_idx_arr[B2_pos]; int A2_pos0 = (A1_pos * A2_size) + kB0; A_val_arr[A2_pos0] = B_val_arr[B2_pos]; B2_pos++; } while (C2_pos < C1_end) { int kC0 = C2_idx_arr[C2_pos]; int A2_pos1 = (A1_pos * A2_size) + kC0; A_val_arr[A2_pos1] = C_val_arr[C2_pos]; C2_pos++; } } else if (jB == j) { for (int B2_pos0 = B2_pos_arr[B1_pos]; B2_pos0 < B2_pos_arr[B1_pos + 1]; B2_pos0++) { int kB1 = B2_idx_arr[B2_pos0]; int A2_pos2 = (A1_pos * A2_size) + kB1; A_val_arr[A2_pos2] = B_val_arr[B2_pos0]; } } else { for (int C2_pos0 = C1_pos; C2_pos0 < C1_end; C2_pos0++) { int kC1 = C2_idx_arr[C2_pos0]; int A2_pos3 = (A1_pos * A2_size) + kC1; A_val_arr[A2_pos3] = C_val_arr[C2_pos0]; } } if (jB == j) B1_pos++; if (jC == j) C1_pos = C1_end; }

while (B1_pos < B1_pos_arr[iB + 1]) { int jB0 = B1_idx_arr[B1_pos]; int A1_pos0 = (iB * A1_size) + jB0; for (int B2_pos1 = B2_pos_arr[B1_pos]; B2_pos1 < B2_pos_arr[B1_pos + 1]; B2_pos1++) { int kB2 = B2_idx_arr[B2_pos1]; int A2_pos4 = (A1_pos0 * A2_size) + kB2; A_val_arr[A2_pos4] = B_val_arr[B2_pos1]; } B1_pos++; } while (C1_pos < C0_end) { int jC0 = C1_idx_arr[C1_pos]; int A1_pos1 = (iB * A1_size) + jC0; int C1_end0 = C1_pos + 1; while ((C1_end0 < C0_end) && (C1_idx_arr[C1_end0] == jC0)) { C1_end0++; } for (int C2_pos1 = C1_pos; C2_pos1 < C1_end0; C2_pos1++) { int kC2 = C2_idx_arr[C2_pos1]; int A2_pos5 = (A1_pos1 * A2_size) + kC2; A_val_arr[A2_pos5] = C_val_arr[C2_pos1]; } C1_pos = C1_end0; } } else { for (int B1_pos0 = B1_pos_arr[iB]; B1_pos0 < B1_pos_arr[iB + 1]; B1_pos0++) { int jB1 = B1_idx_arr[B1_pos0]; int A1_pos2 = (iB * A1_size) + jB1; for (int B2_pos2 = B2_pos_arr[B1_pos0]; B2_pos2 < B2_pos_arr[B1_pos0 + 1]; B2_pos2++) { int kB3 = B2_idx_arr[B2_pos2]; int A2_pos6 = (A1_pos2 * A2_size) + kB3; A_val_arr[A2_pos6] = B_val_arr[B2_pos2]; } } } if (iC == iB) C0_pos = C0_end; iB++; } while (iB < B0_size) { for (int B1_pos1 = B1_pos_arr[iB]; B1_pos1 < B1_pos_arr[iB + 1]; B1_pos1++) { int jB2 = B1_idx_arr[B1_pos1]; int A1_pos3 = (iB * A1_size) + jB2; for (int B2_pos3 = B2_pos_arr[B1_pos1]; B2_pos3 < B2_pos_arr[B1_pos1 + 1]; B2_pos3++) { int kB4 = B2_idx_arr[B2_pos3]; int A2_pos7 = (A1_pos3 * A2_size) + kB4; A_val_arr[A2_pos7] = B_val_arr[B2_pos3]; } } iB++; }

Aijk = Bijk + Cijk

27

Evaluation


Singleton

Dense

Dense

DIADenseRangeOffset


x supports locate?

y supports locate?

y supports locate?





no yes

no

yes yes

no

no

yes


output unordered or

supports insert?

x/y unique?

reorder x/y


no

yes

yes

yes

no


iterator over x/y


yes

x/y unique?

no

yes

no








28

Our technique supports a wide range of disparate tensor formats

tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow

This work [Kjolstad et al. 2017]Sparse vector

Hash map vectorCoordinate matrix

CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded

Coordinate tensorCSF

Mode-generic

28





CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded


Mode-generic

28





CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded


Mode-generic

28





CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded


Mode-generic

28





CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded


Mode-generic

28





CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded


Mode-generic

28





CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded


Mode-generic

28





CSRDCSRELLDIA

BCSRCSBDOKLIL

SkylineBanded


Mode-generic

29

Our technique generates efficient code

29

Our technique generates efficient codeCoordinate SpMV

Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5

This work SciPy Intel MKL MTL4 TensorFlow

DIA SpMV

This work SciPy Intel MKL

29


Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5


DIA SpMV


CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0

This work SciPy Intel MKL MTL4

Coordinate MTTKRP

This work Tensor Toolbox

29


Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5


DIA SpMV


CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0


Coordinate MTTKRP


29


Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5


DIA SpMV


CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0


Coordinate MTTKRP


29


Nor

mal

ized

tim

e

0.0

0.5

1.0

1.5


DIA SpMV


CSR Addition

Nor

mal

ized

tim

e

0.02.04.06.08.0

10.0


Coordinate MTTKRP


30

In conclusion…

We can automatically generate kernels that compute with disparate tensor formats

30

In conclusion…


Adding support for even more tensor formats is straightforward

30

In conclusion…

Supporting many disparate tensor formats is essential for performance



30

In conclusion…

Supporting many disparate tensor formats is essential for performance



This work supported by:

tensor-compiler.org