Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe
2
Sparse tensors are a natural way of representing real-world data
2
Sparse tensors are a natural way of representing real-world data
2
10
Quality
DurablePoor
…
2
1 1
1
1
1
2
3
1 1
Kindle
Dubliners
The Iliad
Monitor
Sweater
Laptop
Candide
Jack
et …
Peter
Paul
Mary
Bob
Sam
Billy
Lilly
Hilde
…
Users
Words
Products
Sparse tensors are a natural way of representing real-world data
2
10
Quality
DurablePoor
…
2
1 1
1
1
1
2
3
1 1
Kindle
Dubliners
The Iliad
Monitor
Sweater
Laptop
Candide
Jack
et …
Peter
Paul
Mary
Bob
Sam
Billy
Lilly
Hilde
…
Users
Words
Products
Sparse tensors are a natural way of representing real-world data
Dense storage: 107 exabytes Sparse storage: 13 gigabytes
3
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Many different formats for storing tensors exist
3
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Hash mapsSparse vector Dense array vector
Many different formats for storing tensors exist
3
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Hash mapsSparse vector Dense array vector
CSFCoordinate tensor
Dense array tensor
Mode-generic tensorHiCOO
F-COO
Many different formats for storing tensors exist
Thermal simulation
3
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Hash mapsSparse vector Dense array vector
CSFCoordinate tensor
Dense array tensor
Mode-generic tensorHiCOO
F-COO
Many different formats for storing tensors exist
Unstructured mesh simulation[Bell and Garland 2009]
Thermal simulation
3
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Hash mapsSparse vector Dense array vector
CSFCoordinate tensor
Dense array tensor
Mode-generic tensorHiCOO
F-COO
Many different formats for storing tensors exist
Unstructured mesh simulation[Bell and Garland 2009]
Thermal simulation
3
CNN with block-sparse weights[Gray et al. 2017]
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Hash mapsSparse vector Dense array vector
CSFCoordinate tensor
Dense array tensor
Mode-generic tensorHiCOO
F-COO
Many different formats for storing tensors exist
Unstructured mesh simulation[Bell and Garland 2009]
Thermal simulation
Image processing
3
CNN with block-sparse weights[Gray et al. 2017]
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Hash mapsSparse vector Dense array vector
CSFCoordinate tensor
Dense array tensor
Mode-generic tensorHiCOO
F-COO
Many different formats for storing tensors exist
Unstructured mesh simulation[Bell and Garland 2009]
Thermal simulation
Data analytics
Image processing
3
CNN with block-sparse weights[Gray et al. 2017]
ELLPACK
DIA
CSR
Coordinate matrix
DCSC
CSB
DCSR
CSC
Dense array matrix
Block DIA
BCSR
BCOO
SELLSkyline BELLLIL
Hash mapsSparse vector Dense array vector
CSFCoordinate tensor
Dense array tensor
Mode-generic tensorHiCOO
F-COO
Many different formats for storing tensors exist
4
There is no universally superior tensor format
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
4
There is no universally superior tensor format
4
There is no universally superior tensor format
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
4
There is no universally superior tensor format
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
4
There is no universally superior tensor format
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
Nor
mal
ized
tim
e
0.0
0.5
1.0
1.5
CSR DIA
y = Ax
CSR DIA
186x
CSR BCSR
4
There is no universally superior tensor format
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
77:22 Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe
DD DS SD SS
1×
2×
Normalized run timeNormalized storage
(a) Dense
DD DS SD SS
1×
2×
158× ruuur 99×
(b) Row-slicing
DD DS SD SS
1×
2×
6270× rrruuu 7780× rurrrrrtruu 6144× ruu 6145×
(c) Thermal
DD DS SD SS
1×
2×
17150×rrrrrrruu 88× rrrruuurrr 26542× rr 123×
(d) Hypersparse
DS SD SS SDᵀTT
1×
2×
112× rrruur 95×
(e) Column-slicing
DS SS DSDD SSDD
1×
2×
(f) Blocked
Fig. 18. Performance of matrix-vector multiplication on various matrices with distinct sparsity patterns usingtaco. The left half of each subfigure depicts the sparsity pattern of the matrix, while the right half showsthe normalized storage costs and normalized average execution times (relative to the optimal format) ofmatrix-vector multiplication using the storage formats labeled on the horizontal axis to store the matrix. Thestorage format labels follow the scheme described in Section 3; for instance, DS is short for (densed1,sparsed2),while SDᵀ is equivalent to (sparsed2,densed1). The dense matrix input has a density of 0.95, the hypersparsematrix has a density of 2.5 × 10−5, the row-slicing and column-slicing matrices have densities of 9.5 × 10−3,and the thermal and blocked matrices have densities of 1.0 × 10−3.
and tensor-times-vector multiplication with matrices and 3rd-order tensors of varying sparsitiesas inputs. The tensors are randomly generated with every component having some probability dof being non-zero, where d is the density of the tensor (i.e. the fraction of components that arenon-zero, and the complement of sparsity). As Fig. 17 shows, while computing with sparse tensorstorage formats incurs some performance penalty as compared to the same computation withdense formats when the inputs are highly dense, the performance penalty decreases and eventuallyturns into performance gain as the sparsity of the inputs increases. For the two computations weevaluate, we observe that input sparsity of as low as approximately 35% is actually sufficient tomake sparse formats that compress out all zeros—including DCSR and CSF—perform better thandense formats, which further emphasizes the practicality of sparse tensor storage formats.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 77. Publication date: October 2017.
Nor
mal
ized
tim
e
0.0
0.5
1.0
1.5
CSR DIA
y = Ax
CSR DIA
186x
CSR BCSR
Format Abstraction for Sparse Tensor Algebra Compilers Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe
6
[Kjolstad et al. 2017] This work
Code Generator
4Npos 0 2 44
crd 0 1 10 0
7
3 4
Code Generator
6M
4N
offset -3 0 1-1
crd 0 1 0 30 41
W 6crd 0 1 46 -1-1
offset -3 0 1-1
4N
pos 0 2 44
crd 0 1 10 0
7
3 4
6
[Kjolstad et al. 2017] This work
Code Generator
4Npos 0 2 44
crd 0 1 10 0
7
3 4
Code Generator
6M
4N
offset -3 0 1-1
crd 0 1 0 30 41
W 6crd 0 1 46 -1-1
offset -3 0 1-1
4N
pos 0 2 44
crd 0 1 10 0
7
3 4
Format abstraction
6
[Kjolstad et al. 2017] This work
Code Generator
4Npos 0 2 44
crd 0 1 10 0
7
3 4
Code Generator
6M
4N
offset -3 0 1-1
crd 0 1 0 30 41
W 6crd 0 1 46 -1-1
offset -3 0 1-1
4N
pos 0 2 44
crd 0 1 10 0
7
3 4
FC D E
A B
7
A B
0 1 2 3
0
1
2 F
C D E
Storing sparse tensors efficiently requires additional metadata
FC D E
A B
7
A B0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3
0
1
2
FC D E
Storing sparse tensors efficiently requires additional metadata
FC D E
A B
7
A B0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3
0
1
2
FC D E
Storing sparse tensors efficiently requires additional metadata
FC D E
A B
7
A B0 1 2 3 4 5 6 7 8 9 10 11
row(6) = 6 / 4 = 1
col(6) = 6 % 4 = 2
0 1 2 3
0
1
2
FC D E
Storing sparse tensors efficiently requires additional metadata
FC D E
A B
7
A B0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3
0
1
2
FC D E
Storing sparse tensors efficiently requires additional metadata
FC D E
A B
7
A B0 1 2 3 4 5 6 7 8 9 10 11
locate(1,2) = 1 * 4 + 2
0 1 2 3
0
1
2
FC D E
= 6
Storing sparse tensors efficiently requires additional metadata
8
A B C D E F0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3
0
1
2 F
C D E
A B
Storing sparse tensors efficiently requires additional metadata
8
A B C D E F0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Storing sparse tensors efficiently requires additional metadata
8
A B C D E F
row(3) = ???
col(3) = ???
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Storing sparse tensors efficiently requires additional metadata
9
A B C D E F0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
9
A B C D E F
0 2 1 2 3 3
0 0 1 1 1 2rows
cols
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
Coordinate
9
A B C D E F
0 2 1 2 3 3
0 0 1 1 1 2rows
cols
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
Coordinate
9
A B C D E F
0 2 1 2 3 3
0 2 5 6
cols
pos
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
CSR0 1 2 3
9
A B C D E F
0 2 1 2 3 3
0 2 5 6
cols
pos
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
CSR0 1 2 3
9
A B C D E F
0 2 1 2 3 3
0 2 5 6
cols
pos
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
CSR0 1 2 3
9
A B C D E F
0 2 1 2 3 3
0 2 5 6
cols
pos
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
CSR0 1 2 3
9
A B C D E F
0 2 1 2 3 3
0 2 5 6
cols
pos
0 1 2 3 4 5
0 1 2 3
0
1
2 F
C D E
A B
Coordinates of tensor elements can be encoded in many ways
CSR0 1 2 3
10
Computing with different formats can require very different codeA = B ∘ C
10
for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }
Coordinate ✕ Dense array
Computing with different formats can require very different codeA = B ∘ C
10
for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }
Coordinate ✕ Dense array CSR ✕ Dense array
Computing with different formats can require very different codeA = B ∘ C
for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}
10
for (int pB = B1_pos[0]; pB < B1_pos[1]; pB++) { int i = B1_crd[pB]; int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }
int pC1 = C1_pos[0];while (pC1 < C1_pos[1]) { int i = C1_crd[pC1]; int C1_segend = pC1 + 1; while (C1_segend < C1_pos[1] && C1_crd[C1_segend] == i) C1_segend++; int pB2 = B2_pos[i]; int pC2 = pC1; while (pB2 < B2_pos[i + 1] && pC2 < C1_segend) { int jB2 = B2_crd[pB2]; int jC2 = C2_crd[pC2]; int j = min(jB2, jC2); int pA = i * N + j; if (jB2 == j && jC2 == j) A[pA] = B[pB2] * C[pC2]; if (jB2 == j) pB2++; if (jC2 == j) pC2++; } pC1 = C1_segend;}
Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate
Computing with different formats can require very different codeA = B ∘ C
for (int i = 0; i < M; i++) { for (int pB = B2_pos[i]; pB < B2_pos[i + 1]; pB++) { int j = B2_crd[pB]; int pC = i * N + j; int pA = i * N + j; A[pA] = B[pB] * C[pC]; }}
Hand-coding support for a wide range of formats is infeasibleA = B ∘ C
Coordinate ✕ Dense array CSR ✕ Dense array CSR ✕ Coordinate
11
Hand-coding support for a wide range of formats is infeasibleA = B ∘ C
Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ Coordinate
11
Hand-coding support for a wide range of formats is infeasibleA = B ∘ C
Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate 11
Hand-coding support for a wide range of formats is infeasibleA = B ∘ C
Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate
A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array
ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA
DIA ✕ DIA ✕ DIA
11
Hand-coding support for a wide range of formats is infeasibleA = B ∘ C
Coordinate ✕ Dense arrayCSR ✕ Dense arrayCSR ✕ CoordinateDense array ✕ Dense arrayCoordinate ✕ CoordinateCSR ✕ CSRDIA ✕ DIADIA ✕ Dense arrayDIA ✕ CoordinateDIA ✕ CSRELLPACK ✕ ELLPACKELLPACK ✕ Dense arrayELLPACK ✕ CoordinateELLPACK ✕ CSRELLPACK ✕ DIABCSR ✕ BCSRBCSR ✕ Dense arrayBCSR ✕ Coordinate
A = B ∘ C ∘ DDense array ✕ CSR ✕ CSRCoordinate ✕ CSR ✕ CSRCSR ✕ CSR ✕ CSRDense array ✕ Coordinate ✕ CSRDense array ✕ Dense array ✕ CSRCoordinate ✕ Coordinate ✕ CSRDIA ✕ Coordinate ✕ Dense arrayDIA ✕ Coordinate ✕ CSRDIA ✕ Dense array ✕ CSRDIA ✕ CSR ✕ CSRDIA ✕ Coordinate ✕ CoordinateDIA ✕ Dense array ✕ Dense arrayDIA ✕ DIA ✕ CSRDIA ✕ DIA ✕ CoordinateDIA ✕ DIA ✕ Dense array
ELLPACK ✕ ELLPACK ✕ DIAELLPACK ✕ CSR ✕ DIAELLPACK ✕ BCSR ✕ DIA
DIA ✕ DIA ✕ DIA
y = Ax + zDense array ✕ Dense array ✕ Dense arrayDense array ✕ Dense array ✕ Sparse vectorDense array ✕ Dense array ✕ Hash mapDense array ✕ Sparse vector ✕ Sparse vectorDense array ✕ Sparse vector ✕ Hash mapDense array ✕ Hash map ✕ Sparse vectorDense array ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Dense arrayCoordinate ✕ Sparse vector ✕ Dense arrayCoordinate ✕ Dense array ✕ Hash mapCoordinate ✕ Sparse vector ✕ Hash mapCoordinate ✕ Hash map ✕ Sparse vectorCSR ✕ Dense array ✕ Dense arrayCSR ✕ Dense array ✕ Sparse vectorCSR ✕ Hash map ✕ Sparse vectorCSR ✕ Hash map ✕ Dense arrayCSR ✕ Sparse vector ✕ Dense arrayDIA ✕ Dense array ✕ Dense arrayDIA ✕ Hash map ✕ Dense arrayELLPACK ✕ Dense array ✕ Sparse vector 11
12
Evaluation
Mode-generic tensorCompressed
Singleton
Dense
Dense
DIADenseRangeOffset
123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe
x supports locate?
y supports locate?
y supports locate?
x unordered and y ordered?
co-iterate over x and y
iterate over x and locate into y
iterate over y and locate into x
no yes
no
yes yes
no
no
yes
x/y ordered or accessed with locate?
output unordered or
supports insert?
x/y unique?
reorder x/y
aggregate duplicates in x/y
no
yes
yes
yes
no
converted iteratorover x/y
iterator over x/y
co-iterating over x and y?
yes
x/y unique?
no
yes
no
Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.
If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.
We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.
4.5 Code Generation Algorithm
Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the
Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.
Format Abstraction & Code Generation
12
Evaluation
Mode-generic tensorCompressed
Singleton
Dense
Dense
DIADenseRangeOffset
123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe
x supports locate?
y supports locate?
y supports locate?
x unordered and y ordered?
co-iterate over x and y
iterate over x and locate into y
iterate over y and locate into x
no yes
no
yes yes
no
no
yes
x/y ordered or accessed with locate?
output unordered or
supports insert?
x/y unique?
reorder x/y
aggregate duplicates in x/y
no
yes
yes
yes
no
converted iteratorover x/y
iterator over x/y
co-iterating over x and y?
yes
x/y unique?
no
yes
no
Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.
If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.
We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.
4.5 Code Generation Algorithm
Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the
Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.
Format Abstraction & Code Generation
12
Evaluation
Mode-generic tensorCompressed
Singleton
Dense
Dense
DIADenseRangeOffset
123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe
x supports locate?
y supports locate?
y supports locate?
x unordered and y ordered?
co-iterate over x and y
iterate over x and locate into y
iterate over y and locate into x
no yes
no
yes yes
no
no
yes
x/y ordered or accessed with locate?
output unordered or
supports insert?
x/y unique?
reorder x/y
aggregate duplicates in x/y
no
yes
yes
yes
no
converted iteratorover x/y
iterator over x/y
co-iterating over x and y?
yes
x/y unique?
no
yes
no
Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.
If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.
We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.
4.5 Code Generation Algorithm
Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the
Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.
Format Abstraction & Code Generation
13
J
GH
3210
0
1
1
0
2
B C
A
F
E
D
Tensor formats can be viewed as compositions of level formats
13
J
GH
3210
0
1
1
0
2
B C
A
F
E
D
0 3 5 9
3
0 1 1 0 1 0 0 1 1
0 0 2 3 1 1 3 0 3
A B C D E F G H J0 1 2 3 4 5 6 7 8
Tensor formats can be viewed as compositions of level formats
13
J
GH
3210
0
1
1
0
2
B C
A
F
E
D0 3 5 9
3
0 1 1 0 1 0 0 1 1
0 0 2 3 1 1 3 0 3
A B C D E F G H J0 1 2 3 4 5 6 7 8
Tensor formats can be viewed as compositions of level formats
13
J
GH
3210
0
1
1
0
2
B C
A
F
E
D0 3 5 9
3
0 1 1 0 1 0 0 1 1
0 0 2 3 1 1 3 0 3
A B C D E F G H J0 1 2 3 4 5 6 7 8
Slices
Tensor formats can be viewed as compositions of level formats
13
J
GH
3210
0
1
1
0
2
B C
A
F
E
D0 3 5 9
3
0 1 1 0 1 0 0 1 1
0 0 2 3 1 1 3 0 3
A B C D E F G H J0 1 2 3 4 5 6 7 8
Rows
Tensor formats can be viewed as compositions of level formats
13
J
GH
3210
0
1
1
0
2
B C
A
F
E
D0 3 5 9
3
0 1 1 0 1 0 0 1 1
0 0 2 3 1 1 3 0 3
A B C D E F G H J0 1 2 3 4 5 6 7 8
Columns
Tensor formats can be viewed as compositions of level formats
13
J
GH
3210
0
1
1
0
2
B C
A
F
E
D0 3 5 9
3
0 1 1 0 1 0 0 1 1
0 0 2 3 1 1 3 0 3
A B C D E F G H J0 1 2 3 4 5 6 7 8
Compressed
Singleton
Dense
Tensor formats can be viewed as compositions of level formats
14
Dense Compressed Singleton
The same level formats can be composed in many ways
14
Dense Compressed Singleton
A B
0 1 2 3
0
1
2 F
C D E
The same level formats can be composed in many ways
14
Dense Compressed Singleton
A B
0 1 2 3
0
1
2 F
C D E
3Dense
The same level formats can be composed in many ways
14
A B C D E F
Dense Compressed Singleton
A B
0 1 2 3
0
1
2 F
C D E
3
0 2 2 3 310 2 5 6
Dense
Compressed
The same level formats can be composed in many ways
14
A B C D E F
Dense Compressed Singleton
A B
0 1 2 3
0
1
2 F
C D E
3
0 2 2 3 310 2 5 6
The same level formats can be composed in many ways
CSR{
15
Dense Compressed Singleton
A B
0 1 2 3
0
1
2 F
C D E
The same level formats can be composed in many ways
15
Dense Compressed Singleton
0 0 1 1 210 6
A B
0 1 2 3
0
1
2 F
C D E
Compressed
The same level formats can be composed in many ways
15
A B C D E F
0 2 2 3 31
Dense Compressed Singleton
0 0 1 1 210 6
A B
0 1 2 3
0
1
2 F
C D E
Compressed
Singleton
The same level formats can be composed in many ways
15
A B C D E F
0 2 2 3 31
Dense Compressed Singleton
0 0 1 1 210 6
A B
0 1 2 3
0
1
2 F
C D E
The same level formats can be composed in many ways
Coordinate{
16
Dense Compressed Singleton
The same level formats can be composed in many ways
Tensorformats
Level formats
16
Coordinate matrixCompressed
Singleton
CSRDense
Compressed
Dense Compressed Singleton
[Tinney and Walker, 1967]
The same level formats can be composed in many ways
Tensorformats
Level formats
16
Coordinate matrixCompressed
Singleton
CSRDense
Compressed
Coordinate tensorCompressed
Singleton
Singleton
Dense Compressed Singleton
Mode-generic tensorCompressed
Singleton
Dense
Dense
[Baskaran et al. 2012]
[Tinney and Walker, 1967]
Dense array tensorDenseDenseDense
The same level formats can be composed in many ways
Tensorformats
Level formats
16
Coordinate matrixCompressed
Singleton
CSRDense
Compressed
Coordinate tensorCompressed
Singleton
Singleton
BCSRDense
Compressed
Dense
Dense
Dense Compressed Singleton
ELLPACKDense
Dense
Singleton
Mode-generic tensorCompressed
Singleton
Dense
Dense
[Baskaran et al. 2012]
CSBDense
Dense
Compressed
Singleton [Kincaid et al. 1989][Buluç et al. 2009]
[Tinney and Walker, 1967]
[Im and Yelick 1998]
Dense array tensorDenseDenseDense
The same level formats can be composed in many ways
Tensorformats
Level formats
16
Hashed Range Offset
Coordinate matrixCompressed
Singleton
CSRDense
Compressed
Coordinate tensorCompressed
Singleton
Singleton
BCSRDense
Compressed
Dense
Dense
Dense Compressed Singleton
ELLPACKDense
Dense
Singleton
Mode-generic tensorCompressed
Singleton
Dense
Dense
[Baskaran et al. 2012]
CSBDense
Dense
Compressed
Singleton [Kincaid et al. 1989][Buluç et al. 2009]
[Tinney and Walker, 1967]
[Im and Yelick 1998]
Dense array tensorDenseDenseDense
The same level formats can be composed in many ways
Tensorformats
Level formats
16
Hashed Range Offset
Coordinate matrixCompressed
Singleton
CSRDense
Compressed
Coordinate tensorCompressed
Singleton
Singleton
BCSRDense
Compressed
Dense
Dense
DIADenseRangeOffset
Block DIADenseRangeOffsetDenseDense
Hash map vectorHashed
Dense Compressed Singleton
ELLPACKDense
Dense
Singleton
Mode-generic tensorCompressed
Singleton
Dense
Dense
[Baskaran et al. 2012]
Hash map matrixHashedHashed
CSBDense
Dense
Compressed
Singleton [Kincaid et al. 1989][Buluç et al. 2009]
[Tinney and Walker, 1967]
[Im and Yelick 1998]
[Saad 2003]
[Patwary et al. 2015]
Dense array tensorDenseDenseDense
The same level formats can be composed in many ways
17
for (int i = 0; i < m; i++) {
for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j;
int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_idx[pB3]; int kc = c1_idx[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { a[pA2] += b[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }
Tensor Algebra Compiler (taco)
c : (compressed)<latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit><latexit sha1_base64="znx8HB33iu6URMn3Pu43WWiyqTY=">AAACAHicbVC7SgNBFJ31GeNr1cLCZjAIsQm7IihWQRvLCOYBSQizk7vJkNmdZeauGJY0/oqNhSK2foadf+PkUWjigYHDOfdw554gkcKg5307S8srq2vruY385tb2zq67t18zKtUcqlxJpRsBMyBFDFUUKKGRaGBRIKEeDG7Gfv0BtBEqvsdhAu2I9WIRCs7QSh33kNMrWmwhPGLGVWSzxkB3dNpxC17Jm4AuEn9GCmSGSsf9anUVTyOIkUtmTNP3EmxnTKPgEkb5VmogYXzAetC0NGYRmHY2OWBET6zSpaHS9sVIJ+rvRMYiY4ZRYCcjhn0z743F/7xmiuFlOxNxkiLEfLooTCVFRcdt0K7QwFEOLWFcC/tXyvtMM462s7wtwZ8/eZHUzkq+5XfnhfL1rI4cOSLHpEh8ckHK5JZUSJVwMiLP5JW8OU/Oi/PufExHl5xZ5oD8gfP5Awmdlg0=</latexit>
B : (dense, compressed, compressed)<latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit><latexit sha1_base64="4q1Sk4lxncbqFoMX5S1GLw3t+GQ=">AAACIXicbVDLSgMxFM34rPVVdekmWIQKIjMiWFyVunFZwT6gLSWTuW2DmcyQ3BHL0F9x46+4caFId+LPmD4W2vZA4OSce29yjx9LYdB1v52V1bX1jc3MVnZ7Z3dvP3dwWDNRojlUeSQj3fCZASkUVFGghEasgYW+hLr/eDv260+gjYjUAw5iaIesp0RXcIZW6uSKZXpDCy2EZ0wDUAaG53R641FoJxkDwTLprJPLuxfuBHSReDOSJzNUOrlRK4h4EoJCLpkxTc+NsZ0yjYJLGGZbiYGY8UfWg6alioVg2ulkwyE9tUpAu5G2RyGdqH87UhYaMwh9Wxky7Jt5bywu85oJdovtVKg4QVB8+lA3kRQjOo6LBkIDRzmwhHEt7F8p7zPNONpQszYEb37lRVK7vPAsv7/Kl8qzODLkmJyQAvHINSmRO1IhVcLJC3kjH+TTeXXenS9nNC1dcWY9R+QfnJ9fTfykRA==</latexit>
A : (dense, dense)
Aij =X
k
Bijkck
[Kjolstad et al. 2017]
18
for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }
Aij =X
k
Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>
taco generates code dimension by dimension
18
for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }
Aij =X
k
Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>
taco generates code dimension by dimension
18
for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }
Aij =X
k
Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>
taco generates code dimension by dimension
18
for (int i = 0; i < m; i++) { for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1 + 1]; pB2++) { int j = B2_idx[pB2]; int pA2 = (i * n) + j; int pB3 = B3_pos[pB2]; int pc1 = c1_pos[0]; while (pB3 < B3_pos[pB2 + 1] && pc1 < c1_pos[1]) { int kB = B3_crd[pB3]; int kc = c1_crd[pc1]; int k = min(kB, kc); if (kB == k && kc == k) { A[pA2] += B[pB3] * c[pc1]; } if (kB == k) pB3++; if (kc == k) pc1++; } } }
Aij =X
k
Bijk · ck<latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uuat66tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="n5e1ptEuTL43cmV9t1ImHPSoNvM=">AAACf3icbZFda9swFIZl76Ndlm1pb3sjWga7aINkN7FNGXTbTS87aNpCbIysKKkW+QNJLgShH7O/tLv+m8ppClu6A4KH9+jVOTqnaARXGqEHz3/1+s3bnd13vff9Dx8/Dfb616puJWUTWota3hZEMcErNtFcC3bbSEbKQrCbYvmjy9/cM6l4XV3pVcOykiwqPueUaCflg98mXT8ylYsiM2iI4hEKT4/RcBwmCMcOEI7COLDfcsN/WQu/wlS1ZW6W1mxZ8Sg8TQLnCKMxDpGDOE5QgO33zrp03pTOag23bBEaB3HSFVqHgwiHUTyyhubOZPPB0XMOvgS8gSOwict88Ced1bQtWaWpIEpNMWp0ZojUnApme2mrWEPokizY1GFFSqYys+7Jws9OmcF5Ld2pNFyrfzsMKZValYW7WRJ9p7Zznfi/3LTV8zgzvGpazSr6VGjeCqhr2O0FzrhkVIuVA0Ild71Cekckodptr+eGgLe//BKugyF2/BOBXXAADsEXgEEEzsEFuAQTQL0d78Qbe5Hf9wM/eRqX723mtg/+Cf/sEWMFufg=</latexit><latexit sha1_base64="9QtCk7M63G3IchN+Z5Vs/Le7raY=">AAACinicbZHfatswFMZld1u7rNvS9nI3YmGwiy1IdhPblELX7WKXLSxtITZGVpRUi/wHSS4EoYfpK+1ubzM5zWBLe0Dw4zvn0znSKRrBlUbot+fvPHv+YnfvZe/V/us3b/sHh1eqbiVlE1qLWt4URDHBKzbRXAt200hGykKw62L5tctf3zGpeF390KuGZSVZVHzOKdFOyvv3Jl1fMpWLIjNoiOIRCo8/oeE4TBCOHSAchXFgv+SG/7QWnsJUtWVultZsWfEoPE4C5wijMQ6RgzhOUIDteWddOm9KZ7WGW7YIjYM46Rqtw0GEwygeWUNzZ7J5f/A3Bx8D3sAAbOIi7/9KZzVtS1ZpKohSU4wanRkiNaeC2V7aKtYQuiQLNnVYkZKpzKxnsvCDU2ZwXkt3Kg3X6r8OQ0qlVmXhKkuib9V2rhOfyk1bPY8zw6um1ayiD43mrYC6ht1e4IxLRrVYOSBUcjcrpLdEEqrd9nruE/D2kx/DVTDEji/R4Ox88x174B14Dz4CDCJwBr6DCzAB1Nv1PntjL/L3/cBP/JOHUt/beI7Af+F/+wOGoLrq</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit><latexit sha1_base64="N+Gz0mErqtYlWzSRJfkrAyiSckE=">AAACinicbZHfbtMwFMadMNgoAwq75MZahcQFVHbSNonQpLHtYpdDotukJooc1+1MnT+yHaTK8sPwStzxNjhdkVjHkSz99J3z+Rz7FI3gSiP02/Of7D19tn/wvPfi8OWr1/03b69V3UrKprQWtbwtiGKCV2yquRbstpGMlIVgN8XqvMvf/GBS8br6ptcNy0qyrPiCU6KdlPd/mnRzyUwui8ygIYrHKBx9RMNJmCAcO0A4CuPAfskN/24tPIGpasvcrKzZseJxOEoC5wijCQ6RgzhOUIDtWWddOW9K57WGO7YITYI46RptwkGEwygeW0NzZ7J5f/A3Bx8D3sIAbOMq7/9K5zVtS1ZpKohSM4wanRkiNaeC2V7aKtYQuiJLNnNYkZKpzGxmsvC9U+ZwUUt3Kg036r8OQ0ql1mXhKkui79RurhP/l5u1ehFnhldNq1lF7xstWgF1Dbu9wDmXjGqxdkCo5G5WSO+IJFS77fXcJ+DdJz+G62CIHX8dDU7Ptt9xAN6BY/ABYBCBU3AJrsAUUG/f++RNvMg/9AM/8T/fl/re1nMEHoR/8QeH4Lru</latexit>
taco generates code dimension by dimension
19
Dense Compressed⇥Dense Compressed+
Hand-coding support for a wide range of level formats is also infeasible
19...
Dense Compressed⇥Dense Compressed+
Compressed ⇥ Hashed
Compressed + Singleton
Dense + Range
Compressed + Offset
Dense Compressed⇥ ⇥ Hashed
Compressed⇥ ⇥ HashedSingleton
Compressed ⇥Singleton + Dense
CompressedSingleton + Dense+
Hand-coding support for a wide range of level formats is also infeasible
20
Code generation is performed in two stages
20
Code generation is performed in two stages
Compressed ⇥ Hashed
20
Code generation is performed in two stages
High-level algorithm
Compressed ⇥ Hashed
20
Code generation is performed in two stages
High-level algorithm
Runnable code
Compressed ⇥ Hashed
20
Code generation is performed in two stages
High-level algorithm
Runnable code
Compressed ⇥ Hashed
How to compute with different data structures
20
Code generation is performed in two stages
High-level algorithm
Runnable code
Compressed ⇥ Hashed
How to compute with different data structures
How to compute with multiple operands
21
Tensor algebra computations can be expressed in terms of high-level operations on tensor operands
OM N∘C D E
B C
F P
K
Q R
A B
G H
I J
S
L
T U
21
Tensor algebra computations can be expressed in terms of high-level operations on tensor operands
OM N∘C D E
B C
F P
K
Q R
A B
G H
I J
S
L
T U
21
Tensor algebra computations can be expressed in terms of high-level operations on tensor operands
OM N∘C D E
B C
F P
K
Q R
A B
G H
I J
S
L
T U
21
Tensor algebra computations can be expressed in terms of high-level operations on tensor operands
OM N∘C D E
B C
F P
K
Q R
A B
G H
I J
S
L
T U
22
Level formats declare whether they support various high-level operations
Dense
Compressed
Hashed
Singleton
Range
Offset
22
Level formats declare whether they support various high-level operations
Random access Iteration
Dense
Compressed
Hashed
Singleton
Range
Offset
22
Level formats declare whether they support various high-level operations
Random access Iteration
Dense
Compressed
Hashed
Singleton
Range
Offset
22
Level formats declare whether they support various high-level operations
Random access Iteration
Dense
Compressed
Hashed
Singleton
Range
Offset
22
Level formats declare whether they support various high-level operations
Random access Iteration
Dense
Compressed
Hashed
Singleton
Range
Offset
23
Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations
Dense
Compressed
Hashed
Singleton
Range
Offset
Random access
23
Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations
Dense
Compressed
Hashed
Singleton
Range
Offset
Random access
B C∘
23
Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations
Compressed ⇥ Hashed
Dense
Compressed
Hashed
Singleton
Range
Offset
Random access
B C∘
23
Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations
Compressed ⇥ Hashed
Dense
Compressed
Hashed
Singleton
Range
Offset
Random access
B C∘
Iterate over B and random access C
23
Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations
Dense
Compressed
Hashed
Singleton
Range
Offset
Random access
B C∘Compressed ⇥ Singleton
23
Compiler constructs efficient algorithm by reasoning about whether operands support required high-level operations
Dense
Compressed
Hashed
Singleton
Range
Offset
Random access
B C∘Compressed ⇥ Singleton
Simultaneously iterate over B and C
24
Random access
Dense
Hashed
Level formats also specify how they support high-level operations
Compressed
Iteration
...
24
int pB2 = pB1 * N + j;
Random access
Dense
Hashed
Level formats also specify how they support high-level operations
Compressed
Iteration
...
24
int pB2 = pB1 * N + j;
int pB2 = j % W + pB1 * W; if (crd[pB2] != j && crd[pB2] != -1) { int end = pB2; do { pB2 = (pB2 + 1) % W; } while (crd[pB2] != j && crd[pB2] != -1 && pB2 != end); } if (crd[pB2] == j) {
Random access
Dense
Hashed
Level formats also specify how they support high-level operations
Compressed
Iteration
...
24
int pB2 = pB1 * N + j;
int pB2 = j % W + pB1 * W; if (crd[pB2] != j && crd[pB2] != -1) { int end = pB2; do { pB2 = (pB2 + 1) % W; } while (crd[pB2] != j && crd[pB2] != -1 && pB2 != end); } if (crd[pB2] == j) {
Random access
Dense
Hashed
Level formats also specify how they support high-level operations
Compressed
Iteration
for (int j = 0; j < N; j++) { int pB2 = pB1 * N + j;
for (int pB2 = pos[pB1]; pB2 < pos[pB1+1]; pB2++) { int j = crd[pB2];
for (int pB2 = pB1 * W; pB2 < (pB1 + 1) * W; pB2++) { int j = crd[pB2]; if (j != -1) {
...
25
Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations
Compressed ⇥ Hashed
B C∘
25
Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations
for every element b in B: find corresponding element c in C A[i][j] = b * c;
Compressed ⇥ Hashed
B C∘
25
Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations
for every element b in B: find corresponding element c in C A[i][j] = b * c;
Compressed ⇥ Hashed
B C∘
25
Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations
for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; find corresponding element c in C A[i][j] = B[pB2] * c; }
Compressed ⇥ Hashed
B C∘
25
Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations
for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; find corresponding element c in C A[i][j] = B[pB2] * c; }
Compressed ⇥ Hashed
B C∘
25
Compiler specializes constructed algorithm to operand formats by inlining code that implements required high-level operations
Compressed ⇥ Hashed
B C∘
for (int pB2 = B2_pos[pB1]; pB2 < B2_pos[pB1+1]; pB2++) { int j = B2_crd[pB2]; int pC2 = j % W + pC1 * W; if (C2_crd[pC2] != j && C2_crd[pC2] != -1) { int end = pC2; do { pC2 = (pC2 + 1) % W; } while (C2_crd[pB2] != j && C2_crd[pB2] != -1 && pC2 != end); } if (C2_crd[pC2] == j) { A[i][j] = B[pB2] * C[pC2]; } }
26
The same process can be repeated dimension by dimensionAijk = Bijk + Cijk
26
The same process can be repeated dimension by dimension
int iB = 0; int C0_pos = C0_pos_arr[0]; while (C0_pos < C0_pos_arr[1]) { int iC = C0_idx_arr[C0_pos]; int C0_end = C0_pos + 1; if (iC == iB) while ((C0_end < C0_pos_arr[1]) && (C0_idx_arr[C0_end] == iB)) { C0_end++; } if (iC == iB) { int B1_pos = B1_pos_arr[iB]; int C1_pos = C0_pos; while ((B1_pos < B1_pos_arr[iB + 1]) && (C1_pos < C0_end)) { int jB = B1_idx_arr[B1_pos]; int jC = C1_idx_arr[C1_pos]; int j = min(jB, jC); int A1_pos = (iB * A1_size) + j; int C1_end = C1_pos + 1; if (jC == j) while ((C1_end < C0_end) && (C1_idx_arr[C1_end] == j)) { C1_end++; } if ((jB == j) && (jC == j)) { int B2_pos = B2_pos_arr[B1_pos]; int C2_pos = C1_pos; while ((B2_pos < B2_pos_arr[B1_pos + 1]) && (C2_pos < C1_end)) { int kB = B2_idx_arr[B2_pos]; int kC = C2_idx_arr[C2_pos]; int k = min(kB, kC); int A2_pos = (A1_pos * A2_size) + k; if ((kB == k) && (kC == k)) { A_val_arr[A2_pos] = B_val_arr[B2_pos] + C_val_arr[C2_pos]; } else if (kB == k) { A_val_arr[A2_pos] = B_val_arr[B2_pos]; } else { A_val_arr[A2_pos] = C_val_arr[C2_pos]; } if (kB == k) B2_pos++; if (kC == k) C2_pos++; } while (B2_pos < B2_pos_arr[B1_pos + 1]) { int kB0 = B2_idx_arr[B2_pos]; int A2_pos0 = (A1_pos * A2_size) + kB0; A_val_arr[A2_pos0] = B_val_arr[B2_pos]; B2_pos++; } while (C2_pos < C1_end) { int kC0 = C2_idx_arr[C2_pos]; int A2_pos1 = (A1_pos * A2_size) + kC0; A_val_arr[A2_pos1] = C_val_arr[C2_pos]; C2_pos++; } } else if (jB == j) { for (int B2_pos0 = B2_pos_arr[B1_pos]; B2_pos0 < B2_pos_arr[B1_pos + 1]; B2_pos0++) { int kB1 = B2_idx_arr[B2_pos0]; int A2_pos2 = (A1_pos * A2_size) + kB1; A_val_arr[A2_pos2] = B_val_arr[B2_pos0]; } } else { for (int C2_pos0 = C1_pos; C2_pos0 < C1_end; C2_pos0++) { int kC1 = C2_idx_arr[C2_pos0]; int A2_pos3 = (A1_pos * A2_size) + kC1; A_val_arr[A2_pos3] = C_val_arr[C2_pos0]; } } if (jB == j) B1_pos++; if (jC == j) C1_pos = C1_end; }
while (B1_pos < B1_pos_arr[iB + 1]) { int jB0 = B1_idx_arr[B1_pos]; int A1_pos0 = (iB * A1_size) + jB0; for (int B2_pos1 = B2_pos_arr[B1_pos]; B2_pos1 < B2_pos_arr[B1_pos + 1]; B2_pos1++) { int kB2 = B2_idx_arr[B2_pos1]; int A2_pos4 = (A1_pos0 * A2_size) + kB2; A_val_arr[A2_pos4] = B_val_arr[B2_pos1]; } B1_pos++; } while (C1_pos < C0_end) { int jC0 = C1_idx_arr[C1_pos]; int A1_pos1 = (iB * A1_size) + jC0; int C1_end0 = C1_pos + 1; while ((C1_end0 < C0_end) && (C1_idx_arr[C1_end0] == jC0)) { C1_end0++; } for (int C2_pos1 = C1_pos; C2_pos1 < C1_end0; C2_pos1++) { int kC2 = C2_idx_arr[C2_pos1]; int A2_pos5 = (A1_pos1 * A2_size) + kC2; A_val_arr[A2_pos5] = C_val_arr[C2_pos1]; } C1_pos = C1_end0; } } else { for (int B1_pos0 = B1_pos_arr[iB]; B1_pos0 < B1_pos_arr[iB + 1]; B1_pos0++) { int jB1 = B1_idx_arr[B1_pos0]; int A1_pos2 = (iB * A1_size) + jB1; for (int B2_pos2 = B2_pos_arr[B1_pos0]; B2_pos2 < B2_pos_arr[B1_pos0 + 1]; B2_pos2++) { int kB3 = B2_idx_arr[B2_pos2]; int A2_pos6 = (A1_pos2 * A2_size) + kB3; A_val_arr[A2_pos6] = B_val_arr[B2_pos2]; } } } if (iC == iB) C0_pos = C0_end; iB++; } while (iB < B0_size) { for (int B1_pos1 = B1_pos_arr[iB]; B1_pos1 < B1_pos_arr[iB + 1]; B1_pos1++) { int jB2 = B1_idx_arr[B1_pos1]; int A1_pos3 = (iB * A1_size) + jB2; for (int B2_pos3 = B2_pos_arr[B1_pos1]; B2_pos3 < B2_pos_arr[B1_pos1 + 1]; B2_pos3++) { int kB4 = B2_idx_arr[B2_pos3]; int A2_pos7 = (A1_pos3 * A2_size) + kB4; A_val_arr[A2_pos7] = B_val_arr[B2_pos3]; } } iB++; }
Aijk = Bijk + Cijk
26
The same process can be repeated dimension by dimension
int iB = 0; int C0_pos = C0_pos_arr[0]; while (C0_pos < C0_pos_arr[1]) { int iC = C0_idx_arr[C0_pos]; int C0_end = C0_pos + 1; if (iC == iB) while ((C0_end < C0_pos_arr[1]) && (C0_idx_arr[C0_end] == iB)) { C0_end++; } if (iC == iB) { int B1_pos = B1_pos_arr[iB]; int C1_pos = C0_pos; while ((B1_pos < B1_pos_arr[iB + 1]) && (C1_pos < C0_end)) { int jB = B1_idx_arr[B1_pos]; int jC = C1_idx_arr[C1_pos]; int j = min(jB, jC); int A1_pos = (iB * A1_size) + j; int C1_end = C1_pos + 1; if (jC == j) while ((C1_end < C0_end) && (C1_idx_arr[C1_end] == j)) { C1_end++; } if ((jB == j) && (jC == j)) { int B2_pos = B2_pos_arr[B1_pos]; int C2_pos = C1_pos; while ((B2_pos < B2_pos_arr[B1_pos + 1]) && (C2_pos < C1_end)) { int kB = B2_idx_arr[B2_pos]; int kC = C2_idx_arr[C2_pos]; int k = min(kB, kC); int A2_pos = (A1_pos * A2_size) + k; if ((kB == k) && (kC == k)) { A_val_arr[A2_pos] = B_val_arr[B2_pos] + C_val_arr[C2_pos]; } else if (kB == k) { A_val_arr[A2_pos] = B_val_arr[B2_pos]; } else { A_val_arr[A2_pos] = C_val_arr[C2_pos]; } if (kB == k) B2_pos++; if (kC == k) C2_pos++; } while (B2_pos < B2_pos_arr[B1_pos + 1]) { int kB0 = B2_idx_arr[B2_pos]; int A2_pos0 = (A1_pos * A2_size) + kB0; A_val_arr[A2_pos0] = B_val_arr[B2_pos]; B2_pos++; } while (C2_pos < C1_end) { int kC0 = C2_idx_arr[C2_pos]; int A2_pos1 = (A1_pos * A2_size) + kC0; A_val_arr[A2_pos1] = C_val_arr[C2_pos]; C2_pos++; } } else if (jB == j) { for (int B2_pos0 = B2_pos_arr[B1_pos]; B2_pos0 < B2_pos_arr[B1_pos + 1]; B2_pos0++) { int kB1 = B2_idx_arr[B2_pos0]; int A2_pos2 = (A1_pos * A2_size) + kB1; A_val_arr[A2_pos2] = B_val_arr[B2_pos0]; } } else { for (int C2_pos0 = C1_pos; C2_pos0 < C1_end; C2_pos0++) { int kC1 = C2_idx_arr[C2_pos0]; int A2_pos3 = (A1_pos * A2_size) + kC1; A_val_arr[A2_pos3] = C_val_arr[C2_pos0]; } } if (jB == j) B1_pos++; if (jC == j) C1_pos = C1_end; }
while (B1_pos < B1_pos_arr[iB + 1]) { int jB0 = B1_idx_arr[B1_pos]; int A1_pos0 = (iB * A1_size) + jB0; for (int B2_pos1 = B2_pos_arr[B1_pos]; B2_pos1 < B2_pos_arr[B1_pos + 1]; B2_pos1++) { int kB2 = B2_idx_arr[B2_pos1]; int A2_pos4 = (A1_pos0 * A2_size) + kB2; A_val_arr[A2_pos4] = B_val_arr[B2_pos1]; } B1_pos++; } while (C1_pos < C0_end) { int jC0 = C1_idx_arr[C1_pos]; int A1_pos1 = (iB * A1_size) + jC0; int C1_end0 = C1_pos + 1; while ((C1_end0 < C0_end) && (C1_idx_arr[C1_end0] == jC0)) { C1_end0++; } for (int C2_pos1 = C1_pos; C2_pos1 < C1_end0; C2_pos1++) { int kC2 = C2_idx_arr[C2_pos1]; int A2_pos5 = (A1_pos1 * A2_size) + kC2; A_val_arr[A2_pos5] = C_val_arr[C2_pos1]; } C1_pos = C1_end0; } } else { for (int B1_pos0 = B1_pos_arr[iB]; B1_pos0 < B1_pos_arr[iB + 1]; B1_pos0++) { int jB1 = B1_idx_arr[B1_pos0]; int A1_pos2 = (iB * A1_size) + jB1; for (int B2_pos2 = B2_pos_arr[B1_pos0]; B2_pos2 < B2_pos_arr[B1_pos0 + 1]; B2_pos2++) { int kB3 = B2_idx_arr[B2_pos2]; int A2_pos6 = (A1_pos2 * A2_size) + kB3; A_val_arr[A2_pos6] = B_val_arr[B2_pos2]; } } } if (iC == iB) C0_pos = C0_end; iB++; } while (iB < B0_size) { for (int B1_pos1 = B1_pos_arr[iB]; B1_pos1 < B1_pos_arr[iB + 1]; B1_pos1++) { int jB2 = B1_idx_arr[B1_pos1]; int A1_pos3 = (iB * A1_size) + jB2; for (int B2_pos3 = B2_pos_arr[B1_pos1]; B2_pos3 < B2_pos_arr[B1_pos1 + 1]; B2_pos3++) { int kB4 = B2_idx_arr[B2_pos3]; int A2_pos7 = (A1_pos3 * A2_size) + kB4; A_val_arr[A2_pos7] = B_val_arr[B2_pos3]; } } iB++; }
Aijk = Bijk + Cijk
27
Evaluation
Mode-generic tensorCompressed
Singleton
Dense
Dense
DIADenseRangeOffset
123:16 Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe
x supports locate?
y supports locate?
y supports locate?
x unordered and y ordered?
co-iterate over x and y
iterate over x and locate into y
iterate over y and locate into x
no yes
no
yes yes
no
no
yes
x/y ordered or accessed with locate?
output unordered or
supports insert?
x/y unique?
reorder x/y
aggregate duplicates in x/y
no
yes
yes
yes
no
converted iteratorover x/y
iterator over x/y
co-iterating over x and y?
yes
x/y unique?
no
yes
no
Fig. 8. The most efficient strategies for computing the intersection merge of two vectors x andy, depending onwhether they support the locate capability and whether they are ordered and unique. The sparsity structureof y is assumed to not be a strict subset of the sparsity structure of x . The flowchart on the right describes,for each operand, what iterator conversions are needed at runtime to compute the merge.
If one of the input vectors,y, supports the locate capability (e.g., it is a dense array), we can insteadjust iterate over the nonzero components of x and, for each component, locate the component withthe same coordinate in y. Lines 2–9 in Figure 1b shows another example of this method applied tomerge the column dimensions of a CSR matrix and a dense matrix. This alternative method reducesthe merge complexity from O(nnz(x) + nnz(y)) to O(nnz(x)) assuming locate runs in constanttime. Moreover, this method does not require enumerating the coordinates of y in order. We do noteven need to enumerate the coordinates of x in order, as long as there are no duplicates and we donot need to compute output components in order (e.g., if the output supports the insert capability).This method is thus ideal for computing intersection merges of unordered levels.
We can generalize and combine the two methods described above to compute arbitrarily complexmerges involving unions and intersections of any number of tensor operands. At a high level, anymerge can be computed by co-iterating over some subset of its operands and, for every enumeratedcoordinate, locating that same coordinate in all the remaining operands with calls to locate. Whichoperands need to be co-iterated can be identified recursively from the expression expr that we wantto compute. In particular, for each subexpression e = e1 op e2 in expr , let Coiter (e) denote the setof operand coordinate hierarchy levels that need to be co-iterated in order to compute e . If op is anoperation that requires a union merge (e.g., addition), then computing e requires co-iterating overall the levels that would have to be co-iterated in order to separately compute e1 and e2; in otherwords, Coiter (e) = Coiter (e1) ∪Coiter (e2). On the other hand, if op is an operation that requiresan intersection merge (e.g., multiplication), then the set of coordinates of nonzeros in the resulte must be a subset of the coordinates of nonzeros in either operand e1 or e2. Thus, in order toenumerate the coordinates of all nonzeros in the result, it suffices to co-iterate over all the levelsmerged by just one of the operands. Without loss of generality, this lets us compute e withouthaving to co-iterate over levels merged by e2 that can instead be accessed with locate; in otherwords,Coiter (e) = Coiter (e1)∪ (Coiter (e2) \LocateCapable(e2)), where LocateCapable(e2) denotesthe set of levels merged by e2 that support the locate capability.
4.5 Code Generation Algorithm
Figure 9a shows our code generation algorithm, which incorporates all of the concepts we presentedin the previous subsections. Each part of the algorithm is labeled from 1 to 11; throughout the
Proc. ACM Program. Lang., Vol. 2, No. OOPSLA, Article 123. Publication date: November 2018.
Format Abstraction & Code Generation
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
28
Our technique supports a wide range of disparate tensor formats
tacoIntel MKL SciPy MTL4 Tensor Toolbox TensorFlow
This work [Kjolstad et al. 2017]Sparse vector
Hash map vectorCoordinate matrix
CSRDCSRELLDIA
BCSRCSBDOKLIL
SkylineBanded
Coordinate tensorCSF
Mode-generic
29
Our technique generates efficient code
29
Our technique generates efficient codeCoordinate SpMV
Nor
mal
ized
tim
e
0.0
0.5
1.0
1.5
This work SciPy Intel MKL MTL4 TensorFlow
DIA SpMV
This work SciPy Intel MKL
29
Our technique generates efficient codeCoordinate SpMV
Nor
mal
ized
tim
e
0.0
0.5
1.0
1.5
This work SciPy Intel MKL MTL4 TensorFlow
DIA SpMV
This work SciPy Intel MKL
CSR Addition
Nor
mal
ized
tim
e
0.02.04.06.08.0
10.0
This work SciPy Intel MKL MTL4
Coordinate MTTKRP
This work Tensor Toolbox
29
Our technique generates efficient codeCoordinate SpMV
Nor
mal
ized
tim
e
0.0
0.5
1.0
1.5
This work SciPy Intel MKL MTL4 TensorFlow
DIA SpMV
This work SciPy Intel MKL
CSR Addition
Nor
mal
ized
tim
e
0.02.04.06.08.0
10.0
This work SciPy Intel MKL MTL4
Coordinate MTTKRP
This work Tensor Toolbox
29
Our technique generates efficient codeCoordinate SpMV
Nor
mal
ized
tim
e
0.0
0.5
1.0
1.5
This work SciPy Intel MKL MTL4 TensorFlow
DIA SpMV
This work SciPy Intel MKL
CSR Addition
Nor
mal
ized
tim
e
0.02.04.06.08.0
10.0
This work SciPy Intel MKL MTL4
Coordinate MTTKRP
This work Tensor Toolbox
29
Our technique generates efficient codeCoordinate SpMV
Nor
mal
ized
tim
e
0.0
0.5
1.0
1.5
This work SciPy Intel MKL MTL4 TensorFlow
DIA SpMV
This work SciPy Intel MKL
CSR Addition
Nor
mal
ized
tim
e
0.02.04.06.08.0
10.0
This work SciPy Intel MKL MTL4
Coordinate MTTKRP
This work Tensor Toolbox
30
In conclusion…
We can automatically generate kernels that compute with disparate tensor formats
30
In conclusion…
We can automatically generate kernels that compute with disparate tensor formats
Adding support for even more tensor formats is straightforward
30
In conclusion…
Supporting many disparate tensor formats is essential for performance
We can automatically generate kernels that compute with disparate tensor formats
Adding support for even more tensor formats is straightforward
30
In conclusion…
Supporting many disparate tensor formats is essential for performance
We can automatically generate kernels that compute with disparate tensor formats
Adding support for even more tensor formats is straightforward
This work supported by:
tensor-compiler.org