Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A Tensorized Transformer for Language Modeling
Tianjin University
Xindian Ma, Peng Zhang*, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, Ming Zhou
NeurIPS 2019
Backgroundโข Transformer has led to breakthroughs in natural language processing tasks.โข Transformer, and its variant model BERT, limit the effective deployment of the
model to limited resource setting.
โข Some compression methods have been proved.โข TT-embedding โข BTRNN โข Tensorizing Neural Networks
Some Compression Methodsโข TT-Embedding [1]
โข Tensor-Train decomposition is used to compress the embedding layer (look-up table).
โข BTRNN [2]โข Block-term tensor decomposition is used to compress the input layers in LSTM
โข Tensorizing Neural Networks [3]โข Tensor Train format is used to compress the fully-connected layers.
[1] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787, 2019[2] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9378โ9387, 2018.[3] Novikov A, Podoprikhin D, Osokin A, et al. Tensorizing neural networks[C]//Advances in neural information processing systems. 2015: 442-450.
Problem Formulationโข The goals are:
โข To linearly represent a self-attention by a group of basic vectors
โข To compress multi-head attention in Transformer
โข After compressing, it can be directly integrated into the encoder and decoder framework of Transformer
MethodsBasic Ideasโข Low-rank decompositionโข Parameters sharing
Using Tucker decomposition formulation is to construct Single-block attention
Using Block-term decomposition + Parameters sharing formulation is to construct multi-head mechanisms(Multi-linear attention)
Transformer Language Modelingโข Scaled Dot Production Attention
โข ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐๐,๐พ๐พ,๐๐ = ๐ ๐ ๐ด๐ด๐ ๐ ๐ด๐ด๐ ๐ ๐ ๐ ๐ ๐ ๐๐๐พ๐พ๐๐
๐๐๐๐
โข Multi-head Attention
โข ๐๐๐๐๐๐๐ด๐ด๐ด๐ด๐๐๐ด๐ด๐ ๐ ๐๐๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐๐,๐พ๐พ,๐๐ = ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ถ๐ถ๐ ๐ ๐ด๐ด โ๐ด๐ด๐ ๐ ๐๐1,โฏ ,โ๐ด๐ด๐ ๐ ๐๐๐๐ ๐๐๐๐
๐ค๐คโ๐ด๐ด๐ค๐ค๐ด๐ด โ๐ด๐ด๐ ๐ ๐๐๐๐ = ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด(๐๐๐๐๐๐๐๐,๐พ๐พ๐๐๐๐
๐พ๐พ,๐๐๐๐๐๐๐๐)
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.
Multi-group parameters
Linear Representationโข Theorem: โScaled Dot Production Attentionโ can be represented by a linear
combination of a set of basis vectors.
โข ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐๐,๐พ๐พ,๐๐ = ๐ด๐ด1,โฏ , ๐ด๐ด๐๐ ๐๐
โข ๐ค๐คโ๐ด๐ด๐ค๐ค๐ด๐ด ๐๐ โ โ๐๐ร๐๐ ๐ด๐ด๐ ๐ ๐ ๐ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ ๐ ๐ ๐ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐ ๐ ๐ ๐ ๐ด๐ด๐ค๐ค๐ด๐ด๐ ๐ .โข
Tucker Decomposition
๐ท๐ท1
๐ท๐ท2
๐ท๐ท3
โ๐๐
๐พ๐พ1
๐พ๐พ2
๐พ๐พ3
๐ข๐ข
๐ฟ๐ฟ๐๐
๐ฟ๐ฟ๐๐
๐ฟ๐ฟ๐๐
๐ถ๐ถ๐ด๐ด๐ค๐ค๐ด๐ด tensor
๐น๐น๐ ๐ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ค๐ค ๐๐๐ ๐ ๐ด๐ด๐ค๐ค๐ด๐ด๐ถ๐ถ๐ด๐ด๐ ๐
Basic vectors
Single-Block Attention
๐ท๐ท1
๐ท๐ท2
๐ท๐ท3
โ๐๐
๐พ๐พ1
๐พ๐พ2
๐พ๐พ3
๐ข๐ข
๐ฟ๐ฟ๐๐
๐ฟ๐ฟ๐๐
๐ฟ๐ฟ๐๐
๐ถ๐ถ๐ด๐ด๐ค๐ค๐ด๐ด tensor
๐น๐น๐ ๐ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ค๐ค ๐๐๐ ๐ ๐ด๐ด๐ค๐ค๐ด๐ด๐ถ๐ถ๐ด๐ด๐ ๐
๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ดTD ๐ข๐ข;๐๐,๐พ๐พ,๐๐ = ๐ข๐ข ๏ฟฝ1 ๐๐ โ 2 ๐พ๐พ ๏ฟฝ3 ๐๐
= ๏ฟฝ๐๐=1
๐ผ๐ผ๏ฟฝ
๐๐=1
๐ฝ๐ฝ๏ฟฝ
๐๐=1
๐๐๐ข๐ข๐๐๐๐๐๐ ๐๐๐๐ โ ๐พ๐พ๐๐ โ ๐๐๐๐
โ ๐ด๐ด๐ ๐ ๐ด๐ดโ๐ด๐ด ๐ด๐ด๐๐๐ด๐ด๐ด๐ด๐ค๐ค ๐๐๐ค๐ค๐ด๐ด๐๐๐๐๐ถ๐ถ๐ด๐ด.
Single-Block Attention in Transformer
๐ด๐ด๐๐๐ด๐ด๐๐๐๐๐ด๐ด๐ด๐ด
๐ด๐ด๐ด๐ด
๐๐ ๐พ๐พ
๐๐
๐๐๐๐
๐๐๐๐
๐ ๐
๐ข๐ข
๐ ๐
๐ ๐
๐ ๐
๐ด๐ด
๐ด๐ด
๐๐๐ฃ๐ฃ
๐ด๐ด
Lower-Rank Decompositionโข The Core-tensor ๐ข๐ข is defined as follows.
โข ๐ข๐ข๐๐๐๐๐๐ = ๏ฟฝ๐ค๐ค๐ ๐ ๐ด๐ด๐๐ 0,1 , ๐ด๐ด = ๐๐ = ๐ ๐ 0 ๐ด๐ด๐ด๐ดโ๐ด๐ด๐ค๐ค๐ค๐ค๐ด๐ด๐ ๐ ๐ด๐ด
โข In this case , it can be set as ๐ผ๐ผ = ๐ฝ๐ฝ = ๐๐ = ๐ ๐ โข R is the Rank.
The time complexity of Single-block attention is ๐ช๐ช ๐๐3 .The time complexity of Scaled dot production attention is ๐ช๐ช ๐๐2๐๐ .
Reconstruction for Scaled dot product attentionโข Corollary: Scaled dot product attention can be reconstructed by Single
block attention
โข ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด(๐๐,๐พ๐พ,๐๐)๐๐,๐๐ = โ๐๐=1๐ฝ๐ฝ ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐๐๐๐(๐ข๐ข;๐๐,๐พ๐พ,๐๐)๐๐,๐๐,๐๐
โข ๐ค๐คโ๐ด๐ด๐ค๐ค๐ด๐ด ๐ด๐ด, ๐๐ ๐ ๐ ๐ด๐ด๐๐ ๐ ๐ ๐ ๐ ๐ค๐ค๐ด๐ด ๐ด๐ดโ๐ด๐ด ๐ด๐ด๐ด๐ด๐๐๐ด๐ด๐ถ๐ถ๐ด๐ด๐ ๐ ๐ด๐ด๐ ๐ ๐ด๐ดโ๐ด๐ด ๐ ๐ ๐ด๐ด๐ด๐ด๐ ๐ ๐๐๐ด๐ด โ ๐๐๐๐๐ด๐ด๐ถ๐ถ๐๐ ๐ ๐ ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ดโฒ๐ ๐ ๐ด๐ด๐๐๐๐๐๐๐ด๐ด.
How to Get Richer Representationโข Tensor Splitโข Matrices Concat
โฎ
๐๐๐ด๐ด๐ด๐ด๐ ๐ ๐ด๐ด๐ค๐ค ๐๐๐๐๐๐๐ด๐ด๐ด๐ด
โฎ
๐๐๐ ๐ ๐ด๐ด๐ค๐ค๐ด๐ด๐ถ๐ถ๐ด๐ด๐ ๐ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ถ๐ถ๐ ๐ ๐ด๐ด
๐ป๐ป๐๐
๐ป๐ป๐๐
๐ป๐ป๐๐
Multi-linear Attention by Block-term Decompositionโข It is important to constructed the multi-head mechanism for modeling long-range dependency.
โข How to design the model with higher compression ratios?โข 1) Block-term decomposition (method) โข 2) Parameters sharing (idea)
Block-term Decomposition
Chen Y, Jin X, Kang B, et al. Sharing Residual Units Through Collective Tensor Factorization To Improve Deep Neural Networks[C]//IJCAI. 2018: 635-641.
Multi-linear Attention by Block-term DecompositionIn order to construct the multi-head mechanism, Multi-linear attention can be formulated as follows:
๐๐๐๐๐๐๐ด๐ด๐ด๐ด๐๐๐ด๐ด๐ด๐ด๐ด๐ด๐ ๐ ๐ค๐ค๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐ข๐ข;๐๐,๐พ๐พ,๐๐ = ๐๐๐๐๐๐๐ด๐ด๐ด๐ด๐ถ๐ถ๐ด๐ด๐ด๐ด๐ถ๐ถ๐ ๐ ๐ด๐ด 1โโ ๐๐1 + โฏ+ ๐๐โ ๐๐๐๐
๐ค๐คโ๐ด๐ด๐ค๐ค๐ด๐ด ๐๐๐๐ = ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐๐๐๐(๐ข๐ข๐๐;๐๐๐๐๐๐,๐พ๐พ๐๐๐พ๐พ,๐๐๐๐๐๐)
Parameters Sharing
Experimental Results in Language ModelingWMT-16 English-to-German
Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems forwmt 16.arXiv preprint arXiv:1606.02891, 2016.
Conclusionโข We provided a novel self-attention method, namely Multi-linear attention.โข The Tensorized Transformer model combines two compression ideas,
parameters sharing and low-rank decomposition.โข Our methods achieve higher compression ratio and better experimental
results in language modeling.โข The Tensorized Transformer model can be implied to more NLP tasks with
limited resources through further optimization