22
A Tensorized Transformer for Language Modeling Tianjin University Xindian Ma, Peng Zhang*, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, Ming Zhou NeurIPS 2019

A Tensorized Transformer for Language Modeling

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

A Tensorized Transformer for Language Modeling

Tianjin University

Xindian Ma, Peng Zhang*, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, Ming Zhou

NeurIPS 2019

Backgroundโ€ข Transformer has led to breakthroughs in natural language processing tasks.โ€ข Transformer, and its variant model BERT, limit the effective deployment of the

model to limited resource setting.

โ€ข Some compression methods have been proved.โ€ข TT-embedding โ€ข BTRNN โ€ข Tensorizing Neural Networks

Some Compression Methodsโ€ข TT-Embedding [1]

โ€ข Tensor-Train decomposition is used to compress the embedding layer (look-up table).

โ€ข BTRNN [2]โ€ข Block-term tensor decomposition is used to compress the input layers in LSTM

โ€ข Tensorizing Neural Networks [3]โ€ข Tensor Train format is used to compress the fully-connected layers.

[1] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787, 2019[2] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9378โ€“9387, 2018.[3] Novikov A, Podoprikhin D, Osokin A, et al. Tensorizing neural networks[C]//Advances in neural information processing systems. 2015: 442-450.

Problem Formulationโ€ข The goals are:

โ€ข To linearly represent a self-attention by a group of basic vectors

โ€ข To compress multi-head attention in Transformer

โ€ข After compressing, it can be directly integrated into the encoder and decoder framework of Transformer

MethodsBasic Ideasโ€ข Low-rank decompositionโ€ข Parameters sharing

Using Tucker decomposition formulation is to construct Single-block attention

Using Block-term decomposition + Parameters sharing formulation is to construct multi-head mechanisms(Multi-linear attention)

Transformer Language Modelingโ€ข Scaled Dot Production Attention

โ€ข ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐‘„๐‘„,๐พ๐พ,๐‘‰๐‘‰ = ๐‘ ๐‘ ๐ด๐ด๐‘ ๐‘ ๐ด๐ด๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ๐‘„๐‘„๐พ๐พ๐‘‡๐‘‡

๐‘‘๐‘‘๐‘‰๐‘‰

โ€ข Multi-head Attention

โ€ข ๐‘€๐‘€๐‘€๐‘€๐‘€๐‘€๐ด๐ด๐ด๐ด๐‘€๐‘€๐ด๐ด๐‘ ๐‘ ๐‘€๐‘€๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐‘„๐‘„,๐พ๐พ,๐‘‰๐‘‰ = ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ถ๐ถ๐‘ ๐‘ ๐ด๐ด โ„Ž๐ด๐ด๐‘ ๐‘ ๐‘€๐‘€1,โ‹ฏ ,โ„Ž๐ด๐ด๐‘ ๐‘ ๐‘€๐‘€๐‘˜๐‘˜ ๐‘Š๐‘Š๐‘œ๐‘œ

๐‘ค๐‘คโ„Ž๐ด๐ด๐‘ค๐‘ค๐ด๐ด โ„Ž๐ด๐ด๐‘ ๐‘ ๐‘€๐‘€๐‘–๐‘– = ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด(๐‘„๐‘„๐‘Š๐‘Š๐‘–๐‘–๐‘„๐‘„,๐พ๐พ๐‘Š๐‘Š๐‘–๐‘–

๐พ๐พ,๐‘‰๐‘‰๐‘Š๐‘Š๐‘–๐‘–๐‘‰๐‘‰)

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.

Multi-group parameters

Linear Representationโ€ข Theorem: โ€œScaled Dot Production Attentionโ€ can be represented by a linear

combination of a set of basis vectors.

โ€ข ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐‘„๐‘„,๐พ๐พ,๐‘‰๐‘‰ = ๐ด๐ด1,โ‹ฏ , ๐ด๐ด๐‘›๐‘› ๐‘€๐‘€

โ€ข ๐‘ค๐‘คโ„Ž๐ด๐ด๐‘ค๐‘ค๐ด๐ด ๐‘€๐‘€ โˆˆ โ„๐‘›๐‘›ร—๐‘‘๐‘‘ ๐ด๐ด๐‘ ๐‘  ๐‘ ๐‘  ๐ถ๐ถ๐ด๐ด๐ด๐ด๐‘ ๐‘ ๐‘ ๐‘ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐‘ ๐‘ ๐‘ ๐‘ ๐ด๐ด๐‘ค๐‘ค๐ด๐ด๐‘ ๐‘ .โ€ข

Tucker Decomposition

๐ท๐ท1

๐ท๐ท2

๐ท๐ท3

โ‰ˆ๐’œ๐’œ

๐พ๐พ1

๐พ๐พ2

๐พ๐พ3

๐’ข๐’ข

๐‘ฟ๐‘ฟ๐Ÿ๐Ÿ

๐‘ฟ๐‘ฟ๐Ÿ๐Ÿ

๐‘ฟ๐‘ฟ๐Ÿ‘๐Ÿ‘

๐ถ๐ถ๐ด๐ด๐‘ค๐‘ค๐ด๐ด tensor

๐น๐น๐‘ ๐‘ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐‘ค๐‘ค ๐‘€๐‘€๐‘ ๐‘ ๐ด๐ด๐‘ค๐‘ค๐ด๐ด๐ถ๐ถ๐ด๐ด๐‘ ๐‘ 

Basic vectors

Single-Block Attention

๐ท๐ท1

๐ท๐ท2

๐ท๐ท3

โ‰ˆ๐’œ๐’œ

๐พ๐พ1

๐พ๐พ2

๐พ๐พ3

๐’ข๐’ข

๐‘ฟ๐‘ฟ๐Ÿ๐Ÿ

๐‘ฟ๐‘ฟ๐Ÿ๐Ÿ

๐‘ฟ๐‘ฟ๐Ÿ‘๐Ÿ‘

๐ถ๐ถ๐ด๐ด๐‘ค๐‘ค๐ด๐ด tensor

๐น๐น๐‘ ๐‘ ๐ถ๐ถ๐ด๐ด๐ด๐ด๐‘ค๐‘ค ๐‘€๐‘€๐‘ ๐‘ ๐ด๐ด๐‘ค๐‘ค๐ด๐ด๐ถ๐ถ๐ด๐ด๐‘ ๐‘ 

๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ดTD ๐’ข๐’ข;๐‘„๐‘„,๐พ๐พ,๐‘‰๐‘‰ = ๐’ข๐’ข ๏ฟฝ1 ๐‘„๐‘„ โ‹…2 ๐พ๐พ ๏ฟฝ3 ๐‘‰๐‘‰

= ๏ฟฝ๐‘–๐‘–=1

๐ผ๐ผ๏ฟฝ

๐‘—๐‘—=1

๐ฝ๐ฝ๏ฟฝ

๐‘š๐‘š=1

๐‘€๐‘€๐’ข๐’ข๐‘–๐‘–๐‘—๐‘—๐‘š๐‘š ๐‘„๐‘„๐‘–๐‘– โˆ˜ ๐พ๐พ๐‘—๐‘— โˆ˜ ๐‘‰๐‘‰๐‘š๐‘š

โˆ˜ ๐ด๐ด๐‘ ๐‘  ๐ด๐ดโ„Ž๐ด๐ด ๐ด๐ด๐‘€๐‘€๐ด๐ด๐ด๐ด๐‘ค๐‘ค ๐‘๐‘๐‘ค๐‘ค๐ด๐ด๐‘€๐‘€๐‘€๐‘€๐ถ๐ถ๐ด๐ด.

Single-Block Attention in Transformer

๐ด๐ด๐‘€๐‘€๐ด๐ด๐‘๐‘๐‘€๐‘€๐ด๐ด๐ด๐ด

๐ด๐ด๐ด๐ด

๐‘„๐‘„ ๐พ๐พ

๐‘‰๐‘‰

๐‘€๐‘€๐‘˜๐‘˜

๐‘€๐‘€๐‘˜๐‘˜

๐‘…๐‘…

๐’ข๐’ข

๐‘…๐‘…

๐‘…๐‘…

๐‘…๐‘…

๐ด๐ด

๐ด๐ด

๐‘€๐‘€๐‘ฃ๐‘ฃ

๐ด๐ด

Lower-Rank Decompositionโ€ข The Core-tensor ๐’ข๐’ข is defined as follows.

โ€ข ๐’ข๐’ข๐‘–๐‘–๐‘—๐‘—๐‘š๐‘š = ๏ฟฝ๐‘ค๐‘ค๐‘ ๐‘ ๐ด๐ด๐‘€๐‘€ 0,1 , ๐ด๐ด = ๐‘—๐‘— = ๐‘ ๐‘ 0 ๐ด๐ด๐ด๐ดโ„Ž๐ด๐ด๐‘ค๐‘ค๐‘ค๐‘ค๐ด๐ด๐‘ ๐‘ ๐ด๐ด

โ€ข In this case , it can be set as ๐ผ๐ผ = ๐ฝ๐ฝ = ๐‘€๐‘€ = ๐‘…๐‘…โ€ข R is the Rank.

The time complexity of Single-block attention is ๐’ช๐’ช ๐‘๐‘3 .The time complexity of Scaled dot production attention is ๐’ช๐’ช ๐‘๐‘2๐‘€๐‘€ .

ๆผ”็คบ่€…
ๆผ”็คบๆ–‡็จฟๅค‡ๆณจ
N is the length of a sentence

Reconstruction for Scaled dot product attentionโ€ข Corollary: Scaled dot product attention can be reconstructed by Single

block attention

โ€ข ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด(๐‘„๐‘„,๐พ๐พ,๐‘‰๐‘‰)๐‘–๐‘–,๐‘š๐‘š = โˆ‘๐‘—๐‘—=1๐ฝ๐ฝ ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐‘‡๐‘‡๐‘‡๐‘‡(๐’ข๐’ข;๐‘„๐‘„,๐พ๐พ,๐‘‰๐‘‰)๐‘–๐‘–,๐‘—๐‘—,๐‘š๐‘š

โ€ข ๐‘ค๐‘คโ„Ž๐ด๐ด๐‘ค๐‘ค๐ด๐ด ๐ด๐ด, ๐‘—๐‘— ๐‘ ๐‘ ๐ด๐ด๐‘€๐‘€ ๐‘ ๐‘  ๐‘ ๐‘ ๐‘ค๐‘ค๐ด๐ด ๐ด๐ดโ„Ž๐ด๐ด ๐ด๐ด๐ด๐ด๐‘€๐‘€๐ด๐ด๐ถ๐ถ๐ด๐ด๐‘ ๐‘  ๐ด๐ด๐‘ ๐‘  ๐ด๐ดโ„Ž๐ด๐ด ๐‘ ๐‘ ๐ด๐ด๐ด๐ด๐‘ ๐‘ ๐‘€๐‘€๐ด๐ด โˆ’ ๐‘๐‘๐‘€๐‘€๐ด๐ด๐ถ๐ถ๐‘๐‘ ๐‘ ๐‘ ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ดโ€ฒ๐‘ ๐‘  ๐ด๐ด๐‘€๐‘€๐‘๐‘๐‘€๐‘€๐ด๐ด.

ๆผ”็คบ่€…
ๆผ”็คบๆ–‡็จฟๅค‡ๆณจ
The Scaled dot product attention can be seen as the marginal probability in Tensor( Single block attention)

Graphing of Reconstruction

How to Get Richer Representationโ€ข Tensor Splitโ€ข Matrices Concat

โ‹ฎ

๐‘‡๐‘‡๐ด๐ด๐ด๐ด๐‘ ๐‘ ๐ด๐ด๐‘ค๐‘ค ๐‘†๐‘†๐‘๐‘๐‘€๐‘€๐ด๐ด๐ด๐ด

โ‹ฎ

๐‘€๐‘€๐‘ ๐‘ ๐ด๐ด๐‘ค๐‘ค๐ด๐ด๐ถ๐ถ๐ด๐ด๐‘ ๐‘  ๐ถ๐ถ๐ด๐ด๐ด๐ด๐ถ๐ถ๐‘ ๐‘ ๐ด๐ด

๐‘ป๐‘ป๐Ÿ๐Ÿ

๐‘ป๐‘ป๐Ÿ๐Ÿ

๐‘ป๐‘ป๐’๐’

Multi-linear Attention by Block-term Decompositionโ€ข It is important to constructed the multi-head mechanism for modeling long-range dependency.

โ€ข How to design the model with higher compression ratios?โ€ข 1) Block-term decomposition (method) โ€ข 2) Parameters sharing (idea)

Block-term Decomposition

Chen Y, Jin X, Kang B, et al. Sharing Residual Units Through Collective Tensor Factorization To Improve Deep Neural Networks[C]//IJCAI. 2018: 635-641.

ๆผ”็คบ่€…
ๆผ”็คบๆ–‡็จฟๅค‡ๆณจ
Block Term Decomposition CP decomposition with Tucker decomposition.

Multi-linear Attention by Block-term DecompositionIn order to construct the multi-head mechanism, Multi-linear attention can be formulated as follows:

๐‘€๐‘€๐‘€๐‘€๐‘€๐‘€๐ด๐ด๐ด๐ด๐‘€๐‘€๐ด๐ด๐ด๐ด๐ด๐ด๐‘ ๐‘ ๐‘ค๐‘ค๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐’ข๐’ข;๐‘„๐‘„,๐พ๐พ,๐‘‰๐‘‰ = ๐‘†๐‘†๐‘๐‘๐‘€๐‘€๐ด๐ด๐ด๐ด๐ถ๐ถ๐ด๐ด๐ด๐ด๐ถ๐ถ๐‘ ๐‘ ๐ด๐ด 1โ„Žโˆ— ๐‘‡๐‘‡1 + โ‹ฏ+ ๐‘‡๐‘‡โ„Ž ๐‘Š๐‘Š๐‘‚๐‘‚

๐‘ค๐‘คโ„Ž๐ด๐ด๐‘ค๐‘ค๐ด๐ด ๐‘‡๐‘‡๐‘—๐‘— = ๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐‘‡๐‘‡๐‘‡๐‘‡(๐’ข๐’ข๐‘—๐‘—;๐‘„๐‘„๐‘Š๐‘Š๐‘ž๐‘ž,๐พ๐พ๐‘Š๐‘Š๐พ๐พ,๐‘‰๐‘‰๐‘Š๐‘Š๐‘‰๐‘‰)

Parameters Sharing

Multi-Linear Attention

Experimental Results in Language Modeling

One-Billion

PTB WikiText-103

Experimental Results in Language ModelingWMT-16 English-to-German

Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems forwmt 16.arXiv preprint arXiv:1606.02891, 2016.

Conclusionโ€ข We provided a novel self-attention method, namely Multi-linear attention.โ€ข The Tensorized Transformer model combines two compression ideas,

parameters sharing and low-rank decomposition.โ€ข Our methods achieve higher compression ratio and better experimental

results in language modeling.โ€ข The Tensorized Transformer model can be implied to more NLP tasks with

limited resources through further optimization

Thanks!