RETURNN overview - RWTH Aachen University...RETURNN overview Albert Zeyer [email protected] January 22nd, 2019 Human Language Technology and Pattern Recognition Computer Science

RETURNN overview

Albert [email protected]

January 22nd, 2019

Human Language Technology and Pattern RecognitionComputer Science Department, RWTH Aachen University

Zeyer:RETURNN 1/24 22/01/2019

[email protected]

Outline

What is RETURNN

Training execution guide

Dataset

Network topology / construction, layers, LayerBase, Data

RecLayer, beam search, automatic optimization

Native CUDA LSTM kernel

Other functions

Technological overview, working with the code

Working with RETURNN

Final words, further resources


What is RETURNN

I RWTH extensible training framework for universal recurrent neuralnetworks

I https://github.com/rwth-i6/returnnI Python, based on TensorFlow / Theano, custom C++ / CUDA codeI both framework and standalone tool

. framework:. embed in existing software (e.g. RASR), or write custom logic. similar: Keras, tf.layers, ...

. tool:. similar: Caffe, Tensor2Tensor, .... write config, run given tool: rnn.py <config>

I very generic, but some of our main applications are:. automatic speech recognition (ASR): hybrid models. ASR / translation encoder-decoder-attention models. language modeling

I high flexibility & high train speed & high inference speed


https://github.com/rwth-i6/returnn

History of RETURNN

I 2013-2014 (?): Patrick Doetsch started the project (Theano)I Jan 2014: Paul Voigtlaender joinedI Jan 2015: Albert Zeyer joinedI Mar 2015: SprintDataset, interface to RASRI Jul 2015: fast CUDA LSTM kernelI Dec 2016: start on TensorFlow support

(TF 0.12.0. initial working support already in that month)I May 2017: flexible RecLayer, encoder-decoder attention, beam search


Usage as a tool

I call rnn.py <config> [-other options]I training, forwarding, eval, beam search / decoding, etc (task option)I config formats: JSON or PythonI config and/or command line options define:

. dataset (for train, cv, eval, ...)

. batching / chunking

. model/network topology (independent from search or training),including losses (for eval or training),and constraints (L2, ...) and regularization (dropout, ...)

. where to load/save the model checkpoints

. training optimizer (SGD, Adam, momentum, ...)

. learning rate scheduling logic (Newbob, or constant decay, ...)

. pretraining logic

. ...


Training execution guide

1. setup (load config, load/setup TF, ...)2. construct network for training (TF computation graph)3. construct optimizer for losses (TF computation graph)4. randomly initialize params5. maybe load model checkpoint file and load params6. train one epoch (epoch defined by dataset)7. cross validation8. learning rate scheduling update9. save model checkpoint

10. repeat with next epoch


Dataset

I generic interface:. data keys (e.g. source, target, ...). for each data key: shape (including variable length axes) and dtype. get next utterance:. either end-of-epoch, or. values for all data keys, matching to the shape/dtype

. initialize new epoch

. control sorting/order of utterancesI existing implementations:

. ExternSprintDataset: runs RASR in background process

. HDFDataset: data stored in HDF files

. ...

. MetaDataset: combine multiple datasets

. create artificial data on-the-fly...I one epoch: arbitrarily defined by the dataset, also partition epochI new corpora, possible options:

. convert such that e.g. HDFDataset can read it

. write own dataset implementationZeyer:RETURNN 7/24 22/01/2019

Network topology

example:network = {" lstm0_fw" : { " class " : " rec" , " uni t " : " lstm" , "n_out" :500 , " direct ion " : 1 } ,"lstm0_bw" : { " class " : " rec" , " uni t " : " lstm" , "n_out" :500 , " direct ion " :−1} ,

" lstm1_fw" : { " class " : " rec" , " uni t " : " lstm" , "n_out" :500 , " direct ion " :1 ,"from" : [ " lstm0_fw" , "lstm0_bw" ] } ,

"lstm1_bw" : { " class " : " rec" , " uni t " : " lstm" , "n_out" :500 , " direct ion ":−1,"from" : [ " lstm0_fw" , "lstm0_bw" ] } ,

"output" : { " class " : "softmax" , " loss " : "ce" , "from" : [ " lstm1_fw" , "lstm1_bw" ] }}

I consists of layers which interconnected in any possible wayI defined in config as a dict

. key: (str) layer name

. value: (dict) kwargs for layerI some kwargs have special handling:

. "class": (str) Python layer class

. "from": (str|list[str]) source(s) as layer name(s)I defines model for inference, and also for training


Network construction

example:network = {

"conv0" : { " class " : "conv" , " f i l t e r _ s i z e " : [ 5 ] , "padding" : "same" , "n_out" : 100} ," lstm1" : { " class " : " rec" , " uni t " : " lstm" , "n_out" : 500 , "from" : [ "conv0" ] } ," lstm2" : { " class " : " rec" , " uni t " : " lstm" , "n_out" : 500 , "from" : [ " lstm1" ] } ,"output" : { " class " : "softmax" , " loss " : "ce" , "from" : [ " lstm2" ] }

}

I start constructing "output" layerI pop out "class" from the kwargs dict,

get Python layer class (e.g. SoftmaxLayer, RecLayer, ...)I pop out "from", collect list of layers, or recursively construct them;

then add list of layers as "sources" back to kwargs dict;default "from": use "data" (implicit layer for input data)

I basically call layer_class(**kwargs)


Layer

every layer class:I base class: LayerBaseI classmethod transform_config_dict: handling of "from", ...I classmethod get_out_data_from_opts(**kwargs):

return output format description (Data instance)I __init__(**kwargs) must set output attributeI layer.output is also an instance of Data,

but now also layer.output.placeholder is set (tf.Tensor)I can define a loss, or multiple losses;

any Loss class can be used with any layer (CrossEntropyLoss, ...)I can define constraints (L2, ...)I can have a subnetwork, with sub layers (SubnetworkLayer, RecLayer)


Data

Example TF tensors are of shape:I (Batch, Time, Feature): audio, or any sequenceI (Time, Batch, Feature): more efficient for RNNsI (Batch, Time): e.g. class indices for each frameI (Batch,): e.g. class indices for the whole seq (speaker id or so)I (Batch, Width, Height, Feature|Channel): imageI (Batch, Feature|Channel, Width|Time, (Height)): more efficient for CNNsData objects:I (batch) shape of tensorI marking of special axes:

. batch-dim-axis

. time-dim-axis (or any spatial axis) (can be of variable length)

. feature-dim-axisI (sequence) lengths (of any variable dim axis)I dtype (float, int, anything what TF supports)I sparse→ dtype is int, values are class indicesI dim: number of classes, or feature dim


Layer conventions

I should be generic / reusable in many different context,simple functionality, only apply a single function. good examples: LinearLayer, ConvLayer, PoolLayer, RecLayer, DropoutLayer,

MergeDimsLayer, SplitDimsLayer, PadLayer, ReduceLayer, ResizeLayer, EvalLayer,

DotLayer, ...

. bad examples: AllophoneStateIdxParserLayer, NeuralTransducerLayer,

NoiseEstimationByFirstTFramesLayer, ...

I should accept any possible input. e.g. LinearLayer accepts:. (B,T,F), (B,F), (T,B,F), (B,F,W,H), .... multiple sources (common behavior: concatenate in feature dim). sparse (int, indices) or dense

I automatically transforms input if more efficient for operation.e.g. RecLayer converts to (T,B,F) and then outputs as (T,B,F)

I results:. quite verbose configs. model/training mostly becomes clear just from looking at the config,

not at the code. all layers well tested, often used. no rotten code


RecLayer

I despite predefined units (LSTM, ...), can define any possible recurrentformula, via subnetwork, applied frame by frame

I generic wrapper for tf.while_loopI use "prev:layer" to access layer output from previous frameI example (LSTM):

network = {" lstm" : { " class " : " rec" , "from" : "data" , " uni t " : {

" input " : { " class " : "copy" , "from" : [ "prev : output" , "data : source" ] } ," input_gate " : { " class " : " l inear " , "from" : " input " , " act ivat ion " : "sigmoid" , "n_out" :10} ," forget_gate " : { " class " : " l inear " , "from" : " input " , " act ivat ion " : "sigmoid" , "n_out" :10} ,"output_gate" : { " class " : " l inear " , "from" : " input " , " act ivat ion " : "sigmoid" , "n_out" :10} ," ce l l _ in " : { " class " : " l inear " , "from" : " input " , " act ivat ion " : " tanh" , "n_out" :10} ,"c" : { " class " : " eval " , "from" : [ " input_gate " , " ce l l _ in " , " forget_gate " , "prev : c" ] ,

" eval " : "source ( 0 ) ∗ source ( 1 ) + source ( 2 ) ∗ source ( 3 ) " } ,"output" : { " class " : " eval " , "from" : [ "output_gate" , "c" ] ,

" eval " : "source ( 0 ) ∗ source ( 1 ) " } ,} } ,"output" : { " class " : "softmax" , " loss " : "ce" , "from" : " lstm" }

}

I automatic optimizationI beam search


Beam search decodingexample (simple encoder decoder model, no attention):network = {

" input " : { " class " : " rec" , " uni t " : "standardlstm" , "n_out" : 20} ," input_last " : { " class " : " get_last_hidden_state " , "from" : " input " , "n_out" :40} ,

"output" : { " class " : " rec" , "from" : [ ] , " target " : "classes" , " uni t " : {"embed" : { " class " : " l inear " , " act ivat ion " :None, "from" : "output" , "n_out" :10} ,"s" : { " class " : " rnn_cel l " , " uni t " : "standardlstm" , "n_out" : 20 ,

"from" : [ "prev :embed" , "base : input_last " ] } ,"p" : { " class " : "softmax" , "from" : "s" , " target " : "classes" , " loss " : "ce" } ,"output" : { " class " : "choice" , "from" : "p" , " target " : "classes" , "beam_size" :8}

} }}

ChoiceLayer:search flag enabled:I select N best (in this frame)

(tf.nn.top_k)I beam (N hyps) hidden in batch dim

(all other layers just work as usual)I output shape (per frame): (Batch,),

int32

search flag disabled, in training:I returns ground truth labelsI output shape (per frame): (Batch,),

int32I no dependency on "p" or anything

in the rec layer


Beam search decoding (continued)

I SearchChoices:. stores scores of current hyps in the beam (beam_scores). ref to layer where the last choice was taken. indices of source beam hyps (src_beams). layer can have this. basically only ChoiceLayer so far. TFNetwork.get_search_choices(sources, ...). SearchChoices.translate_to_common_search_beam

I RecLayer remembers src_beams for each frame, class indices, finalbeam scores

I RecLayer does another inverse tf.while_loop to backtrack the finalbeam


Automatic optimization during network construction

I mode flags: training, beam search, ...I TF computation graph construction different depending on mode flags

. e.g. dropout used in training onlyI layer dependencies depending on mode flags:

. prev:output is true label in training, or predicted at inference

. huge effect in RecLayer. e.g. Transformer training fully parallel

. example (encoder decoder, no attention), all sub layers outside loop:network = {

" input " : { " class " : " rec" , " uni t " : "standardlstm" , "n_out" : 20} ," input_last " : { " class " : " get_last_hidden_state " , "from" : " input " , "n_out" :40} ,

"output" : { " class " : " rec" , "from" : [ ] , " target " : "classes" , " uni t " : {"embed" : { " class " : " l inear " , " act ivat ion " :None, "from" : "output" , "n_out" :10} ,"s" : { " class " : " rnn_cel l " , " uni t " : "standardlstm" , "n_out" : 20 ,

"from" : [ "prev :embed" , "base : input_last " ] } ,"p" : { " class " : "softmax" , "from" : "s" , " target " : "classes" , " loss " : "ce" } ,"output" : { " class " : "choice" , "from" : "p" , " target " : "classes" , "beam_size" :8}

} }}


native LSTM, CUDA implementation

I initial CUDA implementation by Paul Voigtlaender for TheanoI Theano NativeOp: general framework to write code both for CPU/GPUI Theano NativeLstm based on NativeOp (CPU + GPU)I TF TFNativeOp: porting NativeOp to TF, including NativeLstmI NativeLstm2: rewrite, faster, more flexible, more options

TF LSTM implementations, can all be used in RecLayer:I TF official: LSTMCell (StandardLSTM), BasicLSTMCell (BasicLSTM)I TF contrib: LSTMBlockCell, LSTMBlockFusedCellI TF contrib: CudnnLSTM (GPU only)I our natives: NativeLstm, NativeLstm2I benchmark: ours is about the same in speed compared to CuDNN

https://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html


https://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html

Other functions

I Pretraining. supervised. different network topology every epoch, e.g. start with one layer, add

more and more. automatically copies over params from one epoch to the next as far as

possible. more generic support:. write custom Python function in config which returns network

topology for each pretrain epoch. overwrite anything (loss, hyper params) for each epoch

I Learning rate scheduling. constant, constant decay. warmup. decay if cross-validation score under some threshold (Newbob, and

variants). easy to extend by own custom logic

I Multi-GPU support: based on Horovod (currently)


Technological overview

I rnn.main is main entry point (when used as a tool)I Config: loads the configI Util: lots of generic utilities (independent from RETURNN / (mostly) TF)I Dataset, *Dataset: dataset base class + implementations (TF indep.)I LearningRateControl: learning rate scheduling (TF indep.)I Pretrain: pretraining logic (TF indep.)I TFUtil: lots of TF utilities (independent from RETURNN)I Data: tf.Tensor + meta infoI TFNetwork: network (dict of layers), network constructionI ExternData: describes what the dataset is providing (Data instances)I LayerBase, *Layer (in TF-prefixed files): layer base class + layersI Loss, *Loss: loss base class + lossesI NativeOp, TFNativeOp: native CUDA LSTM, and other native opsI TFUpdater: param update / optimizer logic (SGD, ...)I TFEngine: logic around everything, ..., TF session.run

(Time for some code browsing?)


Working with the code

I be familiar with Python, TensorFlow, Git, GitHubI use PyCharm (or an equally powerful IDE):

. must automatically check syntax, PEP, common Python conventions,typos, simple errors, correct types, missing documentation, ...(and do not ignore these hints/warnings which PyCharm gives you...)

. easily browsable through the code

. automatically show you usages of functions, classes, ...

. automatically infer types of all variables (both for checking andautocompletion)

I push simple changes directly (which cannot possibly break somethingfor anyone (else))

I pull request for more critical changesI automatic tests (via Travis, on every push, or pull request)I write testsI write generic layers, losses, ...


Working with RETURNN

I what’s the task? clearly define the dataset keys and format. preparedataset. use ExternSprintDataset, or HDFDataset. use TranslationDataset, or LmDataset. write own. see e.g. LibriSpeechDataset. see tools/hdf_dump.py

I need new layer?. are you really sure? maybe actually equivalent to convolution, or other

op.... for simple equations: just use EvalLayer. think of the generic functions needed, write generic layers

I need new loss?. are you sure? often it’s basically just cross entropy. ViaLayerLoss: define custom gradient. AsIsLoss: layer output itself is the loss. again: try to introduce it as generic as possible


Working with RETURNN (continued)

I need some other custom logic?. can be part of the network itself? try always to prefer this, even if this

also requires other work, like new layer or so. can be formulated as pretrain logic?. can be formulated as param sharing, param import, ...?. dataset needs to provide some additional information? see alsoMetaDataset

. again: if introducing new option/function, try to make it genericI debugging:

. useful options: debug_print_layer_output_template,log_batch_size, tf_log_memory_usage,debug_add_check_numerics_on_output,debug_add_check_numerics_ops, ... (many more, see relevant code)

. log_verbosity = 5

. write test case

. interactively stepping through the code, interactive IPython shell(debug_shell_in_runner)


Further resources

I Homepage, code, issues, pull requests, further links:https://github.com/rwth-i6/returnn

I Documentation: http://returnn.readthedocs.io/I Papers: https://arxiv.org/abs/1608.00895, https://arxiv.org/abs/1805.05225

I Code itself, comments in code, demos, tests, tools (all in repo)I Real experiments: https://github.com/rwth-i6/returnn-experimentsI StackOverflow: https://stackoverflow.com/questions/tagged/returnn


https://github.com/rwth-i6/returnn

http://returnn.readthedocs.io/

https://arxiv.org/abs/1608.00895

https://arxiv.org/abs/1805.05225

https://github.com/rwth-i6/returnn-experiments

https://stackoverflow.com/questions/tagged/returnn

Thank you for your attention

Albert Zeyer

<surname>@cs.rwth-aachen.de


<surname>@cs.rwth-aachen.de

Reference


Documents

RETURNN overview - RWTH Aachen University...RETURNN overview Albert Zeyer [email protected] January 22nd, 2019 Human Language Technology and Pattern Recognition Computer Science