Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
Add an identifier
Abstract:Transformers have a potential of learning longer-term dependency, but are
limited by a fixed-length context in the setting of language modeling. We
propose a novel neural architecture Transformer-XL that enables learning
dependency beyond a fixed length without disrupting temporal coherence. It
consists of a segment-level recurrence mechanism and a novel positional
encoding scheme. Our method not only enables capturing longer-term dependency,
but also resolves the context fragmentation problem. As a res ...
Generalizing attention length beyond training data length
A big challenge in NLP has been to capture long-term dependencies in sequential data within a neural network. The authors are able to introduce a relative positional encoding which generalizes to attention lengths longer than the maximum attention length in the training data.
How and why should the authors relative position encoding generalize to attention lengths which are longer than in the training data?