Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

Add an identifier
Abstract:Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a res ...
Generalizing attention length beyond training data length
Entering edit mode
2.8 years ago
Dustin 125

A big challenge in NLP has been to capture long-term dependencies in sequential data within a neural network. The authors are able to introduce a relative positional encoding which generalizes to attention lengths longer than the maximum attention length in the training data.

How and why should the authors relative position encoding generalize to attention lengths which are longer than in the training data?

NLP • 514 views

Login before adding your answer.

Traffic: 1 users visited in the last hour