Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

Abstract: Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a res ...

Generalizing attention length beyond training data length

Entering edit mode

3.0 years ago

Dustin 125

A big challenge in NLP has been to capture long-term dependencies in sequential data within a neural network. The authors are able to introduce a relative positional encoding which generalizes to attention lengths longer than the maximum attention length in the training data.

How and why should the authors relative position encoding generalize to attention lengths which are longer than in the training data?

NLP • 574 views

ADD COMMENT

last updated by Admin User 1 • posted by Dustin

We appreciate the help!

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context