1 result • Page 1 of 1

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

Abstract: Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a res ...

0

votes

0

answers

1.0k

views

NLP

last updated by Admin User 1 • posted by Dustin

1 result • Page 1 of 1

Recent Votes

Implementing an algorithm

Has Padam optimizer found utility in practice for optimizing neural network classifiers?

Quantum teleportation of entangled coherent states

Answer: Implementing an algorithm

Implementing an algorithm

Answer: Implementing an algorithm

Can a differentially private estimator be characterized via its influence function?

Recent Awards • All

Popular Question to Andy 146

Invitee to gor ▴ 15

Invitee to Yuval Filmus ▴ 15

Invitee to Andy ▴ 15

Founding Contributor to FrankS ▴ 85

Invitee to Cosmas ▴ 15

Invitee to marco.fellous-asiani ▴ 15

Recent Replies

Answer: When a question is asked about a paper, are the authors contacted? by Andy 146

It depends. **Authors that are registered users:** If an author of the paper is - a registered user, - known to be the author (se…

Answer: How to write a good question? by Andy 146

There are no hard rules criteria for what makes a "good" post. The most important thing is that it either contains **information that's us…

Answer: Implementing an algorithm by Andy 146

Use a distibution that's symmetric around the origin and normalize the results so they lie on the sphere. E.g. you can use a Gaussian. He…

Answer: Application of Property RD by Dustin 125

It is used to prove the bound in proposition 4.6 which is then used crucially in proposition 4.7 to prove the uniform boundedness of the op…

Traffic: 5 users visited in the last hour