2 min readfrom Machine Learning

Is Attention sink without Positional Encoding unavoidable? [D]

Our take

In exploring the nuances of Transformer models, a critical question arises: Is the absence of Positional Encoding (PE) in attention mechanisms an unavoidable limitation? The issue manifests as vertical hot lines in attention heatmaps when PE is removed, suggesting that query vectors may uniformly attend to the same key tokens. This phenomenon occurs in both Self and Cross-attention configurations.
Is Attention sink without Positional Encoding unavoidable? [D]
Is Attention sink without Positional Encoding unavoidable? [D]

TL;DR: As soon as I remove Positional Encoding (PE) from Self or Cross-attention, I start seeing vertical hot lines in attention heatmaps. Is there any way to make a model have query-conditioned attention without PE?

So, I've been trying to pre-train a couple types of Transformer based models (small, tinkering level only), Encoder-Decoder model and Cross-attention memory only model (basically, removing FFNs and using cross-attended vectors as memory banks instead), namely. But every-time I try to train cross-attention, I see vertical lines as shown in the image attached. And I'm guessing that means every query vector is attending to the same key tokens. This is while I don't use RoPE or any other PE during cross-attention. I start to see some diagonals when I add PE, though I do not think I should need to add it during cross-attention, as queries and keys are representations of different data.

And this shows up in simple Causal Self-attention too, as soon as I remove PE.

My question is, how do I force the model to attend to key tokens dynamically based on query token?

I've already tried regularization such that attention is more spread out, which does make the attention more spread out, but still in vertical lines, no diagonals, or any other pattern.

submitted by /u/PreetamSing
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#self-service analytics tools#cloud-based spreadsheet applications#real-time data collaboration#self-service analytics#rows.com#big data management in spreadsheets#generative AI for data analysis#enterprise-level spreadsheet solutions#conversational data analysis#Excel alternatives for data analysis#intelligent data visualization#real-time collaboration#data visualization tools#enterprise data management#big data performance#data analysis tools#data cleaning solutions#Attention sink#Positional Encoding#query-conditioned attention