Looking for arXiv endorsement (cs.CV) to post my ViT positional embeddings paper [R]

Our take

Hello everyone, I am seeking an endorsement for my paper submission to arXiv in the cs.CV (computer vision) or cs.LG category. Titled "Positional Encodings in Vision Transformers: A Geometric Account of Spatial Organization and Robustness," this work explores how various positional encoding schemes influence the representations within Vision Transformers. Through innovative metrics and controlled interventions, I demonstrate how spatial organization is affected by these embeddings. For further details, please refer to the full paper [here](https://github.com/mahmoud-mannes/neurips-geometry-paper/blob/main/p

The request for arXiv endorsement by /u/Octacinth represents a common yet often overlooked hurdle in the academic AI research pipeline. For researchers working outside traditional institutional affiliations, the endorsement requirement can feel like a gatekeeping mechanism that slows the dissemination of potentially valuable work. Yet what makes this particular request noteworthy is not the procedural challenge it describes, but rather the substance of the research itself, which tackles a fundamental question in modern computer vision: how do Vision Transformers understand space?

The paper titled "Positional Encodings in Vision Transformers: A Geometric Account of Spatial Organization and Robustness" ventures into territory that deserves far more attention than it typically receives. While much of the Vision Transformer literature focuses on architectural variations or training recipes, this work turns inward to examine how different positional encoding schemes, learned absolute, sinusoidal, and rotary, shape the internal representations that models develop. The authors introduce a metric called Spatial Similarity Distance Correlation to quantify spatial structure in token representations, which represents a genuine contribution to the toolkit available for understanding transformer internals. This kind of work matters because it moves the field beyond performance benchmarking toward mechanistic understanding, a shift that will prove essential as AI systems become more integrated into high-stakes applications.

The findings described in the summary carry implications that extend well beyond the technical community. The researchers demonstrate that Vision Transformers develop spatial structure even without positional embeddings, but this structure is content-driven and collapses under token permutation. When positional encodings are introduced, models shift toward an index-anchored spatial organization that persists even when visual content is disrupted. Most notably, robustness to distributional shifts such as JPEG compression and Gaussian blur correlates directly with the presence of a stable positional reference frame. For practitioners in fields like healthcare AI, where models must generalize across varied imaging conditions, these insights offer actionable guidance about how to think about model design. Similarly, anyone working with data pipelines where missing data creates distribution shifts can appreciate the importance of understanding what makes models robust.

The broader context here is worth considering. As AI tools become more accessible through platforms that simplify complex workflows, the gap between implementation and understanding widens. Researchers producing work like this paper help close that gap by revealing the underlying dynamics that determine whether a model will generalize or fail. The arXiv endorsement system exists precisely to ensure some quality gatekeeping, and papers that introduce novel metrics and demonstrate careful experimental methodology, as this one appears to with ImageNet-100 experiments, multiple random seeds, and full statistical reporting, represent exactly the kind of contribution the system should facilitate. The question worth watching is whether the AI research community can develop more efficient pathways for sharing such work while maintaining meaningful quality standards, and how tools that promise to simplify data management will ultimately need to account for the kinds of representational complexities this paper illuminates.

Hi everyone,

I'm looking for someone to endorse me for arXiv submission in cs.CV (computer vision) or cs.LG. I have a completed paper and want to upload it as a preprint.

About the paper:

Title: Positional Encodings in Vision Transformers: A Geometric Account of Spatial Organization and Robustness

Summary: This paper investigates how different positional encoding schemes (learned absolute, sinusoidal, and rotary) shape the internal representations of Vision Transformers. We introduce a metric called Spatial Similarity Distance Correlation (SSDC) to quantify spatial structure in token representations. Using controlled interventions (random permutation at inference, random permutation training, and positional magnitude scaling), we show that:

ViTs develop non‑trivial spatial structure even without positional embeddings, but this structure is content‑driven and collapses under token permutation.
All positional encodings shift models toward index‑anchored spatial organization that persists under content disruption.
Robustness to distributional shifts (JPEG compression, Gaussian blur) is primarily associated with the presence of a stable positional reference frame and correlates directly with SSDC as measured under intervention.

The paper includes experiments on ImageNet‑100 with ViT‑S models, multiple random seeds, and full statistical reporting.

PDF available at: https://github.com/mahmoud-mannes/neurips-geometry-paper/blob/main/paper/main.pdf

submitted by /u/Octacinth
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →