Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]

Our take

In the evolving landscape of vision language models (VLMs), a pertinent question arises: do leading models still rely on fixed-patch Vision Transformers (ViTs) for their vision capabilities? While the research community has introduced more efficient tokenization methods, there’s a lack of clarity on whether big players are adopting these advancements. Factors like marginal gains, efficiency requirements, and the complexities of scaling laws may deter this shift. For a deeper exploration of the challenges in productionizing AI, check out "Six Sessions at QCon AI Boston 2026."

The discussion around vision-language models (VLMs) and their use of fixed-patch Vision Transformers (ViTs) reveals an important crossroads in the evolution of AI technology. As the research community continues to innovate, we see the emergence of more efficient and effective tokenization strategies that could reshape how VLMs handle visual data. Yet, the question remains: are leading models adopting these advanced techniques, or are they sticking with legacy systems? This inquiry aligns with ongoing conversations in the AI sphere, such as those explored in Six Sessions at QCon AI Boston 2026 That Take Productionizing AI Seriously, where experts tackle the complexities surrounding AI deployment, and the implications of automation discussed in Presentation: The Ironies of A^2 I^2.

The concerns raised about fixed-patch tokenization highlight several potential barriers to adopting more dynamic approaches. For one, the marginal gains from switching to non-fixed-patch methods may not justify the upheaval in established workflows. Many organizations rely on pipelines that require a fixed number of tokens per image for efficiency, which can complicate any attempts to integrate adaptive tokenization. The hesitation from industry leaders may also stem from the current lack of comprehensive understanding regarding scaling laws for input-adaptive patching. This uncertainty could lead to a cautious approach, where companies prefer to stick with what they know rather than experiment with unproven strategies.

This situation underscores a broader theme in the AI landscape: the tension between innovation and risk management. As the capabilities of models expand, organizations must weigh the benefits of adopting cutting-edge techniques against the stability of established practices. The reluctance to abandon fixed-patch methods could be indicative of a larger fear of disrupting existing systems, especially in environments where reliability is paramount. This dynamic is essential to consider as we navigate an era defined by rapid technological advancement and the need for continuous adaptation.

Looking ahead, it will be fascinating to observe how the industry evolves in response to these challenges. Will we see a gradual shift toward dynamic tokenization as researchers continue to validate its effectiveness? Or will legacy systems persist, hampering the potential for more nuanced and effective visual data processing? The outcomes of these questions will not only impact the development of VLMs but could also influence the broader field of AI, shaping how we think about data management, user interaction, and the capabilities of machine learning models.

As we continue to explore these developments, it’s crucial for practitioners and stakeholders in the AI community to remain engaged with ongoing research and innovations. By doing so, they can better prepare for the future of data management and ensure that they are leveraging the most effective tools available. The conversation surrounding tokenization and model architecture is just one facet of a much larger narrative about the future of AI technology—one that invites exploration, innovation, and ultimately, transformation.

The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models?

I imagine not, and I'm trying to think why:

- marginal gains?

- pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)?

- scaling laws are not well understood for input-adaptive patching therefore big players do not bet on this?

or am I simply totally wrong and under the hood all the big players are doing dynamic tokenization for vision?

submitted by /u/howtorewriteaname
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →