1 min readfrom Machine Learning

ROCm with PyTorch and PyTorch Lightning seems to still suck for research [D]

Our take

In a recent exploration of ROCm with PyTorch and PyTorch Lightning, user experiences reveal ongoing challenges in the research community. Despite procuring a RX 7900XTX reference version, the transition from an RTX 3090 to ROCm highlighted persistent issues, particularly with NaN errors during training flow matching models. While certain scripts, like nanoGPT, ran smoothly, the fragility of ROCm with less conventional codebases remains a concern. For those facing similar difficulties, our article "Struggling with Overfitting on Medical Imaging Task" may offer valuable insights.

The recent experiences shared by users on the ROCm platform, particularly with PyTorch and PyTorch Lightning, highlight critical challenges that could have broader implications for the machine learning community. The anecdote from a user who transitioned from NVIDIA’s CUDA to ROCm using an RX 7900XTX graphics card indicates persistent issues with stability and compatibility. Despite ROCm's promises of high performance for AMD hardware, the transition resulted in numerous NaN errors in their training code. This situation raises essential questions about the readiness of ROCm for serious research applications, particularly when compared to the more mature CUDA ecosystem. As researchers increasingly adopt AI technologies and explore innovative solutions, the reliability of their tools will play a pivotal role in their success.

This discussion aligns with ongoing themes in our publication, such as the challenges faced in other AI applications, as seen in the article [Struggling with Overfitting on Medical Imaging Task [D]](/post/struggling-with-overfitting-on-medical-imaging-task-d-cmp7l1qta03e3jwhporh5rzf4) and the ongoing exploration of AI’s potential in financial analyses in How to Analyze Company Earnings with AI in 2026. The common thread across these discussions is the need for robust, reliable tools that can support researchers in their endeavors without adding unnecessary complexity or uncertainty.

The user’s experience with ROCm suggests that while the framework has made strides, it still lags behind in terms of adaptability and user-friendliness for less common codebases. The fact that standard scripts, like those for nanoGPT, perform well indicates that ROCm has made progress in certain areas. However, the fragility exhibited when running more customized models could deter researchers who rely on specific architectures or methodologies. This inconsistency poses a barrier to adoption for users who may not have the resources or time to troubleshoot persistent errors. As the AI landscape evolves, ensuring that tools are not only innovative but also practical and reliable will be crucial in fostering a more inclusive research environment.

The implications of these technological hurdles extend beyond just individual frustration; they reflect on the competitive landscape between AMD and NVIDIA in the realm of machine learning. As researchers navigate these challenges, the perception of ROCm could influence the broader community's willingness to embrace AMD’s offerings. It raises the question of whether ROCm can evolve to meet the needs of researchers or if it will remain a secondary option in a field increasingly dominated by CUDA and NVIDIA. As we continue to explore the intersection of technology and research, it will be essential to watch how ROCm addresses these challenges and whether it can cultivate a more robust ecosystem.

Looking ahead, the future of ROCm and its impact on the machine learning landscape remains to be seen. Will AMD invest the necessary resources to enhance the stability and compatibility of ROCm for diverse applications, or will it fall short compared to its competitors? As researchers seek out tools that empower their work, the developments in ROCm could serve as a barometer for the broader trends in AI technology. The key takeaway is that the challenges faced today may very well shape the tools of tomorrow, insisting that both developers and users remain engaged in this ongoing dialogue for continued progress.

So I asked about people's experiences with ROCm in a post a few weeks or so ago

https://www.reddit.com/r/MachineLearning/comments/1t6cng3/rocm_status_in_mid_2026_d/

I actually went and procured a RX 7900XTX reference version to give it a try

My discover is that it kind of still sucks

I have a small codebase for training flow matching models, which runs fine on my RTX3090s. But the moment I ported it across to ROCm it was NaNs absolutely everywhere. The code was kept identical, apart from altering the pip environment to point to torch2.12 with ROCm7.2 instead of CUDA

Trying everything from switching between bf16, fp32, to tweaking various environment variables yielded nothing.

Unless there's some trick I'm missing, I get the feeling that ROCm is still seriously behind.

I tried running the nanoGPT training script, which ran perfectly

My intuition is that the ROCm people have probably tested their stack on established well known codebases. But, it's still remarkably fragile on even slightly uncommon code.

submitted by /u/QuantumQuokka
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#rows.com#financial modeling with spreadsheets#no-code spreadsheet solutions#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#ROCm#PyTorch#training flow matching models#PyTorch Lightning#NaNs#RX 7900XTX#bf16#fp32#nanoGPT#CUDA#environment variables#torch2.12#codebase#pip environment