2 min readfrom Machine Learning

[D] Why does it seem like open source materials on ML are incomplete? this is not enough...

Our take

Many practitioners in the machine learning community share a common frustration: the open source materials often feel incomplete or insufficient for a deep understanding of complex topics. Users frequently encounter repositories lacking essential code, critical training details, and comprehensive documentation. This leaves them with a sense that open source contributions in ML prioritize basic functionality over thorough reproducibility and understanding. It raises questions about whether this trend stems from competitive pressures, rapid advancements, or a culture that values publications over true educational resources.

Many times when I try to deeply understand a topic in machine learning — whether it's a new architecture, a quantization method, a full training pipeline, or simply reproducing someone’s experiment — I find that the available open source materials are clearly insufficient. Often I notice:

Repositories lack complete code needed to reproduce the results Missing critical training details (datasets, hyperparameters, preprocessing steps, random seeds, etc.) Documentation is superficial or outdated Blog posts and tutorials only show the "happy path", while real edge cases, bugs, and production nuances are completely ignored

This creates the feeling that open source in ML is mostly just "weights + basic inference code", rather than fully reproducible science or engineering. The only big exception I see is Andrej Karpathy — his repositories (like nanoGPT, llm.c, etc.) and YouTube lectures are exceptionally clean, educational, and go much deeper. But even he mostly focuses on one specific direction (LLM training from scratch and neural net fundamentals). What bothers me even more is that I don’t just want the code — I want to understand the logic and reasoning behind the decisions: why certain choices were made, what trade-offs were considered, what failed attempts happened along the way, and how the authors actually thought about the problem. Does anyone else feel the same way? In your opinion, what’s the main reason behind this widespread issue?

Do companies and researchers deliberately hide important details (to protect competitive advantage or because the code is messy)? Does everything move so fast that no one has time (or incentive) to properly document their thought process? Is it the culture in the community — publishing for citations, hype, and leaderboard scores rather than true reproducibility and deep understanding? Or is it simply that “doing it properly (clean code + full reasoning) is hard, time-consuming, and expensive”?

I’d really appreciate opinions from people who have been in the field for a while ,especially those working in industry or research. What’s your take on the underlying mindset and motivations? (Translated with ai, English is not my native language)

submitted by /u/Kalli_animation
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Related Articles

Tagged with

#no-code spreadsheet solutions#natural language processing for spreadsheets#real-time data collaboration#real-time collaboration#machine learning in spreadsheet applications#generative AI for data analysis#Excel alternatives for data analysis#rows.com#big data management in spreadsheets#AI-native spreadsheets#financial modeling with spreadsheets#cloud-native spreadsheets#natural language processing#big data performance#machine learning#open source#reproducibility#repositories#hyperparameters#documentation