[D] Why does it seem like open source materials on ML are incomplete? this is not enough...
Our take
Many times when I try to deeply understand a topic in machine learning — whether it's a new architecture, a quantization method, a full training pipeline, or simply reproducing someone’s experiment — I find that the available open source materials are clearly insufficient. Often I notice:
Repositories lack complete code needed to reproduce the results Missing critical training details (datasets, hyperparameters, preprocessing steps, random seeds, etc.) Documentation is superficial or outdated Blog posts and tutorials only show the "happy path", while real edge cases, bugs, and production nuances are completely ignored
This creates the feeling that open source in ML is mostly just "weights + basic inference code", rather than fully reproducible science or engineering. The only big exception I see is Andrej Karpathy — his repositories (like nanoGPT, llm.c, etc.) and YouTube lectures are exceptionally clean, educational, and go much deeper. But even he mostly focuses on one specific direction (LLM training from scratch and neural net fundamentals). What bothers me even more is that I don’t just want the code — I want to understand the logic and reasoning behind the decisions: why certain choices were made, what trade-offs were considered, what failed attempts happened along the way, and how the authors actually thought about the problem. Does anyone else feel the same way? In your opinion, what’s the main reason behind this widespread issue?
Do companies and researchers deliberately hide important details (to protect competitive advantage or because the code is messy)? Does everything move so fast that no one has time (or incentive) to properly document their thought process? Is it the culture in the community — publishing for citations, hype, and leaderboard scores rather than true reproducibility and deep understanding? Or is it simply that “doing it properly (clean code + full reasoning) is hard, time-consuming, and expensive”?
I’d really appreciate opinions from people who have been in the field for a while ,especially those working in industry or research. What’s your take on the underlying mindset and motivations? (Translated with ai, English is not my native language)
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Why we’re still using 1980s logic for 2026 data problems (and how I'm trying to fix it).Hi everyone, I’m a CSIE student in Taiwan, and I’ve spent the last semester obsessing over why "data organization" still feels like manual labor. We have incredible processing power, yet most of us are still stuck in the "Shovel Era", manually digging through rows, fixing broken VLOOKUPs, and praying our CSV imports don't break. I wanted to share three specific "Excel Pains" I’ve been researching while building my own organizer, and I’d love to hear if you’ve found better ways to handle them: 1. The "Syntax Trap" vs. Human Intent Most people spend 80% of their time worrying about where the comma goes in a nested IF statement and only 20% on what the data actually means. I believe we are moving toward a "Semantic Era" where the computer should understand that "March 26" and "03/26/26" are the same thing without us writing a regex script. 2. The "Final_v2_FINAL_ActuallyFinal.xlsx" Nightmare File organization usually falls apart because our tools don't track the lineage of data. When we move from a messy raw file to a "clean" one, we lose the context of the original. I've been experimenting with building a "Tractor" for this—a system where the AI maintains a "Kanban" of data states so you can see the evolution of your project visually. 3. The 2FA/Security Gap in Spreadsheets We put our lives into Excel files, but standard spreadsheets are notoriously easy to leak or lose. I’ve been implementing 2FA data protection into my workflow because "Data Organization" shouldn't just be about sorting; it should be about stewardship. The Project: Dxtreame Organizer To solve these, I’ve been building Dxtreame Organizer. It’s an AI-driven tool meant to bridge that gap between messy raw data and structured, formula-ready Excel sheets. Current Progress: I've got the AI sorting engine running, 2FA protection live, and I'm currently designing a graph-view to replace the "wall of numbers" we usually stare at. The Goal: I’m currently fundraising as an international student to scale the infrastructure. My vision is to get rid of the "reason to learn syntax" entirely, so we can focus on the Vision instead of the Code. I’m looking for brutally honest feedback: What is the one thing in Excel that makes you want to throw your laptop out a window? If an AI could "auto-clean" your files, what is the one thing you would NEVER trust it to do alone? Thanks for reading, I'm looking forward to the "logic vs. automation" debate in the comments! submitted by /u/Dxxx101 [link] [comments]
- Is the ds/ml slowly being morphed into an AI engineer? [D]Agents are amazing. Harnesses are cool. But the fundamental role of a data scientist is not to use a generalist model in an existing workflow; it's a completely different field. AI engineering is the body of the vehicle, whereas the actual brain/engine behind it is the data scientist's playground. I feel like I am not alone in this realisation that my role somehow got silently morphed into that of an AI engineer, with the engine's development becoming a complete afterthought. Based on industry requirements and ongoing research, most of the work has quietly shifted from building the engine to refining the body around it. Economically, this makes sense, as working with LLMs or other Deep Learning models is a capital-intensive task that not everyone can afford, but the fact that very little of a role's identity is preserved is concerning. Most of the time, when I speak to data scientists, the core reply I get is that they are fine-tuning models to preserve their "muscles". But fine-tuning is a very small part of a data scientist's role; heck, after a point, it's not even the most important part. Fine-tuning is a tool. Understanding, I believe, should be the fundamental block of the role. Realising that there are things other than "transformers" and finding where they fit into the picture. And don't even get me started on the lack of understanding of how important the data is for their systems. A data scientist's primary role is not the model itself. It's about developing the model, the data quality at hand, the appropriate problem framing, efficiency concerns, architectural literacy, evaluation design, and error analysis. Amid the AI hype, many have overlooked that much of their role is static and not considered important. AI engineering is an amazing field. The folks who love doing amazing things with the models always inspire me. But somehow, the same attention and respect are no longer paid to the foundational, scientific side of data and modeling in the current industry. I realise it's not always black and white, but it's kind of interesting how the grey is slowly becoming darker by the day. Do you feel the same way? Or is it just my own internal crisis bells ringing unnecessarily? For those of you who have recognized this shift, how are you handling your careers? Are you leaning into the engineering/systems side and abandoning traditional model development? Or have you found niche roles/companies that still value the fundamental data scientist role (data quality, architectural literacy, statistical rigor)? I'd love to hear how you are adapting submitted by /u/The-Silvervein [link] [comments]
- freshman in ML: how do you identify actually open research problems? [D]Hi, I am a freshman who is trying to break into research. I got into a well known university research lab in my country for the upcoming summer, and the prof said I am "better positioned than numerous others" for hardware-aligned machine learning topics. I am facing a couple of problems, and I would like to know how seasoned researchers deal with them: How do you develop the intuition for what's open vs. what just looks open? When I look at a research space, everything either looks already solved or impossibly vague. There's no middle ground visible to me, yet. This bothers me. How do you handle the feeling that every idea is either already done or not good enough, without it paralyzing you? Ideas that I have "thought" of but have been done already: PQCache, async KVCache prefetching, roofline modeling for GQA decode phase.. etc. A paper that says "future work includes X" BUT it is not the same as X being open, right? Someone may have done X last month and not published yet, or X may be open but intractable, or X may be open but require equipment which I don't have. I would have no way to know which. Morever the thing I want to work on might exist under three different names across three different communities, and if you search the wrong name you conclude it's open when it isn't. (LLMs with Web Search seems to help a bit) Reddit threads that I have already looked into: https://www.reddit.com/r/MachineLearning/comments/1sayptq/d_physicistturnedmlengineer_looking_to_get_into/ https://www.reddit.com/r/MachineLearning/comments/1nsvdqk/d_machine_learning_research_no_longer_feels/ https://www.reddit.com/r/MachineLearning/comments/kw9xk7/d_has_anyone_else_lost_interest_in_ml_research/ My motivation to work on this field is to speed up ai-for-science initiatives, while making it more affordable. submitted by /u/Shonku_ [link] [comments]