LLMs for data pipelines without losing control (API → DuckDB in ~10 mins)
Our take

| Hey folks, I’ve been doing data engineering long enough to believe that “real” pipelines meant writing every parser by hand, dealing with pagination myself, and debugging nested JSON until it finally stopped exploding. I’ve also been pretty skeptical of the “just prompt it” approach. Lately though, I’ve been experimenting with a workflow that feels less like hype and more like controlled engineering, instead of starting with a blank
LLM does the mechanical work, I stay in charge of structure + validation We’re doing a live session on Feb 17 to test this in real time, going from empty folder → github commits dashboard (duckdb + dlt + marimo) and walking through the full loop live if you’ve got an annoying API (weird pagination, nested structures, bad docs), bring it, that’s more interesting than the happy path. we wrote up the full workflow with examples here Curious, what’s the dealbreaker for you using LLMs in pipelines? [link] [comments] |
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Thoughts on how to validate Data Insights while leveraging LLMsI wrote up a blog post on a framework to think about that even though we can use LLMs to generate code to DO Data Science we need additional tools to verify that the inferences generated are valid. I'm sure a lot of other members of this subreddit are having similar thoughts and concerns so I am sharing in case it helps process how to work with LLMs. Maybe this is obvious but I'm trying to write more to help my own thinking. Let me know if you disagree! Data Science is a multiplicative process, not an additive one I’ve worked in Statistics, Data Science, and Machine Learning for 12 years and like most other Data Scientists I’ve been thinking about how LLMs impact my workflow and my career. The more my job becomes asking an AI to accomplish tasks, the more I worry about getting called in to see The Bobs. I’ve been struggling with how to leverage these tools, which are certainly increasing my capabilities and productivity, to produce more output while also verifying the result. And I think I’ve figured out a framework to think about it. Like a logical AND operation, Data Science is a multiplicative process; the output is only valid if all the input steps are also valid. I think this separates Data Science from other software-dependent tasks. submitted by /u/millsGT49 [link] [comments]
- [P] Unix philosophy for ML pipelines: modular, swappable stages with typed contractsWe built an open-source prototype that applies Unix philosophy to retrieval pipelines. Each stage (PII redaction, chunking, dedup, embeddings, eval) is its own plugin with a typed contract, like pipes between Unix tools. The motivation: we swapped a chunker and retrieval got worse, but could not isolate whether it was the chunking or something breaking downstream. With each stage independently swappable, you change one option, re-run eval, and compare precision/recall directly. ```python Feature("docs__pii_redacted__chunked__deduped__embedded__evaluated", options={ "redaction_method": "presidio", "chunking_method": "sentence", "embedding_method": "tfidf", }) ``` Each `__` is a stage boundary. Swap any piece, the rest stays the same. Still a prototype, not production. Looking for feedback on whether the design assumptions hold up. Repo: [https://github.com/mloda-ai/rag_integration](https://github.com/mloda-ai/rag_integration) submitted by /u/coldoven [link] [comments]