[R] How are you managing long-running preprocessing jobs at scale? Curious what's actually working

Our take

Managing long-running preprocessing jobs at scale can be a significant challenge, especially for small machine learning teams. With large datasets often exceeding 50GB, job failures halfway through can be frustrating and time-consuming. Many teams grapple with whether to use single machines or distribute tasks across multiple workers. Solutions like Prefect and Temporal exist, but they can demand extensive DevOps resources. Engaging with the community, we seek insights on effective strategies, internal solutions, and common pitfalls to better navigate this complex issue and enhance productivity.

We're a small ML team for a project and we keep running into the same wall: large preprocessing jobs (think 50–100GB datasets) running on a single machine take hours, and when something fails halfway through, it's painful.

We've looked at Prefect, Temporal, and a few others — but they all feel like they require a full-time DevOps person to set up and maintain properly. And most of our team is focused on the models, not the infrastructure.

Curious how other teams are handling this:

- Are you distributing these jobs across multiple workers, or still running on single machines?

- If you are distributing — what are you using and is it actually worth the setup overhead?

- Has anyone built something internal to handle this, and was it worth it?

- What's the biggest failure point in your current setup?

Trying to figure out if we're solving this the wrong way or if this is just a painful problem everyone deals with. Would love to hear what's actually working for people.

submitted by /u/krishnatamakuwala
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

[R] How are you managing long-running preprocessing jobs at scale? Curious what's actually working

Related Articles