What should a PyTorch training end-of-run performance summary show? [D]

Our take

In PyTorch training, a concise end-of-run performance summary is vital for understanding the efficiency of your model's execution. Instead of sifting through extensive trace events, users need clear answers to key questions: Where did the step time go? Was the run input-bound, compute-bound, or wait-heavy? Were ranks balanced? Additionally, monitoring memory stability is crucial. A compact summary should be lightweight enough for every job, enhancing accessibility and allowing users to pinpoint issues quickly.

What should a PyTorch training end-of-run performance summary show? [D]

For most slow PyTorch runs the first question isn't show me every trace event, it is just: where do I even start?

- where did step time go?
- was the run input-bound, compute-bound, or wait-heavy?
- were ranks imbalanced?
- was memory stable or creeping up?

I haven been thinking about what a compact end-of-run summary would look like: lightweight enough to run on every job, not just dedicated profiling runs.

Here's one example of what that output could look like:

https://preview.redd.it/2q71s9ltkvzg1.png?width=533&format=png&auto=webp&s=cde99ed3224d723bb6dba200b326da826ba4f587

Curious how others are solving this today. What would make something like this useful? What is missing?

submitted by /u/traceml-ai
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Tagged with

#rows.com#big data performance#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#real-time data collaboration#real-time collaboration#PyTorch#training#end-of-run#performance summary#input-bound#compute-bound#step time#memory stable#profiling runs#wait-heavy#compact summary#ranks imbalanced#slow runs