•1 min read•from Machine Learning
What should a PyTorch training end-of-run performance summary show? [D]
Our take
In PyTorch training, a concise end-of-run performance summary is vital for understanding the efficiency of your model's execution. Instead of sifting through extensive trace events, users need clear answers to key questions: Where did the step time go? Was the run input-bound, compute-bound, or wait-heavy? Were ranks balanced? Additionally, monitoring memory stability is crucial. A compact summary should be lightweight enough for every job, enhancing accessibility and allowing users to pinpoint issues quickly.
![What should a PyTorch training end-of-run performance summary show? [D]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2F2q71s9ltkvzg1.png%3Fwidth%3D140%26height%3D123%26auto%3Dwebp%26s%3D77945925d0fb80c62d90ad917fad64c3af772f43&w=3840&q=75)
| For most slow PyTorch runs the first question isn't show me every trace event, it is just: where do I even start? - where did step time go? I haven been thinking about what a compact end-of-run summary would look like: lightweight enough to run on every job, not just dedicated profiling runs. Here's one example of what that output could look like: Curious how others are solving this today. What would make something like this useful? What is missing? [link] [comments] |
Read on the original site
Open the publisher's page for the full experience
Tagged with
#rows.com#big data performance#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#real-time data collaboration#real-time collaboration#PyTorch#training#end-of-run#performance summary#input-bound#compute-bound#step time#memory stable#profiling runs#wait-heavy#compact summary#ranks imbalanced#slow runs