monitor.rst 1.8 KB

1234567891011121314151617181920212223242526272829303132333435
  1. Monitoring
  2. ==========
  3. Deepspeed’s Monitor module can log training details into a
  4. Tensorboard-compatible file, to WandB, or to simple CSV files. Below is an
  5. overview of what DeepSpeed will log automatically.
  6. .. csv-table:: Automatically Logged Data
  7. :header: "Field", "Description", "Condition"
  8. :widths: 20, 20, 10
  9. `Train/Samples/train_loss`,The training loss.,None
  10. `Train/Samples/lr`,The learning rate during training.,None
  11. `Train/Samples/loss_scale`,The loss scale when training using `fp16`.,`fp16` must be enabled.
  12. `Train/Eigenvalues/ModelBlockParam_{i}`,Eigen values per param block.,`eigenvalue` must be enabled.
  13. `Train/Samples/elapsed_time_ms_forward`,The global duration of the forward pass.,`flops_profiler.enabled` or `wall_clock_breakdown`.
  14. `Train/Samples/elapsed_time_ms_backward`,The global duration of the forward pass.,`flops_profiler.enabled` or `wall_clock_breakdown`.
  15. `Train/Samples/elapsed_time_ms_backward_inner`,The backward time that does not include the gradient reduction time. Only in cases where the gradient reduction is not overlapped, if it is overlapped then the inner time should be about the same as the entire backward time.,`flops_profiler.enabled` or `wall_clock_breakdown`.
  16. `Train/Samples/elapsed_time_ms_backward_allreduce`,The global duration of the allreduce operation.,`flops_profiler.enabled` or `wall_clock_breakdown`.
  17. `Train/Samples/elapsed_time_ms_step`,The optimizer step time,`flops_profiler.enabled` or `wall_clock_breakdown`.
  18. TensorBoard
  19. -----------
  20. .. _TensorBoardConfig:
  21. .. autopydantic_model:: deepspeed.monitor.config.TensorBoardConfig
  22. WandB
  23. -----
  24. .. _WandbConfig:
  25. .. autopydantic_model:: deepspeed.monitor.config.WandbConfig
  26. CSV Monitor
  27. -----------
  28. .. _CSVConfig:
  29. .. autopydantic_model:: deepspeed.monitor.config.CSVConfig