Alex Hedges 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 | |
---|---|---|
.. | ||
README.md | 8413b7f83d DS Benchmarks QoL Improvements (#2120) | 2 年之前 |
__init__.py | 9b70ce56e7 Comms Benchmarks (#2040) | 2 年之前 |
all_gather.py | 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 |
all_reduce.py | 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 |
all_to_all.py | 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 |
broadcast.py | 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 |
constants.py | 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 |
pt2pt.py | 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 |
run_all.py | 316c4a43e0 Add flake8 to pre-commit checks (#2051) | 2 年之前 |
utils.py | 8413b7f83d DS Benchmarks QoL Improvements (#2120) | 2 年之前 |
To run benchmarks, there are two options:
For example, run with a single large message size:
deepspeed all_reduce.py
Scan across message sizes:
deepspeed all_reduce.py --scan
deepspeed run_all.py
Like the individual benchmarks, run_all.py
supports scanning arguments for the max message size, bw-unit, etc. Simply pass the desired arguments to run_all.py
and they'll be propagated to each comm op.
usage: ds_bench [-h] [--local_rank LOCAL_RANK] [--trials TRIALS] [--warmups WARMUPS] [--maxsize MAXSIZE] [--async-op] [--bw-unit {Gbps,GBps}] [--backend {nccl}] [--dist {deepspeed,torch}] [--scan] [--raw] [--all-reduce] [--all-gather] [--all-to-all] [--pt2pt] [--broadcast] [--dtype DTYPE] [--mem-factor MEM_FACTOR] [--debug] optional arguments: -h, --help show this help message and exit --local_rank LOCAL_RANK --trials TRIALS Number of timed iterations --warmups WARMUPS Number of warmup (non-timed) iterations --maxsize MAXSIZE Max message size as a power of 2 --async-op Enables non-blocking communication --bw-unit {Gbps,GBps} --backend {nccl} Communication library to use --dist {deepspeed,torch} Distributed DL framework to use --scan Enables scanning all message sizes --raw Print the message size and latency without units --all-reduce Run all_reduce --all-gather Run all_gather --all-to-all Run all_to_all --pt2pt Run pt2pt --broadcast Run broadcast --dtype DTYPE PyTorch tensor dtype --mem-factor MEM_FACTOR Proportion of max available GPU memory to use for single-size evals --debug Enables all_to_all debug prints
Note that ds_bench
is a pre-packaged wrapper around run_all.py
. Users can pass the same arguments as well:
/bin/ds_bench --scan --trials=10
Finally, users can choose specific communication operations to run in run_all.py
or ds_bench
by passing them as arguments (all operations are run by default). For example:
deepspeed run_all.py --scan --all-reduce --all-to-all --broadcast
To add new communication benchmarks, follow this general procedure:
reduce_scatter
, copy all_reduce.py
as a template)utils.get_bw
, a new maximum tensor element formula in utils.max_numel
, and a new arg in utils.benchmark_parser
mem_factor
for use in run_<collective>_single()
functionrun_all.py