digger yu c5edc91ecb change partititon_name to partition_name (#3700) | 1 year ago | |
---|---|---|
.. | ||
README.md | cd4e473ee6 fix typo with deepspeed/ (#3547) | 1 year ago |
__init__.py | b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) | 1 year ago |
checkpoint_engine.py | cd4e473ee6 fix typo with deepspeed/ (#3547) | 1 year ago |
nebula_checkpoint_engine.py | c5edc91ecb change partititon_name to partition_name (#3700) | 1 year ago |
torch_checkpoint_engine.py | b361c72761 Update DeepSpeed copyright license to Apache 2.0 (#3111) | 1 year ago |
The CheckpointEngine
was designed to modularized the checkpoint serialization. In this way, we can simply replace/refine the checkpoint serialization methods.
CheckpointEngine
Basically, for checkpoint management(save/load by deepspeed with the given tag), the CheckpointEngine
will:
1. To make preliminaries ready by call `create(tag)`. For `torch`, we can just log some extra info as `torch` can directly call `save/load` without other preparation.
2. After the `create(tag)`, deepspeed can call `save/load` to persist files into disk/memory/etc.
3. When all the files for a tag are ready, deepspeed engine will call `commit()` to tell the checkpoint engine current checkpoint is complete. For original torch, it also plays the role of logger.
class CheckpointEngine(object):
# init checkpoint engine for save/load
def __init__(self, config_params=None):
pass
def create(self, tag):
# create checkpoint on give tag for save/load.
pass
def save(self, state_dict, path: str):
pass
def load(self, path: str, map_location=None):
pass
def commit(self, tag):
# to tell checkpoint services if all files are ready.
pass