Stas Bekman e355863b83 Update torch_checkpoint_engine.py (#3019) 1 year ago
..
README.md e669aaf55b Trajepl/nebula ckpt engine (#2085) 2 years ago
__init__.py da84e60d98 add missing license info to top of all source code (#2889) 1 year ago
checkpoint_engine.py 43d58d99eb ckpt: create directories in checkpoint_engine (#2988) 1 year ago
nebula_checkpoint_engine.py da84e60d98 add missing license info to top of all source code (#2889) 1 year ago
torch_checkpoint_engine.py e355863b83 Update torch_checkpoint_engine.py (#3019) 1 year ago

README.md

Checkpoint Engine

The CheckpointEngine was designed to modularized the checkpoint serialization. In this way, we can simply replace/refine the checkpoint serialization methods.

Interface for CheckpointEngine

Basically, for checkpoint management(save/load by deepspeed with the given tag), the CheckpointEngine will:

1. To make preliminaries ready by call `create(tag)`. For `torch`, we can just log some extra info as `torch` can directly call `save/load` without other preparation.

2. After the `create(tag)`, deepspeed can call `save/load` to persist files into disk/memory/etc.

3. When all the files for a tag are ready, deepspeed engine will call `commit()` to tell the checkpoint engine current checkpoint is complete. For original torch, it also plays the role of logger.
class CheckpointEngine(object):
    # init checkpoint engine for save/load
    def __init__(self, config_params=None):
        pass

    def create(self, tag):
        # create checkpoint on give tag for save/load.
        pass

    def save(self, state_dict, path: str):
        pass

    def load(self, path: str, map_location=None):
        pass

    def commit(self, tag):
        # to tell checkpoint services if all files are readys.
        pass