zR e6ee283d0e Merge branch 'CogVideoX_dev' of github.com:THUDM/CogVideo into CogVideoX_dev 1 周之前
..
README.md fbfad9c361 readme change 1 月之前
README_ja.md fbfad9c361 readme change 1 月之前
README_zh.md fbfad9c361 readme change 1 月之前
accelerate_config_machine_multi.yaml 6e64359524 finetune and infer upload 1 月之前
accelerate_config_machine_single.yaml 6e64359524 finetune and infer upload 1 月之前
finetune_multi_rank.sh 1f6d9032cc test finetune 1 月之前
finetune_single_rank.sh 1f6d9032cc test finetune 1 月之前
hostfile.txt 66369a90aa update of readme and hostfile 1 月之前
train_cogvideox_image_to_video_lora.py e169e7b045 Update train_cogvideox_image_to_video_lora.py 2 周之前
train_cogvideox_lora.py 532f246d7c Merge pull request #389 from THUDM/CogVideoX_dev 2 周之前

README.md

CogVideoX diffusers Fine-tuning Guide

中文阅读

日本語で読む

This feature is not fully complete yet. If you want to check the fine-tuning for the SAT version, please see here. The dataset format is different from this version.

Hardware Requirements

  • CogVideoX-2B / 5B LoRA: 1 * A100 (5B need to use --use_8bit_adam)
  • CogVideoX-2B SFT: 8 * A100 (Working)
  • CogVideoX-5B-I2V is not supported yet.

Install Dependencies

Since the related code has not been merged into the diffusers release, you need to base your fine-tuning on the diffusers branch. Please follow the steps below to install dependencies:

git clone https://github.com/huggingface/diffusers.git
cd diffusers # Now in Main branch
pip install -e .

Prepare the Dataset

First, you need to prepare the dataset. The dataset format should be as follows, with videos.txt containing the list of videos in the videos directory:

.
├── prompts.txt
├── videos
└── videos.txt

You can download the Disney Steamboat Willie dataset from here.

This video fine-tuning dataset is used as a test for fine-tuning.

Configuration Files and Execution

The accelerate configuration files are as follows:

  • accelerate_config_machine_multi.yaml: Suitable for multi-GPU use
  • accelerate_config_machine_single.yaml: Suitable for single-GPU use

The configuration for the finetune script is as follows:

accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \  # Use accelerate to launch multi-GPU training with the config file accelerate_config_machine_single.yaml
  train_cogvideox_lora.py \  # Training script train_cogvideox_lora.py for LoRA fine-tuning on CogVideoX model
  --gradient_checkpointing \  # Enable gradient checkpointing to reduce memory usage
  --pretrained_model_name_or_path $MODEL_PATH \  # Path to the pretrained model, specified by $MODEL_PATH
  --cache_dir $CACHE_PATH \  # Cache directory for model files, specified by $CACHE_PATH
  --enable_tiling \  # Enable tiling technique to process videos in chunks, saving memory
  --enable_slicing \  # Enable slicing to further optimize memory by slicing inputs
  --instance_data_root $DATASET_PATH \  # Dataset path specified by $DATASET_PATH
  --caption_column prompts.txt \  # Specify the file prompts.txt for video descriptions used in training
  --video_column videos.txt \  # Specify the file videos.txt for video paths used in training
  --validation_prompt "" \  # Prompt used for generating validation videos during training
  --validation_prompt_separator ::: \  # Set ::: as the separator for validation prompts
  --num_validation_videos 1 \  # Generate 1 validation video per validation round
  --validation_epochs 100 \  # Perform validation every 100 training epochs
  --seed 42 \  # Set random seed to 42 for reproducibility
  --rank 128 \  # Set the rank for LoRA parameters to 128
  --lora_alpha 64 \  # Set the alpha parameter for LoRA to 64, adjusting LoRA learning rate
  --mixed_precision bf16 \  # Use bf16 mixed precision for training to save memory
  --output_dir $OUTPUT_PATH \  # Specify the output directory for the model, defined by $OUTPUT_PATH
  --height 480 \  # Set video height to 480 pixels
  --width 720 \  # Set video width to 720 pixels
  --fps 8 \  # Set video frame rate to 8 frames per second
  --max_num_frames 49 \  # Set the maximum number of frames per video to 49
  --skip_frames_start 0 \  # Skip 0 frames at the start of the video
  --skip_frames_end 0 \  # Skip 0 frames at the end of the video
  --train_batch_size 4 \  # Set training batch size to 4
  --num_train_epochs 30 \  # Total number of training epochs set to 30
  --checkpointing_steps 1000 \  # Save model checkpoint every 1000 steps
  --gradient_accumulation_steps 1 \  # Accumulate gradients for 1 step, updating after each batch
  --learning_rate 1e-3 \  # Set learning rate to 0.001
  --lr_scheduler cosine_with_restarts \  # Use cosine learning rate scheduler with restarts
  --lr_warmup_steps 200 \  # Warm up the learning rate for the first 200 steps
  --lr_num_cycles 1 \  # Set the number of learning rate cycles to 1
  --optimizer AdamW \  # Use the AdamW optimizer
  --adam_beta1 0.9 \  # Set Adam optimizer beta1 parameter to 0.9
  --adam_beta2 0.95 \  # Set Adam optimizer beta2 parameter to 0.95
  --max_grad_norm 1.0 \  # Set maximum gradient clipping value to 1.0
  --allow_tf32 \  # Enable TF32 to speed up training
  --report_to wandb  # Use Weights and Biases (wandb) for logging and monitoring the training

Running the Script to Start Fine-tuning

Single Node (One GPU or Multi GPU) fine-tuning:

bash finetune_single_rank.sh

Multi-Node fine-tuning:

bash finetune_multi_rank.sh # Needs to be run on each node

Loading the Fine-tuned Model

  • Please refer to cli_demo.py for how to load the fine-tuned model.

Best Practices

  • Includes 70 training videos with a resolution of 200 x 480 x 720 (frames x height x width). By skipping frames in the data preprocessing, we created two smaller datasets with 49 and 16 frames to speed up experimentation, as the maximum frame limit recommended by the CogVideoX team is 49 frames. We split the 70 videos into three groups of 10, 25, and 50 videos, with similar conceptual nature.
  • Using 25 or more videos works best when training new concepts and styles.
  • It works better to train using identifier tokens specified with --id_token. This is similar to Dreambooth training, but regular fine-tuning without such tokens also works.
  • The original repository used lora_alpha set to 1. We found this value ineffective across multiple runs, likely due to differences in the backend and training setup. Our recommendation is to set lora_alpha equal to rank or rank // 2.
  • We recommend using a rank of 64 or higher.