Skip to content

About Official repo of paper "SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models". A post-training framework that creates a cost-effective, self-iterative optimization loop.

License

WayneJin0918/SRUM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

93 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

SRUM

SRUM Paper on arXiv SRUM Model SRUM Data SRUM Demo SRUM Model SRUM Model

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin*, Yuwei Niu*, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu ๐Ÿ“ง

contact: xihuiliu@hku.hk

We present SRUM, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge.. The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.

๐Ÿ“ข News

We sincerely thank all contributors from the open community for their valuable support.

๐Ÿ“ฎ Notice

Follow the Bagel's original settings, you should focus:

About Inference Hyperparameters:

  • cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0โ€“8.0.
  • cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0โ€“2.0.
  • cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
  • timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
  • num_timesteps: Total denoising steps. Typical: 50.
  • cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
  • cfg_renorm_type: CFG-Renorm method:
    • global: Normalize over all tokens and channels (default for T2I).
    • channel: Normalize across channels for each token.
    • text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
  • If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

๐Ÿ”ฅ Quick Start

1๏ธโƒฃ Set up environment

git clone https://github.com/WayneJin0918/SRUM
cd SRUM
conda env create -f environment.yaml
conda activate SRUM
pip install -r requirements.txt

if flash attention is hard to pip, please follow:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Or you can follow the settings of Bagel

2๏ธโƒฃ Download Bagel pretrained or our SRUM checkpoint

#bagel
from huggingface_hub import snapshot_download

save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
#SRUM
from huggingface_hub import snapshot_download

save_dir = "models/SRUM_BAGEL_7B_MoT"
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

๐Ÿ”ฅ Train & Eval

Train

1๏ธโƒฃ Data preparation

Use srum_data_infer/compt2i.sh for images inference in multi-gpus. Please change the output file address --output_dir as ./your_images_address

bash srum_data_infer/compt2i.sh

Then you will get the image folder ./your_images_address and next use srum_data_infer/vlm.sh for scoring. generally, --image_dir in bash file should same as ./your_address.

Before using vlm inference, you should download the SAM weights under SRUM

wget https://huggingface.co/HCMUE-Research/SAM-vit-h/resolve/main/sam_vit_h_4b8939.pth
bash srum_data_infer/vlm.sh

Now, you have jsonl file your_vlm_output.jsonl and image folder ./your_images_address, add these into data/dataset_info.py.

        'comp_data': {
            'jsonl_path': './your_vlm_output.jsonl',
            # Replace 'image_base_dir' with the 'image_dirs' dictionary
            'image_dirs': {
                # Key 'good' for the ground-truth images
                'good': './your_images_address',
                # Key 'bad' for the input images that have rewards same as good one
                'bad': './your_images_address' 
            },
            'num_total_samples': 5911, # total number of samples in dataset
        },

Or you can directly use our HF training data in huggingface or MS training data in modelscape.

2๏ธโƒฃ Starting training

Down the base model. Then, add yaml file: scripts/data/rft_comp.yaml.

regional_reward:
  dataset_names:
  - comp_data
  image_transform_args:
    image_stride: 256
    max_image_size: 1024
    min_image_size: 512
  num_used_data: # The sum should be larger that NUM_GPUS x NUM_WORKERS
  - 8
  weight: 1
bash scripts/train_reg_comp.sh

Please do not forger to change the PYTHONPATH to your root SRUM path like /mnt/SRUM. If you are not using 8 GPUs in one node, please change the --num_shard to your number of GPUs.

And we highly recommand max of --save_every is --total_steps minus one.

3๏ธโƒฃ Trans to hf weights

bash tool/trans2hf.sh 

If you want use generated data to SFT the base model, please use tool/trans2parquet.py to change images and jsons into parquet.

You can replace the variables in the script with your own before running. See TRAIN for more details.

Eval

Bagel provide the scripts for evaluating VLM, T2I and Editing benchmarks. Please See EVAL for more details.

And if you want eval on T2I-CompBench, referring using file in SRUM/CompBench_eval, it is easy to start. We highly recommend use Qwen2.5-VL-72B-Instruct for evaluation, but you also can use Qwen2.5-VL-32B-Instruct for instead when having not enough memory, the conclusions and overall scores are similar.

Then, run the following command:

bash CompBench_eval/comp_eval_infer.sh

the image output will be saved in BASE_OUTPUT_DIR.

bash CompBench_eval/qwen_eval.sh

the score output will be saved in OUTPUT_DIR.

๐Ÿ“Š Benchmarks

1. Composition

T2I Model 3d spatial Color Complex Nonspatial Numeracy Shape Spatial Texture Overall
FLUX.1-dev 76.39 90.63 83.51 87.47 75.30 80.20 84.23 87.07 83.10
FLUX.1-schnell 79.38 84.53 81.96 85.55 72.82 82.20 85.49 86.38 82.29
SD-3-medium 77.83 91.63 84.73 86.12 72.80 83.72 88.20 89.03 84.26
SD-xl-base-1 72.25 77.75 75.00 85.28 57.14 72.18 77.08 78.38 74.38
Unified Model 3d spatial Color Complex Nonspatial Numeracy Shape Spatial Texture Overall
Janus-Pro 76.17 84.25 80.28 80.47 56.43 65.14 79.67 69.67 74.01
Show-o2 88.61 87.73 87.88 85.91 69.74 73.99 86.60 82.17 82.83
BLIP3o 81.73 89.92 85.55 84.78 71.67 83.75 92.47 87.45 84.66
OmniGen2 82.21 92.22 86.87 88.51 72.00 83.95 90.07 90.88 85.84
Bagel 77.98 89.30 83.32 85.03 70.40 81.94 81.52 87.93 82.18
Bagel (CoT) 84.66 88.85 86.10 85.64 75.36 84.33 82.71 88.07 84.46
BLIP3o+SRUM 83.78โ†‘ 90.22โ†‘ 86.57โ†‘ 85.10โ†‘ 74.52โ†‘ 85.44โ†‘ 93.88โ†‘ 86.52โ†“ 85.75โ†‘
Bagel+SRUM 83.10โ†‘ 92.90โ†‘ 88.69โ†‘ 88.47โ†‘ 78.52โ†‘ 84.23โ†‘ 86.92โ†‘ 89.57โ†‘ 86.55โ†‘
Bagel+SRUM (CoT) ๐Ÿ† 88.60โ†‘ 92.90โ†‘ 91.31โ†‘ 90.48โ†‘ 80.12โ†‘ 84.47โ†‘ 89.93โ†‘ 89.15โ†‘ 88.37โ†‘

2. Reasoning-informed

Model Entity Idiom Scientific Textual Image Average
Bagel 49.70 34.46 47.52 43.59 43.82
Bagel+SFT 50.53 39.43 47.45 44.08 45.37
Bagel+SRUM 52.85 40.51 47.83 45.83 46.75

Performance comparison of Bagel models across four categories and their average scores. Bold values indicate the best performance in each column.

โœ๏ธ Citation

@article{jin2025srum,
  title   = {SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models},
  author  = {Jin, Weiyang and Niu, Yuwei and Liao, Jiaqi and Duan, Chengqi and Li, Aoxue and Gao, Shenghua and Liu, Xihui},
  journal = {arXiv preprint arXiv:2510.12784},
  year    = {2025}
}

๐Ÿ“œ License

SRUM is licensed under the Apache 2.0.

About

About Official repo of paper "SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models". A post-training framework that creates a cost-effective, self-iterative optimization loop.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published