SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin*, Yuwei Niu*, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu 📧

contact: xihuiliu@hku.hk

We present SRUM, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge.. The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.

📢 News

We sincerely thank all contributors from the open community for their valuable support.

Nov. 15, 2025: We released the official website, model, and report for SRUM. And please upvote for our huggingface daily paper as well as try the demo

📮 Notice

Follow the Bagel's original settings, you should focus:

About Inference Hyperparameters:

cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0–8.0.
cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0–2.0.
cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
num_timesteps: Total denoising steps. Typical: 50.
cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
cfg_renorm_type: CFG-Renorm method:
- global: Normalize over all tokens and channels (default for T2I).
- channel: Normalize across channels for each token.
- text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

🔥 Quick Start

1️⃣ Set up environment

git clone https://github.com/WayneJin0918/SRUM
cd SRUM
conda env create -f environment.yaml
conda activate SRUM
pip install -r requirements.txt

if flash attention is hard to pip, please follow:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Or you can follow the settings of Bagel

2️⃣ Download Bagel pretrained or our SRUM checkpoint

#bagel
from huggingface_hub import snapshot_download

save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

#SRUM
from huggingface_hub import snapshot_download

save_dir = "models/SRUM_BAGEL_7B_MoT"
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

🔥 Train & Eval

Train

1️⃣ Data preparation

Use srum_data_infer/compt2i.sh for images inference in multi-gpus. Please change the output file address --output_dir as ./your_images_address

bash srum_data_infer/compt2i.sh

Then you will get the image folder ./your_images_address and next use srum_data_infer/vlm.sh for scoring. generally, --image_dir in bash file should same as ./your_address.

Before using vlm inference, you should download the SAM weights under SRUM

wget https://huggingface.co/HCMUE-Research/SAM-vit-h/resolve/main/sam_vit_h_4b8939.pth

bash srum_data_infer/vlm.sh

Now, you have jsonl file your_vlm_output.jsonl and image folder ./your_images_address, add these into data/dataset_info.py.

        'comp_data': {
            'jsonl_path': './your_vlm_output.jsonl',
            # Replace 'image_base_dir' with the 'image_dirs' dictionary
            'image_dirs': {
                # Key 'good' for the ground-truth images
                'good': './your_images_address',
                # Key 'bad' for the input images that have rewards same as good one
                'bad': './your_images_address' 
            },
            'num_total_samples': 5911, # total number of samples in dataset
        },

Or you can directly use our HF training data in huggingface or MS training data in modelscape.

2️⃣ Starting training

Down the base model. Then, add yaml file: scripts/data/rft_comp.yaml.

regional_reward:
  dataset_names:
  - comp_data
  image_transform_args:
    image_stride: 256
    max_image_size: 1024
    min_image_size: 512
  num_used_data: # The sum should be larger that NUM_GPUS x NUM_WORKERS
  - 8
  weight: 1

bash scripts/train_reg_comp.sh

Please do not forger to change the PYTHONPATH to your root SRUM path like /mnt/SRUM. If you are not using 8 GPUs in one node, please change the --num_shard to your number of GPUs.

And we highly recommand max of --save_every is --total_steps minus one.

3️⃣ Trans to hf weights

bash tool/trans2hf.sh

If you want use generated data to SFT the base model, please use tool/trans2parquet.py to change images and jsons into parquet.

You can replace the variables in the script with your own before running. See TRAIN for more details.

Eval

Bagel provide the scripts for evaluating VLM, T2I and Editing benchmarks. Please See EVAL for more details.

And if you want eval on T2I-CompBench, referring using file in SRUM/CompBench_eval, it is easy to start. We highly recommend use Qwen2.5-VL-72B-Instruct for evaluation, but you also can use Qwen2.5-VL-32B-Instruct for instead when having not enough memory, the conclusions and overall scores are similar.

Then, run the following command:

bash CompBench_eval/comp_eval_infer.sh

the image output will be saved in BASE_OUTPUT_DIR.

bash CompBench_eval/qwen_eval.sh

the score output will be saved in OUTPUT_DIR.

📊 Benchmarks

1. Composition

T2I Model	3d spatial	Color	Complex	Nonspatial	Numeracy	Shape	Spatial	Texture	Overall
FLUX.1-dev	76.39	90.63	83.51	87.47	75.30	80.20	84.23	87.07	83.10
FLUX.1-schnell	79.38	84.53	81.96	85.55	72.82	82.20	85.49	86.38	82.29
SD-3-medium	77.83	91.63	84.73	86.12	72.80	83.72	88.20	89.03	84.26
SD-xl-base-1	72.25	77.75	75.00	85.28	57.14	72.18	77.08	78.38	74.38

Unified Model	3d spatial	Color	Complex	Nonspatial	Numeracy	Shape	Spatial	Texture	Overall
Janus-Pro	76.17	84.25	80.28	80.47	56.43	65.14	79.67	69.67	74.01
Show-o2	88.61	87.73	87.88	85.91	69.74	73.99	86.60	82.17	82.83
BLIP3o	81.73	89.92	85.55	84.78	71.67	83.75	92.47	87.45	84.66
OmniGen2	82.21	92.22	86.87	88.51	72.00	83.95	90.07	90.88	85.84
Bagel	77.98	89.30	83.32	85.03	70.40	81.94	81.52	87.93	82.18
Bagel (CoT)	84.66	88.85	86.10	85.64	75.36	84.33	82.71	88.07	84.46
BLIP3o+SRUM	83.78↑	90.22↑	86.57↑	85.10↑	74.52↑	85.44↑	93.88↑	86.52↓	85.75↑
Bagel+SRUM	83.10↑	92.90↑	88.69↑	88.47↑	78.52↑	84.23↑	86.92↑	89.57↑	86.55↑
Bagel+SRUM (CoT) 🏆	88.60↑	92.90↑	91.31↑	90.48↑	80.12↑	84.47↑	89.93↑	89.15↑	88.37↑

2. Reasoning-informed

Model	Entity	Idiom	Scientific	Textual Image	Average
Bagel	49.70	34.46	47.52	43.59	43.82
Bagel+SFT	50.53	39.43	47.45	44.08	45.37
Bagel+SRUM	52.85	40.51	47.83	45.83	46.75

Performance comparison of Bagel models across four categories and their average scores. Bold values indicate the best performance in each column.

✍️ Citation

@article{jin2025srum,
  title   = {SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models},
  author  = {Jin, Weiyang and Niu, Yuwei and Liao, Jiaqi and Duan, Chengqi and Li, Aoxue and Gao, Shenghua and Liu, Xihui},
  journal = {arXiv preprint arXiv:2510.12784},
  year    = {2025}
}

📜 License

SRUM is licensed under the Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
CompBench_eval		CompBench_eval
assets		assets
data		data
eval		eval
modeling		modeling
scripts		scripts
srum_data_infer		srum_data_infer
test_images		test_images
tool		tool
train		train
.gitignore		.gitignore
EVAL.md		EVAL.md
LICENSE		LICENSE
README.md		README.md
TRAIN.md		TRAIN.md
app.py		app.py
calculate_mean.py		calculate_mean.py
environment.yaml		environment.yaml
inference.ipynb		inference.ipynb
inferencer.py		inferencer.py
requirements.txt		requirements.txt
train_comp.json		train_comp.json
val_comp.json		val_comp.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

📢 News

📮 Notice

🔥 Quick Start

🔥 Train & Eval

Train

Eval

📊 Benchmarks

1. Composition

2. Reasoning-informed

✍️ Citation

📜 License

About

Uh oh!

Releases 1

Packages

Languages

Uh oh!

License

Uh oh!

WayneJin0918/SRUM

Folders and files

Latest commit

History

Repository files navigation

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

📢 News

📮 Notice

🔥 Quick Start

🔥 Train & Eval

Train

Eval

📊 Benchmarks

1. Composition

2. Reasoning-informed

✍️ Citation

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages