Skip to content

VectorSpaceLab/MR2-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

MR²-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

Build Build Build

This repo contains the annotation data and evaluation code for the paper "MR²-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval".

🔔 News:

  • 🥳 2025-10-24: We have released the MR²-Bench Benchmark and Paper! 🔥

License

Our dataset is under the CC-BY-NC-SA-4.0 license.

⚠️ If you need to access and use our dataset, you must understand and agree: This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.

We do not own the copyright of the source materials (images, text) used in this benchmark, which are derived from public datasets or online platforms like Stack Exchange. For the data used, we respect and acknowledge any copyrights of the original authors.

If the original authors of the related works still believe that their content should be removed, please raise an issue.

Introduction

We introduce MR²-Bench, the first benchmark designed to evaluate reasoning-intensive multimodal retrieval. Existing benchmarks primarily test surface-level semantic matching, failing to assess the deeper reasoning required for complex real-world scenarios. MR²-Bench addresses this gap by providing tasks that require logical, spatial, and causal inference.

The benchmark includes 1,309 curated queries across 3 meta-tasks and 12 sub-tasks. It features diverse data types, including natural images, diagrams, charts, and visual puzzles, moving beyond the typical natural images found in other benchmarks. Our evaluation of state-of-the-art models reveals a significant performance drop on MR²-Bench compared to existing benchmarks (e.g., a top model drops from 77.78% on MMEB to 9.91% on MR²-Bench in Recall@1), highlighting the need for more advanced reasoning-intensive retrievers. We anticipate that MR²-Bench will guide the community in developing more capable and robust multimodal retrieval systems.

Statistical Overview of our MR²-Bench. It shows visual examples from the three meta-tasks: Multimodal Knowledge Retrieval, Visual Illustration Search, and Visual Relation Reasoning.

🏆 Mini-Leaderboard

This table shows the average nDCG@10 scores across all 12 sub-tasks. For full results, please refer to our paper.

Model Type Avg. nDCG@10
Full Mark - 100
Multimodal Embedding Models
Seed-1.6-Embedding Multimodal 30.68
MM-Embed Multimodal 30.23
VLM2Vec-v2 Multimodal 23.72
GME Multimodal 21.59
BGE-VL Multimodal 19.53
CLIP Multimodal 18.59
Text Embedding Models (+Captions)
ReasonIR + Captions Textual (Reasoning) 25.72
BGE-Reasoner + Captions Textual (Reasoning) 25.35
Diver-Embed + Captions Textual (Reasoning) 23.59
Qwen3 + Captions Textual 20.17
BGE-M3 + Captions Textual 18.71

MR²-Bench

Before you access our dataset, we kindly ask you to thoroughly read and understand the license outlined above. If you cannot agree to these terms, we request that you refrain from downloading our data.

The annotation file is readily accessible at our GitHub repository. For the full dataset including images, you can access it via this 🤗 HF Link.

MR²-Bench encompasses three meta-tasks designed to test different facets of reasoning in multimodal retrieval:

  1. Multimodal Knowledge Retrieval: Requires retrieving documents where the relevance is established by reasoning over concepts that bridge the query and the document, often relying on essential visual information.
  2. Visual Illustration Search: Involves finding an image (like a chart or visual proof) that intuitively explains or solves a problem posed in a domain-specific textual query.
  3. Visual Relation Reasoning: Assesses vision-centric reasoning capabilities through tasks like spatial reasoning, visual puzzles, and analogies, where intent is conveyed through visual structures rather than language.

Examples of the tasks are displayed below.

Task Examples of our MR²-Bench.

Evaluation

Please refer to our GitHub repository for evaluation code and instructions.

Hosting and Maintenance

The annotation files will be permanently retained.

If source materials (images or text) are requested to be removed by copyright holders, we will take them down from our public dataset. We will still keep the relevant annotation files and actively seek more reliable and risk-free data sources to replace them to ensure the long-term validity of the benchmark.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{mr2bench,
  title={MR $\^{} 2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval},
  author={Zhou, Junjie and Liu, Ze and Xiong, Lei and Yao, Jin-Ge and Wang, Yueze and Xiao, Shitao and Lin, Fenfen and Chen, Miguel Hu and Dou, Zhicheng and Bao, Siqi and Lian, Defu, and Xiong, Yongping, and Liu, Zheng},
  journal={arXiv preprint arXiv:2509.26378},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published