MR²-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

This repo contains the annotation data and evaluation code for the paper "MR²-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval".

🔔 News:

🥳 2025-10-24: We have released the MR²-Bench Benchmark and Paper! 🔥

License

Our dataset is under the CC-BY-NC-SA-4.0 license.

⚠️ If you need to access and use our dataset, you must understand and agree: This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.

We do not own the copyright of the source materials (images, text) used in this benchmark, which are derived from public datasets or online platforms like Stack Exchange. For the data used, we respect and acknowledge any copyrights of the original authors.

If the original authors of the related works still believe that their content should be removed, please raise an issue.

Introduction

We introduce MR²-Bench, the first benchmark designed to evaluate reasoning-intensive multimodal retrieval. Existing benchmarks primarily test surface-level semantic matching, failing to assess the deeper reasoning required for complex real-world scenarios. MR²-Bench addresses this gap by providing tasks that require logical, spatial, and causal inference.

The benchmark includes 1,309 curated queries across 3 meta-tasks and 12 sub-tasks. It features diverse data types, including natural images, diagrams, charts, and visual puzzles, moving beyond the typical natural images found in other benchmarks. Our evaluation of state-of-the-art models reveals a significant performance drop on MR²-Bench compared to existing benchmarks (e.g., a top model drops from 77.78% on MMEB to 9.91% on MR²-Bench in Recall@1), highlighting the need for more advanced reasoning-intensive retrievers. We anticipate that MR²-Bench will guide the community in developing more capable and robust multimodal retrieval systems.

🏆 Mini-Leaderboard

This table shows the average nDCG@10 scores across all 12 sub-tasks. For full results, please refer to our paper.

Model	Type	Avg. nDCG@10
Full Mark	-	100
*Multimodal Embedding Models*
Seed-1.6-Embedding	Multimodal	30.68
MM-Embed	Multimodal	30.23
VLM2Vec-v2	Multimodal	23.72
GME	Multimodal	21.59
BGE-VL	Multimodal	19.53
CLIP	Multimodal	18.59
*Text Embedding Models (+Captions)*
ReasonIR + Captions	Textual (Reasoning)	25.72
BGE-Reasoner + Captions	Textual (Reasoning)	25.35
Diver-Embed + Captions	Textual (Reasoning)	23.59
Qwen3 + Captions	Textual	20.17
BGE-M3 + Captions	Textual	18.71

MR²-Bench

Before you access our dataset, we kindly ask you to thoroughly read and understand the license outlined above. If you cannot agree to these terms, we request that you refrain from downloading our data.

The annotation file is readily accessible at our GitHub repository. For the full dataset including images, you can access it via this 🤗 HF Link.

MR²-Bench encompasses three meta-tasks designed to test different facets of reasoning in multimodal retrieval:

Multimodal Knowledge Retrieval: Requires retrieving documents where the relevance is established by reasoning over concepts that bridge the query and the document, often relying on essential visual information.
Visual Illustration Search: Involves finding an image (like a chart or visual proof) that intuitively explains or solves a problem posed in a domain-specific textual query.
Visual Relation Reasoning: Assesses vision-centric reasoning capabilities through tasks like spatial reasoning, visual puzzles, and analogies, where intent is conveyed through visual structures rather than language.

Examples of the tasks are displayed below.

Evaluation

Please refer to our GitHub repository for evaluation code and instructions.

Hosting and Maintenance

The annotation files will be permanently retained.

If source materials (images or text) are requested to be removed by copyright holders, we will take them down from our public dataset. We will still keep the relevant annotation files and actively seek more reliable and risk-free data sources to replace them to ensure the long-term validity of the benchmark.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{mr2bench,
  title={MR $\^{} 2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval},
  author={Zhou, Junjie and Liu, Ze and Xiong, Lei and Yao, Jin-Ge and Wang, Yueze and Xiao, Shitao and Lin, Fenfen and Chen, Miguel Hu and Dou, Zhicheng and Bao, Siqi and Lian, Defu, and Xiong, Yongping, and Liu, Zheng},
  journal={arXiv preprint arXiv:2509.26378},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
figs		figs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

MR²-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

🔔 News:

License

Introduction

🏆 Mini-Leaderboard

MR²-Bench

Evaluation

Hosting and Maintenance

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

VectorSpaceLab/MR2-Bench

Folders and files

Latest commit

History

Repository files navigation

MR²-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

🔔 News:

License

Introduction

🏆 Mini-Leaderboard

MR²-Bench

Evaluation

Hosting and Maintenance

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages