This repo contains the annotation data and evaluation code for the paper "MR²-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval".
Our dataset is under the CC-BY-NC-SA-4.0 license.
We do not own the copyright of the source materials (images, text) used in this benchmark, which are derived from public datasets or online platforms like Stack Exchange. For the data used, we respect and acknowledge any copyrights of the original authors.
If the original authors of the related works still believe that their content should be removed, please raise an issue.
We introduce MR²-Bench, the first benchmark designed to evaluate reasoning-intensive multimodal retrieval. Existing benchmarks primarily test surface-level semantic matching, failing to assess the deeper reasoning required for complex real-world scenarios. MR²-Bench addresses this gap by providing tasks that require logical, spatial, and causal inference.
The benchmark includes 1,309 curated queries across 3 meta-tasks and 12 sub-tasks. It features diverse data types, including natural images, diagrams, charts, and visual puzzles, moving beyond the typical natural images found in other benchmarks. Our evaluation of state-of-the-art models reveals a significant performance drop on MR²-Bench compared to existing benchmarks (e.g., a top model drops from 77.78% on MMEB to 9.91% on MR²-Bench in Recall@1), highlighting the need for more advanced reasoning-intensive retrievers. We anticipate that MR²-Bench will guide the community in developing more capable and robust multimodal retrieval systems.
This table shows the average nDCG@10 scores across all 12 sub-tasks. For full results, please refer to our paper.
| Model | Type | Avg. nDCG@10 |
|---|---|---|
| Full Mark | - | 100 |
| Multimodal Embedding Models | ||
| Seed-1.6-Embedding | Multimodal | 30.68 |
| MM-Embed | Multimodal | 30.23 |
| VLM2Vec-v2 | Multimodal | 23.72 |
| GME | Multimodal | 21.59 |
| BGE-VL | Multimodal | 19.53 |
| CLIP | Multimodal | 18.59 |
| Text Embedding Models (+Captions) | ||
| ReasonIR + Captions | Textual (Reasoning) | 25.72 |
| BGE-Reasoner + Captions | Textual (Reasoning) | 25.35 |
| Diver-Embed + Captions | Textual (Reasoning) | 23.59 |
| Qwen3 + Captions | Textual | 20.17 |
| BGE-M3 + Captions | Textual | 18.71 |
Before you access our dataset, we kindly ask you to thoroughly read and understand the license outlined above. If you cannot agree to these terms, we request that you refrain from downloading our data.
The annotation file is readily accessible at our GitHub repository. For the full dataset including images, you can access it via this 🤗 HF Link.
MR²-Bench encompasses three meta-tasks designed to test different facets of reasoning in multimodal retrieval:
- Multimodal Knowledge Retrieval: Requires retrieving documents where the relevance is established by reasoning over concepts that bridge the query and the document, often relying on essential visual information.
- Visual Illustration Search: Involves finding an image (like a chart or visual proof) that intuitively explains or solves a problem posed in a domain-specific textual query.
- Visual Relation Reasoning: Assesses vision-centric reasoning capabilities through tasks like spatial reasoning, visual puzzles, and analogies, where intent is conveyed through visual structures rather than language.
Examples of the tasks are displayed below.
Please refer to our GitHub repository for evaluation code and instructions.
The annotation files will be permanently retained.
If source materials (images or text) are requested to be removed by copyright holders, we will take them down from our public dataset. We will still keep the relevant annotation files and actively seek more reliable and risk-free data sources to replace them to ensure the long-term validity of the benchmark.
If you find this repository useful, please consider giving a star ⭐ and citation:
@article{mr2bench,
title={MR $\^{} 2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval},
author={Zhou, Junjie and Liu, Ze and Xiong, Lei and Yao, Jin-Ge and Wang, Yueze and Xiao, Shitao and Lin, Fenfen and Chen, Miguel Hu and Dou, Zhicheng and Bao, Siqi and Lian, Defu, and Xiong, Yongping, and Liu, Zheng},
journal={arXiv preprint arXiv:2509.26378},
year={2025}
}

