Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a
Download and install the specific Transformers from our Repository 🤗Huggingface
unzip transformers-main.zip
pip install -r requirements.list
cd transformers-main
pip install -e . We provide the data in
The complete data of
Add a new key named 'icc' in the "context" of evaluat_final.json, and move the original content of 'context' into the 'icc' dictionary.
1. model type: including single, base_text, base_image and icc
single: input both text and vision context but no instruction towards any modality
base_text: only text context as input
base_image: only vision context as input
icc: input both text and vision context with instruction towards vision or text modality
2. inference_type: including single_multi_no_specific and base_multi
single_multi_no_specific: for single in model type
base_multi: for others in model type
bash qwen_vl_7b.sh 0 icc base_multi
bash qwen_vl_7b.sh 0 single single_multi_no_specific
bash qwen_vl_7b.sh 0 base_text base_multi
bash qwen_vl_7b.sh 0 base_image base_multiSaving hidden_states in results/$version/$mode_type/$inference_type/$model.h5
python pca.py-
For pca.py, set the file paths for
file_singleandfile_instructionas follows:- To generate the hidden states for
file_single, run:bash qwen_vl_7b.sh 0 single single_multi_no_specific
- To generate the hidden states for
file_instruction, run:bash qwen_vl_7b.sh 0 icc base_multi
- To generate the hidden states for