Giuseppe Cartella,
Vittorio Cuculo,
Alessandro D'Amelio,
Marcella Cornia,
Giuseppe Boccignone,
Rita Cucchiara
Official implementation of "Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction", ICCV 2025 🌺
Abstract:
Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research.
# Clone the repository
git clone https://github.com/aimagelab/ScanDiff.git
cd ScanDiff
# Install dependencies
conda create --name scandiff python=3.10
conda activate scandiff
pip install -r requirements.txt
# Install PyTorch and Torchvision (CUDA 12.1)
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu121This project uses Hydra to manage configurations in a flexible and composable way.
All experiment settings (e.g., data, model, diffusion parameters) are defined in YAML files inside the configs/ directory.
Hydra allows to easily override or combine configuration options directly from the command line without modifying the source files.
For more details, visit the Hydra documentation!
Download freeviewing and visual search checkpoints with the following command:
wget -O checkpoints.zip https://ailb-web.ing.unimore.it/publicfiles/ScanDiff_ICCV2025/checkpoints.zip && unzip -j checkpoints.zip -d checkpoints && rm checkpoints.zipDownload task embeddings with the following command:
wget -O data.zip https://ailb-web.ing.unimore.it/publicfiles/ScanDiff_ICCV2025/data.zip && unzip -j data.zip -d data && rm data.zipAt this point the project root should look like:
ScanDiff/
├── data/
│ └── task_embeddings.npy
└── checkpoints/
├── scandiff_freeview.pth
└── scandiff_visualsearch.pthWe provide a simple demo.py script to generate scanpaths for a certain image
- Generate scanpaths in the freeviewing setting:
python demo.py image_path=./sample_images/dog.jpg viewing_task="" checkpoint_path=./checkpoints/scandiff_freeview.pth num_output_scanpaths=10- Generate scanpaths in the visual search setting:
python demo.py image_path=./sample_images/car.jpg viewing_task="car" checkpoint_path=./checkpoints/scandiff_visualsearch.pth num_output_scanpaths=10- Release data and train-val-test splits.
- Release training code for ScanDiff.
- Release evaluation scripts for benchmark comparisons.
If you find this work useful for your research, please cite our paper:
@inproceedings{cartella2025modeling,
title={Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction},
author={Cartella, Giuseppe and Cuculo, Vittorio and D'Amelio, Alessandro and Cornia, Marcella and Boccignone, Giuseppe and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}