Official implementation of CLIPCAM: A Simple Baseline for Zero-shot Object and Action Localization (ICASSP 2022)
- Environment Setup
 - Quick Demo
 - Supported models for CLIPCAM
 - CAM Variations
 - Dataset Preparation
 - Evaluation
- Grid-view Zero-shot Object Localization
- OpenImage
 
 - Grid-view Zero-shot Action Localization
- HICO-DET
 
 - Single-image Zero-shot Object Localization
- OpenImage
 - ILSVRC
 - COSMOS
 - Custom Images
 
 
 - Grid-view Zero-shot Object Localization
 - Other features
 
- create conda enviroment with Python=3.7
conda create -n clipcam python=3.7
conda activate clipcam - install pytorch 1.9.0, torchvision 0.10.0 with compatible cuda version (or any compatible torch version)
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html - install required package
pip install -r requirements.txt 
Please go to this link for a quick demo.

*P.S. First time user: please follow the instruction on top of the demo website to allow your browser connecting to my server.
python clipcam.py \ 
    --image_path "{single image path or grid image directory (4 images)}" \ 
    --sentence "{input sentence}" \  
    --gpu_id 0 \ 
    --clip_model_name "ViT-B/16" \ 
    --cam_model_name "GradCAM" \ CLIP Models (from OpenAI):
example: --clip_model_name ViT-B/16
- ViT-B/16
 - ViT-B/32
 - RN50
 - RN101
 - RN50x4
 - RN50x16
 
example: --clip_model_name ViT-B/16-pretrained
CAMs for CLIP (CLIPCAMs) (from pytorch-grad-cam)
example: --cam_model_name GradCAM
- GradCAM
 - GradCAMPlusPlus
 - XGradCAM
 - ScoreCAM
 - EigenCAM
 - EigenGradCAM
 - GuidedBackpropReLUModel
 - LayerCAM
 
CAMs for other models (from pytorch-grad-cam)
example: --cam_model_name GradCAM_original
- GradCAM_original
 - GradCAMPlusPlus_original
 - XGradCAM_original
 - ScoreCAM_original
 - EigenGradCAM_original
 - EigenCAM_original
 - GuidedBackpropReLUModel_original
 - LayerCAM_original
 
- OpenImage V6
Download OpenImage V6 validation set withdata_prep/openimage.py. - HICO-DET
Download HICO-DET from this link. - ILSVRC (optional)
Download ILSVRC validation set. - COSMOS (optional)
Download COSMOS validation set. 
- Dataset structure (OpenImage)
|--OpenImage |--validation |--data |--{image_path_1} |--{image_path_2} |-- ... |--labels |--detections.csv |--metadata |--classes.csv - Run 
evaluate_grid_openimage.pywith any model selectionpython evaluate_grid_openimage.py \ --data_dir Dataset/OpenImage/validation \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/grid/openimage/vitb32-grad' \ --mask_threshold 0.2 \ --sentence_prefix 'a photo of ' \ --attack 'None' \ --save_result 1 
- 
Dataset structure (HICO-DET)
|--HICO-DET |--images |--test |--{image_path_1} |--{image_path_2} |-- ... |--train |--{image_path_1} |--{image_path_2} |-- ... |--anno.mat |--anno_bbox.mat - 
Run
verb_grid.pyfor pre-trained model
Train the model with half of the classes in HICO-DET.
Or download the fine-tuned checkpoints from this OneDrive.
--train_modewith 'full', 'few' and 'half' specifies the setting when loading the classes in HICO-DET dataset.python verb_grid.py \ --data_dir datasets/hico-det \ --gpu_id 0 \ --clip_model_name 'ViT-B/32-pretrained' \ --cam_model_name 'GradCAM_original' \ --save_dir 'eval_result/grid/hicodet/vitb32-pretrained-grad' \ --mask_threshold 0.2 \ --train_mode 'half' \ --model_name checkpoints/models/vitb32-pretrained-half-1e-6.pth \ --save_result 1 - 
Run
verb_grid.pyfor CLIPCAMpython verb_grid.py \ --data_dir dataset/hico-det \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/grid/hicodet/vitb32-grad' \ --mask_threshold 0.2 \ --save_result 1 
- 
OpenImage
a. Runevaluate_openimage.pypython evaluate_openimage.py \ --data_dir datasets/OpenImage/validation \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/single/openimage/vitb32-grad' \ --save_result 1 \ --sentence_prefix 'a photo of ' \ --distill_num 0 \ --attack 'None' - 
|--ImageNet |--validation |--{label_1}\ |--{image_path_1} |--{image_path_2} |-- ... |--{label_2} |-- ... |--bbox |--val |--{image_path_1}.xml |--{image_path_1}.xml |-- ...b. Run
evaluate_imagenet.pypython evaluate_imagenet.py \ --data_dir dataset/ImageNet/validation \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/single/imagenet/vitb32-grad' \ --batch 128 \ --save_result 1 \ --sentence_prefix 'sentence' \ --attack 'None' - 
COSMOS, OpenImage and custom images

a. Runevaluate.pywith--dataset cosmosor--dataset openimagepython evaluate.py \ --data_dir datasets/COSMOS/val \ --gpu_id 0 \ --dataset cosmos \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/single/cosmos/vitb32-grad' \ --distill_num 0 \ --attack 'None' - 
Test on images with custom guiding text a. Put the images in a folder and run
evaluate.pyb. Runevaluate.pywithout specifying--datasetpython evaluate.py \ --data_dir {path to folder} \ --gpu_id 0 \ --clip_model_name 'ViT-B/32' \ --cam_model_name 'GradCAM' \ --save_dir 'eval_result/custom-input-vitb32-grad' \ --distill_num 0 
We propose an iterative refinement method based on masking out high neural importance areas to expand attention or enhance weak response regions.

Set --distill_num {n} to iteratively mask out {n} times.
We experimented the ability of CLIPCAM to handle attacked images.

Set --attack fog or --attack snow to create fog or snow attack.
If you find the paper or the code useful for your study, please consider citing the CLIPCAM paper:
@article{clipcam_hsia_icassp2022,
    author = {Hsia, Hsuan-An and Lin, Che-Hsien and Kung, Bo-Han and Chen, Jhao-Ting and Tan, Daniel Stanley and Chen, Jun-Cheng and Hua, Kai-Lung},
    title = "{CLIPCAM: A Simple Baseline for Zero-shot Text-guided Object and Action Localization}",
    booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year = {2022}
}If you have questions regarding the paper or code, please open an issue or email us: Jhao-Ting Chen or Che-Hsien Lin. We will get back to you as soon as possible.





