CLIPCAM: Zero-shot Text-guided Object and Action Localization

Official implementation of CLIPCAM: A Simple Baseline for Zero-shot Object and Action Localization (ICASSP 2022)

Environment Setup
Quick Demo
- Website Demo
- Code Demo
Supported models for CLIPCAM
CAM Variations
Dataset Preparation
Evaluation
- Grid-view Zero-shot Object Localization
  - OpenImage
- Grid-view Zero-shot Action Localization
  - HICO-DET
- Single-image Zero-shot Object Localization
  - OpenImage
  - ILSVRC
  - COSMOS
  - Custom Images
Other features
- Iterative Mask
- Weather Attacks

Environment Setup

create conda enviroment with Python=3.7
conda create -n clipcam python=3.7
conda activate clipcam
install pytorch 1.9.0, torchvision 0.10.0 with compatible cuda version (or any compatible torch version)
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
install required package
pip install -r requirements.txt

Quick Demo

Website Demo

Please go to this link for a quick demo.

*P.S. First time user: please follow the instruction on top of the demo website to allow your browser connecting to my server.

Code Demo

python clipcam.py \ 
    --image_path "{single image path or grid image directory (4 images)}" \ 
    --sentence "{input sentence}" \  
    --gpu_id 0 \ 
    --clip_model_name "ViT-B/16" \ 
    --cam_model_name "GradCAM" \

Supported Models for CLIPCAM

CLIP Models (from OpenAI):

example: --clip_model_name ViT-B/16

ViT-B/16
ViT-B/32
RN50
RN101
RN50x4
RN50x16

ImageNet Pre-trained Models

example: --clip_model_name ViT-B/16-pretrained

CAM Variations

CAMs for CLIP (CLIPCAMs) (from pytorch-grad-cam)

example: --cam_model_name GradCAM

GradCAM
GradCAMPlusPlus
XGradCAM
ScoreCAM
EigenCAM
EigenGradCAM
GuidedBackpropReLUModel
LayerCAM

CAMs for other models (from pytorch-grad-cam)

example: --cam_model_name GradCAM_original

GradCAM_original
GradCAMPlusPlus_original
XGradCAM_original
ScoreCAM_original
EigenGradCAM_original
EigenCAM_original
GuidedBackpropReLUModel_original
LayerCAM_original

Dataset Preparation

OpenImage V6
Download OpenImage V6 validation set with data_prep/openimage.py.
HICO-DET
Download HICO-DET from this link.
ILSVRC (optional)
Download ILSVRC validation set.
COSMOS (optional)
Download COSMOS validation set.

Evaluation

Grid-view Zero-shot Object Localization

Dataset structure (OpenImage)

|--OpenImage
    |--validation
        |--data
            |--{image_path_1}
            |--{image_path_2}
            |-- ...
        |--labels
            |--detections.csv
        |--metadata
            |--classes.csv

Run evaluate_grid_openimage.py with any model selection

python evaluate_grid_openimage.py \
    --data_dir Dataset/OpenImage/validation \
    --gpu_id 0 \
    --clip_model_name 'ViT-B/32' \
    --cam_model_name 'GradCAM' \
    --save_dir 'eval_result/grid/openimage/vitb32-grad' \
    --mask_threshold 0.2 \
    --sentence_prefix 'a photo of ' \
    --attack 'None' \
    --save_result 1

Grid-view Zero-shot Action Localization

Dataset structure (HICO-DET)

|--HICO-DET
    |--images
        |--test
            |--{image_path_1}
            |--{image_path_2}
            |-- ...
        |--train
            |--{image_path_1}
            |--{image_path_2}
            |-- ...
    |--anno.mat
    |--anno_bbox.mat

Run verb_grid.py for pre-trained model
Train the model with half of the classes in HICO-DET.
Or download the fine-tuned checkpoints from this OneDrive.
--train_mode with 'full', 'few' and 'half' specifies the setting when loading the classes in HICO-DET dataset.

python verb_grid.py \
    --data_dir datasets/hico-det \
    --gpu_id 0 \
    --clip_model_name 'ViT-B/32-pretrained' \
    --cam_model_name 'GradCAM_original' \
    --save_dir 'eval_result/grid/hicodet/vitb32-pretrained-grad' \
    --mask_threshold 0.2 \
    --train_mode 'half' \
    --model_name checkpoints/models/vitb32-pretrained-half-1e-6.pth \ 
    --save_result 1

Run verb_grid.py for CLIPCAM

python verb_grid.py \
    --data_dir dataset/hico-det \
    --gpu_id 0 \
    --clip_model_name 'ViT-B/32' \
    --cam_model_name 'GradCAM' \
    --save_dir 'eval_result/grid/hicodet/vitb32-grad' \
    --mask_threshold 0.2 \
    --save_result 1

Single-image Zero-shot Object Localization

OpenImage
a. Run evaluate_openimage.py

python evaluate_openimage.py \
    --data_dir datasets/OpenImage/validation \
    --gpu_id 0 \
    --clip_model_name 'ViT-B/32' \
    --cam_model_name 'GradCAM' \
    --save_dir 'eval_result/single/openimage/vitb32-grad' \
    --save_result 1 \
    --sentence_prefix 'a photo of ' \
    --distill_num 0 \
    --attack 'None'

ILSVRC

a. Dataset Structure

|--ImageNet
    |--validation
        |--{label_1}\
            |--{image_path_1}
            |--{image_path_2}
            |-- ...
        |--{label_2}
        |-- ...
    |--bbox
        |--val
            |--{image_path_1}.xml
            |--{image_path_1}.xml
            |-- ...

b. Run evaluate_imagenet.py

python evaluate_imagenet.py \
    --data_dir dataset/ImageNet/validation \
    --gpu_id 0 \
    --clip_model_name 'ViT-B/32' \
    --cam_model_name 'GradCAM' \
    --save_dir 'eval_result/single/imagenet/vitb32-grad' \
    --batch 128 \
    --save_result 1 \
    --sentence_prefix 'sentence' \
    --attack 'None'

COSMOS, OpenImage and custom images

a. Run evaluate.py with --dataset cosmos or --dataset openimage

python evaluate.py \
    --data_dir datasets/COSMOS/val \
    --gpu_id 0 \
    --dataset cosmos \
    --clip_model_name 'ViT-B/32' \
    --cam_model_name 'GradCAM' \
    --save_dir 'eval_result/single/cosmos/vitb32-grad' \
    --distill_num 0 \
    --attack 'None'

Test on images with custom guiding text a. Put the images in a folder and run evaluate.py b. Run evaluate.py without specifying --dataset

python evaluate.py \
    --data_dir {path to folder} \
    --gpu_id 0 \
    --clip_model_name 'ViT-B/32' \
    --cam_model_name 'GradCAM' \
    --save_dir 'eval_result/custom-input-vitb32-grad'  \
    --distill_num 0

Other features

Iterative Mask

We propose an iterative refinement method based on masking out high neural importance areas to expand attention or enhance weak response regions.

Set --distill_num {n} to iteratively mask out {n} times.

Weather Attacks

We experimented the ability of CLIPCAM to handle attacked images.

Set --attack fog or --attack snow to create fog or snow attack.

Citing

If you find the paper or the code useful for your study, please consider citing the CLIPCAM paper:

@article{clipcam_hsia_icassp2022,
    author = {Hsia, Hsuan-An and Lin, Che-Hsien and Kung, Bo-Han and Chen, Jhao-Ting and Tan, Daniel Stanley and Chen, Jun-Cheng and Hua, Kai-Lung},
    title = "{CLIPCAM: A Simple Baseline for Zero-shot Text-guided Object and Action Localization}",
    booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year = {2022}
}

Contact Us

If you have questions regarding the paper or code, please open an issue or email us: Jhao-Ting Chen or Che-Hsien Lin. We will get back to you as soon as possible.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
attacks		attacks
clip_modified		clip_modified
data_prep		data_prep
imgs		imgs
pytorch_grad_cam_modified		pytorch_grad_cam_modified
resources		resources
utils		utils
.gitignore		.gitignore
clipcam.py		clipcam.py
evaluate.py		evaluate.py
evaluate_grid_openimage.py		evaluate_grid_openimage.py
evaluate_imagenet.py		evaluate_imagenet.py
evaluate_openimage.py		evaluate_openimage.py
readme.md		readme.md
requirements.txt		requirements.txt
verb_grid.py		verb_grid.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLIPCAM: Zero-shot Text-guided Object and Action Localization

Official implementation of CLIPCAM: A Simple Baseline for Zero-shot Object and Action Localization (ICASSP 2022)

Table of Contents

Environment Setup

Quick Demo

Website Demo

Code Demo

Supported Models for CLIPCAM

CLIP Models (from OpenAI):

ImageNet Pre-trained Models

CAM Variations

CAMs for CLIP (CLIPCAMs) (from pytorch-grad-cam)

CAMs for other models (from pytorch-grad-cam)

Dataset Preparation

Evaluation

Grid-view Zero-shot Object Localization

Grid-view Zero-shot Action Localization

Single-image Zero-shot Object Localization

Other features

Iterative Mask

Weather Attacks

Citing

Contact Us

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

aiiu-lab/CLIPCAM

Folders and files

Latest commit

History

Repository files navigation

CLIPCAM: Zero-shot Text-guided Object and Action Localization

Official implementation of CLIPCAM: A Simple Baseline for Zero-shot Object and Action Localization (ICASSP 2022)

Table of Contents

Environment Setup

Quick Demo

Website Demo

Code Demo

Supported Models for CLIPCAM

CLIP Models (from OpenAI):

ImageNet Pre-trained Models

CAM Variations

CAMs for CLIP (CLIPCAMs) (from pytorch-grad-cam)

CAMs for other models (from pytorch-grad-cam)

Dataset Preparation

Evaluation

Grid-view Zero-shot Object Localization

Grid-view Zero-shot Action Localization

Single-image Zero-shot Object Localization

Other features

Iterative Mask

Weather Attacks

Citing

Contact Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages