| 🧠 Model Space | 📦 Release Collection |
This repository demonstrates how to adapt Gemma 3, a powerful vision-language model (VLM), for object detection. By treating bounding boxes as discrete <locXXXX> tokens, we enable the model to reason about spatial information in a language-native way — inspired by PaliGemma.
Here's a glimpse of what our fine-tuned model can do. These images are generated by the predict.py script:
| Detected License Plates | Detected License Plates |
|---|---|
![]() |
![]() |
![]() |
![]() |
Most traditional object detection models output continuous bounding box coordinates using regression heads. In contrast, we follow the PaliGemma approach of treating bounding boxes as sequences of discrete tokens (e.g., <loc0512>), allowing detection to be framed as text generation.
However, unlike PaliGemma, Gemma 3 does not natively include these spatial tokens in its tokenizer.
We support two fine-tuning modes:
The model is trained with <locXXXX> tokens even though they are not in the tokenizer vocabulary. This forces it to learn spatial grounding implicitly. Although lightweight, it presents interesting results.
By using the flag --include_loc_tokens, we extend the tokenizer to explicitly include all <locXXXX> tokens (from <loc0000> to <loc1023>) and fine-tune the embeddings for them. After this, we fine-tune the entire model, following a two-stage training procedure. This enables the model to learn spatial grounding more effectively. Learn more here.
💡 Both modes are supported — toggle with
--include_loc_tokensintrain.py.
We use the ariG23498/license-detection-paligemma dataset, a modified version of keremberke/license-plate-object-detection, reformatted to match the expectations of text-based object detection.
Each bounding box is encoded as a sequence of location tokens like <loc0123>, following the PaliGemma format.
To reproduce the dataset or modify it for your use case, refer to the script create_dataset.py.
Get your environment ready to fine-tune Gemma 3:
git clone https://github.com/ariG23498/gemma3-object-detection.git
uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txtFollow these steps to configure, train, and run predictions:
Configure your training via config.py. All major parameters — like learning rate, model path, and dataset — are defined here.
- Train the model using:
python train.py --include_loc_tokensToggle --include_loc_tokens based on your strategy (see explanation above).
Run inference with:
python predict.pyThis script uses the fine-tuned model to detect license plates and generates images in the outputs/ folder.
Here are some tasks that we would want to investigate further.
- Low Rank Adaptation Training.
- Quantized Low Rank Adaptation Training.
- Train with a bigger object detection dataset.
We welcome contributions to enhance this project! If you have ideas for improvements, bug fixes, or new features, please:
- Fork the repository.
- Create a new branch for your feature or fix:
git checkout -b feature/my-new-feature- Implement your changes.
- Commit your changes with clear messages:
git commit -am 'Add some amazing feature'- Push your branch to your fork:
git push origin feature/my-new-feature- Open a Pull Request against the main repository.
If you use our work, please cite us.
@misc{gosthipaty_gemma3_object_detection_2025,
author = {Aritra Roy Gosthipaty and Sergio Paniego},
title = {Fine-tuning Gemma 3 for Object Detection},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ariG23498/gemma3-object-detection.git}}
}




