🦛 Chonkie ✨

The no-nonsense ultra-light RAG pipeline that CHONKs, refines, and ships your texts!

Installation • Usage • Chunkers • Integrations • Benchmarks

Tired of making your gazillionth chunker? Sick of the overhead of large libraries? Want to chunk your texts quickly and efficiently? Chonkie the mighty hippo is here to help!

🚀 Feature-rich: All the CHONKs you'd ever need
🔄 End-to-end: Fetch, CHONK, refine, embed and ship straight to your vector DB!
✨ Easy to use: Install, Import, CHONK
⚡ Fast: CHONK at the speed of light! zooooom
🪶 Light-weight: No bloat, just CHONK
🔌 32+ integrations: Works with your favorite tools and vector DBs out of the box!
💬 ️Multilingual: Out-of-the-box support for 56 languages
☁️ Cloud-Friendly: CHONK locally or in the Cloud
🦛 Cute CHONK mascot: psst it's a pygmy hippo btw
❤️ Moto Moto's favorite python library

Chonkie is a chunking library that "just works" ✨

📦 Installation

Basic Installation

Using pip:

pip install chonkie

Or using uv (faster):

uv pip install chonkie

Full Installation

Chonkie follows the rule of minimum installs. Have a favorite chunker? Read our docs to install only what you need. Don't want to think about it? Simply install all (Not recommended for production environments).

Using pip:

pip install "chonkie[all]"

Or using uv:

uv pip install "chonkie[all]"

🚀 Usage

Basic Usage

Here's a basic example to get you started:

# First import the chunker you want from Chonkie
from chonkie import RecursiveChunker

# Initialize the chunker
chunker = RecursiveChunker()

# Chunk some text
chunks = chunker("Chonkie is the goodest boi! My favorite chunking hippo hehe.")

# Access chunks
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")

Pipeline Usage

You can also use the chonkie.Pipeline to chain components together and handle complex workflows. Read more about pipelines in the docs!

from chonkie import Pipeline

# Create a pipeline with multiple chunking and refinement steps
pipe = (
    Pipeline()
    .chunk_with("recursive", tokenizer="gpt2", chunk_size=2048, recipe="markdown")
    .chunk_with("semantic", chunk_size=512)
    .refine_with("overlap", context_size=128)
    .refine_with("embeddings", embedding_model="sentence-transformers/all-MiniLM-L6-v2")
)

# CHONK some Texts!
doc = pipe.run(texts="Chonkie is the goodest boi! My favorite chunking hippo hehe.")

# Access the processed chunks in the `doc` object
for chunk in doc.chunks:
    print(chunk.text)

Check out more usage examples in the docs!

✂️ Chunkers

Chonkie provides several chunkers to help you split your text efficiently for RAG applications. Here's a quick overview of the available chunkers:

Name	Alias	Description
`TokenChunker`	`token`	Splits text into fixed-size token chunks.
`SentenceChunker`	`sentence`	Splits text into chunks based on sentences.
`RecursiveChunker`	`recursive`	Splits text hierarchically using customizable rules to create semantically meaningful chunks.
`SemanticChunker`	`semantic`	Splits text into chunks based on semantic similarity. Inspired by the work of Greg Kamradt.
`LateChunker`	`late`	Embeds text and then splits it to have better chunk embeddings.
`CodeChunker`	`code`	Splits code into structurally meaningful chunks.
`NeuralChunker`	`neural`	Splits text using a neural model.
`SlumberChunker`	`slumber`	Splits text using an LLM to find semantically meaningful chunks. Also known as "AgenticChunker".

More on these methods and the approaches taken inside the docs

🔌 Integrations

Chonkie boasts 32+ integrations across tokenizers, embedding providers, LLMs, refineries, porters, vector databases, and utilities, ensuring it fits seamlessly into your existing workflow.

👨‍🍳 Chefs & 📁 Fetchers! Text preprocessing and data loading!

Chefs handle text preprocessing, while Fetchers load data from various sources.

Component	Class	Description	Optional Install
`chef`	`TextChef`	Text preprocessing and cleaning.	`default`
`fetcher`	`FileFetcher`	Load text from files and directories.	`default`

🏭 Refine your CHONKs with Context and Embeddings! Chonkie supports 2+ refineries!

Refineries help you post-process and enhance your chunks after initial chunking.

Refinery Name	Class	Description	Optional Install
`overlap`	`OverlapRefinery`	Merge overlapping chunks based on similarity.	`default`
`embeddings`	`EmbeddingsRefinery`	Add embeddings to chunks using any provider.	`chonkie[semantic]`

🐴 Exporting CHONKs! Chonkie supports 2+ Porters!

Porters help you save your chunks easily.

Porter Name	Class	Description	Optional Install
`json`	`JSONPorter`	Export chunks to a JSON file.	`default`
`datasets`	`DatasetsPorter`	Export chunks to HuggingFace datasets.	`chonkie[datasets]`

🤝 Shake hands with your DB! Chonkie connects with 8+ vector stores!

Handshakes provide a unified interface to ingest chunks directly into your favorite vector databases.

Handshake Name	Class	Description	Optional Install
`chroma`	`ChromaHandshake`	Ingest chunks into ChromaDB.	`chonkie[chroma]`
`elastic`	`ElasticHandshake`	Ingest chunks into Elasticsearch.	`chonkie[elastic]`
`mongodb`	`MongoDBHandshake`	Ingest chunks into MongoDB.	`chonkie[mongodb]`
`pgvector`	`PgvectorHandshake`	Ingest chunks into PostgreSQL with pgvector.	`chonkie[pgvector]`
`pinecone`	`PineconeHandshake`	Ingest chunks into Pinecone.	`chonkie[pinecone]`
`qdrant`	`QdrantHandshake`	Ingest chunks into Qdrant.	`chonkie[qdrant]`
`turbopuffer`	`TurbopufferHandshake`	Ingest chunks into Turbopuffer.	`chonkie[tpuf]`
`weaviate`	`WeaviateHandshake`	Ingest chunks into Weaviate.	`chonkie[weaviate]`

🪓 Slice 'n' Dice! Chonkie supports 5+ ways to tokenize!

Choose from supported tokenizers or provide your own custom token counting function. Flexibility first!

Name	Description	Optional Install
`character`	Basic character-level tokenizer. Default tokenizer.	`default`
`word`	Basic word-level tokenizer.	`default`
`tokenizers`	Load any tokenizer from the Hugging Face `tokenizers` library.	`chonkie[tokenizers]`
`tiktoken`	Use OpenAI's `tiktoken` library (e.g., for `gpt-4`).	`chonkie[tiktoken]`
`transformers`	Load tokenizers via `AutoTokenizer` from HF `transformers`.	`chonkie[neural]`

default indicates that the feature is available with the default pip install chonkie.

To use a custom token counter, you can pass in any function that takes a string and returns an integer! Something like this:

def custom_token_counter(text: str) -> int:
    return len(text)

chunker = RecursiveChunker(tokenizer=custom_token_counter)

You can use this to extend Chonkie to support any tokenization scheme you want!

🧠 Embed like a boss! Chonkie links up with 8+ embedding pals!

Seamlessly works with various embedding model providers. Bring your favorite embeddings to the CHONK party! Use AutoEmbeddings to load models easily.

Provider / Alias	Class	Description	Optional Install
`model2vec`	`Model2VecEmbeddings`	Use `Model2Vec` models.	`chonkie[model2vec]`
`sentence-transformers`	`SentenceTransformerEmbeddings`	Use any `sentence-transformers` model.	`chonkie[st]`
`openai`	`OpenAIEmbeddings`	Use OpenAI's embedding API.	`chonkie[openai]`
`azure-openai`	`AzureOpenAIEmbeddings`	Use Azure OpenAI embedding service.	`chonkie[azure-openai]`
`cohere`	`CohereEmbeddings`	Use Cohere's embedding API.	`chonkie[cohere]`
`gemini`	`GeminiEmbeddings`	Use Google's Gemini embedding API.	`chonkie[gemini]`
`jina`	`JinaEmbeddings`	Use Jina AI's embedding API.	`chonkie[jina]`
`voyageai`	`VoyageAIEmbeddings`	Use Voyage AI's embedding API.	`chonkie[voyageai]`

🧞‍♂️ Power Up with Genies! Chonkie supports 3+ LLM providers!

Genies provide interfaces to interact with Large Language Models (LLMs) for advanced chunking strategies or other tasks within the pipeline.

Genie Name	Class	Description	Optional Install
`gemini`	`GeminiGenie`	Interact with Google Gemini APIs.	`chonkie[gemini]`
`openai`	`OpenAIGenie`	Interact with OpenAI APIs.	`chonkie[openai]`
`azure-openai`	`AzureOpenAIGenie`	Interact with Azure OpenAI APIs.	`chonkie[azure-openai]`

You can also use the OpenAIGenie to interact with any LLM provider that supports the OpenAI API format, by simply changing the model, base_url, and api_key parameters. For example, here's how to use the OpenAIGenie to interact with the Llama-4-Maverick model via OpenRouter:

from chonkie import OpenAIGenie

genie = OpenAIGenie(model="meta-llama/llama-4-maverick",
                    base_url="https://openrouter.ai/api/v1",
                    api_key="your_api_key")

🛠️ Utilities & Helpers! Chonkie includes handy tools!

Additional utilities to enhance your chunking workflow.

Utility Name	Class	Description	Optional Install
`hub`	`Hubbie`	Simple wrapper for HuggingFace Hub operations.	`chonkie[hub]`
`viz`	`Visualizer`	Rich console visualizations for chunks.	`chonkie[viz]`

With Chonkie's wide range of integrations, you can easily plug it into your existing infrastructure and start CHONKING!

📊 Benchmarks

"I may be smol hippo, but I pack a big punch!" 🦛

Chonkie is not just cute, it's also fast and efficient! Here's how it stacks up against the competition:

Size📦

Wheel Size: 505KB (vs 1-12MB for alternatives)
Installed Size: 49MB (vs 80-171MB for alternatives)
With Semantic: Still 10x lighter than the closest competition!

Speed⚡

Token Chunking: 33x faster than the slowest alternative
Sentence Chunking: Almost 2x faster than competitors
Semantic Chunking: Up to 2.5x faster than others

Check out our detailed benchmarks to see how Chonkie races past the competition! 🏃‍♂️💨

🤝 Contributing

Want to help grow Chonkie? Check out CONTRIBUTING.md to get started! Whether you're fixing bugs, adding features, or improving docs, every contribution helps make Chonkie a better CHONK for everyone.

Remember: No contribution is too small for this tiny hippo! 🦛

🙏 Acknowledgements

Chonkie would like to CHONK its way through a special thanks to all the users and contributors who have helped make this library what it is today! Your feedback, issue reports, and improvements have helped make Chonkie the CHONKIEST it can be.

And of course, special thanks to Moto Moto for endorsing Chonkie with his famous quote:

"I like them big, I like them chonkie." ~ Moto Moto

📝 Citation

If you use Chonkie in your research, please cite it as follows:

@software{chonkie2025,
  author = {Minhas, Bhavnick AND Nigam, Shreyash},
  title = {Chonkie: A no-nonsense fast, lightweight, and efficient text chunking library},
  year = {2025},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/chonkie-inc/chonkie}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,347 Commits
.github		.github
assets		assets
docs		docs
src/chonkie		src/chonkie
tests		tests
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
llms.txt		llms.txt
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

🦛 Chonkie ✨

📦 Installation

Basic Installation

Full Installation

🚀 Usage

Basic Usage

Pipeline Usage

✂️ Chunkers

🔌 Integrations

📊 Benchmarks

🤝 Contributing

🙏 Acknowledgements

📝 Citation

About

Uh oh!

Releases 26

Packages

Used by 258

Contributors 22

Languages

Uh oh!

License

Uh oh!

chonkie-inc/chonkie

Folders and files

Latest commit

History

Repository files navigation

🦛 Chonkie ✨

📦 Installation

Basic Installation

Full Installation

🚀 Usage

Basic Usage

Pipeline Usage

✂️ Chunkers

🔌 Integrations

📊 Benchmarks

🤝 Contributing

🙏 Acknowledgements

📝 Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 26

Packages 0

Used by 258

Contributors 22

Languages

Packages