Extreme Text Classifier

A Python text classifier for large-scale multi-class classification tasks with 1000+ classes using Amazon Bedrock. Supports basic classification, reranking with LLMs, and PDF document processing.

For a live demo of the library's capabilities, see ui/README.md for the interactive frontend showcase.

Features

Large-Scale Classification: Handle 1000+ classes with high accuracy
LLM-Based Reranking: Improve accuracy using Amazon Bedrock models
Attribute Validation: Enhanced confidence scoring with detailed condition checking
PDF Document Processing: Extract and classify content from PDF files using multimodal LLMs
Dataset Generation: Create synthetic datasets for any domain
Step-by-Step Scoring: Transparent similarity, rerank, and attribute scores

How It Works

The classifier uses a 3-step pipeline that combines semantic understanding with optional quality enhancements:

                   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────┐
  Input Text   ───▶│ Semantic Search │───▶│  LLM Reranking  │───▶│ Attribute Validation│
  "Invoice..."     │                 │    │   (Optional)    │    │    (Optional)       │
                   │ Find Similar    │    │ Refine Top-K    │    │ Validate Rules      │
                   │ Classes 1000+   │    │ Classes 1-5     │    │ Business Logic      │
                   └─────────────────┘    └─────────────────┘    └─────────────────────┘
                            ↓                      ↓                       ↓                       
                    Similarity Score          Rerank Score           Attribute Score

1. 🔍 Semantic Search (Required)

Uses embedding-based similarity to find the most relevant classes from your dataset. This step handles the core classification by computing cosine similarity between the input text and all class descriptions. Fast and scalable across 1000+ classes.

2. 🎯 LLM Reranking (Optional)

Refines the top candidates using Amazon Bedrock models for deeper semantic understanding. This step improves accuracy by having LLMs evaluate the contextual relevance of each candidate, focusing computational resources on the most promising options.

3. ✅ Attribute Validation (Optional)

Validates that the predicted class matches specific business rules and conditions. This step provides additional confidence by checking if the text satisfies class-specific requirements using LLM-based condition evaluation and logical operators.

Example in practice

If a document is classified as an Invoice, it must have:

A monetary amount to be paid (e.g., "$500.00", "Total: $1,250")
Both parties identified (e.g., "From: ABC Corp" and "To: XYZ Inc")
Either a request for payment OR proof of completed payment

This validation catches instances where receipts have been incorrectly classified as invoices.

Why This Pipeline Works

⚡ Scalable: Embedding search handles 1000+ classes in milliseconds, with optional LLM steps only processing top candidates
🎯 Accurate: Multi-step validation catches edge cases that pure similarity might miss
🔒 Reliable: Attribute validation ensures predictions meet business requirements, not just semantic similarity
📊 Transparent: Each step provides its own confidence score, allowing you to understand and tune the decision process
🛠️ Flexible: Use just semantic search for speed, or add reranking and validation for maximum accuracy

Installation

Prerequisites:
- Python 3.8+
- AWS CLI installed and configured
- AWS account with Bedrock access and required models activated (Amazon Nova Lite, Amazon Nova Pro, Amazon Rerank, Cohere Rerank, Anthropic Claude Sonnet 4)

Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set up AWS credentials:
```
aws configure
```

Run examples:

# Try the getting started examples
python examples/getting_started/1_basic_classification.py
python examples/getting_started/2_classification_with_reranking.py
python examples/getting_started/3_pdf_document_classification.py

Quick Start

Basic Classification Examples

1. Basic Text Classification

Start with simple text classification using a pre-built dataset. The classifier uses embeddings to find the most similar class for your input text.

See full example: examples/getting_started/1_basic_classification.py

from multi_class_text_classifier import TextClassifier

classifier = TextClassifier(
    dataset_path="datasets/document_classification_dataset.json",
    embeddings_path="datasets/document_classification_embeddings.pkl.gz"
)

result = classifier.predict("Please remit payment of $1,250.00 within 30 days.")
print(f"Predicted Class: {result.predicted_class.name}")
print(f"Confidence: {result.effective_score:.4f}")

2. Classification with Reranking

Improve classification accuracy by adding a reranking step using Amazon Bedrock models. This refines the initial similarity search results.

See full example: examples/getting_started/2_classification_with_reranking.py

from multi_class_text_classifier import TextClassifier
from multi_class_text_classifier.models.data_models import RerankingConfig

reranking_config = RerankingConfig(
    model_type="amazon_rerank",
    model_id="arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0",
    top_k_candidates=5,
    aws_region="us-west-2"
)

classifier = TextClassifier(
    embeddings_path="datasets/document_classification_embeddings.pkl.gz",
    reranking_config=reranking_config
)

result = classifier.predict("This quarterly financial report shows revenue growth.")
print(f"Predicted Class: {result.predicted_class.name}")
print(f"Reranked: {result.reranked}")

3. PDF Document Classification

Extract content from PDF files using multimodal LLMs that can process both text and images, then classify the extracted content.

See full example: examples/getting_started/3_pdf_document_classification.py

from multi_class_text_classifier import create_pdf_extractor, TextClassifier

pdf_extractor = create_pdf_extractor(
    model_id="us.amazon.nova-lite-v1:0"
)

pdf_content = pdf_extractor.extract_from_file("datasets/dummy_invoice.pdf")

classifier = TextClassifier(
    embeddings_path="datasets/document_classification_embeddings.pkl.gz"
)

result = classifier.predict(pdf_content.full_content)
print(f"Document type: {result.predicted_class.name}")
print(f"Confidence: {result.effective_score:.2%}")

Attribute Validation Examples

4a. Simple Attribute Validation

Enhance classification confidence by validating that the predicted class matches specific attributes. This provides additional scoring and detailed feedback on why a classification succeeded or failed.

See full example: examples/getting_started/4a_simple_attribute_validation.py

from multi_class_text_classifier import TextClassifier
from multi_class_text_classifier.models.data_models import (
    AttributeValidationConfig, AttributeEvaluationConfig
)

attribute_config = AttributeValidationConfig(
    enabled=True,
    attributes_path="datasets/document_classification_attributes.json",
    evaluation_config=AttributeEvaluationConfig(
        model_id="us.amazon.nova-lite-v1:0"
    )
)

classifier = TextClassifier(
    dataset_path="datasets/document_classification_dataset.json",
    embeddings_path="datasets/document_classification_embeddings.pkl.gz",
    attribute_config=attribute_config
)

result = classifier.predict("INVOICE #001 - Amount Due: $500 - From: ABC Corp")
print(f"Predicted: {result.predicted_class.name}")
print(f"Similarity Score: {result.similarity_score:.3f}")
print(f"Attribute Score: {result.attribute_score:.3f}")

if result.attribute_validation:
    print(f"Conditions Met: {result.attribute_validation.conditions_met}")
    print(f"Conditions Not Met: {result.attribute_validation.conditions_not_met}")

4c. Generate Attributes with LLM

Automatically generate attribute definitions for existing classes using LLMs. This can save you time if you don't have attributes for your classes and want to use attribute validation to enhance classification confidence.

See full example: examples/getting_started/4c_attribute_generation_example.py

from multi_class_text_classifier.attribute_generator import AttributeGenerator
from multi_class_text_classifier.dataset_loader import ClassesDataset
from multi_class_text_classifier.models.data_models import AttributeGenerationConfig

# Configure attribute generation
generation_config = AttributeGenerationConfig(
    model_id="us.amazon.nova-pro-v1:0",
    temperature=0.1,
    max_tokens=4000
)

generator = AttributeGenerator(generation_config)

# Load existing classes from dataset
dataset_loader = ClassesDataset("datasets/document_classification_dataset.json")
classes = dataset_loader.load_classes_from_json()

# Generate attributes for all classes (returns standard format)
all_attributes = generator.generate_attributes_for_classes(
    classes,
    domain_context="document classification"
)

# Save generated attributes
generator.save_attributes_to_file(all_attributes, "output/generated_attributes.json")

4d. Use Generated Attributes

Use the LLM-generated attributes for classification and validation. This demonstrates the complete workflow from attribute generation to practical usage.

See full example: examples/getting_started/4d_use_generated_attributes.py

from multi_class_text_classifier import TextClassifier
from multi_class_text_classifier.models.data_models import (
    AttributeValidationConfig, AttributeEvaluationConfig
)

# Convert generated attributes to standard format
standard_attributes_path = convert_generated_to_standard_format()

attribute_config = AttributeValidationConfig(
    enabled=True,
    attributes_path=standard_attributes_path,
    evaluation_config=AttributeEvaluationConfig(
        model_id="us.amazon.nova-lite-v1:0"
    )
)

classifier = TextClassifier(
    dataset_path="datasets/document_classification_dataset.json",
    embeddings_path="datasets/document_classification_embeddings.pkl.gz",
    attribute_config=attribute_config
)

result = classifier.predict("INVOICE #INV-2024-001 - Amount Due: $2,500.00")
print(f"Predicted: {result.predicted_class.name}")
print(f"Attribute Score: {result.attribute_validation.overall_score:.3f}")

Reranking Models

You can use different types of models for reranking through Amazon Bedrock:

Dedicated Reranking Models

Models specifically designed for reranking tasks:

# Amazon Rerank
RerankingConfig(
    model_type="amazon_rerank",
    model_id="arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0"
)

# Cohere Rerank
RerankingConfig(
    model_type="cohere_rerank",
    model_id="cohere.rerank-v3-5:0"
)

Text-to-Text LLMs

General language models that can perform reranking:

# Amazon Nova family
RerankingConfig(
    model_type="llm",
    model_id="us.amazon.nova-lite-v1:0",
    model_parameters={"temperature": 0.1, "max_tokens": 1000}
)

# Anthropic Claude family
RerankingConfig(
    model_type="llm",
    model_id="us.anthropic.claude-3-5-sonnet-20241022-v2:0",
    model_parameters={"temperature": 0.1, "max_tokens": 1000}
)

PDF Processing

PDF processing uses multimodal LLMs that can understand both text and images. These models analyze images within PDFs to extract text content and describe visual elements.

Supported Multimodal Models

Nova Lite: Amazon's multimodal model
Nova Pro: Higher quality image analysis
Claude Sonnet 4: Anthropic's multimodal model
Other models available on Bedrock may work, but have not been tested

PDF Extractor Configuration

from multi_class_text_classifier import create_pdf_extractor

pdf_extractor = create_pdf_extractor(
    model_id="us.amazon.nova-lite-v1:0"
)

pdf_content = pdf_extractor.extract_from_file("document.pdf")

Dataset Generation

Dataset generation creates synthetic class definitions for your domain using LLMs. This is useful when you don't have existing class definitions but need to classify text into domain-specific categories. The generated classes include realistic names and descriptions that capture the nuances of your domain.

generator = DatasetGenerator()

# Technology domain
tech_dataset = generator.generate_dummy_dataset("software tools", 50)

# Healthcare domain  
health_dataset = generator.generate_dummy_dataset("medical devices", 75)

# E-commerce domain
ecommerce_dataset = generator.generate_dummy_dataset("product categories", 100)

Understanding Results

Score Types

Similarity Score: Cosine similarity from embedding search
Rerank Score: Semantic evaluation from LLM models
Effective Score: Best available score (rerank if available, otherwise similarity)

Accessing Detailed Results

result = classifier.predict("Sample text")

print(f"Predicted: {result.predicted_class.name}")
print(f"Confidence: {result.effective_score:.4f} ({result.get_confidence_level()})")

for alt in result.alternatives[1:3]:
    print(f"Alternative: {alt.class_definition.name} ({alt.effective_score:.4f})")

Attribute Validation

Attribute validation provides an additional layer of confidence by checking if the classified text meets specific conditions defined for each class. This helps identify misclassifications and provides detailed feedback.

Example in Practice

If a document is classified as an Invoice, it must have:

A monetary amount to be paid (e.g., "$500.00", "Total: $1,250")
Both parties identified (e.g., "From: ABC Corp" and "To: XYZ Inc")
Either a request for payment OR proof of completed payment

If classified as a Technical Manual, it must have:

Instructions for operating or maintaining something
Step-by-step procedures (numbered or bulleted lists)
Technical specifications, requirements, or interface descriptions

This validation catches cases where similarity search might classify a receipt as an invoice, or a blog post as a technical manual.

How It Works

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Predicted Class │───▶│ Load Attributes │───▶│ LLM Evaluation  │───▶│ Logical Scoring │
│                 │    │                 │    │                 │    │                 │
│ "Invoice"       │    │ Load conditions │    │ Check each      │    │ Apply AND/OR    │
│ Score: 0.85     │    │ for Invoice     │    │ condition       │    │ logic           │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
                                ↓                       ↓                       ↓
                        JSON Definitions        Individual Results      Final Attribute Score
                        • Amount required       • Amount: ✓ YES         Score: 1.0 (all met)
                        • Parties required      • Parties: ✓ YES        Conditions Met: 3
                        • Payment type          • Payment: ✓ YES        Conditions Not Met: 0

Attribute Score Types

Attribute Score: 0.0-1.0 based on how well the text matches class-specific conditions
Combined Confidence: Average of similarity and attribute scores for comprehensive assessment
Conditions Met/Not Met: Detailed breakdown of which conditions passed or failed with reasoning

Example Attribute Definition

{
  "name": "Invoice",
  "required_attributes": {
    "operator": "AND",
    "conditions": [
      "must specify monetary amount to be paid",
      "must identify both parties (payer and payee)",
      {
        "operator": "OR", 
        "conditions": [
          "must be a request for payment",
          "must be a proof of completed payment"
        ]
      }
    ]
  }
}

Examples

The examples/getting_started/ directory contains step-by-step examples:

Basic Classification Examples

Basic Classification - Simple text classification using pre-built datasets
Classification with Reranking - Improved accuracy with LLM reranking
PDF Document Classification - Extract and classify PDF content using multimodal LLMs

Attribute Validation Examples

4a. Simple Attribute Validation - Enhanced classification confidence with predefined attributes 4c. Generate Attributes with LLM - Automatically create attribute definitions using LLMs 4d. Use Generated Attributes - Complete workflow from generation to practical usage

Advanced Examples

5a. Complete Pipeline - Full workflow with dataset generation, embeddings, and classification 5b. Reranking Comparison - Compare different reranking models and their performance

Use Cases

Document Classification: Invoices, contracts, reports, manuals
Product Classification: E-commerce, inventory, parts catalogs
Content Categorization: Articles, support tickets, emails
Medical Classification: Supplies, equipment, procedures
Technical Classification: Software tools, hardware components

Performance Tips

Use existing embeddings when possible for faster initialization
Limit top_k_candidates (3-10) for optimal reranking performance
Batch process multiple texts for better throughput

Project Structure

multi_class_text_classifier/
├── multi_class_text_classifier/     # Core package
├── examples/getting_started/        # Step-by-step examples
├── datasets/                        # Sample datasets
├── tests/                          # Unit tests
└── output/                         # Generated files

The binary files datasets/*_embeddings.pkl.gz contain the embeddings of the class descriptions contained in the corresponding json file. These are generated automatically when the dataset is loaded, but are are provided in the repository to speed up initialization.

Third-Party Dependencies and Licenses

This package depends on and may incorporate or retrieve a number of third-party software packages (such as open source packages) at install-time or build-time or run-time ("External Dependencies"). The External Dependencies are subject to license terms that you must accept in order to use this package. If you do not accept all of the applicable license terms, you should not use this package. We recommend that you consult your company’s open source approval policy before proceeding.

Provided below is a list of External Dependencies and the applicable license identification as indicated by the documentation associated with the External Dependencies as of Amazon's most recent review.

THIS INFORMATION IS PROVIDED FOR CONVENIENCE ONLY. AMAZON DOES NOT PROMISE THAT THE LIST OR THE APPLICABLE TERMS AND CONDITIONS ARE COMPLETE, ACCURATE, OR UP-TO-DATE, AND AMAZON WILL HAVE NO LIABILITY FOR ANY INACCURACIES. YOU SHOULD CONSULT THE DOWNLOAD SITES FOR THE EXTERNAL DEPENDENCIES FOR THE MOST COMPLETE AND UP-TO-DATE LICENSING INFORMATION.

YOUR USE OF THE EXTERNAL DEPENDENCIES IS AT YOUR SOLE RISK. IN NO EVENT WILL AMAZON BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, OR PUNITIVE DAMAGES (INCLUDING FOR ANY LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, OR COMPUTER FAILURE OR MALFUNCTION) ARISING FROM OR RELATING TO THE EXTERNAL DEPENDENCIES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, EVEN IF AMAZON HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS AND DISCLAIMERS APPLY EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW.

AGPL Dependencies

PyMuPDF (1.26.3) - GNU Affero General Public License v3.0 (AGPL-3.0) - https://github.com/pymupdf/PyMuPDF

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
examples		examples
multi_class_text_classifier		multi_class_text_classifier
tests		tests
ui		ui
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Uh oh!

License

Uh oh!

aws-samples/sample-extreme-text-classifier

Folders and files

Latest commit

History

Repository files navigation

Extreme Text Classifier

Features

How It Works

1. 🔍 Semantic Search (Required)

2. 🎯 LLM Reranking (Optional)

3. ✅ Attribute Validation (Optional)

Example in practice

Why This Pipeline Works

Installation

Quick Start

Basic Classification Examples

1. Basic Text Classification

2. Classification with Reranking

3. PDF Document Classification

Attribute Validation Examples

4a. Simple Attribute Validation

4c. Generate Attributes with LLM

4d. Use Generated Attributes

Reranking Models

Dedicated Reranking Models

Text-to-Text LLMs

PDF Processing

Supported Multimodal Models

PDF Extractor Configuration

Dataset Generation

Understanding Results

Score Types

Accessing Detailed Results

Attribute Validation

Example in Practice

How It Works

Attribute Score Types

Example Attribute Definition

Examples

Basic Classification Examples

Attribute Validation Examples

Advanced Examples

Use Cases

Performance Tips

Project Structure

Third-Party Dependencies and Licenses

AGPL Dependencies

Security

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages