CodeConCat

Transform codebases into AI-ready formats with intelligent parsing, compression, and security analysis

What is CodeConCat?

CodeConCat is a Python tool that transforms codebases into formats optimized for AI consumption and analysis. It automatically processes your code to extract functions, classes, imports, and documentation across 25+ programming languages, providing structured output that makes code analysis intuitive and efficient.

Why CodeConCat?

Multi-Language Intelligence: Parse 25+ languages including smart contracts
AI-Optimized Output: Optional compression and AI summarization
Production-Grade Security: Path traversal protection, XXE prevention, Semgrep integration
High Performance: Process 100+ files in under 5 seconds with parallel execution

Primary Use Cases:

Preparing codebases for AI assistant analysis
Code review and documentation generation
Repository migration and refactoring planning
Security auditing and vulnerability scanning

Quick Start

Installation

Using Poetry (Recommended)

# Install Poetry if needed
curl -sSL https://install.python-poetry.org | python3 -

# Clone and install
git clone https://github.com/biostochastics/codeconcat.git
cd codeconcat
poetry install

# Run CodeConCat
poetry run codeconcat --help

Using pip

# Install from source
git clone https://github.com/biostochastics/codeconcat.git
cd codeconcat
pip install -e .

# Or install from PyPI (when published)
pip install codeconcat

Your First Command

# Process current directory with default settings
codeconcat run

# Generate markdown output
codeconcat run --format markdown --output my-code.md

Expected output: Structured file with code analysis, function signatures, and documentation.

Common Examples

# Process with filtering
codeconcat run --include-language python javascript --exclude-path "*/tests/*"

# With AI summarization (requires API key)
codeconcat run --ai-summary --ai-provider anthropic --output analyzed-code.md

# Security scanning with compression
codeconcat run --security --semgrep --compress --output secure-report.json

⚠️ Security Note CodeConCat processes code for AI model consumption. While built-in security features protect against common threats, always review output before sharing with AI assistants or third parties. See the Security section for detailed information.

Next Steps:

See Usage Guide for detailed workflows
Check Configuration for customization options
Review Advanced Features for AI and compression capabilities

Core Features

Multi-Language Parsing - 25+ languages using tree-sitter and regex engines with intelligent result merging
AI Summarization (Optional) - Code summaries via OpenAI, Anthropic, OpenRouter, or local models (Ollama, llama.cpp)
Differential Outputs - Generate diffs between Git refs with AI-powered change summaries
Code Compression - Pattern recognition reduces token usage by 35-40% with configurable levels
Security Scanning - Integrated Semgrep support with configurable thresholds and comprehensive validation
Multiple Output Formats - Markdown, JSON, XML, and text with format-specific optimizations
Remote Repository Support - Analyze GitHub repositories directly via URL or shorthand notation
Parallel Processing - Configurable worker threads for optimal performance
File Reconstruction - Restore source files from output with path traversal protection
REST API - FastAPI-based server for programmatic access
Modern CLI - Typer-powered interface with shell completion and rich help
Smart Caching - TTL-based cache management for repeated operations

Language Support

CodeConCat provides comprehensive parsing for 25+ programming languages with industry-validated accuracy:

Language	Parser Type	Key Features	Documentation
Python	Tree-sitter + Enhanced Regex	Type hints, async/await, decorators, dataclasses	✓ Docstrings
JavaScript/TypeScript	Tree-sitter + Enhanced Regex	JSX/TSX, ES6+, decorators, modules	✓ JSDoc
Java	Tree-sitter + Enhanced Regex	Generics, annotations, lambdas, records	✓ Javadoc
C/C++	Tree-sitter + Enhanced Regex	Templates, modern C++ features, preprocessor	✓ Doxygen
C#	Tree-sitter + Enhanced Regex	Generics, attributes, async/await, LINQ	✓ XML docs
Go	Tree-sitter + Enhanced Regex	Interfaces, embedded types, generics	✓ GoDoc
Rust	Tree-sitter + Enhanced Regex	Traits, lifetimes, const generics, GATs	✓ Rustdoc
PHP	Tree-sitter + Enhanced Regex	Traits, attributes, typed properties	✓ PHPDoc
Julia	Tree-sitter + Enhanced Regex	Multiple dispatch, parametric types, macros	✓ Docstrings
R	Tree-sitter + Enhanced Regex	S3/S4 OOP, functions, tidyverse patterns	✓ Roxygen
Swift	Tree-sitter + Enhanced Regex	Property wrappers, actors, async/await	✓ SwiftDoc
Kotlin	Tree-sitter	Extension functions, suspend functions, sealed classes	✓ KDoc
Ruby	Tree-sitter	Modules, classes, blocks, metaprogramming, mixins	✓ RDoc
Solidity	Tree-sitter	Smart contracts, inheritance, modifiers, events, security patterns	✓ NatSpec
Crystal	Tree-sitter	Type annotations, union types, C bindings, macros	✓ Comments
Dart	Tree-sitter	Null safety, Flutter patterns, mixins	✓ Dartdoc
Elixir	Tree-sitter	GenServer, LiveView, pattern matching, pipe operators, macros	✓ @doc/@moduledoc
Zig	Tree-sitter	Comptime blocks, async/await (suspend/resume), error unions, inline for, struct methods	✓ /// comments
SQL	Tree-sitter	Multi-dialect (PostgreSQL, MySQL, SQLite), CTEs	✓ Comments
HCL/Terraform	Tree-sitter	Resources, modules, providers, variables	✓ Comments
GraphQL	Tree-sitter	Schema definitions, operations, directives	✓ Descriptions
GLSL	Tree-sitter	Vertex/fragment/compute shaders, uniforms, samplers, textures, in/out variables, layout qualifiers	✓ Comments
HLSL	Tree-sitter	Compute/vertex/pixel shaders, cbuffer/tbuffer, RWTextures, structured buffers, typedefs, semantics	✓ Comments
Bash/Shell	Tree-sitter	Functions, variables, source imports	✓ Comments
TOML	Enhanced Regex	Configuration parsing, nested tables	✓ Comments
WAT (WebAssembly Text)	Tree-sitter	Modules, functions, imports/exports, memory, types	✓ Comments

Crystal Support: CodeConCat provides comprehensive parsing for Crystal using a dynamically-compiled tree-sitter grammar from crystal-lang-tools/tree-sitter-crystal. The grammar is automatically downloaded and compiled on first use, with configurable cache directory via CODECONCAT_CACHE_DIR environment variable. The parser extracts classes, modules, structs, methods, macros, type aliases, and C library bindings (lib blocks). It tracks Crystal-specific features including type annotations, union types, nilable types, and generic type parameters. All security features include file locking to prevent race conditions, atomic file operations to prevent TOCTOU vulnerabilities, and automatic caching for improved performance.

WebAssembly Support: CodeConCat provides comprehensive parsing for WebAssembly Text (WAT) format using a dynamically-compiled tree-sitter grammar from wasm-lsp/tree-sitter-wasm. The grammar is automatically downloaded and compiled on first use, with configurable cache directory via CODECONCAT_CACHE_DIR environment variable. The parser extracts module structure, function signatures with parameter/result types, import/export statements, type definitions, and global/table declarations. All security features include file locking to prevent race conditions, commit hash pinning for reproducible builds, and atomic file operations to prevent TOCTOU vulnerabilities.

Parser Architecture

CodeConCat uses an intelligent multi-tier parser system with result merging for maximum reliability:

Tree-sitter Parsers (Primary) - Full syntax tree parsing for accurate structure recognition
Enhanced Regex Parsers (Fallback) - Pattern-based parsing with state tracking for edge cases
Intelligent Merging (v0.8.4+) - Combines results from multiple parsers with confidence scoring

Key Capabilities:

Automatic fallback between parser engines on errors
Unicode support with BOM removal and NFC normalization
Error recovery continues parsing even with syntax errors
Signature extraction for functions/methods with full parameter lists
Docstring extraction in language-specific formats

For detailed parser implementation history, architecture decisions, and refactoring documentation, see docs/PARSER_DETAILS.md.

Usage Guide

Basic Usage

Process Files and Directories

# Process current directory
codeconcat run

# Process specific directory or file
codeconcat run /path/to/project
codeconcat run src/main.py

# Process GitHub repository (NEW in v0.8.3)
codeconcat run https://github.com/owner/repo
codeconcat run owner/repo  # Shorthand notation
codeconcat run git@github.com:owner/repo.git  # SSH URL

Output Formats

# Markdown (default) - Best for human readability
codeconcat run --format markdown --output code.md

# JSON - Best for programmatic access
codeconcat run --format json --output code.json

# XML - Best for structured processing
codeconcat run --format xml --output code.xml

# Text - Best for terminal output
codeconcat run --format text --output code.txt

Filtering Options

# Include specific languages
codeconcat run --include-language python javascript rust

# Exclude paths (glob patterns)
codeconcat run --exclude-path "*/tests/*" "*/node_modules/*" "*/__pycache__/*"

# Include specific paths
codeconcat run --include-path "src/**/*.py" "lib/**/*.py"

# Combine filters
codeconcat run \
    --include-language python \
    --exclude-path "**/test_*.py" \
    --output filtered-code.md

For complete command reference and all available options, see the CLI Reference section.

Advanced Workflows

Differential Analysis

Compare Git branches, tags, or commits to see only changed files:

# Compare branches
codeconcat run --diff-from main --diff-to feature-branch --output changes.md

# Compare commits with AI summaries
codeconcat run \
    --diff-from HEAD~10 \
    --diff-to HEAD \
    --ai-summary \
    --ai-provider anthropic \
    --output recent-changes.md

# Compare tags for release notes
codeconcat run --diff-from v1.0.0 --diff-to v2.0.0 --output release-diff.md

AI-Powered Summarization

Generate intelligent code summaries for better understanding:

# Enable AI summarization with default provider
codeconcat run --ai-summary --ai-provider openai

# Use specific model
codeconcat run \
    --ai-summary \
    --ai-provider anthropic \
    --ai-model claude-3-5-haiku-20241022

# Generate meta-overview of entire codebase
codeconcat run \
    --ai-summary \
    --ai-meta-overview \
    --ai-provider anthropic \
    --output full-analysis.md

# Save summaries for caching
codeconcat run \
    --ai-summary \
    --ai-save-summaries \
    --ai-provider openai \
    --output analyzed-code.md

See Advanced Features - AI Summarization for detailed AI configuration.

Code Compression

Reduce token usage while preserving structure:

# Enable compression with contextual mode
codeconcat run --compress --compression-level medium --output compressed.md

# Aggressive compression (signatures and essential code only)
codeconcat run --compress --compression-level aggressive --output minimal.json

Smart Contract Analysis (Solidity)

Analyze Ethereum/blockchain smart contracts with security pattern detection:

# Process Solidity contracts with security pattern flagging
codeconcat run contracts/ --include-language solidity --output contracts-analysis.md

# Combine with security scanning for comprehensive analysis
codeconcat run \
    --include-language solidity \
    --security \
    --output smart-contract-audit.md

# Process DeFi protocol with AI summarization
codeconcat run defi-protocol/ \
    --include-language solidity javascript \
    --ai-summary \
    --ai-provider anthropic \
    --output protocol-analysis.md

The Solidity parser automatically flags security-relevant patterns including:

selfdestruct and delegatecall usage
Assembly blocks requiring manual review
External calls (potential reentrancy points)
Inheritance hierarchies and modifier chains

Security Scanning

Scan code for security issues and sensitive data:

# Basic security scanning
codeconcat run --security --output secure-report.md

# With Semgrep integration
codeconcat run --security --semgrep --output detailed-security.json

# Set severity threshold
codeconcat run --security --security-threshold HIGH --output critical-only.md

Common Recipes

Production Build with Full Analysis

codeconcat run \
    --format json \
    --compress \
    --compression-level high \
    --security \
    --semgrep \
    --security-threshold HIGH \
    --test-security-report \
    --output production-audit.json

Quick Documentation Extraction

codeconcat run \
    --docs \
    --merge-docs \
    --remove-docstrings \
    --remove-comments \
    --preset lean \
    --output docs-only.md

GitHub Repository Analysis

# Public repository
codeconcat run owner/repo --ai-summary --output analysis.md

# Private repository (requires token)
codeconcat run \
    --source-url owner/private-repo \
    --github-token $GITHUB_TOKEN \
    --output private-analysis.md

Configuration

Configuration File

Create .codeconcat.yml in your project root for persistent settings:

version: "1.0"

# Output settings
output_preset: medium  # Options: lean, medium, full
format: markdown       # Options: markdown, json, xml, text

# Filtering
use_gitignore: true
use_default_excludes: true
include_languages:
  - python
  - javascript
  - rust
exclude_paths:
  - "*/tests/*"
  - "*/node_modules/*"
  - "*/__pycache__/*"

# Processing
parser_engine: tree_sitter  # Options: tree_sitter, regex
max_workers: 4              # Parallel processing threads (1-32)
enable_result_merging: true # Intelligent parser result merging
merge_strategy: confidence  # Options: confidence, union, fast_fail, best_of_breed

# Compression
enable_compression: false
compression_level: medium  # Options: low, medium, high/aggressive

# Security
enable_security_scanning: true
security_scan_severity_threshold: MEDIUM  # Options: LOW, MEDIUM, HIGH, CRITICAL

# AI Features (optional)
enable_ai_summary: false
ai_provider: anthropic           # Options: openai, anthropic, openrouter, ollama,
                                 #          local_server, vllm, lmstudio, llamacpp_server (llamacpp deprecated)
ai_model: ""                     # Optional, uses provider defaults
ai_meta_overview: false          # Generate project-wide overview
ai_save_summaries: false         # Save summaries to disk for caching
ai_min_file_lines: 20           # Skip small files

Initialize configuration interactively:

codeconcat init                    # Interactive setup
codeconcat init --preset medium    # Use specific preset
codeconcat validate .codeconcat.yml  # Validate existing config

Configuration Presets

Preset	Use Case	Features
lean	Minimal output for quick reviews	Basic parsing, no metadata, compressed output
medium	Balanced analysis (default)	Standard parsing with context, moderate detail
full	Comprehensive documentation	Complete parsing, all metadata, full features

Environment Variables

# API Configuration
export GITHUB_TOKEN=your_token_here

# AI Provider Keys (optional, see AI Summarization section)
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export OPENROUTER_API_KEY=sk-or-...
export LOCAL_LLM_API_KEY=""       # Optional: generic OpenAI-compatible local servers
export VLLM_API_KEY=""            # Optional: vLLM preset
export LMSTUDIO_API_KEY=""        # Optional: LM Studio preset
export LLAMACPP_SERVER_API_KEY="" # Optional: llama.cpp server preset

# API Server Configuration
export CODECONCAT_HOST=127.0.0.1
export CODECONCAT_PORT=8000
export CODECONCAT_ALLOW_LOCAL_PATH=false  # Enable local paths in API (dev only)

# Environment Mode
export ENV=production  # Options: production, development, test

CLI Reference

Global Options

Option	Short	Description
`--version`	`-V`	Show version and exit
`--verbose`	`-v`	Increase verbosity (-v for INFO, -vv for DEBUG)
`--quiet`	`-q`	Suppress progress information
`--config`	`-c`	Path to configuration file
`--help`		Show help message

`codeconcat run`

Process files and generate AI-optimized output.

Usage: codeconcat run [OPTIONS] [TARGET]

Arguments:

TARGET - Path to process, GitHub URL, or owner/repo shorthand (default: current directory)

Processing Options

Option	Short	Description
`--format`	`-f`	Output format: `markdown`, `json`, `xml`, `text`
`--output`	`-o`	Output file path
`--preset`	`-p`	Configuration preset: `lean`, `medium`, `full`
`--parser-engine`		Parser engine: `tree_sitter`, `regex`
`--max-workers`		Parallel workers (1-32, default: 4)

Filtering Options

Option	Short	Description
`--include-path`	`-ip`	Glob patterns to include (repeatable)
`--exclude-path`	`-ep`	Glob patterns to exclude (repeatable)
`--include-language`	`-il`	Languages to include
`--exclude-language`	`-el`	Languages to exclude

Feature Options

Option	Description
`--compress`	Enable code compression
`--compression-level`	Level: `low`, `medium`, `high`/`aggressive`
`--security` / `--no-security`	Enable security scanning (default: true)
`--semgrep`	Enable Semgrep security scanning
`--security-threshold`	Severity: `LOW`, `MEDIUM`, `HIGH`, `CRITICAL`
`--docs` / `--no-docs`	Extract standalone documentation files
`--remove-docstrings`	Strip docstrings from code
`--remove-comments`	Strip comments from code

Git Differential Options

Option	Description
`--diff-from`	Starting Git ref (branch, tag, commit)
`--diff-to`	Ending Git ref (branch, tag, commit)

AI Options

Option	Description
`--ai-summary`	Enable AI summarization
`--ai-provider`	Provider: `openai`, `anthropic`, `openrouter`, `ollama`, `local_server`, `vllm`, `lmstudio`, `llamacpp_server`, `llamacpp` (deprecated)
`--ai-model`	Specific model (uses provider defaults if omitted)
`--ai-meta-overview`	Generate project-wide meta-overview
`--ai-save-summaries`	Save summaries to disk for caching
`--ai-api-key`	API key (alternative to environment variable)
`--ai-api-base`	Override the API base URL for local servers

Remote Repository Options

Option	Description
`--source-url`	GitHub URL or owner/repo
`--github-token`	GitHub PAT for private repos

`codeconcat init`

Initialize configuration interactively.

Usage: codeconcat init [OPTIONS]

Option	Short	Description
`--output`	`-o`	Output path (default: .codeconcat.yml)
`--interactive` / `--no-interactive`		Use interactive wizard (default: true)
`--force`	`-f`	Overwrite existing configuration
`--preset`	`-p`	Use preset: `lean`, `medium`, `full`

`codeconcat config local-llm`

Interactive wizard for configuring local OpenAI-compatible servers (vLLM, LM Studio, llama.cpp server, or generic runtimes).

Usage: codeconcat config local-llm

Step	What happens
Preset selection	Choose vLLM, LM Studio, llama.cpp server, or generic
Auto-detect	Probes common ports and health endpoints
Model discovery	Lists models from `/v1/models` or `/models`
Config update	Writes `enable_ai_summary`, `ai_provider`, `ai_api_base`, `ai_model`, and `ai_api_key`

Refer to docs/LOCAL_MODELS.md for a detailed guide.

`codeconcat validate`

Validate a configuration file.

Usage: codeconcat validate CONFIG_FILE

`codeconcat reconstruct`

Reconstruct source files from CodeConCat output with security validation.

Usage: codeconcat reconstruct [OPTIONS] INPUT_FILE

Security: All reconstructed files are validated to prevent path traversal attacks.

Option	Short	Description
`--output-dir`	`-o`	Directory for files (default: ./reconstructed)
`--format`	`-f`	Input format (auto-detected if not specified)
`--force`		Overwrite existing files
`--dry-run`		Preview without creating files
`--verbose`	`-v`	Show detailed progress

Supported Formats: Markdown v2.0, XML v2.0, JSON v2.0

`codeconcat api`

Manage the CodeConCat API server.

Sub-commands:

`codeconcat api start`

Start the API server.

Option	Short	Description
`--host`	`-h`	Host to bind (default: 127.0.0.1)
`--port`	`-p`	Port to bind (default: 8000)
`--reload`		Enable auto-reload (development)
`--workers`	`-w`	Number of worker processes
`--log-level`	`-l`	Logging level

`codeconcat api info`

Display API information and endpoints.

`codeconcat diagnose`

Diagnostic and verification tools.

Sub-commands:

codeconcat diagnose verify - Verify tree-sitter dependencies
codeconcat diagnose parser LANGUAGE [-f FILE] - Test language parser
codeconcat diagnose system - Display system information
codeconcat diagnose languages - List supported languages

`codeconcat keys`

Manage API keys for AI providers with secure storage.

Sub-commands:

codeconcat keys setup - Interactive key setup with encryption
codeconcat keys list [--show-values] - List configured keys
codeconcat keys set PROVIDER [API_KEY] - Set or update key
codeconcat keys delete PROVIDER - Delete key
codeconcat keys reset [--force] - Reset all keys
codeconcat keys test PROVIDER - Test key validity
codeconcat keys export [-o FILE] [--show-keys] - Export configuration

Storage Methods: Encrypted file (default), system keyring, environment variables

Shell Completion

Enable tab completion for your shell:

# Bash
codeconcat --install-completion bash

# Zsh
codeconcat --install-completion zsh

# Fish
codeconcat --install-completion fish

Advanced Features

AI Summarization

Generate intelligent code summaries to enhance understanding and reduce context size.

Supported Providers

Provider	Default Model (Files)	Default Model (Meta)	Notes
OpenAI	gpt-5-mini-2025-08-07	gpt-5-2025-08-07	Fast with reasoning capabilities
Anthropic	claude-3-5-haiku-20241022	claude-sonnet-4-5-20250929	Fast with extended thinking
OpenRouter	qwen/qwen3-coder	z-ai/glm-4.6	Access to 100+ models
Ollama	llama3.2	llama3.2	Local, private, no API needed

Quick Start

# Enable AI summarization
codeconcat run --ai-summary --ai-provider openai

# Use specific model
codeconcat run --ai-summary --ai-provider anthropic --ai-model claude-3-haiku-20240307

# Local model with Ollama (privacy-focused)
ollama run llama3.2  # First-time setup
codeconcat run --ai-summary --ai-provider ollama --ai-model llama3.2

AI Meta-Overview

Generate a comprehensive project-wide overview:

# Enable meta-overview with default prompt
codeconcat run --ai-summary --ai-meta-overview --ai-provider openai

# Custom prompt for focused analysis
codeconcat run --ai-summary --ai-meta-overview \
  --ai-meta-prompt "Focus on security architecture and data flow" \
  --ai-provider anthropic

# Save summaries to disk for caching
codeconcat run --ai-summary --ai-meta-overview --ai-save-summaries \
  --ai-provider anthropic \
  --output report.md

The meta-overview synthesizes all file summaries into:

High-level architectural understanding
Key components and relationships
Design patterns and technologies used
Onboarding insights for new developers

Summary Persistence

Save AI-generated summaries for caching and reuse:

codeconcat run --ai-summary --ai-save-summaries --ai-provider openai

# Summaries saved to {output_dir}/codeconcat_summaries/
# ├── individual/      # Per-file summaries (JSON)
# └── meta_overview.json  # Project-wide overview

API Key Setup

Interactive Setup (Recommended)

codeconcat keys setup

# Prompts for:
# 1. Master password (for encryption)
# 2. Provider selection
# 3. API key entry

Environment Variables

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENROUTER_API_KEY="sk-or-..."
export LOCAL_LLM_API_KEY=""       # Optional
export VLLM_API_KEY=""            # Optional
export LMSTUDIO_API_KEY=""        # Optional
export LLAMACPP_SERVER_API_KEY="" # Optional
# Ollama doesn't require an API key

Local LLM Support

Run codeconcat config local-llm to configure vLLM, LM Studio, llama.cpp server, or any OpenAI-compatible runtime. The wizard probes common ports, auto-discovers models, and writes updated fields to .codeconcat.yml.
See docs/LOCAL_MODELS.md for a full integration guide and troubleshooting tips.

Ollama (Easiest)

# Install and start Ollama
ollama serve
ollama pull codellama  # or deepseek-coder, mistral, etc.

# Use with CodeConCat (auto-discovers best model)
codeconcat run --ai-summary --ai-provider ollama

# Or specify model explicitly
codeconcat run --ai-summary --ai-provider ollama --ai-model deepseek-coder

OpenAI-compatible servers (vLLM / LM Studio / llama.cpp server)

# Start your preferred server first (examples)
python -m vllm.entrypoints.openai.api_server --model path/to/model --port 8000
./llama-server -m models/llama-2-7b-chat.gguf -c 4096 -ngl 35
# In LM Studio: Start the "OpenAI Compatible Server" from the UI

# Run the configuration wizard
codeconcat config local-llm

# Then use CodeConCat with the saved settings
codeconcat run --ai-summary

⚠️ The legacy --ai-provider llamacpp integration is deprecated and will be removed in a future version. Run the llama.cpp HTTP server and use the wizard instead for better compatibility and auto-discovery.

Recommended Models:

Code-specific: DeepSeek-Coder, CodeLlama, StarCoder, WizardCoder
General: Mistral/Mixtral, Llama 3, Phi-3

Configuration

# In .codeconcat.yml
enable_ai_summary: true
ai_provider: openai
ai_model: gpt-5-mini-2025-08-07  # Optional

# Meta-overview settings
ai_meta_overview: false
ai_meta_prompt: ""  # Custom prompt
ai_meta_overview_use_higher_tier: true  # Use premium models
ai_save_summaries: false
ai_summaries_dir: "codeconcat_summaries"

# Processing limits
ai_min_file_lines: 20
ai_summarize_functions: true
ai_max_functions_per_file: 10

# Performance
ai_max_concurrent: 5
ai_cache_enabled: true
ai_timeout: 600  # 10 minutes (default) for AI operations

Note: AI summaries are saved in the codeconcat_summaries/ directory adjacent to your output file. The default timeout of 10 minutes accommodates both cloud and local AI models for comprehensive analysis.

Cost Optimization

# Use free local models
codeconcat run --ai-summary --ai-provider ollama --ai-model llama3.2

# Use low-cost cloud models
codeconcat run --ai-summary --ai-provider openai --ai-model gpt-5-mini-2025-08-07

# Limit scope
codeconcat run --ai-summary --ai-min-file-lines 50

# Enable caching
codeconcat run --ai-summary --ai-cache-enabled --ai-cache-ttl 86400

Security Considerations

⚠️ Important:

API keys encrypted using Fernet (AES-128) with PBKDF2 key derivation
Keys stored in ~/.codeconcat/keys.enc with 0600 permissions
Never commit API keys to version control
Code sent to third-party services (except Ollama/llama.cpp)
Use local models for sensitive/proprietary code

Code Compression

Reduce token usage while preserving code structure:

# Contextual compression (keeps important code with context)
codeconcat run --compress --compression-level medium

# Essential compression (signatures, security, errors only)
codeconcat run --compress --compression-level aggressive

Compression Levels:

low: Remove obvious redundancy, keep most context
medium: Contextual compression, balance of reduction and readability
high/aggressive: Essential code only, maximum token reduction

Typical results: 35-40% token reduction on large codebases.

Security Scanning

Built-in security features protect against common vulnerabilities:

# Basic pattern-based scanning
codeconcat run --security --output secure-report.md

# Advanced Semgrep integration
codeconcat run --security --semgrep --security-threshold HIGH

# Separate test file security report
codeconcat run --security --test-security-report --output analysis.json

Security Features:

Pattern-based credential detection
Path traversal protection (CWE-22)
XXE attack prevention
ReDoS protection
Command injection prevention
Zip Slip protection
File integrity verification (SHA-256)

See the Security section for detailed architecture and best practices.

File Reconstruction

Restore original source files from CodeConCat output:

# Reconstruct from output file
codeconcat reconstruct output.md

# Preview without creating files
codeconcat reconstruct output.json --dry-run

# Reconstruct to specific directory
codeconcat reconstruct output.xml -o ./restored

# Force overwrite existing files
codeconcat reconstruct output.md --force

Security Features:

Path traversal protection prevents ../../../etc/passwd attacks
All file writes validated against target directory boundary
Supports Markdown, XML, and JSON formats

Differential Outputs

Generate diffs between Git references:

# Compare branches
codeconcat run --diff-from main --diff-to feature-branch -o changes.md

# Compare commits with AI summaries
codeconcat run \
  --diff-from HEAD~10 \
  --diff-to HEAD \
  --ai-summary \
  --ai-provider anthropic \
  --output recent-changes.md

# Compare tags for release notes
codeconcat run --diff-from v1.0.0 --diff-to v2.0.0 -o release-diff.md

Features:

Works with branches, tags, and commit hashes
Shows file-level changes with unified diff format
AI summaries explain change impact
Supports all output formats

API Access

REST API

Start the API server:

# Default settings (localhost:8000)
codeconcat api start

# Production with workers
codeconcat api start --host 0.0.0.0 --port 8080 --workers 4

# Development with auto-reload
codeconcat api start --reload

API Endpoints:

Endpoint	Method	Description
`/api/concat`	POST	Process codebase from URL or local path
`/api/upload`	POST	Upload and process archive (zip/tar)
`/api/ping`	GET	Health check
`/api/config/presets`	GET	Available presets
`/api/config/formats`	GET	Supported formats
`/api/config/languages`	GET	Supported languages
`/api/config/defaults`	GET	Default configuration
`/docs`	GET	Interactive API docs (Swagger UI)
`/redoc`	GET	Alternative docs (ReDoc)

Example Usage:

import requests

# Process GitHub repository
response = requests.post(
    "http://localhost:8000/api/concat",
    json={
        "source_url": "https://github.com/owner/repo",
        "format": "markdown",
        "include_languages": ["python", "javascript"],
        "enable_compression": True,
        "compression_level": "medium"
    }
)

# Upload and process archive
with open("code.zip", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/upload",
        files={"file": f},
        data={
            "format": "json",
            "output_preset": "medium"
        }
    )

Python API

Use CodeConCat programmatically:

from codeconcat.cli import app
from typer.testing import CliRunner

runner = CliRunner()

result = runner.invoke(app, [
    "run",
    "/path/to/project",
    "--format", "json",
    "--compress",
    "--output", "output.json"
])

if result.exit_code == 0:
    print("Success!")

Development

Setup Development Environment

# Clone repository
git clone https://github.com/biostochastics/codeconcat.git
cd codeconcat

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Install dependencies (including dev, test, docs)
poetry install --all-extras

# Install pre-commit hooks
poetry run pre-commit install

# Activate virtual environment
poetry shell

Code Quality & Testing

# Run all tests with coverage
poetry run pytest --cov=codeconcat --cov-report=term-missing

# Run specific test file
poetry run pytest tests/unit/test_cli.py -xvs

# Format code (auto-fix)
poetry run ruff format codeconcat tests
poetry run ruff check codeconcat tests --fix

# Lint code (check only)
poetry run ruff check codeconcat
poetry run mypy codeconcat

# Run all pre-commit hooks
poetry run pre-commit run --all-files

CI/CD Pipeline

GitHub Actions runs on every push:

Testing: Python 3.10, 3.11, 3.12
Code Quality: Ruff linting and formatting, mypy type checking
Coverage: Automatic coverage reporting
Pre-commit: All hooks enforced
Publishing: Automatic PyPI deployment on tagged releases

Building & Publishing

# Build package
poetry build

# Check package
poetry check

# Publish to PyPI (requires token)
poetry publish

# Build and publish together
poetry publish --build

Contributing

We welcome contributions! Please ensure:

All tests pass (pytest)
Code is formatted (ruff format)
Type hints are complete (mypy)
Pre-commit hooks pass
New features include tests and documentation

Security

Security Architecture

CodeConCat implements defense-in-depth security with multiple protection layers:

Path Traversal Prevention (CWE-22)

Canonical path resolution with os.path.realpath()
Strict boundary enforcement via os.path.commonpath()
Symlink escape detection and blocking
Null byte injection prevention
Cross-platform path normalization

Content Sanitization

Dangerous content replaced with [REDACTED] placeholders
Never exposes original secret values
API keys, tokens, and credentials detected and removed

Supply Chain Security

Semgrep pinned to tested version (1.52.0)
Apiiro ruleset pinned to verified commit hash
Uses sys.executable -m pip to prevent PATH hijacking
Network timeouts prevent hanging attacks
Post-install version verification

Additional Protections

XXE attack prevention using defusedxml
ReDoS protection with regex timeout limits
Command injection prevention with sanitized inputs
Zip Slip protection for archive extraction
File integrity verification (SHA-256)
Rate limiting for API deployments
Thread-safe concurrent operations

Security Features

API Access Controls: Configurable path access and authentication
Thread-Safe Operations: Concurrent request handling with isolated configurations
Memory Management: File size limits (10MB max) and resource controls
Path Validation: Traversal protection and symlink handling
Secure Git Operations: All repository URLs and tokens sanitized

⚠️ Important Security Warnings

Prompt Injection Risks

This tool processes code for AI model consumption. Malicious code comments, docstrings, or file contents could potentially manipulate AI behavior when the output is used as LLM input.

Recommended Security Practices

Review Output - Always review generated output before sharing with AI models or third parties
Sanitize Sensitive Data - Use security scanning features (--security, --semgrep) to identify sensitive information
Validate AI Responses - Treat all AI-generated code suggestions as untrusted input requiring validation
Restrict Access - Use API access controls and authentication in production
Environment Isolation - Process untrusted codebases in isolated environments

Security Disclosure

To report security vulnerabilities, please open a GitHub issue with the security label or contact maintainers directly. We follow responsible disclosure practices and credit reporters.

Liability Disclaimer

NO WARRANTY: This software is provided "as is" without warranty of any kind, express or implied.

NO LIABILITY: In no event shall the authors be liable for any claim, damages, or other liability arising from the use of this software.

USE AT YOUR OWN RISK: Users assume all responsibility for security implications, validation of AI responses, protection of sensitive information, and compliance with applicable regulations.

Reference

Output Formats

Markdown Format

Table of contents with anchor links
Summary statistics and project overview
Collapsible sections for large content
Syntax highlighting
Security issue badges

JSON Format

Nested structure for traversal
Indexes by language, type, directory
Relationship mapping
Metadata for filtering

XML Format

Semantic tags for structure
Hierarchical navigation
Optional processing instructions (--xml-pi)
CDATA sections for content preservation

Text Format

Box drawing for visual hierarchy
80-character width for terminals
Visual file type indicators
Compact metadata display

Parser Architecture Details

For comprehensive documentation on parser implementation, historical refactoring, and architectural decisions, see:

docs/PARSER_DETAILS.md

Includes:

Multi-tier parser system design
Tree-sitter and regex integration
Result merging algorithms (v0.8.4+)
Phase-by-phase refactoring history (v0.9.4-dev)
Modern tree-sitter API migration details
Language-specific implementation notes

Recent Parser Improvements (v0.8.6+)

The parser system has undergone comprehensive security and performance improvements with critical bug fixes:

Critical Bug Fixes:

Cross-Platform Compatibility: Removed POSIX-only signal timeout that caused AttributeError on Windows systems
Tree-sitter Version Compatibility: Fixed QueryCursor API compatibility across tree-sitter 0.23.x and 0.24.0+ versions
Query Performance: Optimized LRU cache eviction from O(n) to O(1) using deque.popleft() (128x improvement)
Error Search Performance: Reduced BFS complexity from O(n²) to O(n) for deeply nested AST structures

Security Enhancements:

ReDoS Protection: All regex patterns now include timeout limits and input sanitization
Path Traversal Prevention: Robust validation prevents directory traversal attacks
Memory Exhaustion Protection: Content size limits (10MB) prevent resource exhaustion attacks
Input Sanitization: Comprehensive sanitization of all user inputs

Performance Optimizations:

LRU Cache Implementation: Query caching with configurable size limits prevents memory leaks
Memory Management: Automatic cache eviction using collections.deque for O(1) operations
Content Size Validation: Two-tier approach (character count then UTF-8 encoding) for efficiency
C++ Parser: Direct lookup for access specifiers and modifiers (reduced tree traversals)
Rust Parser: Dict-based deduplication for O(1) lookup performance

Error Handling Improvements:

Standardized Error Reporting: Unified error handling across all parsers with consistent error flags
Partial Parse Recovery: Graceful handling of syntax errors with continued parsing
Quality Indicators: Parser quality metrics for result reliability assessment

Type System Standardization:

Unified Type Mapping: Consistent type classification across 25+ programming languages
Declaration Type Hierarchy: Standardized type relationships (function, class, module, etc.)
Language-Agnostic Interface: Consistent API regardless of source language

Documentation Extraction:

Unified Docstring Parsing: Consistent extraction across all documentation formats
Multi-Language Support: Support for Python docstrings, JSDoc, Javadoc, Rustdoc, etc.
Smart Context Detection: Automatic detection of docstring location (preceding, internal, following)

Enhanced Parser Implementations:

Modern Tree-sitter API: Full compatibility with tree-sitter 0.23.x and 0.24.0+ (handles both dict and tuple capture formats)
Improved Error Recovery: Better handling of malformed code and syntax errors
Optimized Query Processing: Efficient AST traversal with memory-safe operations

For detailed technical documentation of all fixes, see PARSER_FIXES_SUMMARY.md.

Version History

See CHANGELOG.md for complete version history and release notes.

Current Version: 0.8.6

Troubleshooting

Tree-sitter Grammar Issues

codeconcat diagnose verify  # Verify all grammars
codeconcat diagnose parser python  # Test specific parser

Performance Issues

# Increase workers for large codebases
codeconcat run --max-workers 8

# Enable compression to reduce output size
codeconcat run --compress --compression-level medium

API Connection Issues

# Check API server status
codeconcat api info

# Start with debug logging
codeconcat api start --log-level debug

AI Summary Failures

# Test API key validity
codeconcat keys test openai

# Use local model as fallback
codeconcat run --ai-summary --ai-provider ollama

Acknowledgments

CodeConCat is built with these open-source tools:

Typer - CLI framework
Rich - Terminal formatting
Poetry - Dependency management
Tree-sitter - Incremental parsing
FastAPI - Web API framework
Pydantic - Data validation
GitPython - Git integration
Semgrep - Security scanning

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

GitHub Issues: https://github.com/biostochastics/codeconcat/issues

Part of the Biostochastics collection of tools for translational science and biomarker discovery

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
.github/workflows		.github/workflows
1/rules		1/rules
assets		assets
codeconcat		codeconcat
docs		docs
parser_reviews		parser_reviews
scripts		scripts
tests		tests
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
build_crystal_grammar.py		build_crystal_grammar.py
build_wat_grammar.py		build_wat_grammar.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py

License

biostochastics/CodeConCat

Folders and files

Latest commit

History

Repository files navigation

CodeConCat

Table of Contents

What is CodeConCat?

Quick Start

Installation

Your First Command

Common Examples

Core Features

Language Support

Parser Architecture

Usage Guide

Basic Usage

Advanced Workflows

Common Recipes

Configuration

Configuration File

Configuration Presets

Environment Variables

CLI Reference

Global Options

codeconcat run

codeconcat init

codeconcat config local-llm

codeconcat validate

codeconcat reconstruct

codeconcat api

codeconcat api start

codeconcat api info

codeconcat diagnose

codeconcat keys

Shell Completion

Advanced Features

AI Summarization

Supported Providers

Quick Start

AI Meta-Overview

Summary Persistence

API Key Setup

Local LLM Support

Configuration

Cost Optimization

Security Considerations

Code Compression

Security Scanning

File Reconstruction

Differential Outputs

API Access

REST API

Python API

Development

Setup Development Environment

Code Quality & Testing

CI/CD Pipeline

Building & Publishing

Contributing

Security

Security Architecture

Security Features

⚠️ Important Security Warnings

Prompt Injection Risks

Recommended Security Practices

Security Disclosure

Liability Disclaimer

Reference

Output Formats

Parser Architecture Details

Recent Parser Improvements (v0.8.6+)

Version History

Troubleshooting

Acknowledgments

License

Contact

About

Topics

`codeconcat run`

`codeconcat init`

`codeconcat config local-llm`

`codeconcat validate`

`codeconcat reconstruct`

`codeconcat api`

`codeconcat api start`

`codeconcat api info`

`codeconcat diagnose`

`codeconcat keys`

Packages