A high-performance, production-ready Python documentation scraper with async architecture, FastAPI REST API, JavaScript rendering, and comprehensive export formats. Built for enterprise-scale documentation archival and processing.
- Async scraping: 2.5 pages/sec (vs 0.5 sync)
- Connection pooling: Reusable HTTP connections with DNS caching
- Priority queue: Intelligent task scheduling and resource management
- Rate limiting: Non-blocking token bucket algorithm with backoff
- Worker pool: Concurrent processing with semaphore-based control
- Async job management: Create, monitor, and cancel scraping jobs
- Real-time progress: WebSocket streaming for live updates
- Multiple export formats: PDF, EPUB, HTML, JSON
- Authentication: Token-based API security
- System monitoring: Health checks, metrics, and diagnostics
- Hybrid rendering: Automatic detection of static vs dynamic content
- Playwright integration: Full JavaScript execution with browser pool
- SPA detection: React, Vue, Angular, Ember framework support
- Resource optimization: Intelligent browser lifecycle management
- Markdown: Clean, consolidated documentation
- PDF: Professional documents via WeasyPrint
- EPUB: E-book format for offline reading
- HTML: Standalone HTML with embedded styles
- JSON: Structured data for programmatic access
- SSRF prevention: URL validation and private IP blocking
- robots.txt compliance: Automatic crawl delay and permission checks
- Content sanitization: XSS protection and safe HTML handling
- Rate limiting: Configurable request throttling per domain
- Docker: Multi-stage builds for optimized images
- Kubernetes: Complete deployment manifests with autoscaling
- CI/CD: GitHub Actions with automated testing and security scans
- Monitoring: Prometheus metrics and alerting rules
- Quick Start
- Installation
- Usage
- Features
- Configuration
- Deployment
- Development
- Documentation
- Architecture
# Install
pip install git+https://github.com/thepingdoctor/scrape-api-docs.git
# Scrape with async (5-10x faster)
scrape-docs https://docs.example.com
# Launch web UI
scrape-docs-ui# Using Docker
docker-compose up -d
# API available at http://localhost:8000
curl -X POST "http://localhost:8000/api/v1/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com", "output_format": "pdf"}'import asyncio
from scrape_api_docs import AsyncDocumentationScraper
async def main():
    scraper = AsyncDocumentationScraper(max_workers=10)
    result = await scraper.scrape_site('https://docs.example.com')
    print(f"Scraped {result.total_pages} pages at {result.throughput:.2f} pages/sec")
asyncio.run(main())- Python 3.11 or higher
- Poetry (recommended) or pip
git clone https://github.com/thepingdoctor/scrape-api-docs
cd scrape-api-docs
poetry install
# For all export formats (PDF, EPUB)
poetry install --extras all-formats
# Activate virtual environment
poetry shellpip install git+https://github.com/thepingdoctor/scrape-api-docs.git
# With all export formats
pip install "git+https://github.com/thepingdoctor/scrape-api-docs.git#egg=scrape-api-docs[all-formats]"# Basic scraper
docker pull ghcr.io/thepingdoctor/scrape-api-docs:latest
# API server
docker-compose -f docker-compose.api.yml up -dLaunch the interactive web interface:
scrape-docs-uiFeatures:
- 📝 URL input with real-time validation
- ⚙️ Advanced configuration (timeout, max pages, output format)
- 📊 Real-time progress tracking with visual feedback
- 📄 Results preview and downloadable output
- 🎨 Modern, user-friendly interface
For detailed UI guide, see STREAMLIT_UI_GUIDE.md
Start the API server:
# Development
uvicorn scrape_api_docs.api.main:app --reload
# Production with Docker
docker-compose -f docker-compose.api.yml up -d
# Using make
make docker-apiScraping Operations:
# Create async scraping job
curl -X POST "http://localhost:8000/api/v1/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "output_format": "markdown",
    "max_pages": 100
  }'
# Get job status
curl "http://localhost:8000/api/v1/jobs/{job_id}"
# WebSocket progress streaming
wscat -c "ws://localhost:8000/api/v1/jobs/{job_id}/stream"Export Formats:
# Export to PDF
curl -X POST "http://localhost:8000/api/v1/exports/pdf" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "abc123"}'
# Export to EPUB
curl -X POST "http://localhost:8000/api/v1/exports/epub" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "abc123", "title": "API Documentation"}'System Endpoints:
# Health check
curl "http://localhost:8000/api/v1/system/health"
# Metrics
curl "http://localhost:8000/api/v1/system/metrics"Full API documentation: http://localhost:8000/docs
# Basic usage
scrape-docs https://docs.example.com
# With options
scrape-docs https://docs.example.com \
  --output my-docs.md \
  --max-pages 50 \
  --timeout 30
# Enable JavaScript rendering
scrape-docs https://spa-app.example.com \
  --enable-js \
  --browser-pool-size 3
# Export to PDF
scrape-docs https://docs.example.com \
  --format pdf \
  --output docs.pdfimport asyncio
from scrape_api_docs import AsyncDocumentationScraper
async def main():
    # Initialize with custom settings
    scraper = AsyncDocumentationScraper(
        max_workers=10,
        rate_limit=10.0,  # requests per second
        timeout=30,
        enable_js=True
    )
    # Scrape site
    result = await scraper.scrape_site(
        'https://docs.example.com',
        output_file='output.md',
        max_pages=100
    )
    # Results
    print(f"Pages scraped: {result.total_pages}")
    print(f"Throughput: {result.throughput:.2f} pages/sec")
    print(f"Errors: {len(result.errors)}")
    print(f"Duration: {result.duration:.2f}s")
asyncio.run(main())from scrape_api_docs import scrape_site
# Simple usage
scrape_site('https://docs.example.com')
# With options
scrape_site(
    'https://docs.example.com',
    output_file='custom-output.md',
    max_pages=50,
    timeout=30
)import asyncio
from scrape_api_docs import AsyncDocumentationScraper
async def scrape_spa():
    scraper = AsyncDocumentationScraper(
        enable_js=True,
        browser_pool_size=3,
        browser_timeout=30000
    )
    result = await scraper.scrape_site('https://react-docs.example.com')
    print(f"Scraped SPA: {result.total_pages} pages")
asyncio.run(scrape_spa())from scrape_api_docs.exporters import (
    PDFExporter,
    EPUBExporter,
    HTMLExporter,
    ExportOrchestrator
)
# Export to PDF
pdf_exporter = PDFExporter()
pdf_exporter.export('output.md', 'output.pdf', metadata={
    'title': 'API Documentation',
    'author': 'Your Name'
})
# Export to EPUB
epub_exporter = EPUBExporter()
epub_exporter.export('output.md', 'output.epub', metadata={
    'title': 'API Documentation',
    'language': 'en'
})
# Multi-format export
orchestrator = ExportOrchestrator()
orchestrator.export_multiple('output.md', ['pdf', 'epub', 'html'])# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
API_WORKERS=4
# Scraper Settings
MAX_WORKERS=10
RATE_LIMIT=10.0
REQUEST_TIMEOUT=30
MAX_PAGES=1000
# JavaScript Rendering
ENABLE_JS=false
BROWSER_POOL_SIZE=3
BROWSER_TIMEOUT=30000
# Security
ENABLE_ROBOTS_TXT=true
BLOCK_PRIVATE_IPS=true# config/default.yaml
scraper:
  max_workers: 10
  rate_limit: 10.0
  timeout: 30
  user_agent: "DocumentationScraper/2.0"
javascript:
  enabled: false
  pool_size: 3
  timeout: 30000
security:
  robots_txt: true
  block_private_ips: true
  max_content_size: 10485760  # 10MB
export:
  default_format: markdown
  pdf_options:
    page_size: A4
    margin: 20mm# Build image
docker build -t scrape-api-docs .
# Run scraper
docker run -v $(pwd)/output:/output scrape-api-docs \
  https://docs.example.com
# Run API server
docker-compose -f docker-compose.api.yml up -d# Deploy to Kubernetes
kubectl apply -f k8s/namespace.yml
kubectl apply -f k8s/secrets.yml
kubectl apply -f k8s/deployment.yml
kubectl apply -f k8s/ingress.yml
# Scale workers
kubectl scale deployment scraper-worker --replicas=5 -n scraper
# Using make
make k8s-deploy# docker-compose.yml
version: '3.8'
services:
  api:
    image: scrape-api-docs:latest
    ports:
      - "8000:8000"
    environment:
      - MAX_WORKERS=10
      - ENABLE_JS=true
    volumes:
      - ./output:/output# Clone repository
git clone https://github.com/thepingdoctor/scrape-api-docs
cd scrape-api-docs
# Install dependencies
poetry install --with dev
# Activate virtual environment
poetry shell# Run all tests
make test
# Run specific test suite
pytest tests/unit/
pytest tests/integration/
pytest tests/e2e/
# Run with coverage
make test-coverage# Format code
make format
# Lint code
make lint
# Type checking
make typecheck
# Security scan
make security-scan# Install pre-commit hooks
pre-commit install
# Run manually
pre-commit run --all-files- API Implementation Summary
- Async Quick Start
- JavaScript Rendering Guide
- Export Formats Usage
- Deployment Guide
- System Overview
- FastAPI Architecture
- Async Refactor Plan
- JavaScript Rendering
- Export Formats
- Deployment Architecture
┌─────────────────────────────────────────────────────────┐
│                   Client Layer                          │
│  (CLI, Web UI, REST API, Python SDK)                   │
└─────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────┐
│              Scraping Engine (Async)                    │
│  • AsyncHTTPClient (Connection Pooling)                 │
│  • AsyncWorkerPool (Concurrency Control)                │
│  • AsyncRateLimiter (Token Bucket)                      │
│  • Priority Queue (BFS Scheduling)                      │
└─────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────┐
│           Rendering Layer (Hybrid)                      │
│  • Static HTML Parser (BeautifulSoup)                   │
│  • JavaScript Renderer (Playwright)                     │
│  • SPA Detector (Framework Detection)                   │
└─────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────┐
│           Export Layer (Multi-format)                   │
│  • Markdown, PDF, EPUB, HTML, JSON                      │
│  • Template Engine (Jinja2)                             │
│  • Export Orchestrator                                  │
└─────────────────────────────────────────────────────────┘
- Core: Python 3.11+, asyncio, aiohttp
- API: FastAPI, Pydantic, uvicorn
- Rendering: BeautifulSoup4, Playwright, markdownify
- Export: WeasyPrint (PDF), EbookLib (EPUB), Jinja2
- Storage: SQLite (jobs), filesystem (output)
- Deployment: Docker, Kubernetes, GitHub Actions
- Monitoring: Prometheus, structured logging
| Metric | Sync Scraper | Async Scraper | Improvement | 
|---|---|---|---|
| Throughput | 0.5 pages/sec | 2.5 pages/sec | 5x | 
| 100-page site | 200 seconds | 40 seconds | 5x faster | 
| Memory usage | ~100 MB | ~150 MB | Acceptable | 
| CPU usage | 15% | 45% | Efficient | 
- SSRF Prevention: Private IP blocking, URL validation
- robots.txt Compliance: Automatic crawl delay and permission checks
- Rate Limiting: Token bucket algorithm with per-domain limits
- Content Sanitization: XSS protection, safe HTML handling
- Input Validation: Pydantic models, URL whitelisting
- Authentication: Token-based API security (JWT)
See the examples/ directory for:
- Integration examples
- Authentication managers
- Caching strategies
- Rate limiting configurations
- Custom export pipelines
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see LICENSE file for details.
This tool is designed for legitimate purposes such as documentation archival for personal or internal team use. Users are responsible for:
- Ensuring they have the right to scrape any website
- Complying with the website's terms of service and robots.txt
- Respecting rate limits and server resources
The author is not responsible for any misuse of this tool. This software is provided "as is" without warranty of any kind.
- Built with FastAPI, Playwright, and BeautifulSoup
- Inspired by documentation tools like Docusaurus and MkDocs
- Performance optimizations based on async best practices
- Issues: https://github.com/thepingdoctor/scrape-api-docs/issues
- Documentation: https://github.com/thepingdoctor/scrape-api-docs/tree/main/docs
- Discussions: https://github.com/thepingdoctor/scrape-api-docs/discussions
Made with ❤️ for the developer community