Documentation Scraper

A high-performance, production-ready Python documentation scraper with async architecture, FastAPI REST API, JavaScript rendering, and comprehensive export formats. Built for enterprise-scale documentation archival and processing.

🚀 Key Features

⚡ High-Performance Async Architecture (5-10x Faster)

Async scraping: 2.5 pages/sec (vs 0.5 sync)
Connection pooling: Reusable HTTP connections with DNS caching
Priority queue: Intelligent task scheduling and resource management
Rate limiting: Non-blocking token bucket algorithm with backoff
Worker pool: Concurrent processing with semaphore-based control

🔌 FastAPI REST API (23+ Endpoints)

Async job management: Create, monitor, and cancel scraping jobs
Real-time progress: WebSocket streaming for live updates
Multiple export formats: PDF, EPUB, HTML, JSON
Authentication: Token-based API security
System monitoring: Health checks, metrics, and diagnostics

🎨 JavaScript Rendering & SPA Support

Hybrid rendering: Automatic detection of static vs dynamic content
Playwright integration: Full JavaScript execution with browser pool
SPA detection: React, Vue, Angular, Ember framework support
Resource optimization: Intelligent browser lifecycle management

📦 Export Formats

Markdown: Clean, consolidated documentation
PDF: Professional documents via WeasyPrint
EPUB: E-book format for offline reading
HTML: Standalone HTML with embedded styles
JSON: Structured data for programmatic access

🔒 Security & Compliance

SSRF prevention: URL validation and private IP blocking
robots.txt compliance: Automatic crawl delay and permission checks
Content sanitization: XSS protection and safe HTML handling
Rate limiting: Configurable request throttling per domain

🐳 Production Deployment

Docker: Multi-stage builds for optimized images
Kubernetes: Complete deployment manifests with autoscaling
CI/CD: GitHub Actions with automated testing and security scans
Monitoring: Prometheus metrics and alerting rules

🎯 Quick Start

Basic Scraping

# Install
pip install git+https://github.com/thepingdoctor/scrape-api-docs.git

# Scrape with async (5-10x faster)
scrape-docs https://docs.example.com

# Launch web UI
scrape-docs-ui

REST API

# Using Docker
docker-compose up -d

# API available at http://localhost:8000
curl -X POST "http://localhost:8000/api/v1/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com", "output_format": "pdf"}'

Python API

import asyncio
from scrape_api_docs import AsyncDocumentationScraper

async def main():
    scraper = AsyncDocumentationScraper(max_workers=10)
    result = await scraper.scrape_site('https://docs.example.com')
    print(f"Scraped {result.total_pages} pages at {result.throughput:.2f} pages/sec")

asyncio.run(main())

📦 Installation

Requirements

Python 3.11 or higher
Poetry (recommended) or pip

Using Poetry (Recommended)

git clone https://github.com/thepingdoctor/scrape-api-docs
cd scrape-api-docs
poetry install

# For all export formats (PDF, EPUB)
poetry install --extras all-formats

# Activate virtual environment
poetry shell

Using pip

pip install git+https://github.com/thepingdoctor/scrape-api-docs.git

# With all export formats
pip install "git+https://github.com/thepingdoctor/scrape-api-docs.git#egg=scrape-api-docs[all-formats]"

Using Docker

# Basic scraper
docker pull ghcr.io/thepingdoctor/scrape-api-docs:latest

# API server
docker-compose -f docker-compose.api.yml up -d

🎮 Usage

Web Interface (Streamlit UI)

Launch the interactive web interface:

scrape-docs-ui

Features:

📝 URL input with real-time validation
⚙️ Advanced configuration (timeout, max pages, output format)
📊 Real-time progress tracking with visual feedback
📄 Results preview and downloadable output
🎨 Modern, user-friendly interface

For detailed UI guide, see STREAMLIT_UI_GUIDE.md

REST API

Start the API server:

# Development
uvicorn scrape_api_docs.api.main:app --reload

# Production with Docker
docker-compose -f docker-compose.api.yml up -d

# Using make
make docker-api

API Endpoints (23+ total)

Scraping Operations:

# Create async scraping job
curl -X POST "http://localhost:8000/api/v1/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "output_format": "markdown",
    "max_pages": 100
  }'

# Get job status
curl "http://localhost:8000/api/v1/jobs/{job_id}"

# WebSocket progress streaming
wscat -c "ws://localhost:8000/api/v1/jobs/{job_id}/stream"

Export Formats:

# Export to PDF
curl -X POST "http://localhost:8000/api/v1/exports/pdf" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "abc123"}'

# Export to EPUB
curl -X POST "http://localhost:8000/api/v1/exports/epub" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "abc123", "title": "API Documentation"}'

System Endpoints:

# Health check
curl "http://localhost:8000/api/v1/system/health"

# Metrics
curl "http://localhost:8000/api/v1/system/metrics"

Full API documentation: http://localhost:8000/docs

Command-Line Interface

# Basic usage
scrape-docs https://docs.example.com

# With options
scrape-docs https://docs.example.com \
  --output my-docs.md \
  --max-pages 50 \
  --timeout 30

# Enable JavaScript rendering
scrape-docs https://spa-app.example.com \
  --enable-js \
  --browser-pool-size 3

# Export to PDF
scrape-docs https://docs.example.com \
  --format pdf \
  --output docs.pdf

Python API

Async Scraper (Recommended - 5-10x Faster)

import asyncio
from scrape_api_docs import AsyncDocumentationScraper

async def main():
    # Initialize with custom settings
    scraper = AsyncDocumentationScraper(
        max_workers=10,
        rate_limit=10.0,  # requests per second
        timeout=30,
        enable_js=True
    )

    # Scrape site
    result = await scraper.scrape_site(
        'https://docs.example.com',
        output_file='output.md',
        max_pages=100
    )

    # Results
    print(f"Pages scraped: {result.total_pages}")
    print(f"Throughput: {result.throughput:.2f} pages/sec")
    print(f"Errors: {len(result.errors)}")
    print(f"Duration: {result.duration:.2f}s")

asyncio.run(main())

Synchronous Scraper (Legacy)

from scrape_api_docs import scrape_site

# Simple usage
scrape_site('https://docs.example.com')

# With options
scrape_site(
    'https://docs.example.com',
    output_file='custom-output.md',
    max_pages=50,
    timeout=30
)

JavaScript Rendering

import asyncio
from scrape_api_docs import AsyncDocumentationScraper

async def scrape_spa():
    scraper = AsyncDocumentationScraper(
        enable_js=True,
        browser_pool_size=3,
        browser_timeout=30000
    )

    result = await scraper.scrape_site('https://react-docs.example.com')
    print(f"Scraped SPA: {result.total_pages} pages")

asyncio.run(scrape_spa())

Export Formats

from scrape_api_docs.exporters import (
    PDFExporter,
    EPUBExporter,
    HTMLExporter,
    ExportOrchestrator
)

# Export to PDF
pdf_exporter = PDFExporter()
pdf_exporter.export('output.md', 'output.pdf', metadata={
    'title': 'API Documentation',
    'author': 'Your Name'
})

# Export to EPUB
epub_exporter = EPUBExporter()
epub_exporter.export('output.md', 'output.epub', metadata={
    'title': 'API Documentation',
    'language': 'en'
})

# Multi-format export
orchestrator = ExportOrchestrator()
orchestrator.export_multiple('output.md', ['pdf', 'epub', 'html'])

🔧 Configuration

Environment Variables

# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
API_WORKERS=4

# Scraper Settings
MAX_WORKERS=10
RATE_LIMIT=10.0
REQUEST_TIMEOUT=30
MAX_PAGES=1000

# JavaScript Rendering
ENABLE_JS=false
BROWSER_POOL_SIZE=3
BROWSER_TIMEOUT=30000

# Security
ENABLE_ROBOTS_TXT=true
BLOCK_PRIVATE_IPS=true

Configuration File (YAML)

# config/default.yaml
scraper:
  max_workers: 10
  rate_limit: 10.0
  timeout: 30
  user_agent: "DocumentationScraper/2.0"

javascript:
  enabled: false
  pool_size: 3
  timeout: 30000

security:
  robots_txt: true
  block_private_ips: true
  max_content_size: 10485760  # 10MB

export:
  default_format: markdown
  pdf_options:
    page_size: A4
    margin: 20mm

🐳 Deployment

Docker

# Build image
docker build -t scrape-api-docs .

# Run scraper
docker run -v $(pwd)/output:/output scrape-api-docs \
  https://docs.example.com

# Run API server
docker-compose -f docker-compose.api.yml up -d

Kubernetes

# Deploy to Kubernetes
kubectl apply -f k8s/namespace.yml
kubectl apply -f k8s/secrets.yml
kubectl apply -f k8s/deployment.yml
kubectl apply -f k8s/ingress.yml

# Scale workers
kubectl scale deployment scraper-worker --replicas=5 -n scraper

# Using make
make k8s-deploy

Docker Compose

# docker-compose.yml
version: '3.8'
services:
  api:
    image: scrape-api-docs:latest
    ports:
      - "8000:8000"
    environment:
      - MAX_WORKERS=10
      - ENABLE_JS=true
    volumes:
      - ./output:/output

🛠️ Development

Setup Development Environment

# Clone repository
git clone https://github.com/thepingdoctor/scrape-api-docs
cd scrape-api-docs

# Install dependencies
poetry install --with dev

# Activate virtual environment
poetry shell

Running Tests

# Run all tests
make test

# Run specific test suite
pytest tests/unit/
pytest tests/integration/
pytest tests/e2e/

# Run with coverage
make test-coverage

Code Quality

# Format code
make format

# Lint code
make lint

# Type checking
make typecheck

# Security scan
make security-scan

Pre-commit Hooks

# Install pre-commit hooks
pre-commit install

# Run manually
pre-commit run --all-files

📚 Documentation

Comprehensive Guides

Architecture Documentation

Phase Summaries

🏗️ Architecture

System Components

┌─────────────────────────────────────────────────────────┐
│                   Client Layer                          │
│  (CLI, Web UI, REST API, Python SDK)                   │
└─────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────┐
│              Scraping Engine (Async)                    │
│  • AsyncHTTPClient (Connection Pooling)                 │
│  • AsyncWorkerPool (Concurrency Control)                │
│  • AsyncRateLimiter (Token Bucket)                      │
│  • Priority Queue (BFS Scheduling)                      │
└─────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────┐
│           Rendering Layer (Hybrid)                      │
│  • Static HTML Parser (BeautifulSoup)                   │
│  • JavaScript Renderer (Playwright)                     │
│  • SPA Detector (Framework Detection)                   │
└─────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────┐
│           Export Layer (Multi-format)                   │
│  • Markdown, PDF, EPUB, HTML, JSON                      │
│  • Template Engine (Jinja2)                             │
│  • Export Orchestrator                                  │
└─────────────────────────────────────────────────────────┘

Technology Stack

Core: Python 3.11+, asyncio, aiohttp
API: FastAPI, Pydantic, uvicorn
Rendering: BeautifulSoup4, Playwright, markdownify
Export: WeasyPrint (PDF), EbookLib (EPUB), Jinja2
Storage: SQLite (jobs), filesystem (output)
Deployment: Docker, Kubernetes, GitHub Actions
Monitoring: Prometheus, structured logging

📊 Performance Benchmarks

Metric	Sync Scraper	Async Scraper	Improvement
Throughput	0.5 pages/sec	2.5 pages/sec	5x
100-page site	200 seconds	40 seconds	5x faster
Memory usage	~100 MB	~150 MB	Acceptable
CPU usage	15%	45%	Efficient

🔐 Security Features

SSRF Prevention: Private IP blocking, URL validation
robots.txt Compliance: Automatic crawl delay and permission checks
Rate Limiting: Token bucket algorithm with per-domain limits
Content Sanitization: XSS protection, safe HTML handling
Input Validation: Pydantic models, URL whitelisting
Authentication: Token-based API security (JWT)

📝 Examples

See the examples/ directory for:

Integration examples
Authentication managers
Caching strategies
Rate limiting configurations
Custom export pipelines

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

📄 License

This project is licensed under the MIT License - see LICENSE file for details.

⚠️ Disclaimer

This tool is designed for legitimate purposes such as documentation archival for personal or internal team use. Users are responsible for:

Ensuring they have the right to scrape any website
Complying with the website's terms of service and robots.txt
Respecting rate limits and server resources

The author is not responsible for any misuse of this tool. This software is provided "as is" without warranty of any kind.

🙏 Acknowledgments

Built with FastAPI, Playwright, and BeautifulSoup
Inspired by documentation tools like Docusaurus and MkDocs
Performance optimizations based on async best practices

📞 Support

Issues: https://github.com/thepingdoctor/scrape-api-docs/issues
Documentation: https://github.com/thepingdoctor/scrape-api-docs/tree/main/docs
Discussions: https://github.com/thepingdoctor/scrape-api-docs/discussions

Made with ❤️ for the developer community

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
config		config
docker		docker
docs		docs
examples		examples
k8s		k8s
scripts		scripts
src/scrape_api_docs		src/scrape_api_docs
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.api		Dockerfile.api
FEATURES.md		FEATURES.md
LICENSE		LICENSE
Makefile		Makefile
POETRY_SETUP.md		POETRY_SETUP.md
README.md		README.md
STREAMLIT_UI_GUIDE.md		STREAMLIT_UI_GUIDE.md
docker-compose.api.yml		docker-compose.api.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-api.txt		requirements-api.txt
scrape.py		scrape.py

License

thepingdoctor/scrape-api-docs

Folders and files

Latest commit

History

Repository files navigation

Documentation Scraper

🚀 Key Features

⚡ High-Performance Async Architecture (5-10x Faster)

🔌 FastAPI REST API (23+ Endpoints)

🎨 JavaScript Rendering & SPA Support

📦 Export Formats

🔒 Security & Compliance

🐳 Production Deployment

📋 Table of Contents

🎯 Quick Start

Basic Scraping

REST API

Python API

📦 Installation

Requirements

Using Poetry (Recommended)

Using pip

Using Docker

🎮 Usage

Web Interface (Streamlit UI)

REST API

API Endpoints (23+ total)

Command-Line Interface

Python API

Async Scraper (Recommended - 5-10x Faster)

Synchronous Scraper (Legacy)

JavaScript Rendering

Export Formats

🔧 Configuration

Environment Variables

Configuration File (YAML)

🐳 Deployment

Docker

Kubernetes

Docker Compose

🛠️ Development

Setup Development Environment

Running Tests

Code Quality

Pre-commit Hooks

📚 Documentation

Comprehensive Guides

Architecture Documentation

Phase Summaries

🏗️ Architecture

System Components

Technology Stack

📊 Performance Benchmarks

🔐 Security Features

📝 Examples

🤝 Contributing

📄 License

⚠️ Disclaimer

🙏 Acknowledgments

📞 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages