Read and extract text and other content from PDFs in C# (port of PDFBox)
-
Updated
Oct 28, 2025 - C#
Read and extract text and other content from PDFs in C# (port of PDFBox)
A Gtk/Qt front-end to tesseract-ocr.
OCR engine for all the languages
Web interface for recognizing text, proofreading OCR, and creating fully-digitized documents.
Document Layout Analysis resources repos for development with PdfPig.
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
Text Overlay plugin for Mirador 3
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
Ergonomic line-by-line transcription of scanned text.
Text-to-tibble
Some basic data and text extraction from the New York City Directories
The data for guides to breweries across the United States from 1896 to 1918
Python parser for hOCR files using lxml
CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR. Using this extracted text to evaluate marks using NLP
graphical HOCR editor to produce minimal diffs for proofreading of tesseract OCR output
Add a description, image, and links to the hocr topic page so that developers can more easily learn about it.
To associate your repository with the hocr topic, visit your repo's landing page and select "manage topics."