Skip to content

albbas/pdf-strings

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-extract

crates.io Documentation

Extract text from PDFs with position data.

Usage

// Simple extraction
let output = pdf_strings::from_path("file.pdf")?;
println!("{}", output);  // Plain text

// With password
let output = pdf_strings::PdfExtractor::builder()
    .password("secret")
    .build()
    .from_path("encrypted.pdf")?;

// Preserve spatial layout
println!("{}", output.to_string_pretty());

// Access structured data with bounding boxes
for line in output.lines() {
    for span in line {
        println!("{} at {:?}", span.text, span.bbox);
    }
}

Features

  • Plain text extraction
  • Spatial layout preservation
  • Bounding box coordinates for every text span
  • Font encoding resolution (ToUnicode, Type1, TrueType, CID, Type3)
  • Password-protected PDF support
  • Handles complex fonts, rotated text, and multi-column layouts

API

Three output formats:

  • to_string() - Plain text
  • to_string_pretty() - Character grid rendering that preserves spatial layout
  • lines() - Structured data with TextSpan objects containing text, bounding boxes, and font sizes

Acknowledgements

This is a fork of pdf-extract. Thanks for laying the groundwork, PDFs are ... something else.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 97.6%
  • Python 2.4%