Webapp Java (Tomcat + JSP/Servlet) PARSING IMAGE

Image Letter Parser is a lightweight OCR-oriented tool that takes an input image, cleans it (binarization/denoise), segments characters, and outputs the extracted letters. Designed to be simple to run and easy to extend with different OCR backends or preprocessing steps.

OCR/Tesseract (Faculty of Engineering)

Developers and Creators

_{Simone Remoli}

Built With

The technologies listed below constitute the foundational stack employed in the design and implementation of this system:

🚦 Overview

This project leverages the power of Tesseract OCR to extract text from images, opening up endless opportunities for document automation, data extraction, and more. With a flexible and modular approach, you can easily customize and expand the parsing pipeline to suit your needs.

1. Clone the Repository

git clone https://github.com/SimoneRemoli/Image-Letter-Parser.git

2. Install Tesseract

brew install tesseract

3. Add Tess4J in your POM

<dependency>
      <groupId>net.sourceforge.tess4j</groupId>
      <artifactId>tess4j</artifactId>
      <version>5.10.0</version>
</dependency>

4. Add VM options to Tomcat

IntelliJ → Run/Debug Configurations → Configuration SmartTomcat → VM options

-Djna.library.path=/usr/local/lib -Dtessdata.dir=/usr/local/share/tessdata

Note: Check

ls /opt/homebrew/share/tessdata (o /usr/local/share/tessdata)

show inside eng.traineddata.

Let’s turn images into actionable data!

Example: A — [15,104,44,48] means “the letter A is inside a bounding box that starts 15 px from the left and 104 px from the top, 44 px wide and 48 px tall.”

Ita - > Esempio: A — [15,104,44,48] significa “la lettera A è dentro un riquadro che inizia a 15 px da sinistra e 104 px dall’alto, largo 44 px e alto 48 px”.

Fra -> Exemple : A — [15,104,44,48] signifie « la lettre A se trouve dans un cadre (bounding box) qui commence à 15 px depuis la gauche et 104 px depuis le haut, d’une largeur de 44 px et d’une hauteur de 48 px. »

Kor -> 예: A — [15,104,44,48] 는 문자 A가 왼쪽에서 15px, 위쪽에서 104px 떨어진 지점에서 시작하는 바운딩 박스 안에 있으며, 너비 44px, 높이 48px임을 의미합니다.

Chin -> 示例：A — [15,104,44,48] 表示“字母 A 位于一个边界框内，该边界框从左侧 15 像素、顶部 104 像素处开始，宽 44 像素，高 48 像素。”.

Tess4J provides Java APIs (via JNA) to invoke Tesseract without writing native code. OCR (Optical Character Recognition) is the technology that “reads” text in an image or PDF and converts it into digital text.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
.smarttomcat/Parsing/conf		.smarttomcat/Parsing/conf
src/main		src/main
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webapp Java (Tomcat + JSP/Servlet) PARSING IMAGE

OCR/Tesseract (Faculty of Engineering)

Built With

🚦 Overview

1. Clone the Repository

2. Install Tesseract

3. Add Tess4J in your POM

4. Add VM options to Tomcat

About

Uh oh!

Releases

Packages

Languages

SimoneRemoli/Image-Letter-Parser

Folders and files

Latest commit

History

Repository files navigation

Webapp Java (Tomcat + JSP/Servlet) PARSING IMAGE

OCR/Tesseract (Faculty of Engineering)

Built With

🚦 Overview

1. Clone the Repository

2. Install Tesseract

3. Add Tess4J in your POM

4. Add VM options to Tomcat

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages