Pytheract

Optical character recognition using tesseract

Introduction

An application that extract meaningful data from any type of files.

Usage

For end users.

Currently in progress to set up an environment

Flow

Upload a file using the frontend.
Tesseract will extract the texts available in the file uploaded.

Installation

For developers.

Prerequisites

The application has a number of dependencies. Kindly ensure you have the following installed on your machine:

Python
- Official download.
Tesseract
Mongo
- Official download.
Compass
- Official download.
Git
- Official download.
Running the Application
1. Install Python if it is not installed already. Add the environment variables and check version.
```
  C:\Users\username> python
  Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)] on win32
  Type "help", "copyright", "credits" or "license" for more information.
```
1. Install Mongodb if it is not installed already.
2. Install Mongodb compass. ( Client )
3. Go to Mongo db bin folder and run the server
```
C:\Program Files\MongoDB\Server\4.4\bin> mongod
```
It will be available in port 27017
1. Go to compass get in to the db
```
  mongodb://localhost:27017
```
1. Install Tesseract
2. Clone the repository
```
git clone https://github.com/SandeepBalachandran/Pytheract.git
```
1. Check into the cloned repository
```
cd Pytheract
```
1. If you are using Pipenv, setup the virtual environment and start it as follows:
```
pipenv install 
```
1. Run Flask
```
set FLASK_APP=app.py
set FLASK_ENV=development
flask run 
```
It will be available in port 5000

Features

Extraction texts from pdf files.
Extraction texts from zip files contains both images and pdf files.
Get webcam on UI.
Capture image/ extract texts from captured image.
Using regex locate specific contents . For eg: Email address, Phone number etc

Contribute

Please check the Contributing Guidelines before contributing.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
images		images
static		static
templates		templates
uploads/Sample		uploads/Sample
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
app.py		app.py
ocr_core.py		ocr_core.py
pdf_img_core.py		pdf_img_core.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pytheract

Optical character recognition using tesseract

Table of contents

Introduction

Usage

Flow

Installation

Prerequisites

Running the Application

Features

Contribute

About

Uh oh!

Releases

Packages

Languages

License

SandeepBalachandran/Pytheract

Folders and files

Latest commit

History

Repository files navigation

Pytheract

Optical character recognition using tesseract

Table of contents

Introduction

Usage

Flow

Installation

Prerequisites

Running the Application

Features

Contribute

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages