This project is a lightweight, modular pipeline for extracting and processing data from various sources like PDFs, text files, directories, and web pages using LangChain and Groq's LLMs.
- ✅ PDF data extraction with
PyPDFLoader - ✅ Directory-wise PDF processing using
DirectoryLoader - ✅ Raw
.txtfile summarization - ✅ Web scraping + LLM-based question answering
- ✅ Uses
ChatGroqwith DeepSeek or LLaMA-3 models
.
├── dataloader/
│ ├── directory_loader.py # Load multiple PDFs from a folder
│ ├── pypdf_loader.py # Load and query a single PDF
│ ├── text_loader.py # Summarize .txt files
│ ├── webbase_loader.py # Extract info from websites
│ ├── extra.py # (Optional utility file)
│ ├── text.txt # Sample text file
│ ├── data.pdf # Sample PDF
│ └── .env # Stores your GROQ_API_KEY
Install dependencies via:
pip install -r requirements.txtlangchain-groq
groq
python-dotenv
langchain_community
pypdf
bs4
-
Create a
.envfile:GROQ_API_KEY=your_groq_api_key_here
-
Run any of the scripts as needed:
python pypdf_loader.py python webbase_loader.py python text_loader.py python directory_loader.py
- PDF:
Tell me all the education institute names of the person - Web:
Name of the darkest coffee - Text:
Summarize the following text
This project is licensed under the MIT License.
Author: Nitesh Kumar Singh
Built with ❤️ using LangChain, Groq, and Python