Spaces:
Sleeping
title: Pdf Extractor
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: pdf_extractor
PDF-to-JSON Extractor with AI
Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.
π Table of Contents
π― Overview
This application converts PDF documents into structured JSON format using:
- OpenAI GPT-4 Vision: For intelligent content extraction
- Template-based extraction: Customizable JSON schemas for different document types
- Streamlit UI: Interactive web interface for easy PDF processing
- Docker support: Containerized deployment for production environments
Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.
β¨ Features
- AI-Powered Extraction: Uses GPT-4 Vision to understand document structure
- Template System: Pre-configured JSON templates for common document types
- Batch Processing: Handle multiple PDFs efficiently
- Image Preview: Visual confirmation of PDF pages before extraction
- Format Validation: Ensures extracted JSON matches defined schema
- Hugging Face Spaces: Ready for cloud deployment
π Technology Stack
- Python 3.9+ - Primary programming language
- OpenAI API - GPT-4 Vision for intelligent extraction
- pypdfium2 - PDF rendering and image conversion
- Streamlit - Interactive web UI framework
- Pillow (PIL) - Image processing
- Pandas - Data manipulation
π Installation
Prerequisites
- Python 3.9 or higher
- OpenAI API key (Get one here)
Setup
Clone the repository: ```bash git clone https://github.com/pradyten/pdf-extractor.git cd pdf-extractor ```
Install dependencies: ```bash pip install -r requirements.txt ```
Configure OpenAI API key: ```bash export OPENAI_API_KEY='your-api-key-here' ```
π» Usage
Command Line
```bash python extractor.py path/to/document.pdf ```
Streamlit Web UI
```bash streamlit run src/streamlit_app.py ```
Docker
```bash docker build -t pdf-extractor . docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor ```
βοΈ Configuration
Define custom templates in `extractor.py` for different document types (resumes, invoices, forms).
π Use Cases
- HR & Recruitment: Batch process resume PDFs
- Accounting: Extract invoice data
- Data Entry: Automate form digitization
- Document Management: Convert scanned documents to searchable JSON
π Security & Privacy
- Never commit API keys - use environment variables
- PDFs are processed in-memory, not stored
- Review OpenAI's data usage policies for compliance
π¨βπ» Author
Pradyumn Tendulkar
Data Science Graduate Student | ML Engineer
- GitHub: @pradyten
- LinkedIn: Pradyumn Tendulkar
- Email: pktendulkar@wpi.edu
β If you found this project helpful, please consider giving it a star!
π License: MIT