Spaces:

pradyten
/

pdf-extractor

Sleeping

App Files Files Community

pdf-extractor / README.md

github-actions[bot]

Sync from GitHub

8e52fc5 2 months ago

preview code

raw

history blame contribute delete

3.36 kB

metadata

title: Pdf Extractor
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: pdf_extractor

PDF-to-JSON Extractor with AI

Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.

🎯 Overview

This application converts PDF documents into structured JSON format using:

OpenAI GPT-4 Vision: For intelligent content extraction
Template-based extraction: Customizable JSON schemas for different document types
Streamlit UI: Interactive web interface for easy PDF processing
Docker support: Containerized deployment for production environments

Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.

✨ Features

AI-Powered Extraction: Uses GPT-4 Vision to understand document structure
Template System: Pre-configured JSON templates for common document types
Batch Processing: Handle multiple PDFs efficiently
Image Preview: Visual confirmation of PDF pages before extraction
Format Validation: Ensures extracted JSON matches defined schema
Hugging Face Spaces: Ready for cloud deployment

🛠 Technology Stack

Python 3.9+ - Primary programming language
OpenAI API - GPT-4 Vision for intelligent extraction
pypdfium2 - PDF rendering and image conversion
Streamlit - Interactive web UI framework
Pillow (PIL) - Image processing
Pandas - Data manipulation

🚀 Installation

Prerequisites

Python 3.9 or higher
OpenAI API key (Get one here)

Setup

Clone the repository: ```bash git clone https://github.com/pradyten/pdf-extractor.git cd pdf-extractor ```
Install dependencies: ```bash pip install -r requirements.txt ```
Configure OpenAI API key: ```bash export OPENAI_API_KEY='your-api-key-here' ```

💻 Usage

Command Line

```bash python extractor.py path/to/document.pdf ```

Streamlit Web UI

```bash streamlit run src/streamlit_app.py ```

Docker

```bash docker build -t pdf-extractor . docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor ```

⚙️ Configuration

Define custom templates in `extractor.py` for different document types (resumes, invoices, forms).

🎓 Use Cases

HR & Recruitment: Batch process resume PDFs
Accounting: Extract invoice data
Data Entry: Automate form digitization
Document Management: Convert scanned documents to searchable JSON

🔒 Security & Privacy

Never commit API keys - use environment variables
PDFs are processed in-memory, not stored
Review OpenAI's data usage policies for compliance

👨‍💻 Author

Pradyumn Tendulkar

Data Science Graduate Student | ML Engineer

⭐ If you found this project helpful, please consider giving it a star!

📝 License: MIT