pdf-extractor / README.md
github-actions[bot]
Sync from GitHub
8e52fc5
metadata
title: Pdf Extractor
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: pdf_extractor

PDF-to-JSON Extractor with AI

Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.

πŸ“‹ Table of Contents

🎯 Overview

This application converts PDF documents into structured JSON format using:

  • OpenAI GPT-4 Vision: For intelligent content extraction
  • Template-based extraction: Customizable JSON schemas for different document types
  • Streamlit UI: Interactive web interface for easy PDF processing
  • Docker support: Containerized deployment for production environments

Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.

✨ Features

  • AI-Powered Extraction: Uses GPT-4 Vision to understand document structure
  • Template System: Pre-configured JSON templates for common document types
  • Batch Processing: Handle multiple PDFs efficiently
  • Image Preview: Visual confirmation of PDF pages before extraction
  • Format Validation: Ensures extracted JSON matches defined schema
  • Hugging Face Spaces: Ready for cloud deployment

πŸ›  Technology Stack

  • Python 3.9+ - Primary programming language
  • OpenAI API - GPT-4 Vision for intelligent extraction
  • pypdfium2 - PDF rendering and image conversion
  • Streamlit - Interactive web UI framework
  • Pillow (PIL) - Image processing
  • Pandas - Data manipulation

πŸš€ Installation

Prerequisites

Setup

  1. Clone the repository: ```bash git clone https://github.com/pradyten/pdf-extractor.git cd pdf-extractor ```

  2. Install dependencies: ```bash pip install -r requirements.txt ```

  3. Configure OpenAI API key: ```bash export OPENAI_API_KEY='your-api-key-here' ```

πŸ’» Usage

Command Line

```bash python extractor.py path/to/document.pdf ```

Streamlit Web UI

```bash streamlit run src/streamlit_app.py ```

Docker

```bash docker build -t pdf-extractor . docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor ```

βš™οΈ Configuration

Define custom templates in `extractor.py` for different document types (resumes, invoices, forms).

πŸŽ“ Use Cases

  • HR & Recruitment: Batch process resume PDFs
  • Accounting: Extract invoice data
  • Data Entry: Automate form digitization
  • Document Management: Convert scanned documents to searchable JSON

πŸ”’ Security & Privacy

  • Never commit API keys - use environment variables
  • PDFs are processed in-memory, not stored
  • Review OpenAI's data usage policies for compliance

πŸ‘¨β€πŸ’» Author

Pradyumn Tendulkar

Data Science Graduate Student | ML Engineer


⭐ If you found this project helpful, please consider giving it a star!

πŸ“ License: MIT