OCR Vision Character Model
This model is a character-level language model trained on OCR-extracted text from historical JFK documents.
Overview
This model is based on nanoGPT by Andrej Karpathy and fine-tuned on top of GPT-2. The training data consists of text extracted from declassified JFK documents using Google Vision OCR.
Training Process
- Source Documents: PDF files were downloaded from the National Archives JFK document releases
- Text Extraction: Google Vision API was used to perform OCR on the PDF documents
- Model Training: The extracted text was used to fine-tune a GPT-2 model using the nanoGPT framework
Training Data
The model was trained on text extracted from the following JFK document releases from the National Archives:
- 206-10001-10001.pdf
- 206-10001-10002.pdf
- 206-10001-10003.pdf
- 206-10001-10004.pdf
- 206-10001-10005.pdf
- 206-10001-10006.pdf
- 206-10001-10007.pdf
- 206-10001-10008.pdf
- 206-10001-10009.pdf
- 206-10001-10010.pdf
- 206-10001-10011.pdf
- 206-10001-10012.pdf
- 206-10001-10013.pdf
- 206-10001-10014.pdf
- 206-10001-10015.pdf
- 206-10001-10016.pdf
- 206-10001-10017.pdf
All training documents are from the March 18, 2025 JFK document release from the National Archives.
Note: This is a work in progress. Future versions will be trained on all documents from the JFK release.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for jpruiz114/jfk-release-2025-small-2025-10-15-v1
Base model
openai-community/gpt2