Spaces:

dtufail
/

nuremberg-scholar

Sleeping

Apply for a GPU community grant: Personal project

by dtufail - opened Mar 11

Owner Mar 11

Nuremberg Scholar is a RAG-based research assistant for the Nuremberg Trials transcripts (1945–1946), scraped from the Yale Avalon Project. The corpus covers all 221 trial sessions, the judgment, and key prosecution documents — 46,325 indexed passages total. The dataset is public domain and already published on HuggingFace at dtufail/nuremberg-trials-corpus.

Why I need a GPU: The retrieval pipeline runs BGE-M3 (hybrid dense + sparse search with RRF fusion) and bge-reranker-v2-m3 as a cross-encoder reranker. Both models need a GPU at query time — roughly 1 second per query, decorated with @spaces.GPU(duration=10). Everything else (FAISS lookup, Groq API call for Llama-3.1-8B generation, citation verification) runs on CPU.

What users get: Ask a question about the Nuremberg Trials, get an answer with specific source citations from the actual transcripts. A post-generation verifier strips any hallucinated citations before the response is shown.
This is an open-source portfolio project. The code is public, the corpus is public domain, and I'm building it as an AI engineer to make these historical records more accessible.

GitHub: github.com/DTufail | Dataset: huggingface.co/datasets/dtufail/nuremberg-trials-corpus

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment