Spaces:
Sleeping
Apply for a GPU community grant: Personal project
Nuremberg Scholar is a RAG-based research assistant for the Nuremberg Trials transcripts (1945–1946), scraped from the Yale Avalon Project. The corpus covers all 221 trial sessions, the judgment, and key prosecution documents — 46,325 indexed passages total. The dataset is public domain and already published on HuggingFace at dtufail/nuremberg-trials-corpus.
Why I need a GPU: The retrieval pipeline runs BGE-M3 (hybrid dense + sparse search with RRF fusion) and bge-reranker-v2-m3 as a cross-encoder reranker. Both models need a GPU at query time — roughly 1 second per query, decorated with @spaces.GPU(duration=10). Everything else (FAISS lookup, Groq API call for Llama-3.1-8B generation, citation verification) runs on CPU.
What users get: Ask a question about the Nuremberg Trials, get an answer with specific source citations from the actual transcripts. A post-generation verifier strips any hallucinated citations before the response is shown.
This is an open-source portfolio project. The code is public, the corpus is public domain, and I'm building it as an AI engineer to make these historical records more accessible.
GitHub: github.com/DTufail | Dataset: huggingface.co/datasets/dtufail/nuremberg-trials-corpus