vn6295337's picture
Initial commit: RAG Document Assistant with Zero-Storage Privacy
f866820
# RAG PoC — Ingestion
This folder contains Day-3 ingestion pipeline components.
Files:
- load_docs.py : Markdown loader -> returns cleaned text + metadata
- chunker.py : Deterministic whitespace chunker (approx tokens->chars)
- test_ingestion.py : End-to-end loader -> chunker smoke test
- embeddings.py : Offline deterministic pseudo-embedding stub (provider="local")
- save_embeddings.py : Persist chunk embeddings to data/embeddings.jsonl
- search_local.py : Local cosine-similarity retrieval against embeddings.jsonl
- data/embeddings.jsonl : Generated embeddings (JSONL)
Quick run (from `RAG-document-assistant/ingestion`, with `aienv` active):
1. Activate venv:
source ~/aienv/bin/activate
2. Load & summarize docs:
python3 load_docs.py /full/path/to/your/markdown_folder
3. End-to-end ingestion test:
python3 test_ingestion.py /full/path/to/your/markdown_folder
4. Generate & save embeddings:
python3 save_embeddings.py /full/path/to/your/markdown_folder local 64
5. Search locally:
python3 search_local.py data/embeddings.jsonl "your query" 3 64
Notes:
- Replace `/full/path/to/your/markdown_folder` with your real path (e.g. /home/vn6295337/RAG-document-assistant/sample_docs).
- This pipeline uses a local pseudo-embedding for offline testing. Replace provider branches when ready to use real APIs.