Spaces:

ujjwalpardeshi
/

chakravyuh

Running

App Files Files Community

chakravyuh / serving /README.md

UjjwalPardeshi

deploy: latest main to HF Space

03815d6 13 days ago

preview code

raw

history blame contribute delete

3.65 kB

Chakravyuh Serving Harnesses

Three deployment paths for the Chakravyuh Analyzer LoRA, illustrating the on-device, on-server, on-cloud spectrum claimed in the README.

Harness	File	Target	Hardware	Status
HF Space (Gradio + FastAPI)	`server/app.py`	Live demo	HF Spaces (Docker SDK, port 8000)	✅ deployed at https://ujjwalpardeshi-chakravyuh.hf.space
vLLM (server-grade)	`serving/vllm_compose.yml`	OpenAI-compatible `/v1/chat/completions` endpoint	A10G or better; CUDA 12	✅ scaffolded
Ollama (laptop / phone-class)	`serving/ollama_modelfile`	`ollama run` on a Pixel 8 / M1 MacBook	CPU + 8 GB RAM (q4_k_m)	⚠️ requires GGUF release (see C.8 in WIN_PLAN.md)

The HF Space path is the canonical demo for judges. The vLLM path is for anyone wanting to integrate Chakravyuh into a production-grade inference pipeline. The Ollama path closes the "fits on a phone" claim.

vLLM (server-grade)

Boots a vLLM server with the v2 LoRA pre-loaded against the Qwen2.5-7B-Instruct base, exposing an OpenAI-compatible /v1/chat/completions endpoint.

docker compose -f serving/vllm_compose.yml up

After ~60 s warm-up:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "chakravyuh-analyzer-lora-v2",
    "messages": [
      {"role": "system", "content": "You are Chakravyuh Analyzer..."},
      {"role": "user", "content": "Hi I am from SBI, your account is frozen, share OTP"}
    ],
    "max_tokens": 160,
    "temperature": 0.0
  }' | jq

Expected: a JSON response with the Analyzer's strict-JSON output containing score, signals, explanation.

GPU requirement: A10G or better. CUDA 12. ~14 GB VRAM for bf16 inference, ~8 GB for AWQ quantization.

Limitations: vLLM 0.6+ supports LoRA adapters but the load-time syntax has shifted across versions; the compose file pins to a known-good version. Update vllm_compose.yml's image: field if your environment needs a different vLLM version.

Ollama (phone / laptop-class)

Boots a local Ollama instance running the merged-and-quantized Chakravyuh v2.

# One-time setup: pull the GGUF artifact (when published)
ollama pull hf.co/ujjwalpardeshi/chakravyuh-v2-gguf:q4_k_m

# Run interactively
ollama run hf.co/ujjwalpardeshi/chakravyuh-v2-gguf:q4_k_m

# Or build locally from this Modelfile (if the Hub artifact is not yet up)
ollama create chakravyuh -f serving/ollama_modelfile
ollama run chakravyuh

Hardware: Tested target is Pixel 8 (8 GB RAM, Tensor G3) at q4_k_m quantization, ~10 tok/s.

Limitation: the GGUF artifact in the README path is planned, not yet shipped. C.8 in WIN_PLAN.md covers the GGUF release workflow.

Why three harnesses?

The README claim is "on-device, on-server, on-cloud — same model." Each harness backs one wedge of that claim with a runnable artifact rather than a paragraph:

HF Space → on-cloud (zero-install demo).
vLLM → on-server (production integration story).
Ollama → on-device (the "fits on a phone" claim).

Judges who care about deployability will probe one of the three; we ship all three so no probe goes unanswered.

Cross-references

Live demo: https://ujjwalpardeshi-chakravyuh.hf.space/demo/
Adapter on HF Hub: https://huggingface.co/ujjwalpardeshi/chakravyuh-analyzer-lora-v2
Latency / memory benchmark: docs/latency_memory.md (when shipped from B.9 in WIN_PLAN)
GGUF release workflow: WIN_PLAN C.8.