Spaces:

chunxiaox
/

nautilus-compass

Running

App Files Files Community

nautilus-compass / README.md

chunxiaox

switch to static HTML/JS demo · drop gradio entirely · in-browser jaccard scoring

3267225 verified 5 days ago

preview code

raw

history blame contribute delete

2.67 kB

metadata

title: nautilus-compass demo
emoji: 🧭
colorFrom: blue
colorTo: purple
sdk: static
app_file: index.html
pinned: false
license: mit

nautilus-compass · drift detector live demo

Static, in-browser demo for nautilus-compass v1.0 · the persona-drift detector + tamper-evident memory log for long-running LLM agents.

What you can try here

Drift detection. Paste a (system_prompt, response) pair. We char-n-gram both and score the response against the 25 positive + 35 negative persona anchors shipped with nautilus-compass.

Green = response sits inside the persona anchor cone (aligned)
Yellow = neutral, weak signal either way
Red = response is closer to the negative anchors (sycophancy, fake-completion, root-cause skipping, "user won't notice", etc.)

The verdict + alignment / deviation / drift_score breakdown render instantly. All scoring runs client-side in your browser — no upload, no tracking, no API key needed.

Two pre-baked sample buttons load (clean) and (drifted) cases from the same fixtures the unit tests use, so you can sanity-check the verdict matches what nautilus-compass ships.

What needs the local install

The full pipeline used in the paper (BGE-m3 dense + bge-reranker-v2-m3 cross-encoder, ~~570M params, ~2GB model weights) doesn't fit a free Space and isn't this demo's point. Same for Merkle hash chain verification — it needs filesystem access to your `~~/.claude/projects/` session logs.

For the full stack:

pip install nautilus-compass==1.0.0
bash daemon_start.sh        # one-time per boot · downloads BGE-m3 ~2GB
compass-verify --all        # Merkle integrity scan

Or in any of 6 MCP-compatible clients (Claude Code · Claude Desktop · Cline · Cursor · Continue.dev · Zed) — see examples/mcp_configs/ in the repo for paste-ready configs.

Headline eval numbers (locked v1.0 · 2026-05-08)

metric	nautilus-compass	best public baseline
LongMemEval-S (n=500)	56.6%	Zep 55-60% (different judge)
EverMemBench-Dynamic Run 1	44.4% (n=500)	MemOS 42.55
EverMemBench-Dynamic Run 2	47.3% (n=497)	—
Drift detector ROC AUC (held-out)	0.83	— (no other black-box drift work)
Reproduction cost	$3.50 end-to-end	$50+ for GPT-4o-judge stacks

Two papers on arxiv (drift detection · memory recall). 228 pytests all green. MIT (anchors CC0).

Local repo

github.com/chunxiaoxx/nautilus-compass