Add 7-layer caching acceleration guide
Browse files- docs/CACHING.md +780 -0
docs/CACHING.md
ADDED
|
@@ -0,0 +1,780 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ScholarMind 高级缓存加速方案
|
| 2 |
+
|
| 3 |
+
> 本文档补充原有 4 级缓存体系,新增 **7 层缓存机制**,覆盖从应用层语义缓存到 GPU KV Cache 硬件层的全栈加速。
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 缓存全景图
|
| 8 |
+
|
| 9 |
+
```
|
| 10 |
+
┌──────────────────────────────────────────────────────────────────────────────────┐
|
| 11 |
+
│ ScholarMind 7层缓存加速栈 │
|
| 12 |
+
├──────────────────────────────────────────────────────────────────────────────────┤
|
| 13 |
+
│ │
|
| 14 |
+
│ Layer 1 ─── 语义缓存 (GPTCache) │
|
| 15 |
+
│ "What is SnapKV?" ≈ "Explain SnapKV" → 同一缓存命中 │
|
| 16 |
+
│ 延迟: ~5ms (缓存命中) vs ~1500ms (LLM调用) │
|
| 17 |
+
│ 节省: ~100× 延迟, ~100% API成本 │
|
| 18 |
+
│ │
|
| 19 |
+
│ Layer 2 ─── 检索结果缓存 (语义级) │
|
| 20 |
+
│ 相似query复用检索结果, 跳过向量搜索+图谱遍历+重排 │
|
| 21 |
+
│ 延迟: ~2ms vs ~300ms │
|
| 22 |
+
│ 节省: 避免重复检索计算 │
|
| 23 |
+
│ │
|
| 24 |
+
│ Layer 3 ─── API Provider 提示缓存 │
|
| 25 |
+
│ OpenAI: 自动, ≥1024 tokens → 90%输入token折扣 │
|
| 26 |
+
│ Anthropic: cache_control 标记 → 90%折扣 │
|
| 27 |
+
│ DeepSeek: 自动磁盘KV → 90%折扣 │
|
| 28 |
+
│ 节省: 50-90% API成本, 最高80%延迟降低 │
|
| 29 |
+
│ │
|
| 30 |
+
│ Layer 4 ─── vLLM 前缀缓存 (APC) │
|
| 31 |
+
│ 共享 system prompt + 文档chunk 的 KV Cache │
|
| 32 |
+
│ 节省: 2-8× TTFT (首Token延迟) 降低 │
|
| 33 |
+
│ │
|
| 34 |
+
│ Layer 5 ─── RAG KV Cache 复用 (LMCache/CacheBlend) │
|
| 35 |
+
│ 预计算每个文档chunk的KV状态, 组合时选择性重算5-15% │
|
| 36 |
+
│ 节省: 2.2-3.3× TTFT, 2.8-5× 吞吐提升 │
|
| 37 |
+
│ │
|
| 38 |
+
│ Layer 6 ─── KV Cache 压缩 (SnapKV/Quest) │
|
| 39 |
+
│ 长文档场景: 只保留注意力关键的20% KV位置 │
|
| 40 |
+
│ 节省: 3.6× 解码加速, 8.2× 显存效率 │
|
| 41 |
+
│ │
|
| 42 |
+
│ Layer 7 ─── 多轮对话状态缓存 │
|
| 43 |
+
│ 缓存Agent中间状态 + 检索上下文 + 部分答案 │
|
| 44 |
+
│ 追问时跳过路由+检索+图谱查询 │
|
| 45 |
+
│ 节省: 追问响应 ~60% 延迟降低 │
|
| 46 |
+
│ │
|
| 47 |
+
└──────────────────────────────────────────────────────────────────────────────────┘
|
| 48 |
+
|
| 49 |
+
请求完整路径:
|
| 50 |
+
User Query
|
| 51 |
+
→ [L1 语义缓存] 命中? → 直接返回 (~5ms)
|
| 52 |
+
→ [L2 检索缓存] 命中? → 跳过检索, 直接生成
|
| 53 |
+
→ [L7 对话缓存] 追问? → 复用上下文
|
| 54 |
+
→ Retriever (向量+图谱+RAPTOR)
|
| 55 |
+
→ Prompt组装 (static prefix + retrieved chunks + query)
|
| 56 |
+
→ [L3 Provider缓存] 系统提示+文档chunk已缓存 → 90%折扣
|
| 57 |
+
→ [L4 vLLM APC] 共享前缀KV命中 → 跳过prefill
|
| 58 |
+
→ [L5 CacheBlend] 非前缀chunk KV复用 → 部分重算
|
| 59 |
+
→ [L6 SnapKV] 长上下文KV压缩 → 加速decode
|
| 60 |
+
→ Response → 写入 L1 + L2
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Layer 1: 语义缓存 (GPTCache)
|
| 66 |
+
|
| 67 |
+
### 原理
|
| 68 |
+
|
| 69 |
+
传统 Redis 缓存只能精确匹配 key。但学术问答中,同一个问题有多种表达方式:
|
| 70 |
+
- "SnapKV 是什么?" ≈ "解释一下 SnapKV 的原理" ≈ "What does SnapKV do?"
|
| 71 |
+
|
| 72 |
+
**语义缓存**将查询编码为向量,通过**相似度搜索**查找语义等价的历史查询,命中则直接返回缓存答案。
|
| 73 |
+
|
| 74 |
+
```
|
| 75 |
+
查询 → Embedding → 向量相似度搜索 → 相似度 > 阈值?
|
| 76 |
+
├── 是: 返回缓存答案 (~5ms)
|
| 77 |
+
└── 否: 调用LLM → 存入缓存
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### 实现
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
# ===== 方案A: GPTCache (推荐, 7k⭐) =====
|
| 84 |
+
# pip install gptcache
|
| 85 |
+
|
| 86 |
+
from gptcache import cache
|
| 87 |
+
from gptcache.adapter import openai
|
| 88 |
+
from gptcache.embedding import Onnx
|
| 89 |
+
from gptcache.manager import CacheBase, VectorBase, get_data_manager
|
| 90 |
+
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
|
| 91 |
+
from gptcache.processor.pre import get_prompt
|
| 92 |
+
|
| 93 |
+
# 初始化: FAISS向量索引 + SQLite存储 (开发)
|
| 94 |
+
onnx = Onnx()
|
| 95 |
+
data_manager = get_data_manager(
|
| 96 |
+
CacheBase("sqlite"),
|
| 97 |
+
VectorBase("faiss", dimension=onnx.dimension)
|
| 98 |
+
)
|
| 99 |
+
|
| 100 |
+
cache.init(
|
| 101 |
+
pre_embedding_func=get_prompt, # 只用user query做缓存key (排除检索上下文)
|
| 102 |
+
embedding_func=onnx.to_embeddings,
|
| 103 |
+
data_manager=data_manager,
|
| 104 |
+
similarity_evaluation=SearchDistanceEvaluation(),
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
+
# 使用: 直接替换 openai 调用
|
| 108 |
+
response = openai.ChatCompletion.create(
|
| 109 |
+
model="gpt-4o-mini",
|
| 110 |
+
messages=[
|
| 111 |
+
{"role": "system", "content": "You are ScholarMind..."},
|
| 112 |
+
{"role": "user", "content": "What is attention mechanism?"}
|
| 113 |
+
]
|
| 114 |
+
)
|
| 115 |
+
# 第二次语义相似查询 → 缓存命中, ~5ms返回
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
# ===== 方案A (生产版): Milvus/Redis后端 =====
|
| 120 |
+
from gptcache.embedding import OpenAI as OpenAIEmbedding
|
| 121 |
+
from gptcache.manager import CacheBase, VectorBase, get_data_manager
|
| 122 |
+
|
| 123 |
+
# 使用 Qdrant 作为向量后端 (复用已有基础设施)
|
| 124 |
+
data_manager = get_data_manager(
|
| 125 |
+
CacheBase("postgresql", sql_url="postgresql://user:pass@localhost/gptcache"),
|
| 126 |
+
VectorBase("milvus", host="localhost", port=19530, dimension=1536)
|
| 127 |
+
# 也可用 VectorBase("qdrant", url="http://localhost:6333", collection_name="gptcache")
|
| 128 |
+
)
|
| 129 |
+
|
| 130 |
+
# 使用 OpenAI Embedding (与检索管道同模型, 一致性最高)
|
| 131 |
+
openai_emb = OpenAIEmbedding()
|
| 132 |
+
cache.init(
|
| 133 |
+
pre_embedding_func=get_prompt,
|
| 134 |
+
embedding_func=openai_emb.to_embeddings,
|
| 135 |
+
data_manager=data_manager,
|
| 136 |
+
similarity_evaluation=SearchDistanceEvaluation(),
|
| 137 |
+
)
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
```python
|
| 141 |
+
# ===== 方案B: LangChain SemanticCache (更简洁) =====
|
| 142 |
+
from langchain_community.cache import RedisSemanticCache
|
| 143 |
+
from langchain_openai import OpenAIEmbeddings
|
| 144 |
+
import langchain
|
| 145 |
+
|
| 146 |
+
langchain.llm_cache = RedisSemanticCache(
|
| 147 |
+
redis_url="redis://localhost:6379",
|
| 148 |
+
embedding=OpenAIEmbeddings(),
|
| 149 |
+
score_threshold=0.85, # 学术领域建议较严格
|
| 150 |
+
)
|
| 151 |
+
# 所有 LangChain LLM 调用自动走语义缓存
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
### 关键参数调优
|
| 155 |
+
|
| 156 |
+
| 参数 | 推荐值 | 说明 |
|
| 157 |
+
|------|--------|------|
|
| 158 |
+
| 相似度阈值 | **0.85** (学术) | 太低→错误答案; 太高→命中率低; 通用场景可用0.75 |
|
| 159 |
+
| 嵌入模型 | text-embedding-3-small | 与检索管道一致, 避免语义偏差 |
|
| 160 |
+
| TTL | **24h** | 学术知识相对稳定 |
|
| 161 |
+
| 淘汰策略 | **LRU** | 最近最少使用 |
|
| 162 |
+
| 缓存key | **仅user query** | 排除检索context, 否则同一问题不同检索结果无法命中 |
|
| 163 |
+
|
| 164 |
+
### 效果预估
|
| 165 |
+
|
| 166 |
+
| 场景 | 命中率 | 延迟节省 |
|
| 167 |
+
|------|--------|---------|
|
| 168 |
+
| 同一用户追问变体 | ~70% | ~300× (5ms vs 1.5s) |
|
| 169 |
+
| 多用户热门问题 | ~30-40% | ~300× |
|
| 170 |
+
| 全新问题 | 0% | 无节省 (还多~10ms嵌入开销) |
|
| 171 |
+
| **加权平均** | **~35%** | **总QPS提升~50%** |
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## Layer 2: 语义检索结果缓存
|
| 176 |
+
|
| 177 |
+
### 原理
|
| 178 |
+
|
| 179 |
+
混合检索 (向量+图谱+RAPTOR+重排) 耗时约 300ms。如果两个查询语义相似,它们的检索结果往往也相似。
|
| 180 |
+
|
| 181 |
+
```python
|
| 182 |
+
class SemanticRetrievalCache:
|
| 183 |
+
"""语义级检索结果缓存 — 相似query复用检索结果"""
|
| 184 |
+
|
| 185 |
+
def __init__(self, qdrant_client, collection="retrieval_cache", threshold=0.90):
|
| 186 |
+
self.client = qdrant_client
|
| 187 |
+
self.collection = collection
|
| 188 |
+
self.threshold = threshold
|
| 189 |
+
self.embed_model = load_embedding_model()
|
| 190 |
+
|
| 191 |
+
# 创建缓存collection
|
| 192 |
+
self.client.create_collection(
|
| 193 |
+
collection_name=collection,
|
| 194 |
+
vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
async def get_or_fetch(self, query: str, retriever) -> list:
|
| 198 |
+
"""查缓存, 没有则检索并缓存"""
|
| 199 |
+
query_vec = self.embed_model.encode(query)
|
| 200 |
+
|
| 201 |
+
# 1. 在缓存中搜索语义相似的历史查询
|
| 202 |
+
hits = self.client.search(
|
| 203 |
+
collection_name=self.collection,
|
| 204 |
+
query_vector=query_vec,
|
| 205 |
+
limit=1,
|
| 206 |
+
score_threshold=self.threshold,
|
| 207 |
+
)
|
| 208 |
+
|
| 209 |
+
if hits and hits[0].score >= self.threshold:
|
| 210 |
+
# 缓存命中 — 直接返回历史检索结果
|
| 211 |
+
cached = hits[0].payload["results"]
|
| 212 |
+
return cached
|
| 213 |
+
|
| 214 |
+
# 2. 缓存未命中 — 执行完整检索
|
| 215 |
+
results = await retriever.retrieve(query, mode="hybrid")
|
| 216 |
+
|
| 217 |
+
# 3. 存入缓存
|
| 218 |
+
self.client.upsert(
|
| 219 |
+
collection_name=self.collection,
|
| 220 |
+
points=[models.PointStruct(
|
| 221 |
+
id=hash(query) % (2**63),
|
| 222 |
+
vector=query_vec,
|
| 223 |
+
payload={
|
| 224 |
+
"query": query,
|
| 225 |
+
"results": results,
|
| 226 |
+
"timestamp": time.time(),
|
| 227 |
+
}
|
| 228 |
+
)]
|
| 229 |
+
)
|
| 230 |
+
|
| 231 |
+
return results
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
### 缓存失效
|
| 235 |
+
|
| 236 |
+
```python
|
| 237 |
+
async def invalidate_on_new_papers(self, paper_ids: list):
|
| 238 |
+
"""新论文导入时, 清除可能受影响的缓存"""
|
| 239 |
+
# 策略1: 全量清除 (简单但激进)
|
| 240 |
+
self.client.delete_collection(self.collection)
|
| 241 |
+
|
| 242 |
+
# 策略2: 精准失效 (复杂但精确)
|
| 243 |
+
# 检索包含这些paper_id的缓存条目并删除
|
| 244 |
+
for paper_id in paper_ids:
|
| 245 |
+
self.client.delete(
|
| 246 |
+
collection_name=self.collection,
|
| 247 |
+
points_selector=models.FilterSelector(
|
| 248 |
+
filter=models.Filter(
|
| 249 |
+
must=[models.FieldCondition(
|
| 250 |
+
key="results[].metadata.paper_id",
|
| 251 |
+
match=models.MatchValue(value=paper_id),
|
| 252 |
+
)]
|
| 253 |
+
)
|
| 254 |
+
)
|
| 255 |
+
)
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
---
|
| 259 |
+
|
| 260 |
+
## Layer 3: API Provider 提示缓存
|
| 261 |
+
|
| 262 |
+
### 核心原理
|
| 263 |
+
|
| 264 |
+
```
|
| 265 |
+
Prompt 结构:
|
| 266 |
+
┌─────────────────────────────────────────┐
|
| 267 |
+
│ System Prompt (固定, ~500 tokens) │ ← 这部分每次都一样
|
| 268 |
+
│ "You are ScholarMind, a research..." │ Provider 自动缓存
|
| 269 |
+
├─────────────────────────────────────────┤
|
| 270 |
+
│ 检索到的论文片段 (半固定, ~2000 tokens) │ ← 热门论文反复被检索到
|
| 271 |
+
│ Paper chunk A: "Attention is..." │ 高概率命中缓存
|
| 272 |
+
│ Paper chunk B: "We propose BERT..." │
|
| 273 |
+
├─────────────────────────────────────────┤
|
| 274 |
+
│ 用户问题 (动态, ~50 tokens) │ ← 每次不同
|
| 275 |
+
│ "Compare BERT and GPT-2 on GLUE" │ 不被缓存
|
| 276 |
+
└─────────────────────────────────────────┘
|
| 277 |
+
|
| 278 |
+
关键: 固定内容放前面, 动态内容放最后!
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
### OpenAI — 全自动 (零配置)
|
| 282 |
+
|
| 283 |
+
```python
|
| 284 |
+
from openai import OpenAI
|
| 285 |
+
client = OpenAI()
|
| 286 |
+
|
| 287 |
+
# 技巧: 构造≥1024 tokens的静态前缀, OpenAI自动缓存
|
| 288 |
+
SYSTEM_PROMPT = """You are ScholarMind, an expert academic research assistant.
|
| 289 |
+
You have access to a knowledge base of 1000+ academic papers spanning NLP,
|
| 290 |
+
computer vision, and machine learning. When answering questions:
|
| 291 |
+
1. Always cite specific papers with [Author, Year] format
|
| 292 |
+
2. Include quantitative results where available
|
| 293 |
+
3. Compare methods objectively
|
| 294 |
+
4. Acknowledge limitations and open questions
|
| 295 |
+
... (填充到≥1024 tokens)
|
| 296 |
+
"""
|
| 297 |
+
|
| 298 |
+
response = client.chat.completions.create(
|
| 299 |
+
model="gpt-4o-mini",
|
| 300 |
+
messages=[
|
| 301 |
+
{"role": "system", "content": SYSTEM_PROMPT}, # ~1200 tokens, 自动缓存
|
| 302 |
+
{"role": "user", "content": f"""
|
| 303 |
+
Based on these papers:
|
| 304 |
+
{retrieved_chunks}
|
| 305 |
+
|
| 306 |
+
Question: {user_question}
|
| 307 |
+
"""}
|
| 308 |
+
]
|
| 309 |
+
)
|
| 310 |
+
|
| 311 |
+
# 检查缓存效果
|
| 312 |
+
usage = response.usage
|
| 313 |
+
cached = usage.prompt_tokens_details.cached_tokens
|
| 314 |
+
total = usage.prompt_tokens
|
| 315 |
+
print(f"Cache hit: {cached}/{total} tokens ({cached/total:.0%})")
|
| 316 |
+
# 首次: 0%, 后续相同前缀: ~70-90%
|
| 317 |
+
```
|
| 318 |
+
|
| 319 |
+
**成本节省**:
|
| 320 |
+
| 场景 | 正常价格 | 缓存命中价格 | 节省 |
|
| 321 |
+
|------|---------|-------------|------|
|
| 322 |
+
| GPT-4o input | $2.50/M | $1.25/M | 50% |
|
| 323 |
+
| GPT-4o-mini input | $0.15/M | $0.075/M | 50% |
|
| 324 |
+
| GPT-4o 长前缀 (>128K) | — | 高达 90% off | 90% |
|
| 325 |
+
|
| 326 |
+
### Anthropic — 显式标记 (精细控制)
|
| 327 |
+
|
| 328 |
+
```python
|
| 329 |
+
import anthropic
|
| 330 |
+
client = anthropic.Anthropic()
|
| 331 |
+
|
| 332 |
+
response = client.messages.create(
|
| 333 |
+
model="claude-sonnet-4-20250514",
|
| 334 |
+
max_tokens=2048,
|
| 335 |
+
system=[
|
| 336 |
+
{
|
| 337 |
+
"type": "text",
|
| 338 |
+
"text": SYSTEM_PROMPT, # 系统指令
|
| 339 |
+
"cache_control": {"type": "ephemeral"} # ← 缓存断点1
|
| 340 |
+
}
|
| 341 |
+
],
|
| 342 |
+
messages=[{
|
| 343 |
+
"role": "user",
|
| 344 |
+
"content": [
|
| 345 |
+
{
|
| 346 |
+
"type": "text",
|
| 347 |
+
"text": retrieved_chunks, # 检索到的论文片段
|
| 348 |
+
"cache_control": {"type": "ephemeral"} # ← 缓存断点2
|
| 349 |
+
},
|
| 350 |
+
{
|
| 351 |
+
"type": "text",
|
| 352 |
+
"text": user_question # 动态部分, 不缓存
|
| 353 |
+
}
|
| 354 |
+
]
|
| 355 |
+
}]
|
| 356 |
+
)
|
| 357 |
+
|
| 358 |
+
# 查看缓存统计
|
| 359 |
+
print(f"Cache write: {response.usage.cache_creation_input_tokens} tokens")
|
| 360 |
+
print(f"Cache read: {response.usage.cache_read_input_tokens} tokens")
|
| 361 |
+
# 首次: 全部write; 5分钟内再次调用相同前缀: 全部read → 90%折扣
|
| 362 |
+
```
|
| 363 |
+
|
| 364 |
+
**Anthropic 缓存定价**:
|
| 365 |
+
| 类型 | Sonnet 4 | Haiku 3.5 |
|
| 366 |
+
|------|----------|-----------|
|
| 367 |
+
| 正常输入 | $3/M | $0.80/M |
|
| 368 |
+
| 缓存写入 (首次) | $3.75/M (+25%) | $1.00/M (+25%) |
|
| 369 |
+
| 缓存读取 (命中) | $0.30/M (**-90%**) | $0.08/M (**-90%**) |
|
| 370 |
+
| TTL | 5分钟 (每次命中刷新) | 5分钟 |
|
| 371 |
+
|
| 372 |
+
### DeepSeek — 自动磁盘缓存
|
| 373 |
+
|
| 374 |
+
```python
|
| 375 |
+
from openai import OpenAI
|
| 376 |
+
client = OpenAI(api_key="sk-xxx", base_url="https://api.deepseek.com")
|
| 377 |
+
|
| 378 |
+
response = client.chat.completions.create(
|
| 379 |
+
model="deepseek-chat",
|
| 380 |
+
messages=[
|
| 381 |
+
{"role": "system", "content": LONG_SYSTEM_PROMPT},
|
| 382 |
+
{"role": "user", "content": user_question}
|
| 383 |
+
]
|
| 384 |
+
)
|
| 385 |
+
|
| 386 |
+
# DeepSeek 自动缓存到磁盘, 无需配置
|
| 387 |
+
usage = response.usage
|
| 388 |
+
print(f"Cache hit: {usage.prompt_cache_hit_tokens} tokens @ 10% price")
|
| 389 |
+
print(f"Cache miss: {usage.prompt_cache_miss_tokens} tokens @ 100% price")
|
| 390 |
+
# 磁盘缓存持续时间 > Anthropic的5分钟, 更适合低流量场景
|
| 391 |
+
```
|
| 392 |
+
|
| 393 |
+
---
|
| 394 |
+
|
| 395 |
+
## Layer 4: vLLM 前缀缓存 (APC)
|
| 396 |
+
|
| 397 |
+
### 原理
|
| 398 |
+
|
| 399 |
+
vLLM 的 Automatic Prefix Caching 将 KV Cache 按 block (16-32 tokens) 分割,每个 block 通过 hash(tokens + position) 索引。新请求到来时,从头匹配已缓存的 blocks,跳过已有 blocks 的 prefill 计算。
|
| 400 |
+
|
| 401 |
+
```
|
| 402 |
+
请求1: [System 500 tokens] + [Chunk A 200 tokens] + [Query 1]
|
| 403 |
+
↓ 全部计算 KV, 缓存所有blocks
|
| 404 |
+
|
| 405 |
+
请求2: [System 500 tokens] + [Chunk A 200 tokens] + [Query 2]
|
| 406 |
+
↓ 前700 tokens 命中缓存! 只需计算 Query 2 的KV
|
| 407 |
+
→ prefill 从 ~700 tokens 降到 ~50 tokens → 14× TTFT降低
|
| 408 |
+
```
|
| 409 |
+
|
| 410 |
+
### 配置
|
| 411 |
+
|
| 412 |
+
```bash
|
| 413 |
+
# vLLM serving — APC默认开启 (v1+)
|
| 414 |
+
vllm serve meta-llama/Llama-3.1-8B-Instruct \
|
| 415 |
+
--enable-prefix-caching \
|
| 416 |
+
--gpu-memory-utilization 0.95 \
|
| 417 |
+
--max-model-len 32768
|
| 418 |
+
|
| 419 |
+
# 查看缓存统计
|
| 420 |
+
curl http://localhost:8000/metrics | grep prefix_cache
|
| 421 |
+
# vllm_prefix_cache_hit_rate: 0.73
|
| 422 |
+
# vllm_prefix_cache_queries_total: 10000
|
| 423 |
+
```
|
| 424 |
+
|
| 425 |
+
```python
|
| 426 |
+
# Python API
|
| 427 |
+
from vllm import LLM, SamplingParams
|
| 428 |
+
|
| 429 |
+
llm = LLM(
|
| 430 |
+
model="meta-llama/Llama-3.1-8B-Instruct",
|
| 431 |
+
enable_prefix_caching=True,
|
| 432 |
+
gpu_memory_utilization=0.95,
|
| 433 |
+
)
|
| 434 |
+
|
| 435 |
+
# 关键: 所有请求共享相同的长前缀
|
| 436 |
+
SHARED_PREFIX = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
|
| 437 |
+
|
| 438 |
+
You are ScholarMind, an expert academic research assistant with access to
|
| 439 |
+
1000+ research papers. Answer precisely with citations.
|
| 440 |
+
|
| 441 |
+
<|eot_id|><|start_header_id|>user<|end_header_id|>
|
| 442 |
+
|
| 443 |
+
Based on the following paper excerpts:
|
| 444 |
+
{FREQUENTLY_RETRIEVED_CHUNKS}
|
| 445 |
+
"""
|
| 446 |
+
|
| 447 |
+
# 每次请求只改变 query 部分 — 前缀全部缓存命中
|
| 448 |
+
responses = llm.generate([
|
| 449 |
+
SHARED_PREFIX + "What is the main contribution of BERT?<|eot_id|>",
|
| 450 |
+
SHARED_PREFIX + "Compare BERT and GPT-2 on GLUE<|eot_id|>",
|
| 451 |
+
])
|
| 452 |
+
```
|
| 453 |
+
|
| 454 |
+
### ⚡ 优化技巧: 文档chunk排序一致化
|
| 455 |
+
|
| 456 |
+
```python
|
| 457 |
+
def order_chunks_for_cache(chunks: list, query: str) -> list:
|
| 458 |
+
"""
|
| 459 |
+
将检索到的chunk按确定性顺序排列, 最大化前缀缓存命中率
|
| 460 |
+
|
| 461 |
+
策略: 按chunk_id排序 (而非按相关性排序)
|
| 462 |
+
→ 不同查询检索到相同chunk集合时, 前缀完全一致 → 缓存命中
|
| 463 |
+
"""
|
| 464 |
+
# 按paper_id + page_idx 确定性排序
|
| 465 |
+
return sorted(chunks, key=lambda c: (c["metadata"]["paper_id"], c["metadata"]["page"]))
|
| 466 |
+
|
| 467 |
+
# 注意: 这会牺牲一点"最相关chunk排前面"的优势
|
| 468 |
+
# 折中方案: 前N个按相关性, 后面按ID排序
|
| 469 |
+
```
|
| 470 |
+
|
| 471 |
+
---
|
| 472 |
+
|
| 473 |
+
## Layer 5: RAG KV Cache 复用 (LMCache / CacheBlend)
|
| 474 |
+
|
| 475 |
+
### 问题
|
| 476 |
+
|
| 477 |
+
标准前缀缓存要求 chunks 以**完全相同的顺序**出现。但 RAG 检索结果的顺序经常变化:
|
| 478 |
+
- Query 1 → [Chunk A, B, C]
|
| 479 |
+
- Query 2 → [Chunk B, A, D] ← Chunk A, B 都出现了, 但前缀不同
|
| 480 |
+
|
| 481 |
+
CacheBlend 解决这个问题: **预计算每个chunk的独立KV状态**, 组合时只选择性重算 5-15% 的 tokens。
|
| 482 |
+
|
| 483 |
+
### 实现
|
| 484 |
+
|
| 485 |
+
```bash
|
| 486 |
+
# 安装 LMCache (CacheBlend 的生产实现)
|
| 487 |
+
pip install lmcache lmcache-vllm
|
| 488 |
+
```
|
| 489 |
+
|
| 490 |
+
```python
|
| 491 |
+
# LMCache 集成 vLLM 的示例
|
| 492 |
+
import lmcache_vllm
|
| 493 |
+
from vllm import LLM, SamplingParams
|
| 494 |
+
|
| 495 |
+
# LMCache 在 vLLM 之上透明地管理 KV 缓存
|
| 496 |
+
# 支持 GPU VRAM → Host RAM → SSD 三级缓存层次
|
| 497 |
+
llm = LLM(
|
| 498 |
+
model="meta-llama/Llama-3.1-8B-Instruct",
|
| 499 |
+
# LMCache 通过环境变量或配置文件集成
|
| 500 |
+
)
|
| 501 |
+
|
| 502 |
+
# 预热: 将所有高频文档chunk的KV预计算并缓存
|
| 503 |
+
for chunk in top_1000_chunks:
|
| 504 |
+
llm.encode(chunk["text"]) # 预计算KV → 缓存到GPU+RAM
|
| 505 |
+
|
| 506 |
+
# 推理: 组合时自动复用已缓存的chunk KV, 只重算5-15%
|
| 507 |
+
response = llm.generate(
|
| 508 |
+
system_prompt + chunk_a + chunk_b + user_query
|
| 509 |
+
)
|
| 510 |
+
# chunk_a 和 chunk_b 的 KV 从缓存加载, 仅选择性重算跨chunk注意力
|
| 511 |
+
```
|
| 512 |
+
|
| 513 |
+
### 效果
|
| 514 |
+
|
| 515 |
+
| 指标 | 无缓存 | vLLM APC | CacheBlend |
|
| 516 |
+
|------|--------|----------|------------|
|
| 517 |
+
| TTFT (首token) | 基准 | 2-8× 降低 (仅限相同前缀) | **2.2-3.3× 降低** (任意chunk组合) |
|
| 518 |
+
| 吞吐 | 基准 | 1.5× 提升 | **2.8-5× 提升** |
|
| 519 |
+
| 适用场景 | — | 固定前缀 | **任意RAG检索结果** |
|
| 520 |
+
|
| 521 |
+
> **论文**: CacheBlend (arxiv:2405.16444), LMCache GitHub: `github.com/LMCache/LMCache`
|
| 522 |
+
|
| 523 |
+
---
|
| 524 |
+
|
| 525 |
+
## Layer 6: KV Cache 压缩 (长文档场景)
|
| 526 |
+
|
| 527 |
+
### SnapKV — 关注度投票压缩
|
| 528 |
+
|
| 529 |
+
当输入很长(多篇论文全文)时,保留完整 KV Cache 会耗尽显存。SnapKV 只保留每个注意力头**真正关注的 20% 位置**。
|
| 530 |
+
|
| 531 |
+
```
|
| 532 |
+
完整KV: [tok1, tok2, tok3, ..., tok10000] → 160GB VRAM (70B模型)
|
| 533 |
+
SnapKV: [tok3, tok45, tok202, ..., tok9998] → 32GB VRAM (仅保留20%)
|
| 534 |
+
~5× 显存节省, ~3.6× 解码加速
|
| 535 |
+
```
|
| 536 |
+
|
| 537 |
+
```python
|
| 538 |
+
# pip install snapkv
|
| 539 |
+
# 核心参数
|
| 540 |
+
config = {
|
| 541 |
+
"compression_ratio": 0.2, # 保留20%的KV位置
|
| 542 |
+
"observation_window": 32, # 用最后32个token投票决定保留哪些位置
|
| 543 |
+
"kernel_size": 5, # 投票时的池化窗口
|
| 544 |
+
}
|
| 545 |
+
|
| 546 |
+
# SnapKV 集成方式: 修改 attention 层
|
| 547 |
+
# 适用: 处理多篇完整论文时 (>16K tokens)
|
| 548 |
+
# 结果: 单A100可处理380K token上下文 (原本~32K)
|
| 549 |
+
```
|
| 550 |
+
|
| 551 |
+
### Quest — 查询感知的稀疏注意力
|
| 552 |
+
|
| 553 |
+
```
|
| 554 |
+
每个KV页面维护元数据 (K向量的min/max值)
|
| 555 |
+
→ 新query来了, 用Q和元数据估算每页的重要性
|
| 556 |
+
→ 只加载Top-K重要的页面
|
| 557 |
+
→ 7× self-attention加速
|
| 558 |
+
```
|
| 559 |
+
|
| 560 |
+
> **论文**: SnapKV (arxiv:2404.14469, GitHub: `fasterdecoding/snapkv`)
|
| 561 |
+
> **论文**: Quest (arxiv:2406.10774, GitHub: `mit-han-lab/quest`)
|
| 562 |
+
|
| 563 |
+
---
|
| 564 |
+
|
| 565 |
+
## Layer 7: 多轮对话状态缓存
|
| 566 |
+
|
| 567 |
+
### 原理
|
| 568 |
+
|
| 569 |
+
学术问答常见多轮追问:
|
| 570 |
+
1. "BERT在GLUE上表现如何?" → 检索+推理+生成
|
| 571 |
+
2. "和GPT-2相比呢?" → **不需要重新检索BERT的信息!**
|
| 572 |
+
|
| 573 |
+
```python
|
| 574 |
+
from langgraph.checkpoint.memory import MemorySaver
|
| 575 |
+
|
| 576 |
+
class ConversationCache:
|
| 577 |
+
"""多轮对话缓存 — 避免追问时重复检索"""
|
| 578 |
+
|
| 579 |
+
def __init__(self):
|
| 580 |
+
self.checkpointer = MemorySaver() # LangGraph 内置状态持久化
|
| 581 |
+
self.context_cache = {} # session_id → {retrieved_docs, entities, graph_context}
|
| 582 |
+
|
| 583 |
+
async def handle_query(self, session_id: str, query: str):
|
| 584 |
+
# 检查是否是追问
|
| 585 |
+
if session_id in self.context_cache:
|
| 586 |
+
prev = self.context_cache[session_id]
|
| 587 |
+
|
| 588 |
+
# 追问检测: 如果query引用了上一轮的实体, 复用上下文
|
| 589 |
+
if self._is_followup(query, prev["entities"]):
|
| 590 |
+
# 直接复用之前的检索结果 + 图谱上下文
|
| 591 |
+
return await self._generate_with_cached_context(
|
| 592 |
+
query=query,
|
| 593 |
+
retrieved_docs=prev["retrieved_docs"],
|
| 594 |
+
graph_context=prev["graph_context"],
|
| 595 |
+
history=prev["history"],
|
| 596 |
+
)
|
| 597 |
+
|
| 598 |
+
# 非追问: 完整检索流程
|
| 599 |
+
result = await full_retrieval_and_generation(query)
|
| 600 |
+
|
| 601 |
+
# 缓存上下文供追问使用
|
| 602 |
+
self.context_cache[session_id] = {
|
| 603 |
+
"retrieved_docs": result["retrieved_docs"],
|
| 604 |
+
"graph_context": result["graph_context"],
|
| 605 |
+
"entities": result["entities"],
|
| 606 |
+
"history": result["messages"],
|
| 607 |
+
"timestamp": time.time(),
|
| 608 |
+
}
|
| 609 |
+
|
| 610 |
+
return result
|
| 611 |
+
|
| 612 |
+
def _is_followup(self, query: str, prev_entities: list) -> bool:
|
| 613 |
+
"""检测是否是追问: 包含代词、比较词、或引用上一轮实体"""
|
| 614 |
+
followup_signals = ["compared to", "和...比", "那", "它", "this method", "上面提到的"]
|
| 615 |
+
has_pronoun = any(s in query.lower() for s in followup_signals)
|
| 616 |
+
references_entity = any(e["name"].lower() in query.lower() for e in prev_entities)
|
| 617 |
+
return has_pronoun or references_entity
|
| 618 |
+
```
|
| 619 |
+
|
| 620 |
+
### LangGraph 内置检查点
|
| 621 |
+
|
| 622 |
+
```python
|
| 623 |
+
from langgraph.graph import StateGraph
|
| 624 |
+
from langgraph.checkpoint.postgres import PostgresSaver
|
| 625 |
+
|
| 626 |
+
# 使用 PostgreSQL 持久化 Agent 状态
|
| 627 |
+
checkpointer = PostgresSaver.from_conn_string("postgresql://...")
|
| 628 |
+
|
| 629 |
+
# 编译时传入 checkpointer
|
| 630 |
+
agent = build_agent_graph().compile(checkpointer=checkpointer)
|
| 631 |
+
|
| 632 |
+
# 每次调用使用 thread_id 标识会话
|
| 633 |
+
config = {"configurable": {"thread_id": session_id}}
|
| 634 |
+
|
| 635 |
+
# 第一次: 完整执行
|
| 636 |
+
result1 = await agent.ainvoke({"query": "BERT在GLUE上表现如何?"}, config)
|
| 637 |
+
|
| 638 |
+
# 追问: LangGraph 自动恢复之前的状态
|
| 639 |
+
result2 = await agent.ainvoke({"query": "和GPT-2相比呢?"}, config)
|
| 640 |
+
# → Agent 已有上一轮的 retrieved_docs, 只需补充检索GPT-2信息
|
| 641 |
+
```
|
| 642 |
+
|
| 643 |
+
---
|
| 644 |
+
|
| 645 |
+
## 集成架构: 完整缓存流水线
|
| 646 |
+
|
| 647 |
+
```python
|
| 648 |
+
class CachedQAEngine:
|
| 649 |
+
"""集成7层缓存的问答引擎"""
|
| 650 |
+
|
| 651 |
+
def __init__(self):
|
| 652 |
+
# L1: 语义缓存
|
| 653 |
+
self.semantic_cache = GPTCacheWrapper(threshold=0.85, ttl_hours=24)
|
| 654 |
+
|
| 655 |
+
# L2: 检索结果缓存
|
| 656 |
+
self.retrieval_cache = SemanticRetrievalCache(qdrant, threshold=0.90)
|
| 657 |
+
|
| 658 |
+
# L7: 对话状态缓存
|
| 659 |
+
self.conversation_cache = ConversationCache()
|
| 660 |
+
|
| 661 |
+
# L3-L6: 由 LiteLLM / vLLM / Provider 自动处理
|
| 662 |
+
|
| 663 |
+
async def query(self, session_id: str, query: str) -> dict:
|
| 664 |
+
|
| 665 |
+
# === L1: 语义缓存检查 ===
|
| 666 |
+
cached_answer = await self.semantic_cache.get(query)
|
| 667 |
+
if cached_answer:
|
| 668 |
+
return {"answer": cached_answer, "cache": "L1_semantic", "latency_ms": 5}
|
| 669 |
+
|
| 670 |
+
# === L7: 追问检查 ===
|
| 671 |
+
if self.conversation_cache.is_followup(session_id, query):
|
| 672 |
+
result = await self.conversation_cache.handle_followup(session_id, query)
|
| 673 |
+
if result:
|
| 674 |
+
return {**result, "cache": "L7_followup"}
|
| 675 |
+
|
| 676 |
+
# === L2: 检索缓存检查 ===
|
| 677 |
+
retrieved = await self.retrieval_cache.get_or_fetch(query, self.retriever)
|
| 678 |
+
|
| 679 |
+
# === 组装 Prompt (L3/L4/L5友好的结构) ===
|
| 680 |
+
prompt = self._build_prompt(
|
| 681 |
+
system=STATIC_SYSTEM_PROMPT, # → L3 Provider缓存
|
| 682 |
+
chunks=order_chunks_for_cache(retrieved), # → L4 vLLM APC缓存
|
| 683 |
+
query=query # → 动态部分
|
| 684 |
+
)
|
| 685 |
+
|
| 686 |
+
# === LLM调用 (L3/L4/L5/L6 自动生效) ===
|
| 687 |
+
answer = await self.llm.complete(prompt, task="generation")
|
| 688 |
+
|
| 689 |
+
# === 写入缓存 ===
|
| 690 |
+
await self.semantic_cache.set(query, answer)
|
| 691 |
+
self.conversation_cache.update(session_id, query, retrieved, answer)
|
| 692 |
+
|
| 693 |
+
return {"answer": answer, "cache": "miss", "retrieved": retrieved}
|
| 694 |
+
|
| 695 |
+
def _build_prompt(self, system: str, chunks: list, query: str) -> list:
|
| 696 |
+
"""
|
| 697 |
+
Prompt结构优化:
|
| 698 |
+
1. 固定系统提示放最前 (Provider缓存 + vLLM前缀缓存)
|
| 699 |
+
2. 检索chunks按确定性排序 (最大化vLLM前缀命中)
|
| 700 |
+
3. 用户query放最后 (动态部分)
|
| 701 |
+
"""
|
| 702 |
+
return [
|
| 703 |
+
{"role": "system", "content": system}, # ≥1024 tokens → OpenAI自动缓存
|
| 704 |
+
{"role": "user", "content":
|
| 705 |
+
"Based on these paper excerpts:\n\n" +
|
| 706 |
+
"\n---\n".join([c["text"] for c in chunks]) +
|
| 707 |
+
f"\n\nQuestion: {query}"
|
| 708 |
+
}
|
| 709 |
+
]
|
| 710 |
+
```
|
| 711 |
+
|
| 712 |
+
---
|
| 713 |
+
|
| 714 |
+
## 性能收益汇总
|
| 715 |
+
|
| 716 |
+
```
|
| 717 |
+
┌───────────────────────────────────────────────────────────────────┐
|
| 718 |
+
│ 7层缓存预估收益 (1000篇论文, 日均500查询) │
|
| 719 |
+
├──────────┬──────────┬──────────────┬──────────────┬──────────────┤
|
| 720 |
+
│ 缓存层 │ 命中率 │ 延迟节省 │ 成本节省 │ 实现复杂度 │
|
| 721 |
+
├──────────┼──────────┼──────────────┼──────────────┼──────────────┤
|
| 722 |
+
│ L1 语义 │ ~35% │ 300×(5ms) │ ~100%(命中时) │ ⭐⭐ 中 │
|
| 723 |
+
│ L2 检索 │ ~25% │ ~60×(5ms) │ 检索计算 │ ⭐⭐ 中 │
|
| 724 |
+
│ L3 API │ ~70% │ 最高80% │ 50-90% │ ⭐ 低 │
|
| 725 |
+
│ L4 APC │ ~60% │ 2-8× TTFT │ GPU算力 │ ⭐ 低 │
|
| 726 |
+
│ L5 Cache │ ~40% │ 2-3× TTFT │ GPU算力 │ ⭐⭐⭐ 高 │
|
| 727 |
+
│ Blend │ │ │ │ │
|
| 728 |
+
│ L6 SnapKV│ N/A │ 3.6×解码 │ 5×显存 │ ⭐⭐ 中 │
|
| 729 |
+
│ L7 对话 │ ~20% │ ~60%(追问) │ 检索+推理 │ ⭐⭐ 中 │
|
| 730 |
+
├──────────┼──────────┼──────────────┼──────────────┼──────────────┤
|
| 731 |
+
│ 综合 │ — │ P50: 1.5s │ API成本 │ │
|
| 732 |
+
│ 效果 │ │ → ~400ms │ 降低60%+ │ │
|
| 733 |
+
│ │ │ P99: 4s │ │ │
|
| 734 |
+
│ │ │ → ~1.5s │ │ │
|
| 735 |
+
└──────────┴──────────┴──────────────┴──────────────┴──────────────┘
|
| 736 |
+
```
|
| 737 |
+
|
| 738 |
+
---
|
| 739 |
+
|
| 740 |
+
## 实施优先级
|
| 741 |
+
|
| 742 |
+
| 优先级 | 缓存层 | 理由 | 工作量 |
|
| 743 |
+
|--------|--------|------|--------|
|
| 744 |
+
| **P0 (立即)** | L3 Provider缓存 | 零代码改动, 只需调整prompt结构 | 2小时 |
|
| 745 |
+
| **P0 (立即)** | L4 vLLM APC | 默认已开启, 确认配置即可 | 1小时 |
|
| 746 |
+
| **P1 (本周)** | L1 语义缓存 | GPTCache几十行代码, 收益最高 | 1天 |
|
| 747 |
+
| **P1 (本周)** | L7 对话缓存 | LangGraph checkpointer, 追问体验质变 | 1天 |
|
| 748 |
+
| **P2 (下周)** | L2 检索缓存 | 复用Qdrant基���设施 | 2天 |
|
| 749 |
+
| **P3 (后续)** | L6 SnapKV | 仅长文档场景需要 | 3天 |
|
| 750 |
+
| **P3 (后续)** | L5 CacheBlend | 需要LMCache集成, 侵入性较大 | 1周 |
|
| 751 |
+
|
| 752 |
+
---
|
| 753 |
+
|
| 754 |
+
## 相关论文
|
| 755 |
+
|
| 756 |
+
| 论文 | ArXiv ID | 核心贡献 |
|
| 757 |
+
|------|---------|---------|
|
| 758 |
+
| GPTCache | ACL 2023 NLPOSS | 语义缓存框架 |
|
| 759 |
+
| GPT Semantic Cache | 2411.05276 | 语义缓存基准评测 |
|
| 760 |
+
| PagedAttention (vLLM) | 2309.06180 | 分页KV Cache管理 |
|
| 761 |
+
| RAGCache | 2404.12457 | RAG专用多级KV缓存 |
|
| 762 |
+
| CacheBlend | 2405.16444 | 非前缀KV复用 |
|
| 763 |
+
| SnapKV | 2404.14469 | 注意力投票KV压缩 |
|
| 764 |
+
| Quest | 2406.10774 | 查询感知稀疏注意力 |
|
| 765 |
+
| StreamingLLM | 2309.17453 | 注意力sink+滚动窗口 |
|
| 766 |
+
| Prompt Cache | 2311.04934 | 模块化KV状态复用 |
|
| 767 |
+
| KV Cache Survey | 2412.19442 | KV Cache管理全面综述 |
|
| 768 |
+
|
| 769 |
+
---
|
| 770 |
+
|
| 771 |
+
## 开源项目
|
| 772 |
+
|
| 773 |
+
| 项目 | GitHub | Stars | 用途 |
|
| 774 |
+
|------|--------|-------|------|
|
| 775 |
+
| GPTCache | zilliztech/GPTCache | 7k+ | 语义缓存 |
|
| 776 |
+
| LMCache | LMCache/LMCache | — | CacheBlend生产实现 |
|
| 777 |
+
| SnapKV | fasterdecoding/snapkv | 311 | KV压缩 |
|
| 778 |
+
| Quest | mit-han-lab/quest | 382 | 稀疏注意力 |
|
| 779 |
+
| vLLM | vllm-project/vllm | 45k+ | APC前缀缓存 |
|
| 780 |
+
| KV Cache Survey | TreeAI-Lab/Awesome-KV-Cache-Management | 314 | 综述索引 |
|