scholarmind-architecture / docs /LATENCY_OPTIMIZATION.md
heyingyue's picture
Add 8-strategy latency optimization guide
cc0461e verified
# ScholarMind Agent ๅปถ่ฟŸไผ˜ๅŒ–ๅ…จๆ”ป็•ฅ
> ็›ฎๆ ‡๏ผšๅฐ† QA ็ซฏๅˆฐ็ซฏๅปถ่ฟŸไปŽ ~3s๏ผˆๅคๅˆ้—ฎ้ข˜๏ผ‰/ ~1.5s๏ผˆ็ฎ€ๅ•้—ฎ้ข˜๏ผ‰ๅŽ‹็ผฉๅˆฐ **<800ms**
---
## ๅฝ“ๅ‰ๅปถ่ฟŸ็“ถ้ขˆๅˆ†ๆž
```
ๅฝ“ๅ‰่ฏทๆฑ‚ๆ—ถ้—ด็บฟ (ๅคๅˆ้—ฎ้ข˜, ~3000ms):
0ms 200ms 500ms 1200ms 2000ms 2800ms 3000ms
โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
โ”‚ๅคๆ‚ๅบฆๅˆคๆ–ญ โ”‚ ๆŸฅ่ฏขๅˆ†่งฃ โ”‚ ๅญ้—ฎ้ข˜ๆฃ€็ดข โ”‚ ๅญ้—ฎ้ข˜็”Ÿๆˆ โ”‚ ่šๅˆ้‡ๆŽ’ โ”‚ ๆœ€็ปˆ็”Ÿๆˆ โ”‚
โ”‚ LLM่ฐƒ็”จ โ”‚ LLM่ฐƒ็”จ โ”‚ ๅ‘้‡+ๅ›พ่ฐฑ โ”‚ LLMร—N โ”‚ Reranker โ”‚ LLM่ฐƒ็”จ โ”‚
โ”‚ ~150ms โ”‚ ~300ms โ”‚ ~300ms โ”‚ ~700ms โ”‚ ~200ms โ”‚ ~500ms โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ ไธฒ่กŒโŒ โ”‚ ไธฒ่กŒโŒ โ”‚ ๅฏๅนถ่กŒโœ… โ”‚ ๅฏๅนถ่กŒโœ… โ”‚ ไธฒ่กŒโŒ โ”‚ ๅฏๆตๅผโœ… โ”‚
ไผ˜ๅŒ–็›ฎๆ ‡:
- ๆถˆ้™คไธๅฟ…่ฆ็š„ไธฒ่กŒ็ญ‰ๅพ…
- ๅˆๅนถ/ๅนถ่กŒๅŒ– LLM ่ฐƒ็”จ
- ๆŠ•ๆœบๆ‰ง่กŒ (speculation)
- ๆตๅผ่พ“ๅ‡บ (็”จๆˆทๆ„Ÿ็Ÿฅๅปถ่ฟŸโ†“)
```
---
## 8 ๅคงไผ˜ๅŒ–็ญ–็•ฅ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ScholarMind Agent ๅปถ่ฟŸไผ˜ๅŒ– โ€” 8ๅคง็ญ–็•ฅ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ1 โ”€โ”€โ”€ ๆŠ•ๆœบๆ‰ง่กŒ (Speculative Execution) โ”‚
โ”‚ ไธ็ญ‰ๅˆ†่งฃ็ป“ๆžœ, ไน่ง‚ๅœฐ็”จๅŽŸๅง‹queryๅ…ˆๅฏๅŠจๆฃ€็ดข โ”‚
โ”‚ ่Š‚็œ: ~300ms (ๅˆ†่งฃไธŽๆฃ€็ดขๅนถ่กŒ) โ”‚
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ2 โ”€โ”€โ”€ LLM ่ฐƒ็”จๅˆๅนถ (Batch & Merge) โ”‚
โ”‚ ๅคๆ‚ๅบฆๅˆคๆ–ญ+ๅˆ†่งฃ ๅˆไธบ1ๆฌก่ฐƒ็”จ; ๅญ้—ฎ้ข˜็ญ”ๆกˆๅนถ่กŒ็”Ÿๆˆ โ”‚
โ”‚ ่Š‚็œ: ~300ms (ๅ‡ๅฐ‘1ๆฌกLLMๅพ€่ฟ”) โ”‚
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ3 โ”€โ”€โ”€ ๆฃ€็ดขๅนถ่กŒๅŒ– + ้ข„ๅ– (Parallel Retrieval + Prefetch) โ”‚
โ”‚ ๅ‘้‡/ๅ›พ่ฐฑ/RAPTOR ไธ‰่ทฏๅŒๆ—ถๅ‘่ตท; ็ƒญ้—จๅฎžไฝ“ๅญๅ›พ้ข„ๅŠ ่ฝฝ โ”‚
โ”‚ ่Š‚็œ: ~200ms (ไธ‰่ทฏไธฒ่กŒโ†’ๅนถ่กŒ) โ”‚
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ4 โ”€โ”€โ”€ ๆตๅผ็”Ÿๆˆ + ๅขž้‡่พ“ๅ‡บ (Streaming) โ”‚
โ”‚ ้ฆ–tokenๅณๅผ€ๅง‹่ฟ”ๅ›ž, ็”จๆˆทๆ„Ÿ็Ÿฅๅปถ่ฟŸไปŽ3sโ†’~500ms โ”‚
โ”‚ ๆ„Ÿ็Ÿฅ่Š‚็œ: ~2500ms โ”‚
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ5 โ”€โ”€โ”€ ่ฝป้‡็บง่ทฏ็”ฑๅ™จ (่ง„ๅˆ™+ๅฐๆจกๅž‹ๆ›ฟไปฃLLM) โ”‚
โ”‚ ๅคๆ‚ๅบฆๅˆคๆ–ญ็”จๆญฃๅˆ™+ๅฏๅ‘ๅผ, ไธ่ฐƒLLM โ”‚
โ”‚ ่Š‚็œ: ~150ms (ๆถˆ้™ค1ๆฌกLLM่ฐƒ็”จ) โ”‚
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ6 โ”€โ”€โ”€ ๅˆ†่งฃ็ป“ๆžœ็ผ“ๅญ˜ (Decomposition Cache) โ”‚
โ”‚ ็›ธไผผๅคๅˆ้—ฎ้ข˜ๅค็”จๅކๅฒๅˆ†่งฃ็ป“ๆž„ โ”‚
โ”‚ ่Š‚็œ: ~300ms (ๅ‘ฝไธญๆ—ถ่ทณ่ฟ‡ๅˆ†่งฃLLM) โ”‚
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ7 โ”€โ”€โ”€ ๅผ‚ๆญฅ็ฎก้“ๅŒ– (Pipeline Parallelism) โ”‚
โ”‚ ๅ‰ไธ€ๆญฅ็š„้ƒจๅˆ†็ป“ๆžœ็ซ‹ๅณๅ–‚็ป™ไธ‹ไธ€ๆญฅ, ไธ็ญ‰ๅ…จ้ƒจๅฎŒๆˆ โ”‚
โ”‚ ่Š‚็œ: ~300ms (ๆถˆ้™คๆญฅ้—ด็ญ‰ๅพ…) โ”‚
โ”‚ โ”‚
โ”‚ ็ญ–็•ฅ8 โ”€โ”€โ”€ ๆจกๅž‹็บงๅŠ ้€Ÿ (Smaller Models + Speculative Decoding) โ”‚
โ”‚ ่ทฏ็”ฑ/ๅˆ†่งฃ็”จ3Bๆจกๅž‹; ็”Ÿๆˆ็”จ8B+ๆŠ•ๆœบ่งฃ็  โ”‚
โ”‚ ่Š‚็œ: ~40% ็”Ÿๆˆๅปถ่ฟŸ โ”‚
โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ็ญ–็•ฅ1: ๆŠ•ๆœบๆ‰ง่กŒ (Speculative Execution)
### ๅŽŸ็†
ไธ็ญ‰ LLM ๅˆ†่งฃๅฎŒๆˆ๏ผŒ**ไน่ง‚ๅ‡่ฎพ**็”จๆˆท็š„ๅŽŸๅง‹ๆŸฅ่ฏขๅฐฑๆ˜ฏไธ€ไธชๆœ‰ๆ•ˆๆฃ€็ดข query๏ผŒ็ซ‹ๅณๅฏๅŠจๆฃ€็ดขใ€‚ๅˆ†่งฃๅฎŒๆˆๅŽ๏ผŒ่กฅๅ……็ผบๅคฑ็š„ๅญๆŸฅ่ฏขๆฃ€็ดขใ€‚
```
ไผ˜ๅŒ–ๅ‰ (ไธฒ่กŒ):
[ๅˆ†่งฃ 300ms] โ”€โ”€็ญ‰ๅพ…โ”€โ”€โ–ถ [ๆฃ€็ดข 300ms] โ”€โ”€็ญ‰ๅพ…โ”€โ”€โ–ถ [็”Ÿๆˆ]
ๆ€ป่ฎก: 600ms ๅŽๆ‰ๅผ€ๅง‹็”Ÿๆˆ
ไผ˜ๅŒ–ๅŽ (ๆŠ•ๆœบๅนถ่กŒ):
[ๅˆ†่งฃ 300ms] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ ๅˆ†่งฃๅฎŒๆˆ, ่กฅๅ……ๆฃ€็ดข
[ๅŽŸๅง‹queryๆฃ€็ดข 300ms] โ”€โ”€โ–ถ ็ป“ๆžœๅฐฑ็ปช! (ๅชๆฃ€็ดขๅˆ†่งฃๆ–ฐๅขž็š„ๅญๆŸฅ่ฏข)
โ†“
ไธค็ป„็ป“ๆžœๅˆๅนถ โ†’ ็”Ÿๆˆ (ๅ‡ ไนŽ้›ถ็ญ‰ๅพ…)
ๆ€ป่ฎก: ~300ms ๅŽๅณๅฏๅผ€ๅง‹็”Ÿๆˆ (่Š‚็œ300ms)
```
### ๅฎž็Žฐ
```python
import asyncio
async def speculative_retrieval(query: str) -> dict:
"""ๆŠ•ๆœบๆ‰ง่กŒ: ๅˆ†่งฃไธŽๆฃ€็ดขๅนถ่กŒ"""
# ๅŒๆ—ถๅฏๅŠจ: (1) ๆŸฅ่ฏขๅˆ†่งฃ (2) ๅŽŸๅง‹query็›ดๆŽฅๆฃ€็ดข
decompose_task = asyncio.create_task(decompose_query(query))
speculative_task = asyncio.create_task(
hybrid_retriever.retrieve(query, mode="hybrid", top_k=10)
)
# ็ญ‰ๅพ…ไธค่€…ๅฎŒๆˆ (ๅฎž้™…ไธŠ่ฐๅฟซ่ฐๅ…ˆๅฎŒๆˆ)
decomposition, speculative_docs = await asyncio.gather(
decompose_task, speculative_task
)
# ๅˆ†่งฃ็ป“ๆžœๅ‡บๆฅๅŽ, ๅชๆฃ€็ดข"ๆŠ•ๆœบๆฃ€็ดขๆœช่ฆ†็›–"็š„ๅญๆŸฅ่ฏข
covered_by_speculation = estimate_coverage(speculative_docs, decomposition)
additional_tasks = []
for sq in decomposition.sub_questions:
if sq.id not in covered_by_speculation and not sq.depends_on:
additional_tasks.append(
hybrid_retriever.retrieve(sq.question, mode=sq.type, top_k=5)
)
# ๅนถ่กŒ่กฅๅ……ๆฃ€็ดข
if additional_tasks:
additional_docs = await asyncio.gather(*additional_tasks)
all_docs = speculative_docs + [d for docs in additional_docs for d in docs]
else:
all_docs = speculative_docs # ๆŠ•ๆœบๆฃ€็ดขๅทฒ่ถณๅคŸ!
return {"docs": all_docs, "decomposition": decomposition}
```
---
## ็ญ–็•ฅ2: LLM ่ฐƒ็”จๅˆๅนถ
### ๅŽŸ็†
ๅฐ†ๅคšๆฌกไธฒ่กŒ LLM ่ฐƒ็”จๅˆๅนถไธบๆ›ดๅฐ‘ๆฌก่ฐƒ็”จ๏ผš
```
ไผ˜ๅŒ–ๅ‰ (4ๆฌกLLM่ฐƒ็”จ):
1. ๅคๆ‚ๅบฆๅˆคๆ–ญ (~150ms)
2. ๆŸฅ่ฏขๅˆ†่งฃ (~300ms)
3. ๅญ้—ฎ้ข˜็ญ”ๆกˆ็”Ÿๆˆ ร—N (~700ms, ไฝ†ๅฏๅนถ่กŒ)
4. ๆœ€็ปˆ็ปผๅˆ็”Ÿๆˆ (~500ms)
ๆ€ป่ฎก: ่‡ณๅฐ‘ 1650ms ็บฏLLMๆ—ถ้—ด (ๅ‡่ฎพ3/4ไธฒ่กŒ)
ไผ˜ๅŒ–ๅŽ (2-3ๆฌก):
1. ๅคๆ‚ๅบฆ+ๅˆ†่งฃ ๅˆๅนถไธบ1ๆฌก (~350ms) โ† ่Š‚็œ1ๆฌกๅพ€่ฟ”
2. ๅญ้—ฎ้ข˜็ญ”ๆกˆๅนถ่กŒ็”Ÿๆˆ (~700msๅนถ่กŒโ†’็ญ‰ๆœ€ๆ…ข็š„้‚ฃไธช)
3. ๆœ€็ปˆ็ปผๅˆ(ๆตๅผ) (~500ms, ไฝ†้ฆ–token ~50ms)
ๆ€ป่ฎก: ~1050ms, ไธ”็”จๆˆทๅœจ ~400ms ๅŽๅฐฑๅผ€ๅง‹็œ‹ๅˆฐ่พ“ๅ‡บ
```
### ๅฎž็Žฐ
```python
# ๅˆๅนถ: ๅคๆ‚ๅบฆๅˆคๆ–ญ + ๅˆ†่งฃ โ†’ ไธ€ๆฌก่ฐƒ็”จๅฎŒๆˆ
UNIFIED_ROUTING_PROMPT = """Analyze this query in ONE step:
1. Is it SIMPLE (single focused question) or COMPOSITE (multiple sub-questions)?
2. If COMPOSITE, decompose into sub-questions with types and dependencies.
3. If SIMPLE, classify as: factual | reasoning | global
Query: {query}
Return JSON:
- If simple: {{"complexity": "simple", "type": "factual|reasoning|global"}}
- If composite: {{"complexity": "composite", "sub_questions": [...], ...}}
"""
async def unified_route_and_decompose(query: str) -> dict:
"""ๅ•ๆฌกLLM่ฐƒ็”จๅฎŒๆˆ: ๅคๆ‚ๅบฆๅˆคๆ–ญ + ่ทฏ็”ฑ/ๅˆ†่งฃ"""
result = await llm.complete(
UNIFIED_ROUTING_PROMPT.format(query=query),
task="routing", # ๆœฌๅœฐๆจกๅž‹, ไฝŽๅปถ่ฟŸ
)
return result # ไธ€ๆฌก่ฐƒ็”จๆžๅฎšไธคๆญฅ
```
```python
# ๅญ้—ฎ้ข˜็ญ”ๆกˆ: ๅนถ่กŒ็”Ÿๆˆ (ไธๆ˜ฏไธฒ่กŒ!)
async def parallel_sub_answers(sub_questions: list, retrieved_docs: dict) -> list:
"""ๅนถ่กŒไธบๆ‰€ๆœ‰ๆ— ไพ่ต–ๅญ้—ฎ้ข˜็”Ÿๆˆ็ญ”ๆกˆ"""
tasks = []
for sq in sub_questions:
if not sq.depends_on: # ๆ— ไพ่ต– โ†’ ๅฏๅนถ่กŒ
tasks.append(generate_sub_answer(sq, retrieved_docs[sq.id]))
# asyncio.gather ๅนถ่กŒๆ‰ง่กŒๆ‰€ๆœ‰LLM่ฐƒ็”จ
results = await asyncio.gather(*tasks)
return results
```
---
## ็ญ–็•ฅ3: ๆฃ€็ดขๅนถ่กŒๅŒ– + ้ข„ๅ–
### ๅŽŸ็†
ไธ‰่ทฏๆฃ€็ดข๏ผˆๅ‘้‡ + ๅ›พ่ฐฑ + RAPTOR๏ผ‰ไธๅบ”ไธฒ่กŒๆ‰ง่กŒ๏ผš
```
ไผ˜ๅŒ–ๅ‰:
[ๅ‘้‡ๆฃ€็ดข 50ms] โ†’ [ๅ›พ่ฐฑๆŸฅ่ฏข 200ms] โ†’ [RAPTOR 100ms] โ†’ [้‡ๆŽ’ 100ms]
ๆ€ป่ฎก: ~450ms
ไผ˜ๅŒ–ๅŽ:
[ๅ‘้‡ๆฃ€็ดข 50ms ]โ”€โ”
[ๅ›พ่ฐฑๆŸฅ่ฏข 200ms ]โ”€โ”ผโ”€โ”€ ็ญ‰ๆœ€ๆ…ข็š„ (200ms) โ”€โ”€โ–ถ [้‡ๆŽ’ 100ms]
[RAPTOR 100ms ]โ”€โ”˜
ๆ€ป่ฎก: ~300ms (่Š‚็œ150ms)
```
### ๅฎž็Žฐ
```python
async def parallel_hybrid_retrieve(query: str, mode: str = "hybrid") -> list:
"""ไธ‰่ทฏๆฃ€็ดขๅฎŒๅ…จๅนถ่กŒ"""
tasks = {}
if mode in ("factual", "hybrid"):
tasks["vector"] = asyncio.create_task(
qdrant_search(query, top_k=10)
)
tasks["sparse"] = asyncio.create_task(
qdrant_sparse_search(query, top_k=10)
)
if mode in ("reasoning", "hybrid"):
tasks["graph"] = asyncio.create_task(
neo4j_graph_search(query, limit=10)
)
if mode in ("global", "hybrid"):
tasks["raptor"] = asyncio.create_task(
raptor_tree_search(query, limit=5)
)
# ็ญ‰ๅพ…ๆ‰€ๆœ‰ๅนถ่กŒไปปๅŠกๅฎŒๆˆ
results = await asyncio.gather(*tasks.values(), return_exceptions=True)
# ๅˆๅนถ + RRF่žๅˆ
all_docs = []
for result in results:
if not isinstance(result, Exception):
all_docs.extend(result)
return rrf_fusion(all_docs)
```
### ็ƒญ้—จๅฎžไฝ“้ข„ๅ–
```python
class GraphPrefetcher:
"""ๅŽๅฐ้ข„ๅ–็ƒญ้—จๅฎžไฝ“็š„2่ทณๅญๅ›พ, ๆŸฅ่ฏขๆ—ถ้›ถๅปถ่ฟŸ"""
def __init__(self, neo4j_driver, top_k: int = 500):
self.cache = {} # entity_name โ†’ subgraph
self._prefetch_top_entities(top_k)
async def _prefetch_top_entities(self, top_k: int):
"""ๅฏๅŠจๆ—ถ้ข„ๅŠ ่ฝฝ้ซ˜้ข‘ๅฎžไฝ“"""
# ๆŒ‰่ขซๅผ•็”จๆฌกๆ•ฐๆŽ’ๅบ็š„topๅฎžไฝ“
cypher = """
MATCH (n)-[r]-()
WITH n, count(r) as degree
ORDER BY degree DESC LIMIT $k
RETURN n.name as name
"""
top_entities = await self.driver.run(cypher, k=top_k)
# ๅนถ่กŒ้ข„ๅ–ๆฏไธชๅฎžไฝ“็š„2่ทณๅญๅ›พ
for entity in top_entities:
subgraph = await self._fetch_subgraph(entity["name"], depth=2)
self.cache[entity["name"]] = subgraph
def get_instant(self, entity_name: str) -> dict:
"""ๆŸฅ่ฏขๆ—ถ: ็ผ“ๅญ˜ๅ‘ฝไธญ โ†’ 0ms, ๆœชๅ‘ฝไธญ โ†’ ่ตฐๆญฃๅธธ่ทฏๅพ„"""
return self.cache.get(entity_name)
```
---
## ็ญ–็•ฅ4: ๆตๅผ่พ“ๅ‡บ
### ๅŽŸ็†
็”จๆˆทๆ„Ÿ็Ÿฅๅปถ่ฟŸ = **้ฆ– token ๅˆฐ่พพๆ—ถ้—ด (TTFT)**๏ผŒไธๆ˜ฏๅฎŒๆ•ดๅ“ๅบ”ๆ—ถ้—ดใ€‚
```
ไผ˜ๅŒ–ๅ‰ (้žๆตๅผ):
็”จๆˆทๅ‘้—ฎ โ”€โ”€โ”€โ”€ [3000ms ๅฎŒๅ…จ็ฉบ็™ฝ็ญ‰ๅพ…] โ”€โ”€โ”€โ”€ ๅฎŒๆ•ด็ญ”ๆกˆไธ€ๆฌกๆ€งๅ‡บ็Žฐ
็”จๆˆทๆ„Ÿ็Ÿฅ: "ๅกไบ†3็ง’"
ไผ˜ๅŒ–ๅŽ (ๆตๅผ):
็”จๆˆทๅ‘้—ฎ โ”€โ”€ [500ms] โ”€โ”€ ้ฆ–tokenๅˆฐ่พพ, ๅผ€ๅง‹้€ๅญ—ๆ˜พ็คบ โ”€โ”€ [2500ms ๆŒ็ปญ่พ“ๅ‡บ]
็”จๆˆทๆ„Ÿ็Ÿฅ: "0.5็ง’ๅฐฑๅผ€ๅง‹ๅ›ž็ญ”ไบ†"
```
### ๅฎž็Žฐ
```python
from fastapi.responses import StreamingResponse
import json
async def stream_answer(query: str):
"""ๆตๅผ่พ“ๅ‡บ: ๆฃ€็ดขๅฎŒๆˆๅŽ็ซ‹ๅณๅผ€ๅง‹็”Ÿๆˆ"""
# Phase 1: ๅฟซ้€Ÿๆฃ€็ดข (้žๆตๅผ, ไฝ†ๅฟซ)
docs = await parallel_hybrid_retrieve(query)
# Phase 2: ๅ…ˆ่พ“ๅ‡บ "ๆญฃๅœจๆ€่€ƒ..." + ๆฃ€็ดขๆฅๆบ
yield json.dumps({"type": "sources", "data": format_sources(docs[:3])}) + "\n"
# Phase 3: ๆตๅผ็”Ÿๆˆ็ญ”ๆกˆ
prompt = build_prompt(query, docs)
async for chunk in llm.stream(prompt, task="generation"):
yield json.dumps({"type": "token", "data": chunk}) + "\n"
# Phase 4: ่ฟฝๅŠ ๅผ•็”จไฟกๆฏ
yield json.dumps({"type": "citations", "data": extract_citations(docs)}) + "\n"
@app.post("/api/v1/query/stream")
async def query_stream(req: QueryRequest):
return StreamingResponse(
stream_answer(req.query),
media_type="application/x-ndjson"
)
```
### ๅขž้‡็ฎก้“ๆตๅผ (่ฟ›้˜ถ)
```python
async def incremental_stream(query: str):
"""ๅขž้‡็ฎก้“: ๆฏไธชๅญ้—ฎ้ข˜็ญ”ๆกˆ็”ŸๆˆๅŽ็ซ‹ๅณๆŽจ้€, ไธ็ญ‰ๅ…จ้ƒจๅฎŒๆˆ"""
# ๅˆ†่งฃ
decomp = await unified_route_and_decompose(query)
yield {"type": "plan", "data": [sq.question for sq in decomp.sub_questions]}
# ๅนถ่กŒๅญ้—ฎ้ข˜, ๆฏไธชๅฎŒๆˆๅŽ็ซ‹ๅณๆŽจ้€
tasks = {
sq.id: asyncio.create_task(process_sub_question(sq))
for sq in decomp.sub_questions if not sq.depends_on
}
for coro in asyncio.as_completed(tasks.values()):
result = await coro
yield {"type": "sub_answer", "data": result}
# ็”จๆˆท็ซ‹ๅณ็œ‹ๅˆฐ้ƒจๅˆ†็ญ”ๆกˆ!
# ๆœ€็ปˆ็ปผๅˆ (ๆตๅผ)
async for token in synthesize_stream(query, all_results):
yield {"type": "token", "data": token}
```
---
## ็ญ–็•ฅ5: ่ฝป้‡็บง่ทฏ็”ฑๅ™จ (ๆถˆ้™คLLM่ฐƒ็”จ)
### ๅŽŸ็†
ๅคๆ‚ๅบฆๅˆคๆ–ญไธ้œ€่ฆ LLMโ€”โ€”็”จ**่ง„ๅˆ™+็‰นๅพ**ๅณๅฏๅœจ <1ms ๅ†…ๅฎŒๆˆ๏ผš
```python
import re
def fast_complexity_check(query: str) -> str:
"""่ง„ๅˆ™ๅผ•ๆ“Ž: <1ms, ๆ— LLM่ฐƒ็”จ"""
# ็‰นๅพๆๅ–
question_marks = query.count("?") + query.count("๏ผŸ")
has_conjunction = bool(re.search(
r'(and also|ไปฅๅŠ|ๅฆๅค–|ๅŒๆ—ถ|besides|moreover|่ฟ˜ๆœ‰|ๅนถไธ”|versus|compared to|ๅฏนๆฏ”)',
query, re.IGNORECASE
))
has_multiple_verbs = len(re.findall(
r'(explain|compare|list|describe|who|what|how|why|่งฃ้‡Š|ๅฏนๆฏ”|ๅˆ—ไธพ|ๆ่ฟฐ|ๆ˜ฏ่ฐ|ๆ˜ฏไป€ไนˆ)',
query, re.IGNORECASE
)) >= 2
# ๅฎžไฝ“่ฎกๆ•ฐ (็”จๅทฒ็ผ“ๅญ˜็š„GLiNERๆˆ–็ฎ€ๅ•NER)
entity_count = count_entities_fast(query)
# ๅ†ณ็ญ–่ง„ๅˆ™
composite_score = (
(question_marks >= 2) * 3 +
has_conjunction * 2 +
has_multiple_verbs * 2 +
(entity_count >= 3) * 1 +
(len(query) > 100) * 1
)
return "composite" if composite_score >= 3 else "simple"
def fast_query_type(query: str) -> str:
"""่ง„ๅˆ™่ทฏ็”ฑ: <1ms"""
query_lower = query.lower()
if any(w in query_lower for w in ["่ถ‹ๅŠฟ", "overview", "็ปผ่ฟฐ", "ๅ‘ๅฑ•", "trend", "survey"]):
return "global"
elif any(w in query_lower for w in ["ไธบไป€ไนˆ", "how does", "ๅŽŸ็†", "ๆœบๅˆถ", "compare", "ๅฏนๆฏ”", "ๅŒบๅˆซ"]):
return "reasoning"
else:
return "factual"
```
**ๆ•ˆๆžœ**: ๆถˆ้™ค 1 ๆฌก LLM ่ฐƒ็”จ (150ms โ†’ 0ms)๏ผŒๅ‡†็กฎ็އ ~90%๏ผˆ่พน็•Œๆƒ…ๅ†ต่ตฐ LLM fallback๏ผ‰
---
## ็ญ–็•ฅ6: ๅˆ†่งฃ็ผ“ๅญ˜
```python
class DecompositionCache:
"""็ผ“ๅญ˜ๅކๅฒๅˆ†่งฃ็ป“ๆž„ โ€” ็›ธไผผๅคๅˆ้—ฎ้ข˜ๅค็”จ"""
def __init__(self, threshold: float = 0.88):
self.cache = {} # embedding โ†’ decomposition
self.threshold = threshold
async def get_or_decompose(self, query: str) -> DecomposedQuery:
# ๆฃ€ๆŸฅ็ผ“ๅญ˜: ่ฏญไน‰็›ธไผผ็š„ๅކๅฒๅˆ†่งฃ
query_vec = embed(query)
best_match = find_nearest(self.cache, query_vec)
if best_match and best_match.score > self.threshold:
# ๅ‘ฝไธญ! ๅค็”จๅˆ†่งฃ็ป“ๆž„, ไฝ†ๆ›ฟๆขๅฎžไฝ“ๅ
cached_decomp = best_match.value
adapted = adapt_entities(cached_decomp, query) # ๆ›ฟๆขๅฎžไฝ“ๅ
return adapted
# ๆœชๅ‘ฝไธญ: LLMๅˆ†่งฃ โ†’ ๅญ˜ๅ…ฅ็ผ“ๅญ˜
decomp = await decompose_query(query)
self.cache[query_vec] = decomp
return decomp
```
**็คบไพ‹**:
- "Compare BERT and GPT-2 on GLUE" โ†’ ๅˆ†่งฃ็ป“ๆž„่ขซ็ผ“ๅญ˜
- "Compare T5 and BART on SuperGLUE" โ†’ ๅ‘ฝไธญ! ๅค็”จ็ป“ๆž„๏ผŒๆ›ฟๆขๅฎžไฝ“ๅ
---
## ็ญ–็•ฅ7: ๅผ‚ๆญฅ็ฎก้“ๅŒ–
### ๅŽŸ็†
ไธ่ฆ็ญ‰ไธ€ไธชๆญฅ้ชคๅฎŒๅ…จๅฎŒๆˆๅ†ๅผ€ๅง‹ไธ‹ไธ€ๆญฅใ€‚**้ƒจๅˆ†็ป“ๆžœๅฐฑ่ถณไปฅ่งฆๅ‘ไธ‹ไธ€ๆญฅ**ใ€‚
```
ไผ˜ๅŒ–ๅ‰ (ๆญฅ้—ดๅ…จ็ญ‰ๅพ…):
[ๆฃ€็ดข10ไธชdocๅ…จ้ƒจ่ฟ”ๅ›ž] โ”€โ”€็ญ‰ๅพ…โ”€โ”€โ–ถ [็”จ10ไธชdoc็”Ÿๆˆ็ญ”ๆกˆ]
ๆ€ป่ฎก: 300ms + 500ms = 800ms
ไผ˜ๅŒ–ๅŽ (็ฎก้“ๅŒ–):
[ๆฃ€็ดข: doc1 50ms] โ†’ ็ซ‹ๅณ้€็ป™็”Ÿๆˆๅ™จๅผ€ๅง‹prefill
[ๆฃ€็ดข: doc2 80ms] โ†’ ่ฟฝๅŠ ๅˆฐ็”Ÿๆˆๅ™จcontext
[ๆฃ€็ดข: doc3-10 ...] โ†’ ็ปง็ปญ่ฟฝๅŠ 
โ†“
[็”Ÿๆˆ: ้ฆ–token ~200msๅŽๅผ€ๅง‹] โ†’ ๆตๅผ่พ“ๅ‡บ
ๆ€ป่ฎก: ็”จๆˆทๅœจ ~250ms ๅŽๅผ€ๅง‹็œ‹ๅˆฐ่พ“ๅ‡บ!
```
### ๅฎž็Žฐ
```python
import asyncio
from asyncio import Queue
async def pipeline_retrieve_and_generate(query: str):
"""็ฎก้“ๅŒ–: ๆฃ€็ดข็ป“ๆžœๆตๅผๅ–‚็ป™็”Ÿๆˆๅ™จ"""
doc_queue = Queue()
async def retriever_producer():
"""ๆฃ€็ดขๅ™จ: ๆฏๆ‰พๅˆฐไธ€ๆ‰นๆ–‡ๆกฃๅฐฑๆ”พๅ…ฅ้˜Ÿๅˆ—"""
# ๅ‘้‡ๆฃ€็ดขๆœ€ๅฟซ, ๅ…ˆ่ฟ”ๅ›ž
vec_docs = await qdrant_search(query, top_k=5)
await doc_queue.put(("vector", vec_docs))
# ๅ›พ่ฐฑ็จๆ…ข
graph_docs = await neo4j_search(query, limit=5)
await doc_queue.put(("graph", graph_docs))
# RAPTOR ๆœ€ๆ…ข
raptor_docs = await raptor_search(query, limit=3)
await doc_queue.put(("raptor", raptor_docs))
await doc_queue.put(None) # ไฟกๅท: ๆฃ€็ดขๅฎŒๆฏ•
async def generator_consumer():
"""็”Ÿๆˆๅ™จ: ไธ€ๆ—ฆๆœ‰่ถณๅคŸๆ–‡ๆกฃๅฐฑๅผ€ๅง‹็”Ÿๆˆ"""
all_docs = []
min_docs_to_start = 3 # ๆœ‰3ไธชๆ–‡ๆกฃๅฐฑๅฏไปฅๅผ€ๅง‹
while True:
item = await doc_queue.get()
if item is None:
break
all_docs.extend(item[1])
if len(all_docs) >= min_docs_to_start:
# ๅผ€ๅง‹ๆตๅผ็”Ÿๆˆ (ๅŽ็ปญๆ–‡ๆกฃๅฏไปฅ้€š่ฟ‡contextๆ‰ฉๅฑ•่ฟฝๅŠ )
break
# ็”จๅทฒๆœ‰ๆ–‡ๆกฃๅผ€ๅง‹็”Ÿๆˆ
prompt = build_prompt(query, all_docs)
async for token in llm.stream(prompt):
yield token
# ๅนถ่กŒๅฏๅŠจ็”Ÿไบง่€…ๅ’Œๆถˆ่ดน่€…
asyncio.create_task(retriever_producer())
async for token in generator_consumer():
yield token
```
---
## ็ญ–็•ฅ8: ๆจกๅž‹็บงๅŠ ้€Ÿ
### 8a. ๅˆ†็บงๆจกๅž‹้€‰ๆ‹ฉ
```python
MODEL_LATENCY_MAP = {
# ไปปๅŠก โ†’ ๆจกๅž‹ โ†’ ้ข„ไผฐๅปถ่ฟŸ
"routing": ("Qwen2.5-3B-Instruct", "~50ms"), # ๆœ€ๅฐๆจกๅž‹, ๆœ€ๅฟซ
"decompose": ("Qwen2.5-7B-Instruct", "~150ms"), # ไธญ็ญ‰ๆจกๅž‹
"sub_answer": ("Qwen2.5-14B-Instruct", "~300ms"), # ่ดจ้‡่ฆๆฑ‚้ซ˜
"synthesis": ("GPT-4o-mini", "~500ms"), # ๆœ€็ปˆ็ปผๅˆ็”จAPI
}
# ๅฏนๆฏ”: ๅฆ‚ๆžœๆ‰€ๆœ‰ๆญฅ้ชค้ƒฝ็”จ14B
# routing(14B): ~150ms, decompose(14B): ~300ms โ†’ ๅคš่Šฑ200msๅดไธๆๅ‡่ดจ้‡
```
### 8b. ๆŠ•ๆœบ่งฃ็  (Speculative Decoding)
```bash
# vLLM ๆŠ•ๆœบ่งฃ็ : ๅฐๆจกๅž‹่‰็จฟ + ๅคงๆจกๅž‹้ชŒ่ฏ
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-model meta-llama/Llama-3.1-1B-Instruct \
--num-speculative-tokens 5 \
--speculative-disable-mqa-scorer
# ๆ•ˆๆžœ: ็”Ÿๆˆ้€Ÿๅบฆๆๅ‡ 1.5-2.5ร—, ่พ“ๅ‡บ่ดจ้‡ไธๅ˜
```
### 8c. ็ป“ๆž„ๅŒ–่พ“ๅ‡บๅŠ ้€Ÿ
```python
# ็”จ Outlines / vLLM structured output ็บฆๆŸ็”Ÿๆˆ
# ๅ‡ๅฐ‘ๆ— ๆ•ˆtoken็”Ÿๆˆ, ็›ดๆŽฅ่พ“ๅ‡บJSON็ป“ๆž„
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
# JSON schema ็บฆๆŸ โ†’ ็”Ÿๆˆๆ›ดๅฟซ (ๅ‡ๅฐ‘ไธๅฟ…่ฆ็š„ๆ€่€ƒtoken)
guided_params = GuidedDecodingParams(json_schema=DecomposedQuery.model_json_schema())
params = SamplingParams(max_tokens=512, guided_decoding=guided_params)
```
---
## ไผ˜ๅŒ–ๅŽ็š„ๅฎŒๆ•ดๆ—ถ้—ด็บฟ
```
=== ็ฎ€ๅ•้—ฎ้ข˜ (ไผ˜ๅŒ–ๅŽ, ~600ms ้ฆ–token) ===
0ms 1ms 50ms 250ms 600ms 1200ms
โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
โ”‚่ง„ๅˆ™่ทฏ็”ฑโ”‚HyDEๆ”นๅ†™โ”‚ ไธ‰่ทฏๅนถ่กŒๆฃ€็ดข โ”‚ ้‡ๆŽ’(100ms)โ”‚ ๆตๅผ็”Ÿๆˆๅผ€ๅง‹ โ”‚ ...่พ“ๅ‡บๅฎŒๆˆ
โ”‚<1ms โ”‚(ๅนถ่กŒ) โ”‚(ๆœ€ๆ…ข200ms) โ”‚ โ”‚ โ† ็”จๆˆท็œ‹ๅˆฐ้ฆ–token
=== ๅคๅˆ้—ฎ้ข˜ (ไผ˜ๅŒ–ๅŽ, ~800ms ้ฆ–token) ===
0ms 1ms 350ms 500ms 800ms 1500ms 2000ms
โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
โ”‚่ง„ๅˆ™ๅˆคๆ–ญโ”‚ ๅˆ†่งฃ+ๆŠ•ๆœบๆฃ€็ดข โ”‚ ่กฅๅ……ๆฃ€็ดข โ”‚ ๅญ็ญ”ๆกˆๅนถ่กŒ โ”‚ ๆตๅผ็ปผๅˆๅผ€ๅง‹โ”‚ ...ๅฎŒๆˆ
โ”‚<1ms โ”‚(ๅนถ่กŒ300ms) โ”‚(ๅนถ่กŒ150ms)โ”‚ (ๅนถ่กŒ500ms)โ”‚ โ† ้ฆ–token โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ ๆŠ•ๆœบ: ๅŽŸๅง‹queryๆฃ€็ดข ไธŽ LLMๅˆ†่งฃ ๅŒๆ—ถ่ฟ›่กŒ โ”‚
โ”‚ โ”‚ ่Š‚็œ: ไธ็”จ็ญ‰ๅˆ†่งฃๅฎŒๆ‰ๅผ€ๅง‹ๆฃ€็ดข โ”‚
```
---
## ๅปถ่ฟŸๅฏนๆฏ”ๆฑ‡ๆ€ป
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ไผ˜ๅŒ–ๅ‰ๅŽๅปถ่ฟŸๅฏนๆฏ” โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ้˜ถๆฎต โ”‚ ไผ˜ๅŒ–ๅ‰ โ”‚ ไผ˜ๅŒ–ๅŽ โ”‚ ไผ˜ๅŒ–ๆ‰‹ๆฎต โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ๅคๆ‚ๅบฆๅˆคๆ–ญ โ”‚ 150ms โ”‚ <1ms โ”‚ ่ง„ๅˆ™ๅผ•ๆ“Žๆ›ฟไปฃLLM โ”‚
โ”‚ ๆŸฅ่ฏขๅˆ†่งฃ โ”‚ 300ms(ไธฒ่กŒ) โ”‚ 0ms(ๅนถ่กŒ) โ”‚ ๆŠ•ๆœบๆ‰ง่กŒ+ๅˆๅนถ่ฐƒ็”จ โ”‚
โ”‚ ๆฃ€็ดข โ”‚ 450ms(ไธฒ่กŒ) โ”‚ 200ms(ๅนถ่กŒ)โ”‚ asyncio.gather โ”‚
โ”‚ ๅญ้—ฎ้ข˜็”Ÿๆˆ โ”‚ 700ms(ไธฒ่กŒ) โ”‚ 350ms(ๅนถ่กŒ)โ”‚ ๅนถ่กŒLLM+ๅฐๆจกๅž‹ โ”‚
โ”‚ ้‡ๆŽ’ โ”‚ 200ms โ”‚ 100ms โ”‚ ้‡ๅŒ–reranker โ”‚
โ”‚ ๆœ€็ปˆ็”Ÿๆˆ(้ฆ–token)โ”‚ 500ms(็ญ‰) โ”‚ 50ms(ๆตๅผ) โ”‚ ๆตๅผ+็ฎก้“ๅŒ– โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ๆ€ปๅปถ่ฟŸ(็ฎ€ๅ•) โ”‚ ~1500ms โ”‚ ~600ms โ”‚ 60% ้™ไฝŽ โ”‚
โ”‚ ๆ€ปๅปถ่ฟŸ(ๅคๅˆ) โ”‚ ~3000ms โ”‚ ~800ms โ”‚ 73% ้™ไฝŽ โ”‚
โ”‚ ็”จๆˆทๆ„Ÿ็Ÿฅ(้ฆ–token)โ”‚ ~1500ms โ”‚ ~400ms โ”‚ 73% ้™ไฝŽ โ”‚
โ”‚ ็ผ“ๅญ˜ๅ‘ฝไธญๆ—ถ โ”‚ โ€” โ”‚ ~5ms โ”‚ L1่ฏญไน‰็ผ“ๅญ˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ๅฎžๆ–ฝไผ˜ๅ…ˆ็บง
| ไผ˜ๅ…ˆ็บง | ็ญ–็•ฅ | ๅปถ่ฟŸๆ”ถ็›Š | ๅทฅไฝœ้‡ | ้ฃŽ้™ฉ |
|--------|------|---------|--------|------|
| **P0** | ็ญ–็•ฅ4: ๆตๅผ่พ“ๅ‡บ | ๆ„Ÿ็Ÿฅ -73% | ๅŠๅคฉ | ๆ—  |
| **P0** | ็ญ–็•ฅ3: ๆฃ€็ดขๅนถ่กŒๅŒ– | -150ms | ๅŠๅคฉ | ๆ—  |
| **P0** | ็ญ–็•ฅ5: ่ง„ๅˆ™่ทฏ็”ฑๅ™จ | -150ms | 2ๅฐๆ—ถ | ่พน็•Œcase |
| **P1** | ็ญ–็•ฅ1: ๆŠ•ๆœบๆ‰ง่กŒ | -300ms | 1ๅคฉ | ๆ— ็”จๆฃ€็ดขๆตช่ดน |
| **P1** | ็ญ–็•ฅ2: LLM่ฐƒ็”จๅˆๅนถ | -300ms | 1ๅคฉ | Promptๅทฅ็จ‹ |
| **P2** | ็ญ–็•ฅ7: ็ฎก้“ๅŒ– | -200ms | 2ๅคฉ | ๅคๆ‚ๅบฆ้ซ˜ |
| **P2** | ็ญ–็•ฅ8: ๆจกๅž‹ๅŠ ้€Ÿ | -40%็”Ÿๆˆ | 2ๅคฉ | ้…็ฝฎ่ฐƒไผ˜ |
| **P3** | ็ญ–็•ฅ6: ๅˆ†่งฃ็ผ“ๅญ˜ | ๅ‘ฝไธญ-300ms | 1ๅคฉ | ็ผ“ๅญ˜ๅคฑๆ•ˆ |
---
## ็ปˆๆžๆžถๆž„: ไฝŽๅปถ่ฟŸ Agent ๆตๆฐด็บฟ
```python
async def ultra_fast_query(query: str, session_id: str):
"""
็ปˆๆžไฝŽๅปถ่ฟŸ Agent:
- ็ฎ€ๅ•้—ฎ้ข˜: 400ms ้ฆ–token
- ๅคๅˆ้—ฎ้ข˜: 800ms ้ฆ–token
- ็ผ“ๅญ˜ๅ‘ฝไธญ: 5ms
"""
# ===== L1: ่ฏญไน‰็ผ“ๅญ˜ (5ms) =====
cached = await semantic_cache.get(query)
if cached:
yield {"type": "cached", "data": cached}
return
# ===== ่ง„ๅˆ™่ทฏ็”ฑ (<1ms) =====
complexity = fast_complexity_check(query)
if complexity == "simple":
query_type = fast_query_type(query)
# ๆฃ€็ดข + HyDE ๅนถ่กŒ
retrieve_task = asyncio.create_task(
parallel_hybrid_retrieve(query, mode=query_type)
)
hyde_task = asyncio.create_task(hyde_expand(query))
docs, hyde_docs = await asyncio.gather(retrieve_task, hyde_task)
all_docs = rrf_fusion(docs + hyde_docs)
reranked = await fast_rerank(query, all_docs, top_k=5)
# ๆตๅผ็”Ÿๆˆ
async for token in llm.stream(build_prompt(query, reranked)):
yield {"type": "token", "data": token}
else: # composite
# ๆŠ•ๆœบๆ‰ง่กŒ: ๅˆ†่งฃ ไธŽ ๅŽŸๅง‹queryๆฃ€็ดข ๅนถ่กŒ
decompose_task = asyncio.create_task(
unified_route_and_decompose(query)
)
speculative_task = asyncio.create_task(
parallel_hybrid_retrieve(query, mode="hybrid", top_k=10)
)
# ๅ…ˆๆŽจ้€ "ๆ€่€ƒไธญ" + ๆœ็ดขๆฅๆบ
spec_docs = await speculative_task
yield {"type": "sources", "data": format_sources(spec_docs[:3])}
# ๅˆ†่งฃๅฎŒๆˆ
decomp = await decompose_task
yield {"type": "plan", "data": [sq.question for sq in decomp.sub_questions]}
# ่กฅๅ……ๅญๆŸฅ่ฏขๆฃ€็ดข (ๅนถ่กŒ)
additional = await parallel_sub_retrieval(decomp, already_have=spec_docs)
all_docs = deduplicate(spec_docs + additional)
reranked = await fast_rerank(query, all_docs, top_k=10)
# ๅนถ่กŒๅญ็ญ”ๆกˆ + ๆตๅผ็ปผๅˆ
sub_answers = await parallel_sub_answers(decomp, reranked)
async for token in synthesize_stream(query, sub_answers, reranked):
yield {"type": "token", "data": token}
# ๅ†™ๅ…ฅ็ผ“ๅญ˜
full_answer = collect_tokens()
await semantic_cache.set(query, full_answer)
```
---
## ๅ‚่€ƒ่ฎบๆ–‡
| ่ฎบๆ–‡ | ๆŠ€ๆœฏ | ๅบ”็”จ |
|------|------|------|
| Speculative Decoding (2302.01318) | ๅฐๆจกๅž‹่‰็จฟ+ๅคงๆจกๅž‹้ชŒ่ฏ | ็ญ–็•ฅ8b |
| Medusa (2401.10774) | ๅคšๅคดๆŠ•ๆœบ่งฃ็  | ็ญ–็•ฅ8b |
| ChunkAttention (2402.15220) | ๅ…ฑไบซๅ‰็ผ€KV็ฎก็† | ็ญ–็•ฅ7 |
| Skeleton-of-Thought (2307.15337) | ๅนถ่กŒๅ†…ๅฎน็”Ÿๆˆ | ็ญ–็•ฅ7 |
| SGLang (2312.07104) | RadixAttention + ๅ‰็ซฏ่ฐƒๅบฆ | ็ญ–็•ฅ4/7 |