| # ScholarMind Agent ๅปถ่ฟไผๅๅ
จๆป็ฅ |
|
|
| > ็ฎๆ ๏ผๅฐ QA ็ซฏๅฐ็ซฏๅปถ่ฟไป ~3s๏ผๅคๅ้ฎ้ข๏ผ/ ~1.5s๏ผ็ฎๅ้ฎ้ข๏ผๅ็ผฉๅฐ **<800ms** |
|
|
| --- |
|
|
| ## ๅฝๅๅปถ่ฟ็ถ้ขๅๆ |
|
|
| ``` |
| ๅฝๅ่ฏทๆฑๆถ้ด็บฟ (ๅคๅ้ฎ้ข, ~3000ms): |
| |
| 0ms 200ms 500ms 1200ms 2000ms 2800ms 3000ms |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โๅคๆๅบฆๅคๆญ โ ๆฅ่ฏขๅ่งฃ โ ๅญ้ฎ้ขๆฃ็ดข โ ๅญ้ฎ้ข็ๆ โ ่ๅ้ๆ โ ๆ็ป็ๆ โ |
| โ LLM่ฐ็จ โ LLM่ฐ็จ โ ๅ้+ๅพ่ฐฑ โ LLMรN โ Reranker โ LLM่ฐ็จ โ |
| โ ~150ms โ ~300ms โ ~300ms โ ~700ms โ ~200ms โ ~500ms โ |
| โ โ โ โ โ โ โ |
| โ ไธฒ่กโ โ ไธฒ่กโ โ ๅฏๅนถ่กโ
โ ๅฏๅนถ่กโ
โ ไธฒ่กโ โ ๅฏๆตๅผโ
โ |
| |
| ไผๅ็ฎๆ : |
| - ๆถ้คไธๅฟ
่ฆ็ไธฒ่ก็ญๅพ
|
| - ๅๅนถ/ๅนถ่กๅ LLM ่ฐ็จ |
| - ๆๆบๆง่ก (speculation) |
| - ๆตๅผ่พๅบ (็จๆทๆ็ฅๅปถ่ฟโ) |
| ``` |
|
|
| --- |
|
|
| ## 8 ๅคงไผๅ็ญ็ฅ |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ScholarMind Agent ๅปถ่ฟไผๅ โ 8ๅคง็ญ็ฅ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ โ |
| โ ็ญ็ฅ1 โโโ ๆๆบๆง่ก (Speculative Execution) โ |
| โ ไธ็ญๅ่งฃ็ปๆ, ไน่งๅฐ็จๅๅงqueryๅ
ๅฏๅจๆฃ็ดข โ |
| โ ่็: ~300ms (ๅ่งฃไธๆฃ็ดขๅนถ่ก) โ |
| โ โ |
| โ ็ญ็ฅ2 โโโ LLM ่ฐ็จๅๅนถ (Batch & Merge) โ |
| โ ๅคๆๅบฆๅคๆญ+ๅ่งฃ ๅไธบ1ๆฌก่ฐ็จ; ๅญ้ฎ้ข็ญๆกๅนถ่ก็ๆ โ |
| โ ่็: ~300ms (ๅๅฐ1ๆฌกLLMๅพ่ฟ) โ |
| โ โ |
| โ ็ญ็ฅ3 โโโ ๆฃ็ดขๅนถ่กๅ + ้ขๅ (Parallel Retrieval + Prefetch) โ |
| โ ๅ้/ๅพ่ฐฑ/RAPTOR ไธ่ทฏๅๆถๅ่ตท; ็ญ้จๅฎไฝๅญๅพ้ขๅ ่ฝฝ โ |
| โ ่็: ~200ms (ไธ่ทฏไธฒ่กโๅนถ่ก) โ |
| โ โ |
| โ ็ญ็ฅ4 โโโ ๆตๅผ็ๆ + ๅข้่พๅบ (Streaming) โ |
| โ ้ฆtokenๅณๅผๅง่ฟๅ, ็จๆทๆ็ฅๅปถ่ฟไป3sโ~500ms โ |
| โ ๆ็ฅ่็: ~2500ms โ |
| โ โ |
| โ ็ญ็ฅ5 โโโ ่ฝป้็บง่ทฏ็ฑๅจ (่งๅ+ๅฐๆจกๅๆฟไปฃLLM) โ |
| โ ๅคๆๅบฆๅคๆญ็จๆญฃๅ+ๅฏๅๅผ, ไธ่ฐLLM โ |
| โ ่็: ~150ms (ๆถ้ค1ๆฌกLLM่ฐ็จ) โ |
| โ โ |
| โ ็ญ็ฅ6 โโโ ๅ่งฃ็ปๆ็ผๅญ (Decomposition Cache) โ |
| โ ็ธไผผๅคๅ้ฎ้ขๅค็จๅๅฒๅ่งฃ็ปๆ โ |
| โ ่็: ~300ms (ๅฝไธญๆถ่ทณ่ฟๅ่งฃLLM) โ |
| โ โ |
| โ ็ญ็ฅ7 โโโ ๅผๆญฅ็ฎก้ๅ (Pipeline Parallelism) โ |
| โ ๅไธๆญฅ็้จๅ็ปๆ็ซๅณๅ็ปไธไธๆญฅ, ไธ็ญๅ
จ้จๅฎๆ โ |
| โ ่็: ~300ms (ๆถ้คๆญฅ้ด็ญๅพ
) โ |
| โ โ |
| โ ็ญ็ฅ8 โโโ ๆจกๅ็บงๅ ้ (Smaller Models + Speculative Decoding) โ |
| โ ่ทฏ็ฑ/ๅ่งฃ็จ3Bๆจกๅ; ็ๆ็จ8B+ๆๆบ่งฃ็ โ |
| โ ่็: ~40% ็ๆๅปถ่ฟ โ |
| โ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## ็ญ็ฅ1: ๆๆบๆง่ก (Speculative Execution) |
|
|
| ### ๅ็ |
|
|
| ไธ็ญ LLM ๅ่งฃๅฎๆ๏ผ**ไน่งๅ่ฎพ**็จๆท็ๅๅงๆฅ่ฏขๅฐฑๆฏไธไธชๆๆๆฃ็ดข query๏ผ็ซๅณๅฏๅจๆฃ็ดขใๅ่งฃๅฎๆๅ๏ผ่กฅๅ
็ผบๅคฑ็ๅญๆฅ่ฏขๆฃ็ดขใ |
|
|
| ``` |
| ไผๅๅ (ไธฒ่ก): |
| [ๅ่งฃ 300ms] โโ็ญๅพ
โโโถ [ๆฃ็ดข 300ms] โโ็ญๅพ
โโโถ [็ๆ] |
| ๆป่ฎก: 600ms ๅๆๅผๅง็ๆ |
| |
| ไผๅๅ (ๆๆบๅนถ่ก): |
| [ๅ่งฃ 300ms] โโโโโโโโโโโโโโโโโโโโโโโโโโถ ๅ่งฃๅฎๆ, ่กฅๅ
ๆฃ็ดข |
| [ๅๅงqueryๆฃ็ดข 300ms] โโโถ ็ปๆๅฐฑ็ปช! (ๅชๆฃ็ดขๅ่งฃๆฐๅข็ๅญๆฅ่ฏข) |
| โ |
| ไธค็ป็ปๆๅๅนถ โ ็ๆ (ๅ ไน้ถ็ญๅพ
) |
| ๆป่ฎก: ~300ms ๅๅณๅฏๅผๅง็ๆ (่็300ms) |
| ``` |
|
|
| ### ๅฎ็ฐ |
|
|
| ```python |
| import asyncio |
| |
| async def speculative_retrieval(query: str) -> dict: |
| """ๆๆบๆง่ก: ๅ่งฃไธๆฃ็ดขๅนถ่ก""" |
| |
| # ๅๆถๅฏๅจ: (1) ๆฅ่ฏขๅ่งฃ (2) ๅๅงquery็ดๆฅๆฃ็ดข |
| decompose_task = asyncio.create_task(decompose_query(query)) |
| speculative_task = asyncio.create_task( |
| hybrid_retriever.retrieve(query, mode="hybrid", top_k=10) |
| ) |
| |
| # ็ญๅพ
ไธค่
ๅฎๆ (ๅฎ้
ไธ่ฐๅฟซ่ฐๅ
ๅฎๆ) |
| decomposition, speculative_docs = await asyncio.gather( |
| decompose_task, speculative_task |
| ) |
| |
| # ๅ่งฃ็ปๆๅบๆฅๅ, ๅชๆฃ็ดข"ๆๆบๆฃ็ดขๆช่ฆ็"็ๅญๆฅ่ฏข |
| covered_by_speculation = estimate_coverage(speculative_docs, decomposition) |
| |
| additional_tasks = [] |
| for sq in decomposition.sub_questions: |
| if sq.id not in covered_by_speculation and not sq.depends_on: |
| additional_tasks.append( |
| hybrid_retriever.retrieve(sq.question, mode=sq.type, top_k=5) |
| ) |
| |
| # ๅนถ่ก่กฅๅ
ๆฃ็ดข |
| if additional_tasks: |
| additional_docs = await asyncio.gather(*additional_tasks) |
| all_docs = speculative_docs + [d for docs in additional_docs for d in docs] |
| else: |
| all_docs = speculative_docs # ๆๆบๆฃ็ดขๅทฒ่ถณๅค! |
| |
| return {"docs": all_docs, "decomposition": decomposition} |
| ``` |
|
|
| --- |
|
|
| ## ็ญ็ฅ2: LLM ่ฐ็จๅๅนถ |
|
|
| ### ๅ็ |
|
|
| ๅฐๅคๆฌกไธฒ่ก LLM ่ฐ็จๅๅนถไธบๆดๅฐๆฌก่ฐ็จ๏ผ |
|
|
| ``` |
| ไผๅๅ (4ๆฌกLLM่ฐ็จ): |
| 1. ๅคๆๅบฆๅคๆญ (~150ms) |
| 2. ๆฅ่ฏขๅ่งฃ (~300ms) |
| 3. ๅญ้ฎ้ข็ญๆก็ๆ รN (~700ms, ไฝๅฏๅนถ่ก) |
| 4. ๆ็ป็ปผๅ็ๆ (~500ms) |
| ๆป่ฎก: ่ณๅฐ 1650ms ็บฏLLMๆถ้ด (ๅ่ฎพ3/4ไธฒ่ก) |
| |
| ไผๅๅ (2-3ๆฌก): |
| 1. ๅคๆๅบฆ+ๅ่งฃ ๅๅนถไธบ1ๆฌก (~350ms) โ ่็1ๆฌกๅพ่ฟ |
| 2. ๅญ้ฎ้ข็ญๆกๅนถ่ก็ๆ (~700msๅนถ่กโ็ญๆๆ
ข็้ฃไธช) |
| 3. ๆ็ป็ปผๅ(ๆตๅผ) (~500ms, ไฝ้ฆtoken ~50ms) |
| ๆป่ฎก: ~1050ms, ไธ็จๆทๅจ ~400ms ๅๅฐฑๅผๅง็ๅฐ่พๅบ |
| ``` |
|
|
| ### ๅฎ็ฐ |
|
|
| ```python |
| # ๅๅนถ: ๅคๆๅบฆๅคๆญ + ๅ่งฃ โ ไธๆฌก่ฐ็จๅฎๆ |
| UNIFIED_ROUTING_PROMPT = """Analyze this query in ONE step: |
| |
| 1. Is it SIMPLE (single focused question) or COMPOSITE (multiple sub-questions)? |
| 2. If COMPOSITE, decompose into sub-questions with types and dependencies. |
| 3. If SIMPLE, classify as: factual | reasoning | global |
| |
| Query: {query} |
| |
| Return JSON: |
| - If simple: {{"complexity": "simple", "type": "factual|reasoning|global"}} |
| - If composite: {{"complexity": "composite", "sub_questions": [...], ...}} |
| """ |
| |
| async def unified_route_and_decompose(query: str) -> dict: |
| """ๅๆฌกLLM่ฐ็จๅฎๆ: ๅคๆๅบฆๅคๆญ + ่ทฏ็ฑ/ๅ่งฃ""" |
| result = await llm.complete( |
| UNIFIED_ROUTING_PROMPT.format(query=query), |
| task="routing", # ๆฌๅฐๆจกๅ, ไฝๅปถ่ฟ |
| ) |
| return result # ไธๆฌก่ฐ็จๆๅฎไธคๆญฅ |
| ``` |
|
|
| ```python |
| # ๅญ้ฎ้ข็ญๆก: ๅนถ่ก็ๆ (ไธๆฏไธฒ่ก!) |
| async def parallel_sub_answers(sub_questions: list, retrieved_docs: dict) -> list: |
| """ๅนถ่กไธบๆๆๆ ไพ่ตๅญ้ฎ้ข็ๆ็ญๆก""" |
| tasks = [] |
| for sq in sub_questions: |
| if not sq.depends_on: # ๆ ไพ่ต โ ๅฏๅนถ่ก |
| tasks.append(generate_sub_answer(sq, retrieved_docs[sq.id])) |
| |
| # asyncio.gather ๅนถ่กๆง่กๆๆLLM่ฐ็จ |
| results = await asyncio.gather(*tasks) |
| return results |
| ``` |
|
|
| --- |
|
|
| ## ็ญ็ฅ3: ๆฃ็ดขๅนถ่กๅ + ้ขๅ |
|
|
| ### ๅ็ |
|
|
| ไธ่ทฏๆฃ็ดข๏ผๅ้ + ๅพ่ฐฑ + RAPTOR๏ผไธๅบไธฒ่กๆง่ก๏ผ |
|
|
| ``` |
| ไผๅๅ: |
| [ๅ้ๆฃ็ดข 50ms] โ [ๅพ่ฐฑๆฅ่ฏข 200ms] โ [RAPTOR 100ms] โ [้ๆ 100ms] |
| ๆป่ฎก: ~450ms |
| |
| ไผๅๅ: |
| [ๅ้ๆฃ็ดข 50ms ]โโ |
| [ๅพ่ฐฑๆฅ่ฏข 200ms ]โโผโโ ็ญๆๆ
ข็ (200ms) โโโถ [้ๆ 100ms] |
| [RAPTOR 100ms ]โโ |
| ๆป่ฎก: ~300ms (่็150ms) |
| ``` |
|
|
| ### ๅฎ็ฐ |
|
|
| ```python |
| async def parallel_hybrid_retrieve(query: str, mode: str = "hybrid") -> list: |
| """ไธ่ทฏๆฃ็ดขๅฎๅ
จๅนถ่ก""" |
| |
| tasks = {} |
| |
| if mode in ("factual", "hybrid"): |
| tasks["vector"] = asyncio.create_task( |
| qdrant_search(query, top_k=10) |
| ) |
| tasks["sparse"] = asyncio.create_task( |
| qdrant_sparse_search(query, top_k=10) |
| ) |
| |
| if mode in ("reasoning", "hybrid"): |
| tasks["graph"] = asyncio.create_task( |
| neo4j_graph_search(query, limit=10) |
| ) |
| |
| if mode in ("global", "hybrid"): |
| tasks["raptor"] = asyncio.create_task( |
| raptor_tree_search(query, limit=5) |
| ) |
| |
| # ็ญๅพ
ๆๆๅนถ่กไปปๅกๅฎๆ |
| results = await asyncio.gather(*tasks.values(), return_exceptions=True) |
| |
| # ๅๅนถ + RRF่ๅ |
| all_docs = [] |
| for result in results: |
| if not isinstance(result, Exception): |
| all_docs.extend(result) |
| |
| return rrf_fusion(all_docs) |
| ``` |
|
|
| ### ็ญ้จๅฎไฝ้ขๅ |
|
|
| ```python |
| class GraphPrefetcher: |
| """ๅๅฐ้ขๅ็ญ้จๅฎไฝ็2่ทณๅญๅพ, ๆฅ่ฏขๆถ้ถๅปถ่ฟ""" |
| |
| def __init__(self, neo4j_driver, top_k: int = 500): |
| self.cache = {} # entity_name โ subgraph |
| self._prefetch_top_entities(top_k) |
| |
| async def _prefetch_top_entities(self, top_k: int): |
| """ๅฏๅจๆถ้ขๅ ่ฝฝ้ซ้ขๅฎไฝ""" |
| # ๆ่ขซๅผ็จๆฌกๆฐๆๅบ็topๅฎไฝ |
| cypher = """ |
| MATCH (n)-[r]-() |
| WITH n, count(r) as degree |
| ORDER BY degree DESC LIMIT $k |
| RETURN n.name as name |
| """ |
| top_entities = await self.driver.run(cypher, k=top_k) |
| |
| # ๅนถ่ก้ขๅๆฏไธชๅฎไฝ็2่ทณๅญๅพ |
| for entity in top_entities: |
| subgraph = await self._fetch_subgraph(entity["name"], depth=2) |
| self.cache[entity["name"]] = subgraph |
| |
| def get_instant(self, entity_name: str) -> dict: |
| """ๆฅ่ฏขๆถ: ็ผๅญๅฝไธญ โ 0ms, ๆชๅฝไธญ โ ่ตฐๆญฃๅธธ่ทฏๅพ""" |
| return self.cache.get(entity_name) |
| ``` |
|
|
| --- |
|
|
| ## ็ญ็ฅ4: ๆตๅผ่พๅบ |
|
|
| ### ๅ็ |
|
|
| ็จๆทๆ็ฅๅปถ่ฟ = **้ฆ token ๅฐ่พพๆถ้ด (TTFT)**๏ผไธๆฏๅฎๆดๅๅบๆถ้ดใ |
|
|
| ``` |
| ไผๅๅ (้ๆตๅผ): |
| ็จๆทๅ้ฎ โโโโ [3000ms ๅฎๅ
จ็ฉบ็ฝ็ญๅพ
] โโโโ ๅฎๆด็ญๆกไธๆฌกๆงๅบ็ฐ |
| ็จๆทๆ็ฅ: "ๅกไบ3็ง" |
| |
| ไผๅๅ (ๆตๅผ): |
| ็จๆทๅ้ฎ โโ [500ms] โโ ้ฆtokenๅฐ่พพ, ๅผๅง้ๅญๆพ็คบ โโ [2500ms ๆ็ปญ่พๅบ] |
| ็จๆทๆ็ฅ: "0.5็งๅฐฑๅผๅงๅ็ญไบ" |
| ``` |
|
|
| ### ๅฎ็ฐ |
|
|
| ```python |
| from fastapi.responses import StreamingResponse |
| import json |
| |
| async def stream_answer(query: str): |
| """ๆตๅผ่พๅบ: ๆฃ็ดขๅฎๆๅ็ซๅณๅผๅง็ๆ""" |
| |
| # Phase 1: ๅฟซ้ๆฃ็ดข (้ๆตๅผ, ไฝๅฟซ) |
| docs = await parallel_hybrid_retrieve(query) |
| |
| # Phase 2: ๅ
่พๅบ "ๆญฃๅจๆ่..." + ๆฃ็ดขๆฅๆบ |
| yield json.dumps({"type": "sources", "data": format_sources(docs[:3])}) + "\n" |
| |
| # Phase 3: ๆตๅผ็ๆ็ญๆก |
| prompt = build_prompt(query, docs) |
| async for chunk in llm.stream(prompt, task="generation"): |
| yield json.dumps({"type": "token", "data": chunk}) + "\n" |
| |
| # Phase 4: ่ฟฝๅ ๅผ็จไฟกๆฏ |
| yield json.dumps({"type": "citations", "data": extract_citations(docs)}) + "\n" |
| |
| @app.post("/api/v1/query/stream") |
| async def query_stream(req: QueryRequest): |
| return StreamingResponse( |
| stream_answer(req.query), |
| media_type="application/x-ndjson" |
| ) |
| ``` |
|
|
| ### ๅข้็ฎก้ๆตๅผ (่ฟ้ถ) |
|
|
| ```python |
| async def incremental_stream(query: str): |
| """ๅข้็ฎก้: ๆฏไธชๅญ้ฎ้ข็ญๆก็ๆๅ็ซๅณๆจ้, ไธ็ญๅ
จ้จๅฎๆ""" |
| |
| # ๅ่งฃ |
| decomp = await unified_route_and_decompose(query) |
| yield {"type": "plan", "data": [sq.question for sq in decomp.sub_questions]} |
| |
| # ๅนถ่กๅญ้ฎ้ข, ๆฏไธชๅฎๆๅ็ซๅณๆจ้ |
| tasks = { |
| sq.id: asyncio.create_task(process_sub_question(sq)) |
| for sq in decomp.sub_questions if not sq.depends_on |
| } |
| |
| for coro in asyncio.as_completed(tasks.values()): |
| result = await coro |
| yield {"type": "sub_answer", "data": result} |
| # ็จๆท็ซๅณ็ๅฐ้จๅ็ญๆก! |
| |
| # ๆ็ป็ปผๅ (ๆตๅผ) |
| async for token in synthesize_stream(query, all_results): |
| yield {"type": "token", "data": token} |
| ``` |
|
|
| --- |
|
|
| ## ็ญ็ฅ5: ่ฝป้็บง่ทฏ็ฑๅจ (ๆถ้คLLM่ฐ็จ) |
|
|
| ### ๅ็ |
|
|
| ๅคๆๅบฆๅคๆญไธ้่ฆ LLMโโ็จ**่งๅ+็นๅพ**ๅณๅฏๅจ <1ms ๅ
ๅฎๆ๏ผ |
|
|
| ```python |
| import re |
| |
| def fast_complexity_check(query: str) -> str: |
| """่งๅๅผๆ: <1ms, ๆ LLM่ฐ็จ""" |
| |
| # ็นๅพๆๅ |
| question_marks = query.count("?") + query.count("๏ผ") |
| has_conjunction = bool(re.search( |
| r'(and also|ไปฅๅ|ๅฆๅค|ๅๆถ|besides|moreover|่ฟๆ|ๅนถไธ|versus|compared to|ๅฏนๆฏ)', |
| query, re.IGNORECASE |
| )) |
| has_multiple_verbs = len(re.findall( |
| r'(explain|compare|list|describe|who|what|how|why|่งฃ้|ๅฏนๆฏ|ๅไธพ|ๆ่ฟฐ|ๆฏ่ฐ|ๆฏไปไน)', |
| query, re.IGNORECASE |
| )) >= 2 |
| |
| # ๅฎไฝ่ฎกๆฐ (็จๅทฒ็ผๅญ็GLiNERๆ็ฎๅNER) |
| entity_count = count_entities_fast(query) |
| |
| # ๅณ็ญ่งๅ |
| composite_score = ( |
| (question_marks >= 2) * 3 + |
| has_conjunction * 2 + |
| has_multiple_verbs * 2 + |
| (entity_count >= 3) * 1 + |
| (len(query) > 100) * 1 |
| ) |
| |
| return "composite" if composite_score >= 3 else "simple" |
| |
| def fast_query_type(query: str) -> str: |
| """่งๅ่ทฏ็ฑ: <1ms""" |
| query_lower = query.lower() |
| |
| if any(w in query_lower for w in ["่ถๅฟ", "overview", "็ปผ่ฟฐ", "ๅๅฑ", "trend", "survey"]): |
| return "global" |
| elif any(w in query_lower for w in ["ไธบไปไน", "how does", "ๅ็", "ๆบๅถ", "compare", "ๅฏนๆฏ", "ๅบๅซ"]): |
| return "reasoning" |
| else: |
| return "factual" |
| ``` |
|
|
| **ๆๆ**: ๆถ้ค 1 ๆฌก LLM ่ฐ็จ (150ms โ 0ms)๏ผๅ็กฎ็ ~90%๏ผ่พน็ๆ
ๅต่ตฐ LLM fallback๏ผ |
|
|
| --- |
|
|
| ## ็ญ็ฅ6: ๅ่งฃ็ผๅญ |
|
|
| ```python |
| class DecompositionCache: |
| """็ผๅญๅๅฒๅ่งฃ็ปๆ โ ็ธไผผๅคๅ้ฎ้ขๅค็จ""" |
| |
| def __init__(self, threshold: float = 0.88): |
| self.cache = {} # embedding โ decomposition |
| self.threshold = threshold |
| |
| async def get_or_decompose(self, query: str) -> DecomposedQuery: |
| # ๆฃๆฅ็ผๅญ: ่ฏญไน็ธไผผ็ๅๅฒๅ่งฃ |
| query_vec = embed(query) |
| best_match = find_nearest(self.cache, query_vec) |
| |
| if best_match and best_match.score > self.threshold: |
| # ๅฝไธญ! ๅค็จๅ่งฃ็ปๆ, ไฝๆฟๆขๅฎไฝๅ |
| cached_decomp = best_match.value |
| adapted = adapt_entities(cached_decomp, query) # ๆฟๆขๅฎไฝๅ |
| return adapted |
| |
| # ๆชๅฝไธญ: LLMๅ่งฃ โ ๅญๅ
ฅ็ผๅญ |
| decomp = await decompose_query(query) |
| self.cache[query_vec] = decomp |
| return decomp |
| ``` |
|
|
| **็คบไพ**: |
| - "Compare BERT and GPT-2 on GLUE" โ ๅ่งฃ็ปๆ่ขซ็ผๅญ |
| - "Compare T5 and BART on SuperGLUE" โ ๅฝไธญ! ๅค็จ็ปๆ๏ผๆฟๆขๅฎไฝๅ |
|
|
| --- |
|
|
| ## ็ญ็ฅ7: ๅผๆญฅ็ฎก้ๅ |
|
|
| ### ๅ็ |
|
|
| ไธ่ฆ็ญไธไธชๆญฅ้ชคๅฎๅ
จๅฎๆๅๅผๅงไธไธๆญฅใ**้จๅ็ปๆๅฐฑ่ถณไปฅ่งฆๅไธไธๆญฅ**ใ |
|
|
| ``` |
| ไผๅๅ (ๆญฅ้ดๅ
จ็ญๅพ
): |
| [ๆฃ็ดข10ไธชdocๅ
จ้จ่ฟๅ] โโ็ญๅพ
โโโถ [็จ10ไธชdoc็ๆ็ญๆก] |
| ๆป่ฎก: 300ms + 500ms = 800ms |
| |
| ไผๅๅ (็ฎก้ๅ): |
| [ๆฃ็ดข: doc1 50ms] โ ็ซๅณ้็ป็ๆๅจๅผๅงprefill |
| [ๆฃ็ดข: doc2 80ms] โ ่ฟฝๅ ๅฐ็ๆๅจcontext |
| [ๆฃ็ดข: doc3-10 ...] โ ็ปง็ปญ่ฟฝๅ |
| โ |
| [็ๆ: ้ฆtoken ~200msๅๅผๅง] โ ๆตๅผ่พๅบ |
| ๆป่ฎก: ็จๆทๅจ ~250ms ๅๅผๅง็ๅฐ่พๅบ! |
| ``` |
|
|
| ### ๅฎ็ฐ |
|
|
| ```python |
| import asyncio |
| from asyncio import Queue |
| |
| async def pipeline_retrieve_and_generate(query: str): |
| """็ฎก้ๅ: ๆฃ็ดข็ปๆๆตๅผๅ็ป็ๆๅจ""" |
| |
| doc_queue = Queue() |
| |
| async def retriever_producer(): |
| """ๆฃ็ดขๅจ: ๆฏๆพๅฐไธๆนๆๆกฃๅฐฑๆพๅ
ฅ้ๅ""" |
| # ๅ้ๆฃ็ดขๆๅฟซ, ๅ
่ฟๅ |
| vec_docs = await qdrant_search(query, top_k=5) |
| await doc_queue.put(("vector", vec_docs)) |
| |
| # ๅพ่ฐฑ็จๆ
ข |
| graph_docs = await neo4j_search(query, limit=5) |
| await doc_queue.put(("graph", graph_docs)) |
| |
| # RAPTOR ๆๆ
ข |
| raptor_docs = await raptor_search(query, limit=3) |
| await doc_queue.put(("raptor", raptor_docs)) |
| |
| await doc_queue.put(None) # ไฟกๅท: ๆฃ็ดขๅฎๆฏ |
| |
| async def generator_consumer(): |
| """็ๆๅจ: ไธๆฆๆ่ถณๅคๆๆกฃๅฐฑๅผๅง็ๆ""" |
| all_docs = [] |
| min_docs_to_start = 3 # ๆ3ไธชๆๆกฃๅฐฑๅฏไปฅๅผๅง |
| |
| while True: |
| item = await doc_queue.get() |
| if item is None: |
| break |
| all_docs.extend(item[1]) |
| |
| if len(all_docs) >= min_docs_to_start: |
| # ๅผๅงๆตๅผ็ๆ (ๅ็ปญๆๆกฃๅฏไปฅ้่ฟcontextๆฉๅฑ่ฟฝๅ ) |
| break |
| |
| # ็จๅทฒๆๆๆกฃๅผๅง็ๆ |
| prompt = build_prompt(query, all_docs) |
| async for token in llm.stream(prompt): |
| yield token |
| |
| # ๅนถ่กๅฏๅจ็ไบง่
ๅๆถ่ดน่
|
| asyncio.create_task(retriever_producer()) |
| async for token in generator_consumer(): |
| yield token |
| ``` |
|
|
| --- |
|
|
| ## ็ญ็ฅ8: ๆจกๅ็บงๅ ้ |
|
|
| ### 8a. ๅ็บงๆจกๅ้ๆฉ |
|
|
| ```python |
| MODEL_LATENCY_MAP = { |
| # ไปปๅก โ ๆจกๅ โ ้ขไผฐๅปถ่ฟ |
| "routing": ("Qwen2.5-3B-Instruct", "~50ms"), # ๆๅฐๆจกๅ, ๆๅฟซ |
| "decompose": ("Qwen2.5-7B-Instruct", "~150ms"), # ไธญ็ญๆจกๅ |
| "sub_answer": ("Qwen2.5-14B-Instruct", "~300ms"), # ่ดจ้่ฆๆฑ้ซ |
| "synthesis": ("GPT-4o-mini", "~500ms"), # ๆ็ป็ปผๅ็จAPI |
| } |
| |
| # ๅฏนๆฏ: ๅฆๆๆๆๆญฅ้ชค้ฝ็จ14B |
| # routing(14B): ~150ms, decompose(14B): ~300ms โ ๅค่ฑ200msๅดไธๆๅ่ดจ้ |
| ``` |
|
|
| ### 8b. ๆๆบ่งฃ็ (Speculative Decoding) |
|
|
| ```bash |
| # vLLM ๆๆบ่งฃ็ : ๅฐๆจกๅ่็จฟ + ๅคงๆจกๅ้ช่ฏ |
| vllm serve meta-llama/Llama-3.1-8B-Instruct \ |
| --speculative-model meta-llama/Llama-3.1-1B-Instruct \ |
| --num-speculative-tokens 5 \ |
| --speculative-disable-mqa-scorer |
| |
| # ๆๆ: ็ๆ้ๅบฆๆๅ 1.5-2.5ร, ่พๅบ่ดจ้ไธๅ |
| ``` |
|
|
| ### 8c. ็ปๆๅ่พๅบๅ ้ |
|
|
| ```python |
| # ็จ Outlines / vLLM structured output ็บฆๆ็ๆ |
| # ๅๅฐๆ ๆtoken็ๆ, ็ดๆฅ่พๅบJSON็ปๆ |
| |
| from vllm import LLM, SamplingParams |
| from vllm.sampling_params import GuidedDecodingParams |
| |
| llm = LLM(model="Qwen/Qwen2.5-7B-Instruct") |
| |
| # JSON schema ็บฆๆ โ ็ๆๆดๅฟซ (ๅๅฐไธๅฟ
่ฆ็ๆ่token) |
| guided_params = GuidedDecodingParams(json_schema=DecomposedQuery.model_json_schema()) |
| params = SamplingParams(max_tokens=512, guided_decoding=guided_params) |
| ``` |
|
|
| --- |
|
|
| ## ไผๅๅ็ๅฎๆดๆถ้ด็บฟ |
|
|
| ``` |
| === ็ฎๅ้ฎ้ข (ไผๅๅ, ~600ms ้ฆtoken) === |
| |
| 0ms 1ms 50ms 250ms 600ms 1200ms |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ่งๅ่ทฏ็ฑโHyDEๆนๅโ ไธ่ทฏๅนถ่กๆฃ็ดข โ ้ๆ(100ms)โ ๆตๅผ็ๆๅผๅง โ ...่พๅบๅฎๆ |
| โ<1ms โ(ๅนถ่ก) โ(ๆๆ
ข200ms) โ โ โ ็จๆท็ๅฐ้ฆtoken |
| |
| === ๅคๅ้ฎ้ข (ไผๅๅ, ~800ms ้ฆtoken) === |
| |
| 0ms 1ms 350ms 500ms 800ms 1500ms 2000ms |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ่งๅๅคๆญโ ๅ่งฃ+ๆๆบๆฃ็ดข โ ่กฅๅ
ๆฃ็ดข โ ๅญ็ญๆกๅนถ่ก โ ๆตๅผ็ปผๅๅผๅงโ ...ๅฎๆ |
| โ<1ms โ(ๅนถ่ก300ms) โ(ๅนถ่ก150ms)โ (ๅนถ่ก500ms)โ โ ้ฆtoken โ |
| โ โ โ โ โ โ |
| โ โ ๆๆบ: ๅๅงqueryๆฃ็ดข ไธ LLMๅ่งฃ ๅๆถ่ฟ่ก โ |
| โ โ ่็: ไธ็จ็ญๅ่งฃๅฎๆๅผๅงๆฃ็ดข โ |
| ``` |
|
|
| --- |
|
|
| ## ๅปถ่ฟๅฏนๆฏๆฑๆป |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ไผๅๅๅๅปถ่ฟๅฏนๆฏ โ |
| โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ้ถๆฎต โ ไผๅๅ โ ไผๅๅ โ ไผๅๆๆฎต โ |
| โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ๅคๆๅบฆๅคๆญ โ 150ms โ <1ms โ ่งๅๅผๆๆฟไปฃLLM โ |
| โ ๆฅ่ฏขๅ่งฃ โ 300ms(ไธฒ่ก) โ 0ms(ๅนถ่ก) โ ๆๆบๆง่ก+ๅๅนถ่ฐ็จ โ |
| โ ๆฃ็ดข โ 450ms(ไธฒ่ก) โ 200ms(ๅนถ่ก)โ asyncio.gather โ |
| โ ๅญ้ฎ้ข็ๆ โ 700ms(ไธฒ่ก) โ 350ms(ๅนถ่ก)โ ๅนถ่กLLM+ๅฐๆจกๅ โ |
| โ ้ๆ โ 200ms โ 100ms โ ้ๅreranker โ |
| โ ๆ็ป็ๆ(้ฆtoken)โ 500ms(็ญ) โ 50ms(ๆตๅผ) โ ๆตๅผ+็ฎก้ๅ โ |
| โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ๆปๅปถ่ฟ(็ฎๅ) โ ~1500ms โ ~600ms โ 60% ้ไฝ โ |
| โ ๆปๅปถ่ฟ(ๅคๅ) โ ~3000ms โ ~800ms โ 73% ้ไฝ โ |
| โ ็จๆทๆ็ฅ(้ฆtoken)โ ~1500ms โ ~400ms โ 73% ้ไฝ โ |
| โ ็ผๅญๅฝไธญๆถ โ โ โ ~5ms โ L1่ฏญไน็ผๅญ โ |
| โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## ๅฎๆฝไผๅ
็บง |
|
|
| | ไผๅ
็บง | ็ญ็ฅ | ๅปถ่ฟๆถ็ | ๅทฅไฝ้ | ้ฃ้ฉ | |
| |--------|------|---------|--------|------| |
| | **P0** | ็ญ็ฅ4: ๆตๅผ่พๅบ | ๆ็ฅ -73% | ๅๅคฉ | ๆ | |
| | **P0** | ็ญ็ฅ3: ๆฃ็ดขๅนถ่กๅ | -150ms | ๅๅคฉ | ๆ | |
| | **P0** | ็ญ็ฅ5: ่งๅ่ทฏ็ฑๅจ | -150ms | 2ๅฐๆถ | ่พน็case | |
| | **P1** | ็ญ็ฅ1: ๆๆบๆง่ก | -300ms | 1ๅคฉ | ๆ ็จๆฃ็ดขๆตช่ดน | |
| | **P1** | ็ญ็ฅ2: LLM่ฐ็จๅๅนถ | -300ms | 1ๅคฉ | Promptๅทฅ็จ | |
| | **P2** | ็ญ็ฅ7: ็ฎก้ๅ | -200ms | 2ๅคฉ | ๅคๆๅบฆ้ซ | |
| | **P2** | ็ญ็ฅ8: ๆจกๅๅ ้ | -40%็ๆ | 2ๅคฉ | ้
็ฝฎ่ฐไผ | |
| | **P3** | ็ญ็ฅ6: ๅ่งฃ็ผๅญ | ๅฝไธญ-300ms | 1ๅคฉ | ็ผๅญๅคฑๆ | |
|
|
| --- |
|
|
| ## ็ปๆๆถๆ: ไฝๅปถ่ฟ Agent ๆตๆฐด็บฟ |
|
|
| ```python |
| async def ultra_fast_query(query: str, session_id: str): |
| """ |
| ็ปๆไฝๅปถ่ฟ Agent: |
| - ็ฎๅ้ฎ้ข: 400ms ้ฆtoken |
| - ๅคๅ้ฎ้ข: 800ms ้ฆtoken |
| - ็ผๅญๅฝไธญ: 5ms |
| """ |
| |
| # ===== L1: ่ฏญไน็ผๅญ (5ms) ===== |
| cached = await semantic_cache.get(query) |
| if cached: |
| yield {"type": "cached", "data": cached} |
| return |
| |
| # ===== ่งๅ่ทฏ็ฑ (<1ms) ===== |
| complexity = fast_complexity_check(query) |
| |
| if complexity == "simple": |
| query_type = fast_query_type(query) |
| |
| # ๆฃ็ดข + HyDE ๅนถ่ก |
| retrieve_task = asyncio.create_task( |
| parallel_hybrid_retrieve(query, mode=query_type) |
| ) |
| hyde_task = asyncio.create_task(hyde_expand(query)) |
| |
| docs, hyde_docs = await asyncio.gather(retrieve_task, hyde_task) |
| all_docs = rrf_fusion(docs + hyde_docs) |
| reranked = await fast_rerank(query, all_docs, top_k=5) |
| |
| # ๆตๅผ็ๆ |
| async for token in llm.stream(build_prompt(query, reranked)): |
| yield {"type": "token", "data": token} |
| |
| else: # composite |
| # ๆๆบๆง่ก: ๅ่งฃ ไธ ๅๅงqueryๆฃ็ดข ๅนถ่ก |
| decompose_task = asyncio.create_task( |
| unified_route_and_decompose(query) |
| ) |
| speculative_task = asyncio.create_task( |
| parallel_hybrid_retrieve(query, mode="hybrid", top_k=10) |
| ) |
| |
| # ๅ
ๆจ้ "ๆ่ไธญ" + ๆ็ดขๆฅๆบ |
| spec_docs = await speculative_task |
| yield {"type": "sources", "data": format_sources(spec_docs[:3])} |
| |
| # ๅ่งฃๅฎๆ |
| decomp = await decompose_task |
| yield {"type": "plan", "data": [sq.question for sq in decomp.sub_questions]} |
| |
| # ่กฅๅ
ๅญๆฅ่ฏขๆฃ็ดข (ๅนถ่ก) |
| additional = await parallel_sub_retrieval(decomp, already_have=spec_docs) |
| all_docs = deduplicate(spec_docs + additional) |
| reranked = await fast_rerank(query, all_docs, top_k=10) |
| |
| # ๅนถ่กๅญ็ญๆก + ๆตๅผ็ปผๅ |
| sub_answers = await parallel_sub_answers(decomp, reranked) |
| async for token in synthesize_stream(query, sub_answers, reranked): |
| yield {"type": "token", "data": token} |
| |
| # ๅๅ
ฅ็ผๅญ |
| full_answer = collect_tokens() |
| await semantic_cache.set(query, full_answer) |
| ``` |
|
|
| --- |
|
|
| ## ๅ่่ฎบๆ |
|
|
| | ่ฎบๆ | ๆๆฏ | ๅบ็จ | |
| |------|------|------| |
| | Speculative Decoding (2302.01318) | ๅฐๆจกๅ่็จฟ+ๅคงๆจกๅ้ช่ฏ | ็ญ็ฅ8b | |
| | Medusa (2401.10774) | ๅคๅคดๆๆบ่งฃ็ | ็ญ็ฅ8b | |
| | ChunkAttention (2402.15220) | ๅ
ฑไบซๅ็ผKV็ฎก็ | ็ญ็ฅ7 | |
| | Skeleton-of-Thought (2307.15337) | ๅนถ่กๅ
ๅฎน็ๆ | ็ญ็ฅ7 | |
| | SGLang (2312.07104) | RadixAttention + ๅ็ซฏ่ฐๅบฆ | ็ญ็ฅ4/7 | |
|
|