# ๐Ÿ—๏ธ ScholarMind โ€” ็”Ÿไบง็บงๅญฆๆœฏ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” & ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ณป็ปŸ ## ็ณป็ปŸๆฆ‚่ฟฐ ScholarMind ๆ˜ฏไธ€ไธช้ขๅ‘ **1000+ ็ฏ‡ๅญฆๆœฏ PDF ่ฎบๆ–‡** ็š„็”Ÿไบง็บงๆ™บ่ƒฝ็Ÿฅ่ฏ†็ณป็ปŸ๏ผŒ้›†ๆˆ๏ผš - **PDF ๆทฑๅบฆ่งฃๆž**๏ผšๅŸบไบŽ MinerU 2.5 VLM ็š„้ซ˜็ฒพๅบฆ OCR๏ผˆๅ…ฌๅผ/่กจๆ ผ/ๅ›พ่กจ๏ผ‰ - **็Ÿฅ่ฏ†ๅ›พ่ฐฑ่‡ชๅŠจๆž„ๅปบ**๏ผšไปŽ่ฎบๆ–‡ไธญ่‡ชๅŠจๆŠฝๅ–ๅฎžไฝ“ไธŽๅ…ณ็ณป๏ผŒๆž„ๅปบ้ข†ๅŸŸ็Ÿฅ่ฏ†ๅ›พ่ฐฑ - **ๆททๅˆๆฃ€็ดข้—ฎ็ญ”**๏ผšGraphRAG + ๅ‘้‡ๆฃ€็ดข + BM25 ็จ€็–ๆฃ€็ดข็š„ไธ‰่ทฏ่žๅˆ - **ๅคšๆจกๅž‹ๆ”ฏๆŒ**๏ผšๅŒๆ—ถๆ”ฏๆŒๆœฌๅœฐ้ƒจ็ฝฒ๏ผˆvLLM/Ollama๏ผ‰ๅ’Œๅค–้ƒจ API๏ผˆOpenAI/Anthropic/DeepSeek๏ผ‰ - **Agent ็ผ–ๆŽ’**๏ผšๅŸบไบŽ LangGraph ็š„ๅคš Agent ๅไฝœ๏ผŒๆ”ฏๆŒๅคš่ทณๆŽจ็† > **ๆ ธๅฟƒๆŒ‡ๆ ‡**๏ผšๅ• A100 80G ๅฏๅœจ ~80 ๅˆ†้’Ÿๅ†…ๅฎŒๆˆ 1000 ็ฏ‡่ฎบๆ–‡๏ผˆ~10000 ้กต๏ผ‰็š„ๅ…จ้‡่งฃๆž --- ## ็›ฎๅฝ• 1. [็ณป็ปŸๆžถๆž„ๆ€ป่งˆ](#1-็ณป็ปŸๆžถๆž„ๆ€ป่งˆ) 2. [PDF ่งฃๆžๅฑ‚ โ€” MinerU Pipeline](#2-pdf-่งฃๆžๅฑ‚) 3. [็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ โ€” ๅฎžไฝ“ๅ…ณ็ณปๆŠฝๅ–](#3-็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚) 4. [็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚ โ€” ๅ›พๆž„ๅปบไธŽๅญ˜ๅ‚จ](#4-็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚) 5. [็ดขๅผ•ๅฑ‚ โ€” ๅคš่ทฏ็ดขๅผ•ๆž„ๅปบ](#5-็ดขๅผ•ๅฑ‚) 6. [ๆฃ€็ดขๅฑ‚ โ€” ๆททๅˆๆฃ€็ดขไธŽ้‡ๆŽ’](#6-ๆฃ€็ดขๅฑ‚) 7. [Agent ็ผ–ๆŽ’ๅฑ‚ โ€” ๆ™บ่ƒฝ้—ฎ็ญ”](#7-agent-็ผ–ๆŽ’ๅฑ‚) 8. [LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚](#8-llm-็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚) 9. [็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„](#9-็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„) 10. [ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ”](#10-ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ”) 11. [ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ](#11-ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ) 12. [้กน็›ฎ็ป“ๆž„](#12-้กน็›ฎ็ป“ๆž„) --- ## 1. ็ณป็ปŸๆžถๆž„ๆ€ป่งˆ ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ScholarMind ็ณป็ปŸๆžถๆž„ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ็”จๆˆทๅฑ‚ โ”‚ โ”‚ FastAPI Gateway โ”‚ โ”‚ โ”‚ โ”‚ Web UI โ”‚โ”€โ”€โ”€โ–ถโ”‚ /upload /query /graph /status /chat WebSocket SSE โ”‚ โ”‚ โ”‚ โ”‚ API่ฐƒ็”จ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Agent ็ผ–ๆŽ’ๅฑ‚ (LangGraph) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ ่ทฏ็”ฑAgent โ”‚ โ”‚ ๆฃ€็ดขAgent โ”‚ โ”‚ ๆŽจ็†Agent โ”‚ โ”‚ ๅ›พ่ฐฑAgent โ”‚ โ”‚ ๆ€ป็ป“Agent โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (ๅˆ†็ฑปๆ„ๅ›พ)โ”‚ โ”‚ (ๆททๅˆๆฃ€็ดข)โ”‚ โ”‚ (ๅคš่ทณๆŽจ็†)โ”‚ โ”‚ (ๅ›พ่ฐฑๆŸฅ่ฏข)โ”‚ โ”‚ (็ญ”ๆกˆ็”Ÿๆˆ)โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚ (LiteLLM Proxy) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ vLLM โ”‚ โ”‚ Ollama โ”‚ โ”‚ OpenAI/Claudeโ”‚ โ”‚ DeepSeek โ”‚ โ”‚ Gemini โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (ๆœฌๅœฐ) โ”‚ โ”‚ (ๆœฌๅœฐ) โ”‚ โ”‚ (ๅค–้ƒจAPI) โ”‚ โ”‚ (ๅค–้ƒจAPI)โ”‚ โ”‚ (ๅค–้ƒจAPI)โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ๆฃ€็ดขๅฑ‚ (Hybrid Retrieval) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Dense Vector โ”‚ โ”‚ Sparse BM25 โ”‚ โ”‚ Graph Query โ”‚ โ”‚ Cross-Encoderโ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (Qdrant) โ”‚ โ”‚ (Qdrant) โ”‚ โ”‚ (Neo4j) โ”‚ โ”‚ Reranker โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ RRF / ๅŠ ๆƒ่žๅˆ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ็ดขๅผ•ๅฑ‚ (Multi-Index) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ ๅ‘้‡็ดขๅผ• โ”‚ โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ดขๅผ• โ”‚ โ”‚ RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Qdrant โ”‚ โ”‚ Neo4j 5.x โ”‚ โ”‚ (้€’ๅฝ’่š็ฑปโ†’ๆ‘˜่ฆโ†’ๅ†ๅตŒๅ…ฅ) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Dense+Sparse โ”‚ โ”‚ Entity/Relation โ”‚ โ”‚ Paperโ†’Sectionโ†’Paragraph โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ (Knowledge Extraction) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ ๅฎžไฝ“ๆŠฝๅ– (NER) โ”‚ โ”‚ ๅ…ณ็ณปๆŠฝๅ– (RE) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ GLiNER 440M โ”‚ โ”‚ LLMGraphTransformer โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ ้›ถๆ ทๆœฌ, ่‡ชๅฎšไน‰ๆ ‡็ญพ โ”‚ โ”‚ + Graphusion ่žๅˆๅŽป้‡ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ PDF ่งฃๆžๅฑ‚ (MinerU Pipeline) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ PDF้˜Ÿๅˆ— โ”‚ โ”‚ MinerU 2.5 VLM โ”‚ โ”‚ ๆ ผๅผ่ฝฌๆข โ”‚ โ”‚ ๅ…ƒๆ•ฐๆฎๆๅ– โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Celery โ”‚โ”€โ–ถโ”‚ vLLMๅŽ็ซฏ 2pg/s โ”‚โ”€โ–ถโ”‚ JSONโ†’MD โ”‚โ”€โ–ถโ”‚ ๆ ‡้ข˜/ไฝœ่€…/DOI โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +Redis โ”‚ โ”‚ ๅธƒๅฑ€+OCR+ๅ…ฌๅผ+่กจๆ ผ โ”‚ โ”‚ +็ป“ๆž„ๅŒ– โ”‚ โ”‚ +็ซ ่Š‚/ๅผ•็”จ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ๅญ˜ๅ‚จๅฑ‚ (Storage) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ PostgreSQLโ”‚ โ”‚ Qdrant โ”‚ โ”‚ Neo4j โ”‚ โ”‚ Redis โ”‚ โ”‚ MinIO/S3 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ ๅ…ƒๆ•ฐๆฎ โ”‚ โ”‚ ๅ‘้‡็ดขๅผ• โ”‚ โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑ โ”‚ โ”‚ ็ผ“ๅญ˜/้˜Ÿๅˆ— โ”‚ โ”‚ PDFๅญ˜ๅ‚จ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## 2. PDF ่งฃๆžๅฑ‚ ### 2.1 ๆŠ€ๆœฏ้€‰ๅž‹๏ผšMinerU 2.5 VLM | ๆŒ‡ๆ ‡ | MinerU 2.5 | Marker | Nougat | PyMuPDF | |------|-----------|--------|--------|---------| | ๅญฆๆœฏ่ฎบๆ–‡ๆ–‡ๆœฌ็ฒพๅบฆ(Edit Distanceโ†“) | **0.047** | 0.080 | 0.365 | N/A(ไป…ๆ•ฐๅญ—PDF) | | ๅ…ฌๅผ่ฏ†ๅˆซ(CDMโ†‘) | **88.46** | 17.6 | 15.1 | โŒ | | ่กจๆ ผ่ฏ†ๅˆซ(TEDSโ†‘) | **88.22** | 67.6 | 39.9 | โŒ | | ๅžๅ(A100, pg/s) | **2.12** | ~5 | ~0.5 | ~100 | | ๆ‰ซๆไปถๆ”ฏๆŒ | โœ… | โš ๏ธ | โŒ | โŒ | > **Benchmark ๆฅๆบ**: OmniDocBench (CVPR 2025, arxiv:2412.07626) ### 2.2 ๆžถๆž„่ฎพ่ฎก ```python # ๆททๅˆ่ทฏ็”ฑ็ญ–็•ฅ๏ผšๆ•ฐๅญ—PDF่ตฐPyMuPDF(ๅฟซ), ๅคๆ‚PDF่ตฐMinerU 2.5(็ฒพ) class PDFRouter: """ๆ นๆฎPDF็‰นๅพๆ™บ่ƒฝ้€‰ๆ‹ฉ่งฃๆžๅผ•ๆ“Ž""" def route(self, pdf_path: str) -> str: import fitz doc = fitz.open(pdf_path) avg_chars = sum(len(p.get_text()) for p in doc) / len(doc) has_images = any(p.get_images() for p in doc) if avg_chars > 500 and not has_images: return "pymupdf_fast" # ็บฏๆ•ฐๅญ—PDF๏ผŒPyMuPDF็ง’็บง่งฃๆž elif avg_chars > 200: return "mineru_pipeline" # ๆ•ฐๅญ—PDF+ๅ›พ่กจ๏ผŒPipelineๆจกๅผ(CPU) else: return "mineru_vlm" # ๆ‰ซๆไปถ/ๅคๆ‚ๅธƒๅฑ€๏ผŒVLMๆจกๅผ(GPU) ``` ### 2.3 ๆ‰น้‡ๅค„็†ๆตๆฐด็บฟ ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ PDF ๆ‰น้‡ๅค„็†ๆตๆฐด็บฟ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ PDF ๆ–‡ไปถ โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚ PDF่ทฏ็”ฑๅ™จ โ”‚โ”€โ”€โ–ถโ”‚ Celery Workerๆฑ  โ”‚ โ”‚โ”€โ”€โ–ถโ”‚ ็ป“ๆž„ๅŒ– โ”‚ โ”‚ ไธŠไผ /ๆ‰น้‡ โ”‚ โ”‚ โ”‚ (็‰นๅพๆฃ€ๆต‹) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ JSON+MD โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ W1: MinerU VLM โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ W2: MinerU VLM โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ W3: Pipeline โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Redis โ”‚โ—€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚ W4: PyMuPDF โ”‚ โ”‚โ”€โ”€โ–ถโ”‚ ๅ…ƒๆ•ฐๆฎ โ”‚ โ”‚ ไปปๅŠก้˜Ÿๅˆ— โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ ๆๅ– โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ็›‘ๆŽง: ่ฟ›ๅบฆ/ๅคฑ่ดฅ้‡่ฏ•/ๅžๅ้‡็ปŸ่ฎก โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ๅ…ณ้”ฎ้…็ฝฎ: - MinerU VLM Worker: ๆฏGPUไธ€ไธช่ฟ›็จ‹, vLLMๅผ‚ๆญฅๆ‰นๅค„็† - gpu_memory_utilization: 0.7 (้ข„็•™30%็ป™OOMๅฎ‰ๅ…จ่พน้™…) - max_num_batched_tokens: 16384 (ๆ้ซ˜GPUๅˆฉ็”จ็އ) - ๅคฑ่ดฅ้‡่ฏ•: ๆœ€ๅคš3ๆฌก, ๆŒ‡ๆ•ฐ้€€้ฟ - ่ถ…ๆ—ถ: ๅ•PDF 300็ง’ไธŠ้™ ``` ### 2.4 ่พ“ๅ‡บๆ•ฐๆฎๆจกๅž‹ ```python from pydantic import BaseModel from typing import List, Optional from enum import Enum class ContentType(str, Enum): TITLE = "title" TEXT = "text" TABLE = "table" EQUATION = "equation" EQUATION_BLOCK = "equation_block" IMAGE = "image" CODE = "code" LIST = "list" REFERENCE = "reference" class ContentBlock(BaseModel): type: ContentType content: str # Markdown/LaTeX/HTML page_idx: int bbox: List[float] # [x0, y0, x1, y1] reading_order: int section_hierarchy: List[str] # ["3", "3.1", "Methods"] class PaperMetadata(BaseModel): paper_id: str title: str authors: List[str] abstract: str doi: Optional[str] year: Optional[int] venue: Optional[str] keywords: List[str] references: List[str] # ๅผ•็”จ็š„่ฎบๆ–‡ๆ ‡้ข˜ class ParsedPaper(BaseModel): metadata: PaperMetadata content_blocks: List[ContentBlock] markdown: str page_count: int parse_engine: str # "mineru_vlm" | "mineru_pipeline" | "pymupdf" parse_time_seconds: float ``` --- ## 3. ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ ### 3.1 ไธค้˜ถๆฎตๆŠฝๅ–็ญ–็•ฅ ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–ๆตๆฐด็บฟ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Stage 1: ๅฟซ้€Ÿๅฎžไฝ“ๆŠฝๅ– (GLiNER, ๆœฌๅœฐ, 440Mๅ‚ๆ•ฐ) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ่พ“ๅ…ฅ: ่ฎบๆ–‡ๆ–‡ๆœฌๅ— โ”‚ โ”‚ โ”‚ โ”‚ ๆจกๅž‹: urchade/gliner_large-v2.1 (้›ถๆ ทๆœฌNER) โ”‚ โ”‚ โ”‚ โ”‚ ๆ ‡็ญพ: [Author, Method, Dataset, Metric, Task, โ”‚ โ”‚ โ”‚ โ”‚ Model, Concept, Venue, Score, Tool] โ”‚ โ”‚ โ”‚ โ”‚ ่พ“ๅ‡บ: [(text, label, score, span), ...] โ”‚ โ”‚ โ”‚ โ”‚ ้€Ÿๅบฆ: ~1000 chunks/min (CPU), ~5000 chunks/min (GPU) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Stage 2: LLMๅ…ณ็ณปๆŠฝๅ– (LLMGraphTransformer, ๆœฌๅœฐๆˆ–API) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ่พ“ๅ…ฅ: ๆ–‡ๆœฌๅ— + Stage1ๅฎžไฝ“ๆ็คบ โ”‚ โ”‚ โ”‚ โ”‚ ๅ…ณ็ณป็ฑปๅž‹: [PROPOSED_BY, USED_FOR, EVALUATED_ON, โ”‚ โ”‚ โ”‚ โ”‚ TRAINED_WITH, COMPARED_TO, PART_OF, ACHIEVED_SCORE, โ”‚ โ”‚ โ”‚ โ”‚ HYPONYM_OF, CITED_BY, IMPROVES_ON] โ”‚ โ”‚ โ”‚ โ”‚ ๆœฌๅœฐ: Ollama(Qwen2.5-14B) ๆˆ– vLLM(Llama-3.1-8B) โ”‚ โ”‚ โ”‚ โ”‚ API: GPT-4o-mini ๆˆ– DeepSeek-V3 โ”‚ โ”‚ โ”‚ โ”‚ ่พ“ๅ‡บ: [(head, relation, tail, properties), ...] โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Stage 3: Graphusion ่žๅˆ (ๅฎžไฝ“ๅฝ’ไธ€ๅŒ– + ๅ†ฒ็ชๆถˆ่งฃ) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ - ๅตŒๅ…ฅ็›ธไผผๅบฆๅˆๅนถ: "NMT" โ†” "neural machine translation" โ”‚ โ”‚ โ”‚ โ”‚ - LLMๅ†ฒ็ชๆถˆ่งฃ: ็›ธๅŒๅฎžไฝ“ๅฏน็š„็Ÿ›็›พๅ…ณ็ณป โ”‚ โ”‚ โ”‚ โ”‚ - ๆ–ฐไธ‰ๅ…ƒ็ป„ๆŽจๆ–ญ: ๅŸบไบŽไธŠไธ‹ๆ–‡่กฅๅ…จ็ผบๅคฑๅ…ณ็ณป โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 3.2 ๅญฆๆœฏ่ฎบๆ–‡ๅฎžไฝ“-ๅ…ณ็ณป Schema ``` ๅฎžไฝ“็ฑปๅž‹ (Node Types): โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๅฎžไฝ“็ฑปๅž‹ โ”‚ ๆ่ฟฐ โ”‚ ๅฑžๆ€ง โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Paper โ”‚ ่ฎบๆ–‡ โ”‚ title, year, doi โ”‚ โ”‚ Author โ”‚ ไฝœ่€… โ”‚ name, affiliationโ”‚ โ”‚ Method โ”‚ ๆ–นๆณ•/็ฎ—ๆณ• โ”‚ name, descriptionโ”‚ โ”‚ Dataset โ”‚ ๆ•ฐๆฎ้›† โ”‚ name, size, domainโ”‚ โ”‚ Task โ”‚ ไปปๅŠก โ”‚ name, domain โ”‚ โ”‚ Metric โ”‚ ่ฏ„ไผฐๆŒ‡ๆ ‡ โ”‚ name, value โ”‚ โ”‚ Model โ”‚ ๅ…ทไฝ“ๆจกๅž‹ๅฎžไพ‹ โ”‚ name, params โ”‚ โ”‚ Concept โ”‚ ๅญฆๆœฏๆฆ‚ๅฟต โ”‚ name, definition โ”‚ โ”‚ Tool โ”‚ ๅทฅๅ…ท/ๆก†ๆžถ โ”‚ name, version โ”‚ โ”‚ Venue โ”‚ ๅ‘่กจๅœบๆ‰€ โ”‚ name, type โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ๅ…ณ็ณป็ฑปๅž‹ (Edge Types): โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๅ…ณ็ณป็ฑปๅž‹ โ”‚ ๆ่ฟฐ (Head โ†’ Tail) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ PROPOSED_BY โ”‚ Method โ†’ Author (ๆ–นๆณ•็”ฑไฝœ่€…ๆๅ‡บ) โ”‚ โ”‚ PUBLISHED_IN โ”‚ Paper โ†’ Venue (่ฎบๆ–‡ๅ‘่กจๅœจๆŸไผš่ฎฎ/ๆœŸๅˆŠ) โ”‚ โ”‚ USED_FOR โ”‚ Method โ†’ Task (ๆ–นๆณ•็”จไบŽๆŸไปปๅŠก) โ”‚ โ”‚ EVALUATED_ON โ”‚ Method โ†’ Dataset (ๆ–นๆณ•ๅœจๆŸๆ•ฐๆฎ้›†ไธŠ่ฏ„ไผฐ) โ”‚ โ”‚ ACHIEVED_SCORE โ”‚ Method โ†’ Metric (ๆ–นๆณ•่พพๅˆฐๆŸๆŒ‡ๆ ‡ๅ€ผ) โ”‚ โ”‚ TRAINED_WITH โ”‚ Model โ†’ Dataset (ๆจกๅž‹ๅœจๆŸๆ•ฐๆฎ้›†ไธŠ่ฎญ็ปƒ) โ”‚ โ”‚ COMPARED_TO โ”‚ Method โ†’ Method (ๆ–นๆณ•ไน‹้—ด็š„ๅฏนๆฏ”) โ”‚ โ”‚ IMPROVES_ON โ”‚ Method โ†’ Method (ๆ–นๆณ•Aๆ”น่ฟ›ไบ†ๆ–นๆณ•B) โ”‚ โ”‚ PART_OF โ”‚ Concept โ†’ Concept (ๆฆ‚ๅฟตๅฑ‚็บงๅ…ณ็ณป) โ”‚ โ”‚ CITES โ”‚ Paper โ†’ Paper (ๅผ•็”จๅ…ณ็ณป) โ”‚ โ”‚ AUTHORED_BY โ”‚ Paper โ†’ Author (่ฎบๆ–‡ไฝœ่€…) โ”‚ โ”‚ HYPONYM_OF โ”‚ Concept โ†’ Concept (ไธŠไธ‹ไฝๅ…ณ็ณป) โ”‚ โ”‚ USES_TOOL โ”‚ Method โ†’ Tool (ๆ–นๆณ•ไฝฟ็”จๆŸๅทฅๅ…ท) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 3.3 ๆ ธๅฟƒๆŠฝๅ–ไปฃ็  ```python from gliner import GLiNER from langchain_experimental.graph_transformers import LLMGraphTransformer from langchain_core.documents import Document class KnowledgeExtractor: """ไธค้˜ถๆฎต็Ÿฅ่ฏ†ๆŠฝๅ–ๅ™จ""" def __init__(self, llm_backend: str = "local_ollama"): # Stage 1: ๅฟซ้€ŸNER self.ner_model = GLiNER.from_pretrained("urchade/gliner_large-v2.1") self.entity_labels = [ "author", "method", "dataset", "metric", "task", "model", "concept", "tool", "venue", "score" ] # Stage 2: LLMๅ…ณ็ณปๆŠฝๅ– self.llm = self._init_llm(llm_backend) self.graph_transformer = LLMGraphTransformer( llm=self.llm, allowed_nodes=["Author","Method","Dataset","Metric","Task","Model","Concept","Tool","Venue"], allowed_relationships=[ "PROPOSED_BY","USED_FOR","EVALUATED_ON","ACHIEVED_SCORE", "TRAINED_WITH","COMPARED_TO","IMPROVES_ON","PART_OF", "CITES","AUTHORED_BY","HYPONYM_OF","USES_TOOL","PUBLISHED_IN" ], node_properties=["description", "year"], relationship_properties=["score_value", "metric_name", "confidence"], strict_mode=True, ) def _init_llm(self, backend: str): """็ปŸไธ€LLMๅˆๅง‹ๅŒ– โ€” ๆ”ฏๆŒๆœฌๅœฐๅ’Œๅค–้ƒจAPI""" if backend == "local_ollama": from langchain_community.llms import Ollama return Ollama(model="qwen2.5:14b-instruct", temperature=0) elif backend == "local_vllm": from langchain_openai import ChatOpenAI return ChatOpenAI( base_url="http://localhost:8000/v1", api_key="token", model="meta-llama/Llama-3.1-8B-Instruct", temperature=0 ) elif backend == "openai": from langchain_openai import ChatOpenAI return ChatOpenAI(model="gpt-4o-mini", temperature=0) elif backend == "deepseek": from langchain_openai import ChatOpenAI return ChatOpenAI( base_url="https://api.deepseek.com/v1", model="deepseek-chat", temperature=0 ) async def extract(self, text: str, paper_id: str) -> dict: """ไธค้˜ถๆฎตๆŠฝๅ–""" # Stage 1: GLiNERๅฟซ้€ŸNER entities = self.ner_model.predict_entities( text, self.entity_labels, threshold=0.5 ) # Stage 2: LLMๅ…ณ็ณปๆŠฝๅ– (ไผ ๅ…ฅๅฎžไฝ“ไฝœไธบๆ็คบ) entity_hint = ", ".join([f"{e['text']}({e['label']})" for e in entities[:20]]) doc = Document( page_content=text, metadata={"paper_id": paper_id, "entity_hints": entity_hint} ) graph_docs = await self.graph_transformer.aconvert_to_graph_documents([doc]) return { "entities": entities, "graph_documents": graph_docs, "paper_id": paper_id } ``` --- ## 4. ็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚ ### 4.1 ๅ›พๆ•ฐๆฎๅบ“้€‰ๅž‹๏ผšNeo4j 5.x | ๅ›พๆ•ฐๆฎๅบ“ | ่ฎธๅฏ่ฏ | ๆŸฅ่ฏข่ฏญ่จ€ | Python้ฉฑๅŠจ | ็”Ÿๆ€้›†ๆˆ | ้€‚็”จ่ง„ๆจก | |---------|--------|---------|-----------|---------|---------| | **Neo4j 5.x** | Community AGPL | Cypher | `neo4j` | LangChain/LlamaIndexๅŽŸ็”Ÿ | <1ไบฟ่Š‚็‚น | | ArangoDB | Apache 2.0 | AQL | `python-arango` | ๅคšๆจกๅž‹(ๆ–‡ๆกฃ+ๅ›พ) | <1ไบฟ่Š‚็‚น | | NebulaGraph | Apache 2.0 | nGQL | `nebula3-python` | LlamaIndexๅŽŸ็”Ÿ | 10ไบฟ+่Š‚็‚น | | Kuzu | MIT | Cypher | `kuzu` | ๅตŒๅ…ฅๅผ, ่ฝป้‡ | <1000ไธ‡่Š‚็‚น | > **ๆŽจ่ Neo4j 5.x**๏ผšLangChain/LlamaIndex ๅŽŸ็”Ÿ้›†ๆˆๆœ€ๅฎŒๅ–„๏ผŒCypher ๆŸฅ่ฏข็”Ÿๆ€ๆœ€ๆˆ็†Ÿ๏ผŒ้€‚ๅˆ1000็ฏ‡่ฎบๆ–‡่ง„ๆจก ### 4.2 ๅ›พ่ฐฑๆ•ฐๆฎๆจกๅž‹ ```cypher // ===== ่Š‚็‚น ===== (:Paper {id, title, year, doi, venue, abstract, embedding}) (:Author {id, name, affiliation, h_index}) (:Method {id, name, description, year_proposed, embedding}) (:Dataset {id, name, domain, size, description}) (:Task {id, name, domain, description}) (:Metric {id, name, description}) (:Concept {id, name, definition, embedding}) // ===== ๅ…ณ็ณป ===== (:Paper)-[:AUTHORED_BY {order}]->(:Author) (:Paper)-[:PUBLISHED_IN {year}]->(:Venue) (:Paper)-[:CITES]->(:Paper) (:Paper)-[:PROPOSES]->(:Method) (:Method)-[:USED_FOR]->(:Task) (:Method)-[:EVALUATED_ON {score, metric}]->(:Dataset) (:Method)-[:IMPROVES_ON {delta, metric}]->(:Method) (:Method)-[:COMPARED_TO {result}]->(:Method) (:Concept)-[:PART_OF]->(:Concept) (:Concept)-[:HYPONYM_OF]->(:Concept) // ===== ็ดขๅผ• ===== CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding) OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}; CREATE VECTOR INDEX method_embedding FOR (m:Method) ON (m.embedding) OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}; CREATE FULLTEXT INDEX paper_fulltext FOR (p:Paper) ON EACH [p.title, p.abstract]; ``` ### 4.3 ๅ›พ่ฐฑๆž„ๅปบๆตๆฐด็บฟ ``` ่งฃๆžๅŽ็š„่ฎบๆ–‡ โ”€โ”€โ–ถ ็Ÿฅ่ฏ†ๆŠฝๅ– โ”€โ”€โ–ถ ไธ‰ๅ…ƒ็ป„่ง„่ŒƒๅŒ– โ”€โ”€โ–ถ Neo4j ๅ†™ๅ…ฅ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Graphusion ่žๅˆๅผ•ๆ“Ž โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 1. ๅฎžไฝ“ๅฝ’ไธ€ๅŒ– โ”‚ โ”‚ - ๅตŒๅ…ฅ็›ธไผผๅบฆ > 0.92 โ”‚ โ”‚ - LLM็กฎ่ฎคๅˆๅนถ โ”‚ โ”‚ "BERT" = "bert model" โ”‚ โ”‚ โ”‚ โ”‚ 2. ๅ…ณ็ณปๅ†ฒ็ชๆถˆ่งฃ โ”‚ โ”‚ - ๅŒไธ€ๅฎžไฝ“ๅฏนๅคšๅ…ณ็ณป โ”‚ โ”‚ - ๅ–็ฝฎไฟกๅบฆๆœ€้ซ˜็š„ โ”‚ โ”‚ โ”‚ โ”‚ 3. ็ผบๅคฑๅ…ณ็ณปๆŽจๆ–ญ โ”‚ โ”‚ - ๅŸบไบŽๅ›พ็ป“ๆž„ๆจกๅผ โ”‚ โ”‚ - LLM่กฅๅ…จ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 4.4 ๅ›พ่ฐฑๅฏ่ง†ๅŒ–ๆ–นๆกˆ ```python # ๆ–นๆกˆ1: Neo4j Browser (ๅผ€ๅ‘้˜ถๆฎต) # ๅ†…็ฝฎCypherๆŸฅ่ฏข + ไบคไบ’ๅผๅ›พๅฏ่ง†ๅŒ– # ๆ–นๆกˆ2: vis-network (ๅ‰็ซฏ้›†ๆˆ) # pip install pyvis from pyvis.network import Network def visualize_subgraph(nodes, edges, output_path="graph.html"): net = Network(height="800px", width="100%", directed=True) color_map = { "Method": "#ff6b6b", "Dataset": "#4ecdc4", "Task": "#45b7d1", "Author": "#96ceb4", "Paper": "#ffeaa7", "Concept": "#dfe6e9" } for node in nodes: net.add_node(node["id"], label=node["name"], color=color_map.get(node["type"], "#95a5a6")) for edge in edges: net.add_edge(edge["from"], edge["to"], label=edge["type"]) net.show(output_path) # ๆ–นๆกˆ3: React + D3-force (็”Ÿไบงๅ‰็ซฏ) # ๆŽจ่ react-force-graph ๆˆ– neo4j-viz ``` --- ## 5. ็ดขๅผ•ๅฑ‚ ### 5.1 ไธ‰่ทฏ็ดขๅผ•ๆžถๆž„ ``` ่งฃๆžๅŽ็š„่ฎบๆ–‡ๅ†…ๅฎน โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๅ‘้‡็ดขๅผ• โ”‚ โ”‚ ๅ›พ่ฐฑ็ดขๅผ• โ”‚ โ”‚ RAPTORๆ ‘ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Qdrant โ”‚ โ”‚ Neo4j โ”‚ โ”‚ ๅฑ‚ๆฌกๆ‘˜่ฆ โ”‚ โ”‚ Dense + โ”‚ โ”‚ Cypher + โ”‚ โ”‚ ้€’ๅฝ’่š็ฑป โ”‚ โ”‚ Sparse โ”‚ โ”‚ Vector โ”‚ โ”‚ โ†’ ๆ‘˜่ฆ โ”‚ โ”‚ โ”‚ โ”‚ Index โ”‚ โ”‚ โ†’ ๅ†ๅตŒๅ…ฅ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ้€‚ๅˆ: ้€‚ๅˆ: ้€‚ๅˆ: ไบ‹ๅฎžๆŸฅ่ฏข ๅคš่ทณๆŽจ็† ๅ…จๅฑ€ๆฆ‚่งˆ ็ฒพ็กฎๆฃ€็ดข ๅ…ณ็ณป่ฟฝๆบฏ ไธป้ข˜ๆ€ป็ป“ ็›ธไผผ่ฎบๆ–‡ ๅฏนๆฏ”ๅˆ†ๆž ่ถ‹ๅŠฟๅˆ†ๆž ``` ### 5.2 ๆ–‡ๆกฃๅˆ†ๅ—็ญ–็•ฅ ```python class AcademicChunker: """ๅญฆๆœฏ่ฎบๆ–‡ไธ“็”จๅˆ†ๅ—ๅ™จ โ€” ไฟ็•™็ซ ่Š‚ๅฑ‚็บง""" def __init__(self, chunk_size: int = 256, overlap: int = 50): self.chunk_size = chunk_size # 256 tokens (ๅฎž้ชŒ้ชŒ่ฏๆœ€ไฝณ, arxiv:2502.11371) self.overlap = overlap def chunk(self, parsed_paper: ParsedPaper) -> list: chunks = [] for block in parsed_paper.content_blocks: if block.type == ContentType.TABLE: # ่กจๆ ผไฝœไธบๅฎŒๆ•ดchunk, ้™„ๅŠ ๆ่ฟฐ chunks.append({ "text": f"[TABLE] {block.content}", "metadata": { "paper_id": parsed_paper.metadata.paper_id, "type": "table", "section": block.section_hierarchy, "page": block.page_idx, } }) elif block.type == ContentType.EQUATION_BLOCK: # ๅ…ฌๅผๅ— + ไธŠไธ‹ๆ–‡ chunks.append({ "text": f"[EQUATION] {block.content}", "metadata": { "paper_id": parsed_paper.metadata.paper_id, "type": "equation", "section": block.section_hierarchy, } }) else: # ๆ™ฎ้€šๆ–‡ๆœฌ: ๅ›บๅฎšๅคงๅฐๅˆ†ๅ—, ๆŒ‰ๅฅๅญ่พน็•Œๅฏน้ฝ text_chunks = self._split_text(block.content) for tc in text_chunks: chunks.append({ "text": tc, "metadata": { "paper_id": parsed_paper.metadata.paper_id, "type": block.type.value, "section": block.section_hierarchy, "page": block.page_idx, } }) return chunks def _split_text(self, text: str) -> list: """ๆŒ‰ๅฅๅญ่พน็•Œๅˆ†ๅ—, ไฟๆŒ256 tokenๅคงๅฐ""" import re sentences = re.split(r'(?<=[.!?])\s+', text) chunks, current = [], [] current_len = 0 for sent in sentences: sent_len = len(sent.split()) # ็ฎ€ๅŒ–็š„token่ฎกๆ•ฐ if current_len + sent_len > self.chunk_size and current: chunks.append(" ".join(current)) # ไฟ็•™overlap overlap_sents = [] overlap_len = 0 for s in reversed(current): if overlap_len + len(s.split()) > self.overlap: break overlap_sents.insert(0, s) overlap_len += len(s.split()) current = overlap_sents current_len = overlap_len current.append(sent) current_len += sent_len if current: chunks.append(" ".join(current)) return chunks ``` ### 5.3 RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘ ``` ่ฎบๆ–‡้›†ๅˆ (1000็ฏ‡) โ”‚ โ”œโ”€โ”€ Level 0: ๅŽŸๅง‹ๆ–‡ๆœฌๅ— (256 tokens) โ”‚ โ”‚ โ”‚ โ–ผ SBERTๅตŒๅ…ฅ โ†’ GMM่š็ฑป โ†’ UMAP้™็ปด โ”‚ โ”œโ”€โ”€ Level 1: ๆฎต่ฝ็บงๆ‘˜่ฆ (~50ไธช่š็ฑป) โ”‚ โ”‚ LLM็”Ÿๆˆๆ‘˜่ฆ โ†’ ้‡ๆ–ฐๅตŒๅ…ฅ โ”‚ โ–ผ ๅ†ๆฌก่š็ฑป โ”‚ โ”œโ”€โ”€ Level 2: ไธป้ข˜็บงๆ‘˜่ฆ (~15ไธช่š็ฑป) โ”‚ โ”‚ "Transformerๆžถๆž„็š„ๆ”น่ฟ›ๆ–นๅ‘" โ”‚ โ–ผ "ๅคง่ง„ๆจก้ข„่ฎญ็ปƒๆ•ฐๆฎ้›†็ปผ่ฟฐ" โ”‚ โ””โ”€โ”€ Level 3: ้ข†ๅŸŸ็บงๆ‘˜่ฆ (~5ไธช่š็ฑป) "NLP้ข†ๅŸŸ่ฟ‘ๅนดไธป่ฆ็ ”็ฉถๆ–นๅ‘ไธŽ็ช็ ด" ๆŸฅ่ฏขๆ—ถ: ไปŽๆ‰€ๆœ‰ๅฑ‚็บงไธญๆฃ€็ดขๆœ€็›ธๅ…ณ่Š‚็‚น (Collapsed Treeๆจกๅผ) ไผ˜ๅŠฟ: ๆ—ข่ƒฝๅ›ž็ญ”็ป†่Š‚้—ฎ้ข˜(Level 0), ไนŸ่ƒฝๅ›ž็ญ”ๅ…จๅฑ€้—ฎ้ข˜(Level 2-3) ``` --- ## 6. ๆฃ€็ดขๅฑ‚ ### 6.1 ๆททๅˆๆฃ€็ดขๆžถๆž„ ``` ็”จๆˆทๆŸฅ่ฏข โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ HyDE โ”‚ โ”‚ ๆŸฅ่ฏขๅˆ†็ฑปๅ™จ โ”‚ โ”‚ ๅ‡่ฎพๆ–‡ๆกฃ โ”‚ โ”‚ (Router LLM) โ”‚ โ”‚ ็”Ÿๆˆ+ๅตŒๅ…ฅ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ factual reasoning global โ”‚ (ไบ‹ๅฎž) (ๆŽจ็†) (ๅ…จๅฑ€) โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๅ‘้‡+BM25 โ”‚ โ”‚ ๅ›พ่ฐฑ้ๅކ โ”‚ โ”‚ RAPTOR โ”‚ โ”‚ Qdrant โ”‚ โ”‚ Neo4j โ”‚ โ”‚ ๆ‘˜่ฆๆ ‘ โ”‚ โ”‚ Hybrid โ”‚ โ”‚ Cypher โ”‚ โ”‚ ๅ…จๅฑ€ๆฃ€็ดข โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ RRF ่žๅˆๆŽ’ๅบ โ”‚ โ”‚ (Reciprocal โ”‚ โ”‚ Rank Fusion) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Cross-Encoder โ”‚ โ”‚ Reranker โ”‚ โ”‚ bge-reranker โ”‚ โ”‚ -large โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Top-5 ็ป“ๆžœ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LLM ็ญ”ๆกˆ็”Ÿๆˆ โ”‚ โ”‚ + ๅผ•็”จๆบฏๆบ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 6.2 ๆ ธๅฟƒๆฃ€็ดขไปฃ็  ```python from qdrant_client import QdrantClient, models from neo4j import GraphDatabase class HybridRetriever: """ไธ‰่ทฏๆททๅˆๆฃ€็ดขๅ™จ""" def __init__(self): self.qdrant = QdrantClient("localhost", port=6333) self.neo4j = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password")) self.reranker = self._load_reranker() self.embed_model = self._load_embedder() async def retrieve(self, query: str, mode: str = "hybrid", top_k: int = 20) -> list: """ mode: "factual" | "reasoning" | "global" | "hybrid" """ results = [] if mode in ("factual", "hybrid"): # 1. Dense + Sparse ๅ‘้‡ๆฃ€็ดข query_vec = self.embed_model.encode(query) vec_results = self.qdrant.search( collection_name="papers", query_vector=models.NamedVector(name="dense", vector=query_vec), limit=top_k, with_payload=True, ) results.extend([{"text": r.payload["text"], "score": r.score, "source": "vector", "metadata": r.payload} for r in vec_results]) if mode in ("reasoning", "hybrid"): # 2. ๅ›พ่ฐฑๆฃ€็ดข โ€” ๅฎžไฝ“+ๅ…ณ็ณป่ทฏๅพ„ graph_results = self._graph_search(query, limit=top_k // 2) results.extend(graph_results) if mode in ("global", "hybrid"): # 3. RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆฃ€็ดข raptor_results = self._raptor_search(query, limit=top_k // 3) results.extend(raptor_results) # 4. RRF ่žๅˆๆŽ’ๅบ fused = self._rrf_fusion(results) # 5. Cross-Encoder ้‡ๆŽ’ reranked = self._rerank(query, fused[:top_k]) return reranked[:5] def _graph_search(self, query: str, limit: int = 10) -> list: """Neo4j ๅญๅ›พๆฃ€็ดข""" # ๅ…ˆ็”จๅ‘้‡็ดขๅผ•ๆ‰พๅˆฐๆœ€็›ธๅ…ณ็š„ๅฎžไฝ“่Š‚็‚น # ๅ†็”จCypher้ๅކ1-2่ทณ้‚ปๅฑ… cypher = """ CALL db.index.vector.queryNodes('method_embedding', $limit, $query_vec) YIELD node, score MATCH (node)-[r]-(neighbor) RETURN node, r, neighbor, score ORDER BY score DESC LIMIT $limit """ with self.neo4j.session() as session: result = session.run(cypher, query_vec=self.embed_model.encode(query).tolist(), limit=limit) return [{"text": self._format_graph_result(r), "score": r["score"], "source": "graph"} for r in result] def _rrf_fusion(self, results: list, k: int = 60) -> list: """Reciprocal Rank Fusion โ€” ๅคš่ทฏ็ป“ๆžœ่žๅˆ""" doc_scores = {} for rank, r in enumerate(sorted(results, key=lambda x: x["score"], reverse=True)): doc_key = r["text"][:200] # ๅŽป้‡key if doc_key not in doc_scores: doc_scores[doc_key] = {"result": r, "rrf_score": 0} doc_scores[doc_key]["rrf_score"] += 1.0 / (k + rank + 1) return [v["result"] | {"score": v["rrf_score"]} for v in sorted(doc_scores.values(), key=lambda x: x["rrf_score"], reverse=True)] def _rerank(self, query: str, results: list) -> list: """BAAI/bge-reranker-large ไบคๅ‰็ผ–็ ๅ™จ้‡ๆŽ’""" pairs = [(query, r["text"]) for r in results] scores = self.reranker.predict(pairs) for r, s in zip(results, scores): r["rerank_score"] = float(s) return sorted(results, key=lambda x: x["rerank_score"], reverse=True) ``` --- ## 7. Agent ็ผ–ๆŽ’ๅฑ‚ ### 7.1 LangGraph ๅคšAgentๆžถๆž„ ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ็”จๆˆทๆŸฅ่ฏข โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ่ทฏ็”ฑ Agent โ”‚ โ”‚ (ๆ„ๅ›พๅˆ†็ฑป) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ็ฎ€ๅ•้—ฎ็ญ” โ”‚ โ”‚ ๅคš่ทณๆŽจ็† โ”‚ โ”‚ ๅ…จๅฑ€ๅˆ†ๆž โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ ๅ‘้‡ๆฃ€็ดข โ”‚ โ”‚ ๅ›พ่ฐฑ้ๅކ โ”‚ โ”‚ RAPTOR+KG โ”‚ โ”‚ โ†’ ็”Ÿๆˆ็ญ”ๆกˆ โ”‚ โ”‚ โ†’ ้“พๅผๆŽจ็† โ”‚ โ”‚ โ†’ ็ปผๅˆๆ€ป็ป“ โ”‚ โ”‚ โ†’ ๅผ•็”จๆบฏๆบ โ”‚ โ”‚ โ†’ ่ฏๆฎๆ”ถ้›† โ”‚ โ”‚ โ†’ ่ถ‹ๅŠฟๆดžๅฏŸ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ่‡ชๆฃ€ Agent โ”‚ โ”‚ (็ญ”ๆกˆ้ชŒ่ฏ) โ”‚ โ”‚ ๆ˜ฏๅฆๅ……ๅˆ†? โ”‚ โ”‚ ๆ˜ฏๅฆๆœ‰ๅนป่ง‰? โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๅ……ๅˆ† โ”‚ ไธๅ……ๅˆ† โ”‚ โ–ผ โ–ผ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ ่พ“ๅ‡บ็ญ”ๆกˆ โ”‚ โ”‚ ่กฅๅ……ๆฃ€็ดข โ”‚โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ + ๅผ•็”จ โ”‚ โ”‚ (ๆ›ดๅคšๆบ) โ”‚ (ๆœ€ๅคš3่ฝฎ) โ”‚ + ๅ›พ่ฐฑ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 7.2 LangGraph ็Šถๆ€ๆœบๅฎšไน‰ ```python from typing import TypedDict, Annotated, Literal from langgraph.graph import StateGraph, END from langgraph.graph.message import add_messages class AgentState(TypedDict): messages: Annotated[list, add_messages] query: str query_type: Literal["factual", "reasoning", "global"] retrieved_docs: list graph_context: list answer: str citations: list confidence: float iteration: int def build_agent_graph(): graph = StateGraph(AgentState) # ๆทปๅŠ ่Š‚็‚น graph.add_node("router", route_query) graph.add_node("retriever", hybrid_retrieve) graph.add_node("graph_explorer", explore_knowledge_graph) graph.add_node("generator", generate_answer) graph.add_node("validator", validate_answer) graph.add_node("supplementer", supplement_retrieval) # ๅฎšไน‰่พน graph.set_entry_point("router") graph.add_edge("router", "retriever") graph.add_edge("retriever", "graph_explorer") graph.add_edge("graph_explorer", "generator") graph.add_edge("generator", "validator") # ๆกไปถ่พน: ้ชŒ่ฏ้€š่ฟ‡โ†’็ป“ๆŸ, ไธ้€š่ฟ‡โ†’่กฅๅ……ๆฃ€็ดข(ๆœ€ๅคš3่ฝฎ) graph.add_conditional_edges( "validator", lambda state: "end" if state["confidence"] > 0.8 or state["iteration"] >= 3 else "supplement", {"end": END, "supplement": "supplementer"} ) graph.add_edge("supplementer", "retriever") return graph.compile() async def route_query(state: AgentState) -> AgentState: """LLMๆ„ๅ›พๅˆ†็ฑป""" classification_prompt = f""" ๅฐ†ไปฅไธ‹ๅญฆๆœฏ้—ฎ้ข˜ๅˆ†็ฑปไธบไธ‰็ง็ฑปๅž‹ไน‹ไธ€: - factual: ๅ…ทไฝ“ไบ‹ๅฎžๆŸฅ่ฏข (ๆŸไธชๆ–นๆณ•็š„ๆ•ˆๆžœใ€ๆŸ็ฏ‡่ฎบๆ–‡็š„ไฝœ่€…) - reasoning: ้œ€่ฆๅคšๆญฅๆŽจ็† (ๆ–นๆณ•Aๅ’ŒB็š„ๅŒบๅˆซใ€ๆŸๆŠ€ๆœฏ็š„ๅ‘ๅฑ•่„‰็ปœ) - global: ๅ…จๅฑ€ๆ€งๅˆ†ๆž (ๆŸ้ข†ๅŸŸ็š„็ ”็ฉถ่ถ‹ๅŠฟใ€ไธป่ฆๆŒ‘ๆˆ˜) ้—ฎ้ข˜: {state['query']} ็ฑปๅž‹: """ query_type = await llm.ainvoke(classification_prompt) return {"query_type": query_type.content.strip()} async def validate_answer(state: AgentState) -> AgentState: """Self-RAG ๆจกๅผ: LLM่‡ชๆฃ€็ญ”ๆกˆ่ดจ้‡""" validation_prompt = f""" ่ฏ„ไผฐไปฅไธ‹็ญ”ๆกˆ็š„่ดจ้‡(0-1ๅˆ†): ้—ฎ้ข˜: {state['query']} ็ญ”ๆกˆ: {state['answer']} ๆฃ€็ดขไพๆฎ: {state['retrieved_docs'][:3]} ่ฏ„ๅˆ†ๆ ‡ๅ‡†: - ๆ˜ฏๅฆๅฎŒๆ•ดๅ›ž็ญ”ไบ†้—ฎ้ข˜ - ๆ˜ฏๅฆๆœ‰ไพๆฎๆ”ฏๆ’‘ - ๆ˜ฏๅฆๅญ˜ๅœจๅนป่ง‰ ่ฟ”ๅ›žJSON: {{"confidence": 0.X, "issues": ["..."]}} """ result = await llm.ainvoke(validation_prompt) confidence = parse_confidence(result.content) return {"confidence": confidence, "iteration": state["iteration"] + 1} ``` ### 7.3 Agent ๅทฅๅ…ท้›† ```python from langchain.tools import tool @tool def vector_search(query: str, top_k: int = 5) -> str: """ๅœจ่ฎบๆ–‡ๅ‘้‡ๅบ“ไธญ่ฟ›่กŒ่ฏญไน‰ๆœ็ดข""" results = retriever.search_vectors(query, top_k) return format_search_results(results) @tool def graph_query(cypher: str) -> str: """ๆ‰ง่กŒCypherๆŸฅ่ฏข, ๅœจ็Ÿฅ่ฏ†ๅ›พ่ฐฑไธญๆฃ€็ดขๅฎžไฝ“ๅ’Œๅ…ณ็ณป""" with neo4j_driver.session() as session: result = session.run(cypher) return format_graph_results(result) @tool def find_related_methods(method_name: str) -> str: """ๆŸฅๆ‰พไธŽๆŒ‡ๅฎšๆ–นๆณ•็›ธๅ…ณ็š„ๆ‰€ๆœ‰ๆ–นๆณ•(ๆ”น่ฟ›ใ€ๅฏนๆฏ”ใ€ไฝฟ็”จ)""" cypher = """ MATCH (m:Method {name: $name})-[r]-(related) RETURN type(r) as relation, labels(related) as type, related.name as name, r.score_value as score ORDER BY r.score_value DESC LIMIT 20 """ return execute_and_format(cypher, {"name": method_name}) @tool def get_paper_summary(paper_id: str) -> str: """่Žทๅ–่ฎบๆ–‡็š„ๆ‘˜่ฆๅ’Œๆ ธๅฟƒ่ดก็Œฎ""" return paper_store.get_summary(paper_id) @tool def compare_methods(method_a: str, method_b: str) -> str: """ๅฏนๆฏ”ไธคไธชๆ–นๆณ•ๅœจ็›ธๅŒๆ•ฐๆฎ้›†ไธŠ็š„่กจ็Žฐ""" cypher = """ MATCH (a:Method {name: $a})-[r1:EVALUATED_ON]->(d:Dataset)<-[r2:EVALUATED_ON]-(b:Method {name: $b}) RETURN d.name as dataset, r1.score as score_a, r2.score as score_b, r1.metric as metric """ return execute_and_format(cypher, {"a": method_a, "b": method_b}) @tool def research_trend(topic: str, years: int = 5) -> str: """ๅˆ†ๆžๆŸไธช็ ”็ฉถไธป้ข˜ๅœจ่ฟ‘Nๅนด็š„ๅ‘ๅฑ•่ถ‹ๅŠฟ""" raptor_results = raptor_index.search(topic, level="high") graph_stats = get_temporal_graph_stats(topic, years) return synthesize_trend(raptor_results, graph_stats) ``` --- ## 8. LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚ ### 8.1 ๆžถๆž„่ฎพ่ฎก ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LiteLLM Proxy Server โ”‚ โ”‚ (็ปŸไธ€ OpenAI ๅ…ผๅฎนๆŽฅๅฃ) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ model_list: โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ "local/qwen2.5-14b" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Ollama :11434 โ”‚ โ”‚ "local/llama-3.1-8b" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ vLLM :8000 โ”‚ โ”‚ "gpt-4o-mini" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ OpenAI API โ”‚ โ”‚ "claude-3-5-sonnet" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Anthropic API โ”‚ โ”‚ "deepseek-chat" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ DeepSeek API โ”‚ โ”‚ "gemini-2.0-flash" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Google API โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ ๅŠŸ่ƒฝ: โ”‚ โ”‚ - ็ปŸไธ€ /chat/completions ๆŽฅๅฃ โ”‚ โ”‚ - ่‡ชๅŠจfallback (ๆœฌๅœฐโ†’API) โ”‚ โ”‚ - ่ดŸ่ฝฝๅ‡่กก (ๅคšvLLMๅฎžไพ‹) โ”‚ โ”‚ - ้€Ÿ็އ้™ๅˆถ & ๆˆๆœฌ่ฟฝ่ธช โ”‚ โ”‚ - ็ผ“ๅญ˜ (็›ธๅŒqueryๅค็”จ) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 8.2 LiteLLM ้…็ฝฎ ```yaml # litellm_config.yaml model_list: # ===== ๆœฌๅœฐๆจกๅž‹ ===== - model_name: "local/qwen2.5-14b" litellm_params: model: "openai/Qwen2.5-14B-Instruct" api_base: "http://localhost:11434/v1" # Ollama api_key: "ollama" model_info: max_tokens: 32768 input_cost_per_token: 0 # ๆœฌๅœฐๅ…่ดน - model_name: "local/llama-3.1-8b" litellm_params: model: "openai/meta-llama/Llama-3.1-8B-Instruct" api_base: "http://localhost:8000/v1" # vLLM api_key: "token" model_info: max_tokens: 131072 # ===== ๅค–้ƒจAPI ===== - model_name: "gpt-4o-mini" litellm_params: model: "gpt-4o-mini" api_key: "os.environ/OPENAI_API_KEY" - model_name: "deepseek-chat" litellm_params: model: "deepseek/deepseek-chat" api_key: "os.environ/DEEPSEEK_API_KEY" # ่ทฏ็”ฑ็ญ–็•ฅ router_settings: routing_strategy: "latency-based-routing" # ้€‰ๆ‹ฉๅปถ่ฟŸๆœ€ไฝŽ็š„ num_retries: 3 fallbacks: - "local/qwen2.5-14b": ["gpt-4o-mini"] # ๆœฌๅœฐๅคฑ่ดฅโ†’API - "gpt-4o-mini": ["deepseek-chat"] # OpenAIๅคฑ่ดฅโ†’DeepSeek # ไธๅŒไปปๅŠก็”จไธๅŒๆจกๅž‹ model_group_alias: "extraction": "local/qwen2.5-14b" # ็Ÿฅ่ฏ†ๆŠฝๅ–: ๆœฌๅœฐ(็œ้’ฑ) "generation": "gpt-4o-mini" # ็ญ”ๆกˆ็”Ÿๆˆ: API(้ซ˜่ดจ้‡) "routing": "local/llama-3.1-8b" # ๆ„ๅ›พๅˆ†็ฑป: ๆœฌๅœฐๅฐๆจกๅž‹(ๅฟซ) ``` ### 8.3 ็ปŸไธ€่ฐƒ็”จๆŽฅๅฃ ```python import litellm from typing import Optional class UnifiedLLM: """็ปŸไธ€LLM่ฐƒ็”จๅฑ‚ โ€” ่‡ชๅŠจ่ทฏ็”ฑๆœฌๅœฐ/API""" def __init__(self, config_path: str = "litellm_config.yaml"): litellm.set_verbose = False # ๅฏ็”จ็ผ“ๅญ˜ litellm.cache = litellm.Cache(type="redis", host="localhost", port=6379) async def complete( self, messages: list, task: str = "generation", # extraction | generation | routing temperature: float = 0, max_tokens: int = 4096, stream: bool = False, ) -> str: """ ็ปŸไธ€่ฐƒ็”จๆŽฅๅฃ, ๆ นๆฎtask่‡ชๅŠจ้€‰ๆ‹ฉๆจกๅž‹ """ model = self._select_model(task) response = await litellm.acompletion( model=model, messages=messages, temperature=temperature, max_tokens=max_tokens, stream=stream, metadata={"task": task}, # ็”จไบŽๆˆๆœฌ่ฟฝ่ธช ) if stream: return response # ่ฟ”ๅ›žๅผ‚ๆญฅ็”Ÿๆˆๅ™จ return response.choices[0].message.content def _select_model(self, task: str) -> str: model_map = { "extraction": "local/qwen2.5-14b", "generation": "gpt-4o-mini", "routing": "local/llama-3.1-8b", "fusion": "gpt-4o-mini", # Graphusion่žๅˆ้œ€่ฆๅผบๆจกๅž‹ "rewrite": "local/llama-3.1-8b", # HyDEๆŸฅ่ฏขๆ”นๅ†™ } return model_map.get(task, "local/qwen2.5-14b") ``` --- ## 9. ็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„ ### 9.1 Docker Compose ้ƒจ็ฝฒ ```yaml # docker-compose.yml version: '3.8' services: # ===== ๆ ธๅฟƒๆœๅŠก ===== api: build: ./services/api ports: ["8080:8080"] environment: - REDIS_URL=redis://redis:6379 - QDRANT_URL=http://qdrant:6333 - NEO4J_URL=bolt://neo4j:7687 - LITELLM_URL=http://litellm:4000 depends_on: [redis, qdrant, neo4j, litellm] # ===== PDF่งฃๆžๆœๅŠก ===== mineru-worker: build: ./services/mineru deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] environment: - MINERU_MODEL_SOURCE=local - CELERY_BROKER_URL=redis://redis:6379 volumes: - mineru-models:/models - pdf-storage:/pdfs # ===== LLMๆœๅŠก ===== litellm: image: ghcr.io/berriai/litellm:main-latest ports: ["4000:4000"] volumes: - ./config/litellm_config.yaml:/app/config.yaml command: ["--config", "/app/config.yaml"] ollama: image: ollama/ollama:latest ports: ["11434:11434"] deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes: - ollama-data:/root/.ollama # ===== ๅญ˜ๅ‚จๆœๅŠก ===== qdrant: image: qdrant/qdrant:latest ports: ["6333:6333"] volumes: - qdrant-data:/qdrant/storage neo4j: image: neo4j:5-community ports: ["7474:7474", "7687:7687"] environment: - NEO4J_AUTH=neo4j/password - NEO4J_PLUGINS=["apoc", "graph-data-science"] volumes: - neo4j-data:/data redis: image: redis:7-alpine ports: ["6379:6379"] postgres: image: postgres:16-alpine environment: - POSTGRES_DB=scholarmind - POSTGRES_PASSWORD=password volumes: - postgres-data:/var/lib/postgresql/data minio: image: minio/minio:latest ports: ["9000:9000", "9001:9001"] command: server /data --console-address ":9001" volumes: - minio-data:/data volumes: qdrant-data: neo4j-data: postgres-data: minio-data: ollama-data: mineru-models: pdf-storage: ``` ### 9.2 ็กฌไปถ้…็ฝฎๅปบ่ฎฎ ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ็กฌไปถ้…็ฝฎๅปบ่ฎฎ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ ้…็ฝฎ โ”‚ ๅผ€ๅ‘็Žฏๅขƒ โ”‚ ็”Ÿไบง(ๅฐ) โ”‚ ็”Ÿไบง(ๅคง) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ PDF่งฃๆž GPU โ”‚ RTX 3090 โ”‚ A100 80G โ”‚ 2ร—A100 80G โ”‚ โ”‚ LLMๆŽจ็† GPU โ”‚ RTX 4090 โ”‚ A100 80G โ”‚ 2ร—H100 80G โ”‚ โ”‚ CPU โ”‚ 16ๆ ธ โ”‚ 32ๆ ธ โ”‚ 64ๆ ธ โ”‚ โ”‚ RAM โ”‚ 64GB โ”‚ 128GB โ”‚ 256GB โ”‚ โ”‚ SSD โ”‚ 1TB NVMe โ”‚ 2TB NVMe โ”‚ 4TB NVMe โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 1000็ฏ‡่ฎบๆ–‡ โ”‚ ~3ๅฐๆ—ถ โ”‚ ~80ๅˆ†้’Ÿ โ”‚ ~40ๅˆ†้’Ÿ โ”‚ โ”‚ ่งฃๆžๆ—ถ้—ด โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ โ”‚ ~5s โ”‚ ~2s โ”‚ ~1s โ”‚ โ”‚ ๅนถๅ‘็”จๆˆท โ”‚ 1-5 โ”‚ 10-50 โ”‚ 50-200 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 9.3 API ่ฎพ่ฎก ```python from fastapi import FastAPI, UploadFile, BackgroundTasks from fastapi.responses import StreamingResponse from pydantic import BaseModel app = FastAPI(title="ScholarMind API", version="1.0") # ===== PDFไธŠไผ ไธŽ่งฃๆž ===== @app.post("/api/v1/papers/upload") async def upload_papers(files: list[UploadFile], bg: BackgroundTasks): """ๆ‰น้‡ไธŠไผ PDF่ฎบๆ–‡, ๅผ‚ๆญฅ่งฃๆž""" task_ids = [] for f in files: task_id = await save_and_queue(f) task_ids.append(task_id) return {"task_ids": task_ids, "status": "processing"} @app.get("/api/v1/papers/{task_id}/status") async def get_parse_status(task_id: str): """ๆŸฅ่ฏข่งฃๆž่ฟ›ๅบฆ""" return celery_app.AsyncResult(task_id).info # ===== ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” ===== class QueryRequest(BaseModel): query: str mode: str = "hybrid" # factual | reasoning | global | hybrid llm_backend: str = "auto" # auto | local | openai | deepseek top_k: int = 5 stream: bool = False include_citations: bool = True include_graph: bool = False # ๆ˜ฏๅฆ่ฟ”ๅ›ž็›ธๅ…ณๅญๅ›พ @app.post("/api/v1/query") async def query_knowledge_base(req: QueryRequest): """็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ”""" if req.stream: return StreamingResponse( agent.astream(req), media_type="text/event-stream" ) result = await agent.ainvoke(req) return { "answer": result["answer"], "citations": result["citations"], "confidence": result["confidence"], "graph_snippet": result.get("graph_snippet"), } # ===== ็Ÿฅ่ฏ†ๅ›พ่ฐฑ ===== @app.get("/api/v1/graph/entity/{name}") async def get_entity(name: str, depth: int = 2): """่Žทๅ–ๅฎžไฝ“ๅŠๅ…ถN่ทณๅญๅ›พ""" subgraph = await graph_service.get_subgraph(name, depth) return subgraph @app.get("/api/v1/graph/path") async def find_path(source: str, target: str, max_hops: int = 4): """ๆŸฅๆ‰พไธคไธชๅฎžไฝ“ไน‹้—ด็š„ๆœ€็Ÿญ่ทฏๅพ„""" path = await graph_service.shortest_path(source, target, max_hops) return path @app.get("/api/v1/graph/stats") async def graph_statistics(): """็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ปŸ่ฎกไฟกๆฏ""" return await graph_service.get_stats() # ===== ๅ›พ่ฐฑๅฏ่ง†ๅŒ– ===== @app.get("/api/v1/graph/visualize") async def visualize_graph(center: str, depth: int = 2, layout: str = "force"): """่ฟ”ๅ›žๅฏ่ง†ๅŒ–ๆ•ฐๆฎ (vis.jsๆ ผๅผ)""" data = await graph_service.get_vis_data(center, depth) return {"nodes": data["nodes"], "edges": data["edges"]} ``` --- ## 10. ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ” ### 10.1 ๅฎŒๆ•ดๆŠ€ๆœฏๆ ˆ | ๅฑ‚ | ็ป„ไปถ | ้€‰ๅž‹ | ๆ›ฟไปฃๆ–นๆกˆ | ้€‰ๅž‹็†็”ฑ | |----|------|------|---------|---------| | PDF่งฃๆž | OCRๅผ•ๆ“Ž | **MinerU 2.5 VLM** | Marker, Nougat, Docling | ๅญฆๆœฏ่ฎบๆ–‡SOTA(0.047 Edit Dist), ๅ…ฌๅผ88.46 CDM | | PDF่งฃๆž | ๅฟซ้€Ÿ่ทฏๅพ„ | **PyMuPDF** | pdfplumber | ๆ•ฐๅญ—PDF็ง’็บง่งฃๆž, ๆ— GPU้œ€ๆฑ‚ | | ็Ÿฅ่ฏ†ๆŠฝๅ– | NER | **GLiNER** (440M) | spaCy, DeepKE | ้›ถๆ ทๆœฌ, ่‡ชๅฎšไน‰ๆ ‡็ญพ, ๆœฌๅœฐ่ฟ่กŒ | | ็Ÿฅ่ฏ†ๆŠฝๅ– | RE | **LLMGraphTransformer** | REBEL, GLiREL, ReLiK | ๆ”ฏๆŒๆœฌๅœฐ+API LLM, Schema็บฆๆŸ | | ็Ÿฅ่ฏ†ๆŠฝๅ– | ่žๅˆ | **Graphusion** | ๆ—  | ๅฎžไฝ“ๅฝ’ไธ€ๅŒ–+ๅ†ฒ็ชๆถˆ่งฃ, ๆฏ”naiveๅฅฝ9.2% | | ็Ÿฅ่ฏ†ๅ›พ่ฐฑ | ๅ›พๆ•ฐๆฎๅบ“ | **Neo4j 5.x** | ArangoDB, NebulaGraph | LangChainๅŽŸ็”Ÿ้›†ๆˆ, Cypher็”Ÿๆ€ๆœ€ๆˆ็†Ÿ | | ๅ‘้‡็ดขๅผ• | ๅ‘้‡ๅบ“ | **Qdrant** | Milvus, Weaviate | Rust้ซ˜ๆ€ง่ƒฝ, ๅŽŸ็”ŸHybridๆœ็ดข, ็ฎ€ๅ•้ƒจ็ฝฒ | | ๆฃ€็ดข | ้‡ๆŽ’ๅ™จ | **bge-reranker-large** | Cohere, jina | ๅผ€ๆบSOTA, ๆ— APIไพ่ต– | | ๆฃ€็ดข | ๆŸฅ่ฏขๅขžๅผบ | **HyDE** | Query2Doc | +10 NDCG้›ถๆˆๆœฌๆๅ‡ | | ็ดขๅผ• | ๅฑ‚ๆฌก็ดขๅผ• | **RAPTOR** | GraphRAG Communities | ้€‚ๅˆๅฑ‚็บงๆ–‡ๆกฃ, +20%ๅ‡†็กฎ็އ | | RAG | ๅ›พๅขžๅผบ | **LightRAG** | NodeRAG, GraphRAG | 34kโญ, ๅขž้‡ๆ›ดๆ–ฐ, ๅคšๅŽ็ซฏ | | Agent | ็ผ–ๆŽ’ | **LangGraph** | smolagents, AutoGen | ๆœ‰็Šถๆ€ๅ›พ, ๆกไปถๅˆ†ๆ”ฏ, ็”Ÿไบง็บง | | LLM | ็ปŸไธ€ๆŽฅๅ…ฅ | **LiteLLM** | OpenRouter | 20kโญ, ๆ‰€ๆœ‰ๆไพ›ๅ•†็ปŸไธ€ๆŽฅๅฃ | | LLM | ๆœฌๅœฐๆŽจ็† | **vLLM + Ollama** | SGLang, llama.cpp | vLLM้ซ˜ๅžๅ, Ollamaๆ˜“็”จ | | ๅŽ็ซฏ | Webๆก†ๆžถ | **FastAPI** | Flask, Django | ๅผ‚ๆญฅๅŽŸ็”Ÿ, ้ซ˜ๆ€ง่ƒฝ, OpenAPI่‡ชๅŠจๆ–‡ๆกฃ | | ้˜Ÿๅˆ— | ไปปๅŠก้˜Ÿๅˆ— | **Celery + Redis** | RQ, Dramatiq | ๆˆ็†Ÿ็จณๅฎš, ๅˆ†ๅธƒๅผๆ”ฏๆŒ | | ๅญ˜ๅ‚จ | ๅฏน่ฑกๅญ˜ๅ‚จ | **MinIO** | S3 | S3ๅ…ผๅฎน, ๆœฌๅœฐ้ƒจ็ฝฒ | | ๅญ˜ๅ‚จ | ๅ…ณ็ณปๆ•ฐๆฎๅบ“ | **PostgreSQL** | MySQL | JSONๆ”ฏๆŒ, ๅ…จๆ–‡ๆœ็ดข | | ๅ‰็ซฏ | ๅ›พๅฏ่ง†ๅŒ– | **react-force-graph** | vis.js, D3 | React็”Ÿๆ€, 3Dๆ”ฏๆŒ | ### 10.2 ๆ€ง่ƒฝ้ข„ไผฐ ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 1000็ฏ‡่ฎบๆ–‡็ณป็ปŸๆ€ง่ƒฝ้ข„ไผฐ (A100 80G) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ PDF่งฃๆž โ”‚ ~80ๅˆ†้’Ÿ (MinerU 2.5, 2.12 pg/s) โ”‚ โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–(GLiNER NER) โ”‚ ~15ๅˆ†้’Ÿ (GPU batch) โ”‚ โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–(LLM RE) โ”‚ ~60ๅˆ†้’Ÿ (ๆœฌๅœฐ14Bๆจกๅž‹) โ”‚ โ”‚ โ”‚ ~30ๅˆ†้’Ÿ (GPT-4o-mini API) โ”‚ โ”‚ ๅ‘้‡็ดขๅผ•ๆž„ๅปบ โ”‚ ~10ๅˆ†้’Ÿ (text-embedding-3-small) โ”‚ โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑๆž„ๅปบ โ”‚ ~20ๅˆ†้’Ÿ (ๅซ่žๅˆ) โ”‚ โ”‚ RAPTORๆ ‘ๆž„ๅปบ โ”‚ ~30ๅˆ†้’Ÿ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ ๆ€ป่ฎก(็ซฏๅˆฐ็ซฏ) โ”‚ ~3.5ๅฐๆ—ถ (ๅ…จๆœฌๅœฐ) / ~2.5ๅฐๆ—ถ (ๆททๅˆAPI) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ (P50) โ”‚ ~1.5s (ๆœฌๅœฐLLM) / ~0.8s (API) โ”‚ โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ (P99) โ”‚ ~4s (ๆœฌๅœฐLLM) / ~2s (API) โ”‚ โ”‚ ๅ›พ่ฐฑๆŸฅ่ฏขๅปถ่ฟŸ โ”‚ ~200ms (2่ทณๅญๅ›พ) โ”‚ โ”‚ ๅ‘้‡ๆฃ€็ดขๅปถ่ฟŸ โ”‚ ~50ms (Qdrant, 1Mๅ‘้‡) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## 11. ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ ### 11.1 ๆ ธๅฟƒ่ฎบๆ–‡ | ่ฎบๆ–‡ | ArXiv ID | ่ดก็Œฎ | ๆŽจ่ๅบฆ | |------|---------|------|--------| | **MinerU 2.5** | 2509.22186 | ็ปŸไธ€VLMๆ–‡ๆกฃ่งฃๆžSOTA | โญโญโญโญโญ | | **OmniDocBench** | 2412.07626 | ๆ–‡ๆกฃ่งฃๆžๅŸบๅ‡† (CVPR 2025) | โญโญโญโญ | | **Graphusion** | 2410.17600 | ้›ถๆ ทๆœฌKGๆž„ๅปบ+่žๅˆ | โญโญโญโญโญ | | **GLiNER** | 2311.08526 | ้›ถๆ ทๆœฌNER, 440M | โญโญโญโญโญ | | **SciER** | 2410.21155 | ๅญฆๆœฏ่ฎบๆ–‡IEๆ•ฐๆฎ้›†+ๅŸบๅ‡† | โญโญโญโญ | | **ReLiK** | 2408.00103 | ๅฟซ้€Ÿๅฎžไฝ“้“พๆŽฅ+ๅ…ณ็ณปๆŠฝๅ– | โญโญโญโญ | | **NodeRAG** | 2504.11544 | ๅผ‚ๆž„ๅ›พRAG SOTA | โญโญโญโญโญ | | **LightRAG** | 2410.05779 | ่ฝป้‡ๅ›พRAG, ๅขž้‡ๆ›ดๆ–ฐ | โญโญโญโญโญ | | **Microsoft GraphRAG** | 2404.16130 | ็คพๅŒบๆ‘˜่ฆ+ๅ…จๅฑ€ๆฃ€็ดข | โญโญโญโญ | | **RAPTOR** | 2401.18059 | ้€’ๅฝ’ๆ‘˜่ฆๆ ‘ | โญโญโญโญ | | **Self-RAG** | 2310.11511 | ่‡ชๅๆ€ๆฃ€็ดข็”Ÿๆˆ | โญโญโญ | | **HyDE** | 2212.10496 | ๅ‡่ฎพๆ–‡ๆกฃๅตŒๅ…ฅ | โญโญโญโญ | | **RAG vs GraphRAG** | 2502.11371 | RAG+GraphRAG่žๅˆๅฎž้ชŒ | โญโญโญโญ | | **LLM-KGC Survey** | 2510.20345 | LLM็Ÿฅ่ฏ†ๅ›พ่ฐฑๆž„ๅปบ็ปผ่ฟฐ | โญโญโญโญ | ### 11.2 ๆ ธๅฟƒๅผ€ๆบ้กน็›ฎ | ้กน็›ฎ | GitHub | Stars | ็”จ้€” | |------|--------|-------|------| | **MinerU** | opendatalab/MinerU | 61k+ | PDFๆทฑๅบฆ่งฃๆž | | **LightRAG** | hkuds/lightrag | 34k+ | ๅ›พๅขžๅผบRAG | | **RAGFlow** | infiniflow/ragflow | 36k+ | ๅ…จๆ ˆRAGๅนณๅฐ(ๅซUI) | | **LiteLLM** | BerriAI/litellm | 20k+ | LLM็ปŸไธ€ไปฃ็† | | **Neo4j LLM Graph Builder** | neo4j-labs/llm-graph-builder | 3k+ | PDFโ†’KGโ†’QA | | **Kotaemon** | Cinnamon/kotaemon | 18k+ | ๆ–‡ๆกฃQA(ๅซGraphRAG) | | **Dify** | langgenius/dify | 70k+ | AIๅบ”็”จๅผ€ๅ‘ๅนณๅฐ | | **LangGraph** | langchain-ai/langgraph | 10k+ | Agent็Šถๆ€ๆœบ็ผ–ๆŽ’ | | **GLiNER** | urchade/GLiNER | 2k+ | ้›ถๆ ทๆœฌNER | | **Graphusion** | irenezihuili/graphusion | 27 | KG่žๅˆๅŽป้‡ | | **RAPTOR** | parthsarthi03/raptor | 1.6k+ | ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘ | | **NodeRAG** | Terry-Xu-666/NodeRAG | 412 | ๅผ‚ๆž„ๅ›พRAG | | **Qdrant** | qdrant/qdrant | 22k+ | ๅ‘้‡ๆ•ฐๆฎๅบ“ | | **vLLM** | vllm-project/vllm | 45k+ | ้ซ˜ๅžๅLLMๆŽจ็† | --- ## 12. ้กน็›ฎ็ป“ๆž„ ``` scholarmind/ โ”œโ”€โ”€ docker-compose.yml # ไธ€้”ฎ้ƒจ็ฝฒ โ”œโ”€โ”€ config/ โ”‚ โ”œโ”€โ”€ litellm_config.yaml # LLM่ทฏ็”ฑ้…็ฝฎ โ”‚ โ”œโ”€โ”€ mineru_config.yaml # MinerU่งฃๆž้…็ฝฎ โ”‚ โ””โ”€โ”€ settings.py # ๅ…จๅฑ€้…็ฝฎ โ”‚ โ”œโ”€โ”€ services/ โ”‚ โ”œโ”€โ”€ api/ # FastAPI ไธปๆœๅŠก โ”‚ โ”‚ โ”œโ”€โ”€ main.py # ๅ…ฅๅฃ โ”‚ โ”‚ โ”œโ”€โ”€ routers/ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ papers.py # PDFไธŠไผ /่งฃๆžAPI โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ query.py # ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ”API โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ graph.py # ็Ÿฅ่ฏ†ๅ›พ่ฐฑAPI โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ admin.py # ็ฎก็†API โ”‚ โ”‚ โ””โ”€โ”€ middleware/ โ”‚ โ”‚ โ”œโ”€โ”€ auth.py # ่ฎค่ฏ โ”‚ โ”‚ โ””โ”€โ”€ rate_limit.py # ้™ๆต โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ parser/ # PDF่งฃๆžๆœๅŠก โ”‚ โ”‚ โ”œโ”€โ”€ router.py # PDF็‰นๅพ่ทฏ็”ฑ โ”‚ โ”‚ โ”œโ”€โ”€ mineru_worker.py # MinerU VLM Worker โ”‚ โ”‚ โ”œโ”€โ”€ pymupdf_worker.py # PyMuPDF ๅฟซ้€Ÿ่งฃๆž โ”‚ โ”‚ โ”œโ”€โ”€ metadata_extractor.py # ๅ…ƒๆ•ฐๆฎๆๅ– โ”‚ โ”‚ โ””โ”€โ”€ tasks.py # CeleryไปปๅŠกๅฎšไน‰ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ extractor/ # ็Ÿฅ่ฏ†ๆŠฝๅ–ๆœๅŠก โ”‚ โ”‚ โ”œโ”€โ”€ ner_engine.py # GLiNER NER โ”‚ โ”‚ โ”œโ”€โ”€ re_engine.py # LLM ๅ…ณ็ณปๆŠฝๅ– โ”‚ โ”‚ โ”œโ”€โ”€ fusion_engine.py # Graphusion ่žๅˆ โ”‚ โ”‚ โ””โ”€โ”€ schema.py # ๅฎžไฝ“/ๅ…ณ็ณปSchema โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ graph/ # ็Ÿฅ่ฏ†ๅ›พ่ฐฑๆœๅŠก โ”‚ โ”‚ โ”œโ”€โ”€ neo4j_client.py # Neo4j ่ฟžๆŽฅ็ฎก็† โ”‚ โ”‚ โ”œโ”€โ”€ graph_builder.py # ๅ›พๆž„ๅปบ โ”‚ โ”‚ โ”œโ”€โ”€ graph_query.py # ๅ›พๆŸฅ่ฏข โ”‚ โ”‚ โ””โ”€โ”€ visualization.py # ๅ›พๅฏ่ง†ๅŒ– โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ indexer/ # ็ดขๅผ•ๆœๅŠก โ”‚ โ”‚ โ”œโ”€โ”€ chunker.py # ๅญฆๆœฏ่ฎบๆ–‡ๅˆ†ๅ—ๅ™จ โ”‚ โ”‚ โ”œโ”€โ”€ vector_indexer.py # Qdrant ๅ‘้‡็ดขๅผ• โ”‚ โ”‚ โ”œโ”€โ”€ raptor_builder.py # RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘ โ”‚ โ”‚ โ””โ”€โ”€ embedder.py # ๅตŒๅ…ฅๆจกๅž‹็ฎก็† โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ retriever/ # ๆฃ€็ดขๆœๅŠก โ”‚ โ”‚ โ”œโ”€โ”€ hybrid_retriever.py # ไธ‰่ทฏๆททๅˆๆฃ€็ดข โ”‚ โ”‚ โ”œโ”€โ”€ hyde.py # HyDE ๆŸฅ่ฏขๅขžๅผบ โ”‚ โ”‚ โ”œโ”€โ”€ reranker.py # ไบคๅ‰็ผ–็ ๅ™จ้‡ๆŽ’ โ”‚ โ”‚ โ””โ”€โ”€ rrf.py # RRF ่žๅˆ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ agent/ # Agent็ผ–ๆŽ’ๆœๅŠก โ”‚ โ”‚ โ”œโ”€โ”€ graph_definition.py # LangGraph ็Šถๆ€ๆœบ โ”‚ โ”‚ โ”œโ”€โ”€ nodes.py # Agent่Š‚็‚นๅฎšไน‰ โ”‚ โ”‚ โ”œโ”€โ”€ tools.py # Agentๅทฅๅ…ท้›† โ”‚ โ”‚ โ””โ”€โ”€ prompts.py # Promptๆจกๆฟ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ llm/ # LLM็ปŸไธ€ๆŽฅๅ…ฅ โ”‚ โ”œโ”€โ”€ unified_llm.py # LiteLLMๅฐ่ฃ… โ”‚ โ”œโ”€โ”€ model_router.py # ไปปๅŠกโ†’ๆจกๅž‹่ทฏ็”ฑ โ”‚ โ””โ”€โ”€ cache.py # LLM็ผ“ๅญ˜ โ”‚ โ”œโ”€โ”€ models/ # ๆ•ฐๆฎๆจกๅž‹ โ”‚ โ”œโ”€โ”€ paper.py # ่ฎบๆ–‡ๆ•ฐๆฎๆจกๅž‹ โ”‚ โ”œโ”€โ”€ graph.py # ๅ›พ่ฐฑๆ•ฐๆฎๆจกๅž‹ โ”‚ โ””โ”€โ”€ query.py # ๆŸฅ่ฏขๆ•ฐๆฎๆจกๅž‹ โ”‚ โ”œโ”€โ”€ tests/ โ”‚ โ”œโ”€โ”€ test_parser.py โ”‚ โ”œโ”€โ”€ test_extractor.py โ”‚ โ”œโ”€โ”€ test_retriever.py โ”‚ โ””โ”€โ”€ test_agent.py โ”‚ โ”œโ”€โ”€ scripts/ โ”‚ โ”œโ”€โ”€ setup_neo4j.cypher # Neo4jๅˆๅง‹ๅŒ–่„šๆœฌ โ”‚ โ”œโ”€โ”€ batch_parse.py # ๆ‰น้‡่งฃๆž่„šๆœฌ โ”‚ โ””โ”€โ”€ build_index.py # ็ดขๅผ•ๆž„ๅปบ่„šๆœฌ โ”‚ โ”œโ”€โ”€ frontend/ # ๅ‰็ซฏ (React/Next.js) โ”‚ โ”œโ”€โ”€ components/ โ”‚ โ”‚ โ”œโ”€โ”€ ChatInterface.tsx # ้—ฎ็ญ”็•Œ้ข โ”‚ โ”‚ โ”œโ”€โ”€ GraphViewer.tsx # ็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฏ่ง†ๅŒ– โ”‚ โ”‚ โ”œโ”€โ”€ PaperUploader.tsx # PDFไธŠไผ  โ”‚ โ”‚ โ””โ”€โ”€ SearchResults.tsx # ๆœ็ดข็ป“ๆžœๅฑ•็คบ โ”‚ โ””โ”€โ”€ ... โ”‚ โ”œโ”€โ”€ requirements.txt โ”œโ”€โ”€ Dockerfile โ””โ”€โ”€ README.md ``` --- ## ๅฟซ้€Ÿๅผ€ๅง‹ ```bash # 1. ๅ…‹้š†้กน็›ฎ git clone https://github.com/your-org/scholarmind.git cd scholarmind # 2. ้…็ฝฎ็Žฏๅขƒๅ˜้‡ cp .env.example .env # ็ผ–่พ‘ .env: ่ฎพ็ฝฎ OPENAI_API_KEY, DEEPSEEK_API_KEY ็ญ‰ # 3. ไธ‹่ฝฝMinerUๆจกๅž‹ pip install mineru mineru-models-download -s huggingface -m all # 4. ๅฏๅŠจๆ‰€ๆœ‰ๆœๅŠก docker-compose up -d # 5. ไธ‹่ฝฝๆœฌๅœฐLLM (ๅฏ้€‰) docker exec -it scholarmind-ollama ollama pull qwen2.5:14b-instruct # 6. ๆ‰น้‡ๅฏผๅ…ฅ่ฎบๆ–‡ python scripts/batch_parse.py --input /path/to/pdfs/ --workers 4 # 7. ๆž„ๅปบ็ดขๅผ• python scripts/build_index.py --vector --graph --raptor # 8. ่ฎฟ้—ฎ็ณป็ปŸ # API: http://localhost:8080/docs # Neo4j: http://localhost:7474 # MinIO: http://localhost:9001 ``` --- ## ่ฎธๅฏ่ฏ MIT License --- *ๆžถๆž„่ฎพ่ฎกๅŸบไบŽ 2024-2025 ๅนดๆœ€ๆ–ฐ็ ”็ฉถๆˆๆžœๅ’Œๅผ€ๆบๅฎž่ทต๏ผŒๆ‰€ๆœ‰่ฎบๆ–‡ๅผ•็”จๅ’ŒBenchmarkๆ•ฐๆฎๅ‡ๅฏๆบฏๆบใ€‚*