scholarmind-architecture / docs /ARCHITECTURE.md
heyingyue's picture
Add docs/ARCHITECTURE.md
9195a5e verified
# ๐Ÿ—๏ธ ScholarMind โ€” ็”Ÿไบง็บงๅญฆๆœฏ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” & ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ณป็ปŸ
## ็ณป็ปŸๆฆ‚่ฟฐ
ScholarMind ๆ˜ฏไธ€ไธช้ขๅ‘ **1000+ ็ฏ‡ๅญฆๆœฏ PDF ่ฎบๆ–‡** ็š„็”Ÿไบง็บงๆ™บ่ƒฝ็Ÿฅ่ฏ†็ณป็ปŸ๏ผŒ้›†ๆˆ๏ผš
- **PDF ๆทฑๅบฆ่งฃๆž**๏ผšๅŸบไบŽ MinerU 2.5 VLM ็š„้ซ˜็ฒพๅบฆ OCR๏ผˆๅ…ฌๅผ/่กจๆ ผ/ๅ›พ่กจ๏ผ‰
- **็Ÿฅ่ฏ†ๅ›พ่ฐฑ่‡ชๅŠจๆž„ๅปบ**๏ผšไปŽ่ฎบๆ–‡ไธญ่‡ชๅŠจๆŠฝๅ–ๅฎžไฝ“ไธŽๅ…ณ็ณป๏ผŒๆž„ๅปบ้ข†ๅŸŸ็Ÿฅ่ฏ†ๅ›พ่ฐฑ
- **ๆททๅˆๆฃ€็ดข้—ฎ็ญ”**๏ผšGraphRAG + ๅ‘้‡ๆฃ€็ดข + BM25 ็จ€็–ๆฃ€็ดข็š„ไธ‰่ทฏ่žๅˆ
- **ๅคšๆจกๅž‹ๆ”ฏๆŒ**๏ผšๅŒๆ—ถๆ”ฏๆŒๆœฌๅœฐ้ƒจ็ฝฒ๏ผˆvLLM/Ollama๏ผ‰ๅ’Œๅค–้ƒจ API๏ผˆOpenAI/Anthropic/DeepSeek๏ผ‰
- **Agent ็ผ–ๆŽ’**๏ผšๅŸบไบŽ LangGraph ็š„ๅคš Agent ๅไฝœ๏ผŒๆ”ฏๆŒๅคš่ทณๆŽจ็†
> **ๆ ธๅฟƒๆŒ‡ๆ ‡**๏ผšๅ• A100 80G ๅฏๅœจ ~80 ๅˆ†้’Ÿๅ†…ๅฎŒๆˆ 1000 ็ฏ‡่ฎบๆ–‡๏ผˆ~10000 ้กต๏ผ‰็š„ๅ…จ้‡่งฃๆž
---
## ็›ฎๅฝ•
1. [็ณป็ปŸๆžถๆž„ๆ€ป่งˆ](#1-็ณป็ปŸๆžถๆž„ๆ€ป่งˆ)
2. [PDF ่งฃๆžๅฑ‚ โ€” MinerU Pipeline](#2-pdf-่งฃๆžๅฑ‚)
3. [็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ โ€” ๅฎžไฝ“ๅ…ณ็ณปๆŠฝๅ–](#3-็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚)
4. [็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚ โ€” ๅ›พๆž„ๅปบไธŽๅญ˜ๅ‚จ](#4-็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚)
5. [็ดขๅผ•ๅฑ‚ โ€” ๅคš่ทฏ็ดขๅผ•ๆž„ๅปบ](#5-็ดขๅผ•ๅฑ‚)
6. [ๆฃ€็ดขๅฑ‚ โ€” ๆททๅˆๆฃ€็ดขไธŽ้‡ๆŽ’](#6-ๆฃ€็ดขๅฑ‚)
7. [Agent ็ผ–ๆŽ’ๅฑ‚ โ€” ๆ™บ่ƒฝ้—ฎ็ญ”](#7-agent-็ผ–ๆŽ’ๅฑ‚)
8. [LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚](#8-llm-็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚)
9. [็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„](#9-็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„)
10. [ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ”](#10-ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ”)
11. [ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ](#11-ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ)
12. [้กน็›ฎ็ป“ๆž„](#12-้กน็›ฎ็ป“ๆž„)
---
## 1. ็ณป็ปŸๆžถๆž„ๆ€ป่งˆ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ScholarMind ็ณป็ปŸๆžถๆž„ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ็”จๆˆทๅฑ‚ โ”‚ โ”‚ FastAPI Gateway โ”‚ โ”‚
โ”‚ โ”‚ Web UI โ”‚โ”€โ”€โ”€โ–ถโ”‚ /upload /query /graph /status /chat WebSocket SSE โ”‚ โ”‚
โ”‚ โ”‚ API่ฐƒ็”จ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ–ผ โ–ผ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Agent ็ผ–ๆŽ’ๅฑ‚ (LangGraph) โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ ่ทฏ็”ฑAgent โ”‚ โ”‚ ๆฃ€็ดขAgent โ”‚ โ”‚ ๆŽจ็†Agent โ”‚ โ”‚ ๅ›พ่ฐฑAgent โ”‚ โ”‚ ๆ€ป็ป“Agent โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ (ๅˆ†็ฑปๆ„ๅ›พ)โ”‚ โ”‚ (ๆททๅˆๆฃ€็ดข)โ”‚ โ”‚ (ๅคš่ทณๆŽจ็†)โ”‚ โ”‚ (ๅ›พ่ฐฑๆŸฅ่ฏข)โ”‚ โ”‚ (็ญ”ๆกˆ็”Ÿๆˆ)โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚ (LiteLLM Proxy) โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ vLLM โ”‚ โ”‚ Ollama โ”‚ โ”‚ OpenAI/Claudeโ”‚ โ”‚ DeepSeek โ”‚ โ”‚ Gemini โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ (ๆœฌๅœฐ) โ”‚ โ”‚ (ๆœฌๅœฐ) โ”‚ โ”‚ (ๅค–้ƒจAPI) โ”‚ โ”‚ (ๅค–้ƒจAPI)โ”‚ โ”‚ (ๅค–้ƒจAPI)โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ๆฃ€็ดขๅฑ‚ (Hybrid Retrieval) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ Dense Vector โ”‚ โ”‚ Sparse BM25 โ”‚ โ”‚ Graph Query โ”‚ โ”‚ Cross-Encoderโ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ (Qdrant) โ”‚ โ”‚ (Qdrant) โ”‚ โ”‚ (Neo4j) โ”‚ โ”‚ Reranker โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ”‚ RRF / ๅŠ ๆƒ่žๅˆ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ็ดขๅผ•ๅฑ‚ (Multi-Index) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ ๅ‘้‡็ดขๅผ• โ”‚ โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ดขๅผ• โ”‚ โ”‚ RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ Qdrant โ”‚ โ”‚ Neo4j 5.x โ”‚ โ”‚ (้€’ๅฝ’่š็ฑปโ†’ๆ‘˜่ฆโ†’ๅ†ๅตŒๅ…ฅ) โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ Dense+Sparse โ”‚ โ”‚ Entity/Relation โ”‚ โ”‚ Paperโ†’Sectionโ†’Paragraph โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ (Knowledge Extraction) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ ๅฎžไฝ“ๆŠฝๅ– (NER) โ”‚ โ”‚ ๅ…ณ็ณปๆŠฝๅ– (RE) โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ GLiNER 440M โ”‚ โ”‚ LLMGraphTransformer โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ ้›ถๆ ทๆœฌ, ่‡ชๅฎšไน‰ๆ ‡็ญพ โ”‚ โ”‚ + Graphusion ่žๅˆๅŽป้‡ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ PDF ่งฃๆžๅฑ‚ (MinerU Pipeline) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ PDF้˜Ÿๅˆ— โ”‚ โ”‚ MinerU 2.5 VLM โ”‚ โ”‚ ๆ ผๅผ่ฝฌๆข โ”‚ โ”‚ ๅ…ƒๆ•ฐๆฎๆๅ– โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ Celery โ”‚โ”€โ–ถโ”‚ vLLMๅŽ็ซฏ 2pg/s โ”‚โ”€โ–ถโ”‚ JSONโ†’MD โ”‚โ”€โ–ถโ”‚ ๆ ‡้ข˜/ไฝœ่€…/DOI โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ +Redis โ”‚ โ”‚ ๅธƒๅฑ€+OCR+ๅ…ฌๅผ+่กจๆ ผ โ”‚ โ”‚ +็ป“ๆž„ๅŒ– โ”‚ โ”‚ +็ซ ่Š‚/ๅผ•็”จ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ๅญ˜ๅ‚จๅฑ‚ (Storage) โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ PostgreSQLโ”‚ โ”‚ Qdrant โ”‚ โ”‚ Neo4j โ”‚ โ”‚ Redis โ”‚ โ”‚ MinIO/S3 โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ ๅ…ƒๆ•ฐๆฎ โ”‚ โ”‚ ๅ‘้‡็ดขๅผ• โ”‚ โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑ โ”‚ โ”‚ ็ผ“ๅญ˜/้˜Ÿๅˆ— โ”‚ โ”‚ PDFๅญ˜ๅ‚จ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## 2. PDF ่งฃๆžๅฑ‚
### 2.1 ๆŠ€ๆœฏ้€‰ๅž‹๏ผšMinerU 2.5 VLM
| ๆŒ‡ๆ ‡ | MinerU 2.5 | Marker | Nougat | PyMuPDF |
|------|-----------|--------|--------|---------|
| ๅญฆๆœฏ่ฎบๆ–‡ๆ–‡ๆœฌ็ฒพๅบฆ(Edit Distanceโ†“) | **0.047** | 0.080 | 0.365 | N/A(ไป…ๆ•ฐๅญ—PDF) |
| ๅ…ฌๅผ่ฏ†ๅˆซ(CDMโ†‘) | **88.46** | 17.6 | 15.1 | โŒ |
| ่กจๆ ผ่ฏ†ๅˆซ(TEDSโ†‘) | **88.22** | 67.6 | 39.9 | โŒ |
| ๅžๅ(A100, pg/s) | **2.12** | ~5 | ~0.5 | ~100 |
| ๆ‰ซๆไปถๆ”ฏๆŒ | โœ… | โš ๏ธ | โŒ | โŒ |
> **Benchmark ๆฅๆบ**: OmniDocBench (CVPR 2025, arxiv:2412.07626)
### 2.2 ๆžถๆž„่ฎพ่ฎก
```python
# ๆททๅˆ่ทฏ็”ฑ็ญ–็•ฅ๏ผšๆ•ฐๅญ—PDF่ตฐPyMuPDF(ๅฟซ), ๅคๆ‚PDF่ตฐMinerU 2.5(็ฒพ)
class PDFRouter:
"""ๆ นๆฎPDF็‰นๅพๆ™บ่ƒฝ้€‰ๆ‹ฉ่งฃๆžๅผ•ๆ“Ž"""
def route(self, pdf_path: str) -> str:
import fitz
doc = fitz.open(pdf_path)
avg_chars = sum(len(p.get_text()) for p in doc) / len(doc)
has_images = any(p.get_images() for p in doc)
if avg_chars > 500 and not has_images:
return "pymupdf_fast" # ็บฏๆ•ฐๅญ—PDF๏ผŒPyMuPDF็ง’็บง่งฃๆž
elif avg_chars > 200:
return "mineru_pipeline" # ๆ•ฐๅญ—PDF+ๅ›พ่กจ๏ผŒPipelineๆจกๅผ(CPU)
else:
return "mineru_vlm" # ๆ‰ซๆไปถ/ๅคๆ‚ๅธƒๅฑ€๏ผŒVLMๆจกๅผ(GPU)
```
### 2.3 ๆ‰น้‡ๅค„็†ๆตๆฐด็บฟ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PDF ๆ‰น้‡ๅค„็†ๆตๆฐด็บฟ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PDF ๆ–‡ไปถ โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚ PDF่ทฏ็”ฑๅ™จ โ”‚โ”€โ”€โ–ถโ”‚ Celery Workerๆฑ  โ”‚ โ”‚โ”€โ”€โ–ถโ”‚ ็ป“ๆž„ๅŒ– โ”‚
โ”‚ ไธŠไผ /ๆ‰น้‡ โ”‚ โ”‚ โ”‚ (็‰นๅพๆฃ€ๆต‹) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ JSON+MD โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ W1: MinerU VLM โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚ W2: MinerU VLM โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ W3: Pipeline โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Redis โ”‚โ—€โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”‚ W4: PyMuPDF โ”‚ โ”‚โ”€โ”€โ–ถโ”‚ ๅ…ƒๆ•ฐๆฎ โ”‚
โ”‚ ไปปๅŠก้˜Ÿๅˆ— โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ ๆๅ– โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ็›‘ๆŽง: ่ฟ›ๅบฆ/ๅคฑ่ดฅ้‡่ฏ•/ๅžๅ้‡็ปŸ่ฎก โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ๅ…ณ้”ฎ้…็ฝฎ:
- MinerU VLM Worker: ๆฏGPUไธ€ไธช่ฟ›็จ‹, vLLMๅผ‚ๆญฅๆ‰นๅค„็†
- gpu_memory_utilization: 0.7 (้ข„็•™30%็ป™OOMๅฎ‰ๅ…จ่พน้™…)
- max_num_batched_tokens: 16384 (ๆ้ซ˜GPUๅˆฉ็”จ็އ)
- ๅคฑ่ดฅ้‡่ฏ•: ๆœ€ๅคš3ๆฌก, ๆŒ‡ๆ•ฐ้€€้ฟ
- ่ถ…ๆ—ถ: ๅ•PDF 300็ง’ไธŠ้™
```
### 2.4 ่พ“ๅ‡บๆ•ฐๆฎๆจกๅž‹
```python
from pydantic import BaseModel
from typing import List, Optional
from enum import Enum
class ContentType(str, Enum):
TITLE = "title"
TEXT = "text"
TABLE = "table"
EQUATION = "equation"
EQUATION_BLOCK = "equation_block"
IMAGE = "image"
CODE = "code"
LIST = "list"
REFERENCE = "reference"
class ContentBlock(BaseModel):
type: ContentType
content: str # Markdown/LaTeX/HTML
page_idx: int
bbox: List[float] # [x0, y0, x1, y1]
reading_order: int
section_hierarchy: List[str] # ["3", "3.1", "Methods"]
class PaperMetadata(BaseModel):
paper_id: str
title: str
authors: List[str]
abstract: str
doi: Optional[str]
year: Optional[int]
venue: Optional[str]
keywords: List[str]
references: List[str] # ๅผ•็”จ็š„่ฎบๆ–‡ๆ ‡้ข˜
class ParsedPaper(BaseModel):
metadata: PaperMetadata
content_blocks: List[ContentBlock]
markdown: str
page_count: int
parse_engine: str # "mineru_vlm" | "mineru_pipeline" | "pymupdf"
parse_time_seconds: float
```
---
## 3. ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚
### 3.1 ไธค้˜ถๆฎตๆŠฝๅ–็ญ–็•ฅ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–ๆตๆฐด็บฟ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”‚
โ”‚ Stage 1: ๅฟซ้€Ÿๅฎžไฝ“ๆŠฝๅ– (GLiNER, ๆœฌๅœฐ, 440Mๅ‚ๆ•ฐ) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ่พ“ๅ…ฅ: ่ฎบๆ–‡ๆ–‡ๆœฌๅ— โ”‚ โ”‚
โ”‚ โ”‚ ๆจกๅž‹: urchade/gliner_large-v2.1 (้›ถๆ ทๆœฌNER) โ”‚ โ”‚
โ”‚ โ”‚ ๆ ‡็ญพ: [Author, Method, Dataset, Metric, Task, โ”‚ โ”‚
โ”‚ โ”‚ Model, Concept, Venue, Score, Tool] โ”‚ โ”‚
โ”‚ โ”‚ ่พ“ๅ‡บ: [(text, label, score, span), ...] โ”‚ โ”‚
โ”‚ โ”‚ ้€Ÿๅบฆ: ~1000 chunks/min (CPU), ~5000 chunks/min (GPU) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ”‚
โ”‚ Stage 2: LLMๅ…ณ็ณปๆŠฝๅ– (LLMGraphTransformer, ๆœฌๅœฐๆˆ–API) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ ่พ“ๅ…ฅ: ๆ–‡ๆœฌๅ— + Stage1ๅฎžไฝ“ๆ็คบ โ”‚ โ”‚
โ”‚ โ”‚ ๅ…ณ็ณป็ฑปๅž‹: [PROPOSED_BY, USED_FOR, EVALUATED_ON, โ”‚ โ”‚
โ”‚ โ”‚ TRAINED_WITH, COMPARED_TO, PART_OF, ACHIEVED_SCORE, โ”‚ โ”‚
โ”‚ โ”‚ HYPONYM_OF, CITED_BY, IMPROVES_ON] โ”‚ โ”‚
โ”‚ โ”‚ ๆœฌๅœฐ: Ollama(Qwen2.5-14B) ๆˆ– vLLM(Llama-3.1-8B) โ”‚ โ”‚
โ”‚ โ”‚ API: GPT-4o-mini ๆˆ– DeepSeek-V3 โ”‚ โ”‚
โ”‚ โ”‚ ่พ“ๅ‡บ: [(head, relation, tail, properties), ...] โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ”‚
โ”‚ Stage 3: Graphusion ่žๅˆ (ๅฎžไฝ“ๅฝ’ไธ€ๅŒ– + ๅ†ฒ็ชๆถˆ่งฃ) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ - ๅตŒๅ…ฅ็›ธไผผๅบฆๅˆๅนถ: "NMT" โ†” "neural machine translation" โ”‚ โ”‚
โ”‚ โ”‚ - LLMๅ†ฒ็ชๆถˆ่งฃ: ็›ธๅŒๅฎžไฝ“ๅฏน็š„็Ÿ›็›พๅ…ณ็ณป โ”‚ โ”‚
โ”‚ โ”‚ - ๆ–ฐไธ‰ๅ…ƒ็ป„ๆŽจๆ–ญ: ๅŸบไบŽไธŠไธ‹ๆ–‡่กฅๅ…จ็ผบๅคฑๅ…ณ็ณป โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 3.2 ๅญฆๆœฏ่ฎบๆ–‡ๅฎžไฝ“-ๅ…ณ็ณป Schema
```
ๅฎžไฝ“็ฑปๅž‹ (Node Types):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๅฎžไฝ“็ฑปๅž‹ โ”‚ ๆ่ฟฐ โ”‚ ๅฑžๆ€ง โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Paper โ”‚ ่ฎบๆ–‡ โ”‚ title, year, doi โ”‚
โ”‚ Author โ”‚ ไฝœ่€… โ”‚ name, affiliationโ”‚
โ”‚ Method โ”‚ ๆ–นๆณ•/็ฎ—ๆณ• โ”‚ name, descriptionโ”‚
โ”‚ Dataset โ”‚ ๆ•ฐๆฎ้›† โ”‚ name, size, domainโ”‚
โ”‚ Task โ”‚ ไปปๅŠก โ”‚ name, domain โ”‚
โ”‚ Metric โ”‚ ่ฏ„ไผฐๆŒ‡ๆ ‡ โ”‚ name, value โ”‚
โ”‚ Model โ”‚ ๅ…ทไฝ“ๆจกๅž‹ๅฎžไพ‹ โ”‚ name, params โ”‚
โ”‚ Concept โ”‚ ๅญฆๆœฏๆฆ‚ๅฟต โ”‚ name, definition โ”‚
โ”‚ Tool โ”‚ ๅทฅๅ…ท/ๆก†ๆžถ โ”‚ name, version โ”‚
โ”‚ Venue โ”‚ ๅ‘่กจๅœบๆ‰€ โ”‚ name, type โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ๅ…ณ็ณป็ฑปๅž‹ (Edge Types):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๅ…ณ็ณป็ฑปๅž‹ โ”‚ ๆ่ฟฐ (Head โ†’ Tail) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PROPOSED_BY โ”‚ Method โ†’ Author (ๆ–นๆณ•็”ฑไฝœ่€…ๆๅ‡บ) โ”‚
โ”‚ PUBLISHED_IN โ”‚ Paper โ†’ Venue (่ฎบๆ–‡ๅ‘่กจๅœจๆŸไผš่ฎฎ/ๆœŸๅˆŠ) โ”‚
โ”‚ USED_FOR โ”‚ Method โ†’ Task (ๆ–นๆณ•็”จไบŽๆŸไปปๅŠก) โ”‚
โ”‚ EVALUATED_ON โ”‚ Method โ†’ Dataset (ๆ–นๆณ•ๅœจๆŸๆ•ฐๆฎ้›†ไธŠ่ฏ„ไผฐ) โ”‚
โ”‚ ACHIEVED_SCORE โ”‚ Method โ†’ Metric (ๆ–นๆณ•่พพๅˆฐๆŸๆŒ‡ๆ ‡ๅ€ผ) โ”‚
โ”‚ TRAINED_WITH โ”‚ Model โ†’ Dataset (ๆจกๅž‹ๅœจๆŸๆ•ฐๆฎ้›†ไธŠ่ฎญ็ปƒ) โ”‚
โ”‚ COMPARED_TO โ”‚ Method โ†’ Method (ๆ–นๆณ•ไน‹้—ด็š„ๅฏนๆฏ”) โ”‚
โ”‚ IMPROVES_ON โ”‚ Method โ†’ Method (ๆ–นๆณ•Aๆ”น่ฟ›ไบ†ๆ–นๆณ•B) โ”‚
โ”‚ PART_OF โ”‚ Concept โ†’ Concept (ๆฆ‚ๅฟตๅฑ‚็บงๅ…ณ็ณป) โ”‚
โ”‚ CITES โ”‚ Paper โ†’ Paper (ๅผ•็”จๅ…ณ็ณป) โ”‚
โ”‚ AUTHORED_BY โ”‚ Paper โ†’ Author (่ฎบๆ–‡ไฝœ่€…) โ”‚
โ”‚ HYPONYM_OF โ”‚ Concept โ†’ Concept (ไธŠไธ‹ไฝๅ…ณ็ณป) โ”‚
โ”‚ USES_TOOL โ”‚ Method โ†’ Tool (ๆ–นๆณ•ไฝฟ็”จๆŸๅทฅๅ…ท) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 3.3 ๆ ธๅฟƒๆŠฝๅ–ไปฃ็ 
```python
from gliner import GLiNER
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_core.documents import Document
class KnowledgeExtractor:
"""ไธค้˜ถๆฎต็Ÿฅ่ฏ†ๆŠฝๅ–ๅ™จ"""
def __init__(self, llm_backend: str = "local_ollama"):
# Stage 1: ๅฟซ้€ŸNER
self.ner_model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")
self.entity_labels = [
"author", "method", "dataset", "metric",
"task", "model", "concept", "tool", "venue", "score"
]
# Stage 2: LLMๅ…ณ็ณปๆŠฝๅ–
self.llm = self._init_llm(llm_backend)
self.graph_transformer = LLMGraphTransformer(
llm=self.llm,
allowed_nodes=["Author","Method","Dataset","Metric","Task","Model","Concept","Tool","Venue"],
allowed_relationships=[
"PROPOSED_BY","USED_FOR","EVALUATED_ON","ACHIEVED_SCORE",
"TRAINED_WITH","COMPARED_TO","IMPROVES_ON","PART_OF",
"CITES","AUTHORED_BY","HYPONYM_OF","USES_TOOL","PUBLISHED_IN"
],
node_properties=["description", "year"],
relationship_properties=["score_value", "metric_name", "confidence"],
strict_mode=True,
)
def _init_llm(self, backend: str):
"""็ปŸไธ€LLMๅˆๅง‹ๅŒ– โ€” ๆ”ฏๆŒๆœฌๅœฐๅ’Œๅค–้ƒจAPI"""
if backend == "local_ollama":
from langchain_community.llms import Ollama
return Ollama(model="qwen2.5:14b-instruct", temperature=0)
elif backend == "local_vllm":
from langchain_openai import ChatOpenAI
return ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="token",
model="meta-llama/Llama-3.1-8B-Instruct",
temperature=0
)
elif backend == "openai":
from langchain_openai import ChatOpenAI
return ChatOpenAI(model="gpt-4o-mini", temperature=0)
elif backend == "deepseek":
from langchain_openai import ChatOpenAI
return ChatOpenAI(
base_url="https://api.deepseek.com/v1",
model="deepseek-chat",
temperature=0
)
async def extract(self, text: str, paper_id: str) -> dict:
"""ไธค้˜ถๆฎตๆŠฝๅ–"""
# Stage 1: GLiNERๅฟซ้€ŸNER
entities = self.ner_model.predict_entities(
text, self.entity_labels, threshold=0.5
)
# Stage 2: LLMๅ…ณ็ณปๆŠฝๅ– (ไผ ๅ…ฅๅฎžไฝ“ไฝœไธบๆ็คบ)
entity_hint = ", ".join([f"{e['text']}({e['label']})" for e in entities[:20]])
doc = Document(
page_content=text,
metadata={"paper_id": paper_id, "entity_hints": entity_hint}
)
graph_docs = await self.graph_transformer.aconvert_to_graph_documents([doc])
return {
"entities": entities,
"graph_documents": graph_docs,
"paper_id": paper_id
}
```
---
## 4. ็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚
### 4.1 ๅ›พๆ•ฐๆฎๅบ“้€‰ๅž‹๏ผšNeo4j 5.x
| ๅ›พๆ•ฐๆฎๅบ“ | ่ฎธๅฏ่ฏ | ๆŸฅ่ฏข่ฏญ่จ€ | Python้ฉฑๅŠจ | ็”Ÿๆ€้›†ๆˆ | ้€‚็”จ่ง„ๆจก |
|---------|--------|---------|-----------|---------|---------|
| **Neo4j 5.x** | Community AGPL | Cypher | `neo4j` | LangChain/LlamaIndexๅŽŸ็”Ÿ | <1ไบฟ่Š‚็‚น |
| ArangoDB | Apache 2.0 | AQL | `python-arango` | ๅคšๆจกๅž‹(ๆ–‡ๆกฃ+ๅ›พ) | <1ไบฟ่Š‚็‚น |
| NebulaGraph | Apache 2.0 | nGQL | `nebula3-python` | LlamaIndexๅŽŸ็”Ÿ | 10ไบฟ+่Š‚็‚น |
| Kuzu | MIT | Cypher | `kuzu` | ๅตŒๅ…ฅๅผ, ่ฝป้‡ | <1000ไธ‡่Š‚็‚น |
> **ๆŽจ่ Neo4j 5.x**๏ผšLangChain/LlamaIndex ๅŽŸ็”Ÿ้›†ๆˆๆœ€ๅฎŒๅ–„๏ผŒCypher ๆŸฅ่ฏข็”Ÿๆ€ๆœ€ๆˆ็†Ÿ๏ผŒ้€‚ๅˆ1000็ฏ‡่ฎบๆ–‡่ง„ๆจก
### 4.2 ๅ›พ่ฐฑๆ•ฐๆฎๆจกๅž‹
```cypher
// ===== ่Š‚็‚น =====
(:Paper {id, title, year, doi, venue, abstract, embedding})
(:Author {id, name, affiliation, h_index})
(:Method {id, name, description, year_proposed, embedding})
(:Dataset {id, name, domain, size, description})
(:Task {id, name, domain, description})
(:Metric {id, name, description})
(:Concept {id, name, definition, embedding})
// ===== ๅ…ณ็ณป =====
(:Paper)-[:AUTHORED_BY {order}]->(:Author)
(:Paper)-[:PUBLISHED_IN {year}]->(:Venue)
(:Paper)-[:CITES]->(:Paper)
(:Paper)-[:PROPOSES]->(:Method)
(:Method)-[:USED_FOR]->(:Task)
(:Method)-[:EVALUATED_ON {score, metric}]->(:Dataset)
(:Method)-[:IMPROVES_ON {delta, metric}]->(:Method)
(:Method)-[:COMPARED_TO {result}]->(:Method)
(:Concept)-[:PART_OF]->(:Concept)
(:Concept)-[:HYPONYM_OF]->(:Concept)
// ===== ็ดขๅผ• =====
CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
CREATE VECTOR INDEX method_embedding FOR (m:Method) ON (m.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
CREATE FULLTEXT INDEX paper_fulltext FOR (p:Paper) ON EACH [p.title, p.abstract];
```
### 4.3 ๅ›พ่ฐฑๆž„ๅปบๆตๆฐด็บฟ
```
่งฃๆžๅŽ็š„่ฎบๆ–‡ โ”€โ”€โ–ถ ็Ÿฅ่ฏ†ๆŠฝๅ– โ”€โ”€โ–ถ ไธ‰ๅ…ƒ็ป„่ง„่ŒƒๅŒ– โ”€โ”€โ–ถ Neo4j ๅ†™ๅ…ฅ
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Graphusion ่žๅˆๅผ•ๆ“Ž โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1. ๅฎžไฝ“ๅฝ’ไธ€ๅŒ– โ”‚
โ”‚ - ๅตŒๅ…ฅ็›ธไผผๅบฆ > 0.92 โ”‚
โ”‚ - LLM็กฎ่ฎคๅˆๅนถ โ”‚
โ”‚ "BERT" = "bert model" โ”‚
โ”‚ โ”‚
โ”‚ 2. ๅ…ณ็ณปๅ†ฒ็ชๆถˆ่งฃ โ”‚
โ”‚ - ๅŒไธ€ๅฎžไฝ“ๅฏนๅคšๅ…ณ็ณป โ”‚
โ”‚ - ๅ–็ฝฎไฟกๅบฆๆœ€้ซ˜็š„ โ”‚
โ”‚ โ”‚
โ”‚ 3. ็ผบๅคฑๅ…ณ็ณปๆŽจๆ–ญ โ”‚
โ”‚ - ๅŸบไบŽๅ›พ็ป“ๆž„ๆจกๅผ โ”‚
โ”‚ - LLM่กฅๅ…จ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 4.4 ๅ›พ่ฐฑๅฏ่ง†ๅŒ–ๆ–นๆกˆ
```python
# ๆ–นๆกˆ1: Neo4j Browser (ๅผ€ๅ‘้˜ถๆฎต)
# ๅ†…็ฝฎCypherๆŸฅ่ฏข + ไบคไบ’ๅผๅ›พๅฏ่ง†ๅŒ–
# ๆ–นๆกˆ2: vis-network (ๅ‰็ซฏ้›†ๆˆ)
# pip install pyvis
from pyvis.network import Network
def visualize_subgraph(nodes, edges, output_path="graph.html"):
net = Network(height="800px", width="100%", directed=True)
color_map = {
"Method": "#ff6b6b", "Dataset": "#4ecdc4",
"Task": "#45b7d1", "Author": "#96ceb4",
"Paper": "#ffeaa7", "Concept": "#dfe6e9"
}
for node in nodes:
net.add_node(node["id"], label=node["name"],
color=color_map.get(node["type"], "#95a5a6"))
for edge in edges:
net.add_edge(edge["from"], edge["to"], label=edge["type"])
net.show(output_path)
# ๆ–นๆกˆ3: React + D3-force (็”Ÿไบงๅ‰็ซฏ)
# ๆŽจ่ react-force-graph ๆˆ– neo4j-viz
```
---
## 5. ็ดขๅผ•ๅฑ‚
### 5.1 ไธ‰่ทฏ็ดขๅผ•ๆžถๆž„
```
่งฃๆžๅŽ็š„่ฎบๆ–‡ๅ†…ๅฎน
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ–ผ โ–ผ โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๅ‘้‡็ดขๅผ• โ”‚ โ”‚ ๅ›พ่ฐฑ็ดขๅผ• โ”‚ โ”‚ RAPTORๆ ‘ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ Qdrant โ”‚ โ”‚ Neo4j โ”‚ โ”‚ ๅฑ‚ๆฌกๆ‘˜่ฆ โ”‚
โ”‚ Dense + โ”‚ โ”‚ Cypher + โ”‚ โ”‚ ้€’ๅฝ’่š็ฑป โ”‚
โ”‚ Sparse โ”‚ โ”‚ Vector โ”‚ โ”‚ โ†’ ๆ‘˜่ฆ โ”‚
โ”‚ โ”‚ โ”‚ Index โ”‚ โ”‚ โ†’ ๅ†ๅตŒๅ…ฅ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
้€‚ๅˆ: ้€‚ๅˆ: ้€‚ๅˆ:
ไบ‹ๅฎžๆŸฅ่ฏข ๅคš่ทณๆŽจ็† ๅ…จๅฑ€ๆฆ‚่งˆ
็ฒพ็กฎๆฃ€็ดข ๅ…ณ็ณป่ฟฝๆบฏ ไธป้ข˜ๆ€ป็ป“
็›ธไผผ่ฎบๆ–‡ ๅฏนๆฏ”ๅˆ†ๆž ่ถ‹ๅŠฟๅˆ†ๆž
```
### 5.2 ๆ–‡ๆกฃๅˆ†ๅ—็ญ–็•ฅ
```python
class AcademicChunker:
"""ๅญฆๆœฏ่ฎบๆ–‡ไธ“็”จๅˆ†ๅ—ๅ™จ โ€” ไฟ็•™็ซ ่Š‚ๅฑ‚็บง"""
def __init__(self, chunk_size: int = 256, overlap: int = 50):
self.chunk_size = chunk_size # 256 tokens (ๅฎž้ชŒ้ชŒ่ฏๆœ€ไฝณ, arxiv:2502.11371)
self.overlap = overlap
def chunk(self, parsed_paper: ParsedPaper) -> list:
chunks = []
for block in parsed_paper.content_blocks:
if block.type == ContentType.TABLE:
# ่กจๆ ผไฝœไธบๅฎŒๆ•ดchunk, ้™„ๅŠ ๆ่ฟฐ
chunks.append({
"text": f"[TABLE] {block.content}",
"metadata": {
"paper_id": parsed_paper.metadata.paper_id,
"type": "table",
"section": block.section_hierarchy,
"page": block.page_idx,
}
})
elif block.type == ContentType.EQUATION_BLOCK:
# ๅ…ฌๅผๅ— + ไธŠไธ‹ๆ–‡
chunks.append({
"text": f"[EQUATION] {block.content}",
"metadata": {
"paper_id": parsed_paper.metadata.paper_id,
"type": "equation",
"section": block.section_hierarchy,
}
})
else:
# ๆ™ฎ้€šๆ–‡ๆœฌ: ๅ›บๅฎšๅคงๅฐๅˆ†ๅ—, ๆŒ‰ๅฅๅญ่พน็•Œๅฏน้ฝ
text_chunks = self._split_text(block.content)
for tc in text_chunks:
chunks.append({
"text": tc,
"metadata": {
"paper_id": parsed_paper.metadata.paper_id,
"type": block.type.value,
"section": block.section_hierarchy,
"page": block.page_idx,
}
})
return chunks
def _split_text(self, text: str) -> list:
"""ๆŒ‰ๅฅๅญ่พน็•Œๅˆ†ๅ—, ไฟๆŒ256 tokenๅคงๅฐ"""
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks, current = [], []
current_len = 0
for sent in sentences:
sent_len = len(sent.split()) # ็ฎ€ๅŒ–็š„token่ฎกๆ•ฐ
if current_len + sent_len > self.chunk_size and current:
chunks.append(" ".join(current))
# ไฟ็•™overlap
overlap_sents = []
overlap_len = 0
for s in reversed(current):
if overlap_len + len(s.split()) > self.overlap:
break
overlap_sents.insert(0, s)
overlap_len += len(s.split())
current = overlap_sents
current_len = overlap_len
current.append(sent)
current_len += sent_len
if current:
chunks.append(" ".join(current))
return chunks
```
### 5.3 RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘
```
่ฎบๆ–‡้›†ๅˆ (1000็ฏ‡)
โ”‚
โ”œโ”€โ”€ Level 0: ๅŽŸๅง‹ๆ–‡ๆœฌๅ— (256 tokens)
โ”‚ โ”‚
โ”‚ โ–ผ SBERTๅตŒๅ…ฅ โ†’ GMM่š็ฑป โ†’ UMAP้™็ปด
โ”‚
โ”œโ”€โ”€ Level 1: ๆฎต่ฝ็บงๆ‘˜่ฆ (~50ไธช่š็ฑป)
โ”‚ โ”‚ LLM็”Ÿๆˆๆ‘˜่ฆ โ†’ ้‡ๆ–ฐๅตŒๅ…ฅ
โ”‚ โ–ผ ๅ†ๆฌก่š็ฑป
โ”‚
โ”œโ”€โ”€ Level 2: ไธป้ข˜็บงๆ‘˜่ฆ (~15ไธช่š็ฑป)
โ”‚ โ”‚ "Transformerๆžถๆž„็š„ๆ”น่ฟ›ๆ–นๅ‘"
โ”‚ โ–ผ "ๅคง่ง„ๆจก้ข„่ฎญ็ปƒๆ•ฐๆฎ้›†็ปผ่ฟฐ"
โ”‚
โ””โ”€โ”€ Level 3: ้ข†ๅŸŸ็บงๆ‘˜่ฆ (~5ไธช่š็ฑป)
"NLP้ข†ๅŸŸ่ฟ‘ๅนดไธป่ฆ็ ”็ฉถๆ–นๅ‘ไธŽ็ช็ ด"
ๆŸฅ่ฏขๆ—ถ: ไปŽๆ‰€ๆœ‰ๅฑ‚็บงไธญๆฃ€็ดขๆœ€็›ธๅ…ณ่Š‚็‚น (Collapsed Treeๆจกๅผ)
ไผ˜ๅŠฟ: ๆ—ข่ƒฝๅ›ž็ญ”็ป†่Š‚้—ฎ้ข˜(Level 0), ไนŸ่ƒฝๅ›ž็ญ”ๅ…จๅฑ€้—ฎ้ข˜(Level 2-3)
```
---
## 6. ๆฃ€็ดขๅฑ‚
### 6.1 ๆททๅˆๆฃ€็ดขๆžถๆž„
```
็”จๆˆทๆŸฅ่ฏข
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ–ผ โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ HyDE โ”‚ โ”‚ ๆŸฅ่ฏขๅˆ†็ฑปๅ™จ โ”‚
โ”‚ ๅ‡่ฎพๆ–‡ๆกฃ โ”‚ โ”‚ (Router LLM) โ”‚
โ”‚ ็”Ÿๆˆ+ๅตŒๅ…ฅ โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โ–ผ โ–ผ โ–ผ
โ”‚ factual reasoning global
โ”‚ (ไบ‹ๅฎž) (ๆŽจ็†) (ๅ…จๅฑ€)
โ”‚ โ”‚ โ”‚ โ”‚
โ–ผ โ–ผ โ–ผ โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๅ‘้‡+BM25 โ”‚ โ”‚ ๅ›พ่ฐฑ้ๅކ โ”‚ โ”‚ RAPTOR โ”‚
โ”‚ Qdrant โ”‚ โ”‚ Neo4j โ”‚ โ”‚ ๆ‘˜่ฆๆ ‘ โ”‚
โ”‚ Hybrid โ”‚ โ”‚ Cypher โ”‚ โ”‚ ๅ…จๅฑ€ๆฃ€็ดข โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ RRF ่žๅˆๆŽ’ๅบ โ”‚
โ”‚ (Reciprocal โ”‚
โ”‚ Rank Fusion) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Cross-Encoder โ”‚
โ”‚ Reranker โ”‚
โ”‚ bge-reranker โ”‚
โ”‚ -large โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
Top-5 ็ป“ๆžœ
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ LLM ็ญ”ๆกˆ็”Ÿๆˆ โ”‚
โ”‚ + ๅผ•็”จๆบฏๆบ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 6.2 ๆ ธๅฟƒๆฃ€็ดขไปฃ็ 
```python
from qdrant_client import QdrantClient, models
from neo4j import GraphDatabase
class HybridRetriever:
"""ไธ‰่ทฏๆททๅˆๆฃ€็ดขๅ™จ"""
def __init__(self):
self.qdrant = QdrantClient("localhost", port=6333)
self.neo4j = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
self.reranker = self._load_reranker()
self.embed_model = self._load_embedder()
async def retrieve(self, query: str, mode: str = "hybrid", top_k: int = 20) -> list:
"""
mode: "factual" | "reasoning" | "global" | "hybrid"
"""
results = []
if mode in ("factual", "hybrid"):
# 1. Dense + Sparse ๅ‘้‡ๆฃ€็ดข
query_vec = self.embed_model.encode(query)
vec_results = self.qdrant.search(
collection_name="papers",
query_vector=models.NamedVector(name="dense", vector=query_vec),
limit=top_k,
with_payload=True,
)
results.extend([{"text": r.payload["text"], "score": r.score,
"source": "vector", "metadata": r.payload} for r in vec_results])
if mode in ("reasoning", "hybrid"):
# 2. ๅ›พ่ฐฑๆฃ€็ดข โ€” ๅฎžไฝ“+ๅ…ณ็ณป่ทฏๅพ„
graph_results = self._graph_search(query, limit=top_k // 2)
results.extend(graph_results)
if mode in ("global", "hybrid"):
# 3. RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆฃ€็ดข
raptor_results = self._raptor_search(query, limit=top_k // 3)
results.extend(raptor_results)
# 4. RRF ่žๅˆๆŽ’ๅบ
fused = self._rrf_fusion(results)
# 5. Cross-Encoder ้‡ๆŽ’
reranked = self._rerank(query, fused[:top_k])
return reranked[:5]
def _graph_search(self, query: str, limit: int = 10) -> list:
"""Neo4j ๅญๅ›พๆฃ€็ดข"""
# ๅ…ˆ็”จๅ‘้‡็ดขๅผ•ๆ‰พๅˆฐๆœ€็›ธๅ…ณ็š„ๅฎžไฝ“่Š‚็‚น
# ๅ†็”จCypher้ๅކ1-2่ทณ้‚ปๅฑ…
cypher = """
CALL db.index.vector.queryNodes('method_embedding', $limit, $query_vec)
YIELD node, score
MATCH (node)-[r]-(neighbor)
RETURN node, r, neighbor, score
ORDER BY score DESC LIMIT $limit
"""
with self.neo4j.session() as session:
result = session.run(cypher, query_vec=self.embed_model.encode(query).tolist(), limit=limit)
return [{"text": self._format_graph_result(r), "score": r["score"],
"source": "graph"} for r in result]
def _rrf_fusion(self, results: list, k: int = 60) -> list:
"""Reciprocal Rank Fusion โ€” ๅคš่ทฏ็ป“ๆžœ่žๅˆ"""
doc_scores = {}
for rank, r in enumerate(sorted(results, key=lambda x: x["score"], reverse=True)):
doc_key = r["text"][:200] # ๅŽป้‡key
if doc_key not in doc_scores:
doc_scores[doc_key] = {"result": r, "rrf_score": 0}
doc_scores[doc_key]["rrf_score"] += 1.0 / (k + rank + 1)
return [v["result"] | {"score": v["rrf_score"]}
for v in sorted(doc_scores.values(), key=lambda x: x["rrf_score"], reverse=True)]
def _rerank(self, query: str, results: list) -> list:
"""BAAI/bge-reranker-large ไบคๅ‰็ผ–็ ๅ™จ้‡ๆŽ’"""
pairs = [(query, r["text"]) for r in results]
scores = self.reranker.predict(pairs)
for r, s in zip(results, scores):
r["rerank_score"] = float(s)
return sorted(results, key=lambda x: x["rerank_score"], reverse=True)
```
---
## 7. Agent ็ผ–ๆŽ’ๅฑ‚
### 7.1 LangGraph ๅคšAgentๆžถๆž„
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ็”จๆˆทๆŸฅ่ฏข โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ่ทฏ็”ฑ Agent โ”‚
โ”‚ (ๆ„ๅ›พๅˆ†็ฑป) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ็ฎ€ๅ•้—ฎ็ญ” โ”‚ โ”‚ ๅคš่ทณๆŽจ็† โ”‚ โ”‚ ๅ…จๅฑ€ๅˆ†ๆž โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ ๅ‘้‡ๆฃ€็ดข โ”‚ โ”‚ ๅ›พ่ฐฑ้ๅކ โ”‚ โ”‚ RAPTOR+KG โ”‚
โ”‚ โ†’ ็”Ÿๆˆ็ญ”ๆกˆ โ”‚ โ”‚ โ†’ ้“พๅผๆŽจ็† โ”‚ โ”‚ โ†’ ็ปผๅˆๆ€ป็ป“ โ”‚
โ”‚ โ†’ ๅผ•็”จๆบฏๆบ โ”‚ โ”‚ โ†’ ่ฏๆฎๆ”ถ้›† โ”‚ โ”‚ โ†’ ่ถ‹ๅŠฟๆดžๅฏŸ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ่‡ชๆฃ€ Agent โ”‚
โ”‚ (็ญ”ๆกˆ้ชŒ่ฏ) โ”‚
โ”‚ ๆ˜ฏๅฆๅ……ๅˆ†? โ”‚
โ”‚ ๆ˜ฏๅฆๆœ‰ๅนป่ง‰? โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๅ……ๅˆ† โ”‚ ไธๅ……ๅˆ† โ”‚
โ–ผ โ–ผ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ ่พ“ๅ‡บ็ญ”ๆกˆ โ”‚ โ”‚ ่กฅๅ……ๆฃ€็ดข โ”‚โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ + ๅผ•็”จ โ”‚ โ”‚ (ๆ›ดๅคšๆบ) โ”‚ (ๆœ€ๅคš3่ฝฎ)
โ”‚ + ๅ›พ่ฐฑ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 7.2 LangGraph ็Šถๆ€ๆœบๅฎšไน‰
```python
from typing import TypedDict, Annotated, Literal
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
query: str
query_type: Literal["factual", "reasoning", "global"]
retrieved_docs: list
graph_context: list
answer: str
citations: list
confidence: float
iteration: int
def build_agent_graph():
graph = StateGraph(AgentState)
# ๆทปๅŠ ่Š‚็‚น
graph.add_node("router", route_query)
graph.add_node("retriever", hybrid_retrieve)
graph.add_node("graph_explorer", explore_knowledge_graph)
graph.add_node("generator", generate_answer)
graph.add_node("validator", validate_answer)
graph.add_node("supplementer", supplement_retrieval)
# ๅฎšไน‰่พน
graph.set_entry_point("router")
graph.add_edge("router", "retriever")
graph.add_edge("retriever", "graph_explorer")
graph.add_edge("graph_explorer", "generator")
graph.add_edge("generator", "validator")
# ๆกไปถ่พน: ้ชŒ่ฏ้€š่ฟ‡โ†’็ป“ๆŸ, ไธ้€š่ฟ‡โ†’่กฅๅ……ๆฃ€็ดข(ๆœ€ๅคš3่ฝฎ)
graph.add_conditional_edges(
"validator",
lambda state: "end" if state["confidence"] > 0.8 or state["iteration"] >= 3 else "supplement",
{"end": END, "supplement": "supplementer"}
)
graph.add_edge("supplementer", "retriever")
return graph.compile()
async def route_query(state: AgentState) -> AgentState:
"""LLMๆ„ๅ›พๅˆ†็ฑป"""
classification_prompt = f"""
ๅฐ†ไปฅไธ‹ๅญฆๆœฏ้—ฎ้ข˜ๅˆ†็ฑปไธบไธ‰็ง็ฑปๅž‹ไน‹ไธ€:
- factual: ๅ…ทไฝ“ไบ‹ๅฎžๆŸฅ่ฏข (ๆŸไธชๆ–นๆณ•็š„ๆ•ˆๆžœใ€ๆŸ็ฏ‡่ฎบๆ–‡็š„ไฝœ่€…)
- reasoning: ้œ€่ฆๅคšๆญฅๆŽจ็† (ๆ–นๆณ•Aๅ’ŒB็š„ๅŒบๅˆซใ€ๆŸๆŠ€ๆœฏ็š„ๅ‘ๅฑ•่„‰็ปœ)
- global: ๅ…จๅฑ€ๆ€งๅˆ†ๆž (ๆŸ้ข†ๅŸŸ็š„็ ”็ฉถ่ถ‹ๅŠฟใ€ไธป่ฆๆŒ‘ๆˆ˜)
้—ฎ้ข˜: {state['query']}
็ฑปๅž‹: """
query_type = await llm.ainvoke(classification_prompt)
return {"query_type": query_type.content.strip()}
async def validate_answer(state: AgentState) -> AgentState:
"""Self-RAG ๆจกๅผ: LLM่‡ชๆฃ€็ญ”ๆกˆ่ดจ้‡"""
validation_prompt = f"""
่ฏ„ไผฐไปฅไธ‹็ญ”ๆกˆ็š„่ดจ้‡(0-1ๅˆ†):
้—ฎ้ข˜: {state['query']}
็ญ”ๆกˆ: {state['answer']}
ๆฃ€็ดขไพๆฎ: {state['retrieved_docs'][:3]}
่ฏ„ๅˆ†ๆ ‡ๅ‡†:
- ๆ˜ฏๅฆๅฎŒๆ•ดๅ›ž็ญ”ไบ†้—ฎ้ข˜
- ๆ˜ฏๅฆๆœ‰ไพๆฎๆ”ฏๆ’‘
- ๆ˜ฏๅฆๅญ˜ๅœจๅนป่ง‰
่ฟ”ๅ›žJSON: {{"confidence": 0.X, "issues": ["..."]}}
"""
result = await llm.ainvoke(validation_prompt)
confidence = parse_confidence(result.content)
return {"confidence": confidence, "iteration": state["iteration"] + 1}
```
### 7.3 Agent ๅทฅๅ…ท้›†
```python
from langchain.tools import tool
@tool
def vector_search(query: str, top_k: int = 5) -> str:
"""ๅœจ่ฎบๆ–‡ๅ‘้‡ๅบ“ไธญ่ฟ›่กŒ่ฏญไน‰ๆœ็ดข"""
results = retriever.search_vectors(query, top_k)
return format_search_results(results)
@tool
def graph_query(cypher: str) -> str:
"""ๆ‰ง่กŒCypherๆŸฅ่ฏข, ๅœจ็Ÿฅ่ฏ†ๅ›พ่ฐฑไธญๆฃ€็ดขๅฎžไฝ“ๅ’Œๅ…ณ็ณป"""
with neo4j_driver.session() as session:
result = session.run(cypher)
return format_graph_results(result)
@tool
def find_related_methods(method_name: str) -> str:
"""ๆŸฅๆ‰พไธŽๆŒ‡ๅฎšๆ–นๆณ•็›ธๅ…ณ็š„ๆ‰€ๆœ‰ๆ–นๆณ•(ๆ”น่ฟ›ใ€ๅฏนๆฏ”ใ€ไฝฟ็”จ)"""
cypher = """
MATCH (m:Method {name: $name})-[r]-(related)
RETURN type(r) as relation, labels(related) as type,
related.name as name, r.score_value as score
ORDER BY r.score_value DESC
LIMIT 20
"""
return execute_and_format(cypher, {"name": method_name})
@tool
def get_paper_summary(paper_id: str) -> str:
"""่Žทๅ–่ฎบๆ–‡็š„ๆ‘˜่ฆๅ’Œๆ ธๅฟƒ่ดก็Œฎ"""
return paper_store.get_summary(paper_id)
@tool
def compare_methods(method_a: str, method_b: str) -> str:
"""ๅฏนๆฏ”ไธคไธชๆ–นๆณ•ๅœจ็›ธๅŒๆ•ฐๆฎ้›†ไธŠ็š„่กจ็Žฐ"""
cypher = """
MATCH (a:Method {name: $a})-[r1:EVALUATED_ON]->(d:Dataset)<-[r2:EVALUATED_ON]-(b:Method {name: $b})
RETURN d.name as dataset, r1.score as score_a, r2.score as score_b,
r1.metric as metric
"""
return execute_and_format(cypher, {"a": method_a, "b": method_b})
@tool
def research_trend(topic: str, years: int = 5) -> str:
"""ๅˆ†ๆžๆŸไธช็ ”็ฉถไธป้ข˜ๅœจ่ฟ‘Nๅนด็š„ๅ‘ๅฑ•่ถ‹ๅŠฟ"""
raptor_results = raptor_index.search(topic, level="high")
graph_stats = get_temporal_graph_stats(topic, years)
return synthesize_trend(raptor_results, graph_stats)
```
---
## 8. LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚
### 8.1 ๆžถๆž„่ฎพ่ฎก
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ LiteLLM Proxy Server โ”‚
โ”‚ (็ปŸไธ€ OpenAI ๅ…ผๅฎนๆŽฅๅฃ) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”‚
โ”‚ model_list: โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ "local/qwen2.5-14b" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Ollama :11434
โ”‚ โ”‚ "local/llama-3.1-8b" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ vLLM :8000
โ”‚ โ”‚ "gpt-4o-mini" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ OpenAI API
โ”‚ โ”‚ "claude-3-5-sonnet" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Anthropic API
โ”‚ โ”‚ "deepseek-chat" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ DeepSeek API
โ”‚ โ”‚ "gemini-2.0-flash" โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Google API
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ ๅŠŸ่ƒฝ: โ”‚
โ”‚ - ็ปŸไธ€ /chat/completions ๆŽฅๅฃ โ”‚
โ”‚ - ่‡ชๅŠจfallback (ๆœฌๅœฐโ†’API) โ”‚
โ”‚ - ่ดŸ่ฝฝๅ‡่กก (ๅคšvLLMๅฎžไพ‹) โ”‚
โ”‚ - ้€Ÿ็އ้™ๅˆถ & ๆˆๆœฌ่ฟฝ่ธช โ”‚
โ”‚ - ็ผ“ๅญ˜ (็›ธๅŒqueryๅค็”จ) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 8.2 LiteLLM ้…็ฝฎ
```yaml
# litellm_config.yaml
model_list:
# ===== ๆœฌๅœฐๆจกๅž‹ =====
- model_name: "local/qwen2.5-14b"
litellm_params:
model: "openai/Qwen2.5-14B-Instruct"
api_base: "http://localhost:11434/v1" # Ollama
api_key: "ollama"
model_info:
max_tokens: 32768
input_cost_per_token: 0 # ๆœฌๅœฐๅ…่ดน
- model_name: "local/llama-3.1-8b"
litellm_params:
model: "openai/meta-llama/Llama-3.1-8B-Instruct"
api_base: "http://localhost:8000/v1" # vLLM
api_key: "token"
model_info:
max_tokens: 131072
# ===== ๅค–้ƒจAPI =====
- model_name: "gpt-4o-mini"
litellm_params:
model: "gpt-4o-mini"
api_key: "os.environ/OPENAI_API_KEY"
- model_name: "deepseek-chat"
litellm_params:
model: "deepseek/deepseek-chat"
api_key: "os.environ/DEEPSEEK_API_KEY"
# ่ทฏ็”ฑ็ญ–็•ฅ
router_settings:
routing_strategy: "latency-based-routing" # ้€‰ๆ‹ฉๅปถ่ฟŸๆœ€ไฝŽ็š„
num_retries: 3
fallbacks:
- "local/qwen2.5-14b": ["gpt-4o-mini"] # ๆœฌๅœฐๅคฑ่ดฅโ†’API
- "gpt-4o-mini": ["deepseek-chat"] # OpenAIๅคฑ่ดฅโ†’DeepSeek
# ไธๅŒไปปๅŠก็”จไธๅŒๆจกๅž‹
model_group_alias:
"extraction": "local/qwen2.5-14b" # ็Ÿฅ่ฏ†ๆŠฝๅ–: ๆœฌๅœฐ(็œ้’ฑ)
"generation": "gpt-4o-mini" # ็ญ”ๆกˆ็”Ÿๆˆ: API(้ซ˜่ดจ้‡)
"routing": "local/llama-3.1-8b" # ๆ„ๅ›พๅˆ†็ฑป: ๆœฌๅœฐๅฐๆจกๅž‹(ๅฟซ)
```
### 8.3 ็ปŸไธ€่ฐƒ็”จๆŽฅๅฃ
```python
import litellm
from typing import Optional
class UnifiedLLM:
"""็ปŸไธ€LLM่ฐƒ็”จๅฑ‚ โ€” ่‡ชๅŠจ่ทฏ็”ฑๆœฌๅœฐ/API"""
def __init__(self, config_path: str = "litellm_config.yaml"):
litellm.set_verbose = False
# ๅฏ็”จ็ผ“ๅญ˜
litellm.cache = litellm.Cache(type="redis", host="localhost", port=6379)
async def complete(
self,
messages: list,
task: str = "generation", # extraction | generation | routing
temperature: float = 0,
max_tokens: int = 4096,
stream: bool = False,
) -> str:
"""
็ปŸไธ€่ฐƒ็”จๆŽฅๅฃ, ๆ นๆฎtask่‡ชๅŠจ้€‰ๆ‹ฉๆจกๅž‹
"""
model = self._select_model(task)
response = await litellm.acompletion(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
stream=stream,
metadata={"task": task}, # ็”จไบŽๆˆๆœฌ่ฟฝ่ธช
)
if stream:
return response # ่ฟ”ๅ›žๅผ‚ๆญฅ็”Ÿๆˆๅ™จ
return response.choices[0].message.content
def _select_model(self, task: str) -> str:
model_map = {
"extraction": "local/qwen2.5-14b",
"generation": "gpt-4o-mini",
"routing": "local/llama-3.1-8b",
"fusion": "gpt-4o-mini", # Graphusion่žๅˆ้œ€่ฆๅผบๆจกๅž‹
"rewrite": "local/llama-3.1-8b", # HyDEๆŸฅ่ฏขๆ”นๅ†™
}
return model_map.get(task, "local/qwen2.5-14b")
```
---
## 9. ็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„
### 9.1 Docker Compose ้ƒจ็ฝฒ
```yaml
# docker-compose.yml
version: '3.8'
services:
# ===== ๆ ธๅฟƒๆœๅŠก =====
api:
build: ./services/api
ports: ["8080:8080"]
environment:
- REDIS_URL=redis://redis:6379
- QDRANT_URL=http://qdrant:6333
- NEO4J_URL=bolt://neo4j:7687
- LITELLM_URL=http://litellm:4000
depends_on: [redis, qdrant, neo4j, litellm]
# ===== PDF่งฃๆžๆœๅŠก =====
mineru-worker:
build: ./services/mineru
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- MINERU_MODEL_SOURCE=local
- CELERY_BROKER_URL=redis://redis:6379
volumes:
- mineru-models:/models
- pdf-storage:/pdfs
# ===== LLMๆœๅŠก =====
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports: ["4000:4000"]
volumes:
- ./config/litellm_config.yaml:/app/config.yaml
command: ["--config", "/app/config.yaml"]
ollama:
image: ollama/ollama:latest
ports: ["11434:11434"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ollama-data:/root/.ollama
# ===== ๅญ˜ๅ‚จๆœๅŠก =====
qdrant:
image: qdrant/qdrant:latest
ports: ["6333:6333"]
volumes:
- qdrant-data:/qdrant/storage
neo4j:
image: neo4j:5-community
ports: ["7474:7474", "7687:7687"]
environment:
- NEO4J_AUTH=neo4j/password
- NEO4J_PLUGINS=["apoc", "graph-data-science"]
volumes:
- neo4j-data:/data
redis:
image: redis:7-alpine
ports: ["6379:6379"]
postgres:
image: postgres:16-alpine
environment:
- POSTGRES_DB=scholarmind
- POSTGRES_PASSWORD=password
volumes:
- postgres-data:/var/lib/postgresql/data
minio:
image: minio/minio:latest
ports: ["9000:9000", "9001:9001"]
command: server /data --console-address ":9001"
volumes:
- minio-data:/data
volumes:
qdrant-data:
neo4j-data:
postgres-data:
minio-data:
ollama-data:
mineru-models:
pdf-storage:
```
### 9.2 ็กฌไปถ้…็ฝฎๅปบ่ฎฎ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ็กฌไปถ้…็ฝฎๅปบ่ฎฎ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ้…็ฝฎ โ”‚ ๅผ€ๅ‘็Žฏๅขƒ โ”‚ ็”Ÿไบง(ๅฐ) โ”‚ ็”Ÿไบง(ๅคง) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PDF่งฃๆž GPU โ”‚ RTX 3090 โ”‚ A100 80G โ”‚ 2ร—A100 80G โ”‚
โ”‚ LLMๆŽจ็† GPU โ”‚ RTX 4090 โ”‚ A100 80G โ”‚ 2ร—H100 80G โ”‚
โ”‚ CPU โ”‚ 16ๆ ธ โ”‚ 32ๆ ธ โ”‚ 64ๆ ธ โ”‚
โ”‚ RAM โ”‚ 64GB โ”‚ 128GB โ”‚ 256GB โ”‚
โ”‚ SSD โ”‚ 1TB NVMe โ”‚ 2TB NVMe โ”‚ 4TB NVMe โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1000็ฏ‡่ฎบๆ–‡ โ”‚ ~3ๅฐๆ—ถ โ”‚ ~80ๅˆ†้’Ÿ โ”‚ ~40ๅˆ†้’Ÿ โ”‚
โ”‚ ่งฃๆžๆ—ถ้—ด โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ โ”‚ ~5s โ”‚ ~2s โ”‚ ~1s โ”‚
โ”‚ ๅนถๅ‘็”จๆˆท โ”‚ 1-5 โ”‚ 10-50 โ”‚ 50-200 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 9.3 API ่ฎพ่ฎก
```python
from fastapi import FastAPI, UploadFile, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
app = FastAPI(title="ScholarMind API", version="1.0")
# ===== PDFไธŠไผ ไธŽ่งฃๆž =====
@app.post("/api/v1/papers/upload")
async def upload_papers(files: list[UploadFile], bg: BackgroundTasks):
"""ๆ‰น้‡ไธŠไผ PDF่ฎบๆ–‡, ๅผ‚ๆญฅ่งฃๆž"""
task_ids = []
for f in files:
task_id = await save_and_queue(f)
task_ids.append(task_id)
return {"task_ids": task_ids, "status": "processing"}
@app.get("/api/v1/papers/{task_id}/status")
async def get_parse_status(task_id: str):
"""ๆŸฅ่ฏข่งฃๆž่ฟ›ๅบฆ"""
return celery_app.AsyncResult(task_id).info
# ===== ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” =====
class QueryRequest(BaseModel):
query: str
mode: str = "hybrid" # factual | reasoning | global | hybrid
llm_backend: str = "auto" # auto | local | openai | deepseek
top_k: int = 5
stream: bool = False
include_citations: bool = True
include_graph: bool = False # ๆ˜ฏๅฆ่ฟ”ๅ›ž็›ธๅ…ณๅญๅ›พ
@app.post("/api/v1/query")
async def query_knowledge_base(req: QueryRequest):
"""็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ”"""
if req.stream:
return StreamingResponse(
agent.astream(req), media_type="text/event-stream"
)
result = await agent.ainvoke(req)
return {
"answer": result["answer"],
"citations": result["citations"],
"confidence": result["confidence"],
"graph_snippet": result.get("graph_snippet"),
}
# ===== ็Ÿฅ่ฏ†ๅ›พ่ฐฑ =====
@app.get("/api/v1/graph/entity/{name}")
async def get_entity(name: str, depth: int = 2):
"""่Žทๅ–ๅฎžไฝ“ๅŠๅ…ถN่ทณๅญๅ›พ"""
subgraph = await graph_service.get_subgraph(name, depth)
return subgraph
@app.get("/api/v1/graph/path")
async def find_path(source: str, target: str, max_hops: int = 4):
"""ๆŸฅๆ‰พไธคไธชๅฎžไฝ“ไน‹้—ด็š„ๆœ€็Ÿญ่ทฏๅพ„"""
path = await graph_service.shortest_path(source, target, max_hops)
return path
@app.get("/api/v1/graph/stats")
async def graph_statistics():
"""็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ปŸ่ฎกไฟกๆฏ"""
return await graph_service.get_stats()
# ===== ๅ›พ่ฐฑๅฏ่ง†ๅŒ– =====
@app.get("/api/v1/graph/visualize")
async def visualize_graph(center: str, depth: int = 2, layout: str = "force"):
"""่ฟ”ๅ›žๅฏ่ง†ๅŒ–ๆ•ฐๆฎ (vis.jsๆ ผๅผ)"""
data = await graph_service.get_vis_data(center, depth)
return {"nodes": data["nodes"], "edges": data["edges"]}
```
---
## 10. ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ”
### 10.1 ๅฎŒๆ•ดๆŠ€ๆœฏๆ ˆ
| ๅฑ‚ | ็ป„ไปถ | ้€‰ๅž‹ | ๆ›ฟไปฃๆ–นๆกˆ | ้€‰ๅž‹็†็”ฑ |
|----|------|------|---------|---------|
| PDF่งฃๆž | OCRๅผ•ๆ“Ž | **MinerU 2.5 VLM** | Marker, Nougat, Docling | ๅญฆๆœฏ่ฎบๆ–‡SOTA(0.047 Edit Dist), ๅ…ฌๅผ88.46 CDM |
| PDF่งฃๆž | ๅฟซ้€Ÿ่ทฏๅพ„ | **PyMuPDF** | pdfplumber | ๆ•ฐๅญ—PDF็ง’็บง่งฃๆž, ๆ— GPU้œ€ๆฑ‚ |
| ็Ÿฅ่ฏ†ๆŠฝๅ– | NER | **GLiNER** (440M) | spaCy, DeepKE | ้›ถๆ ทๆœฌ, ่‡ชๅฎšไน‰ๆ ‡็ญพ, ๆœฌๅœฐ่ฟ่กŒ |
| ็Ÿฅ่ฏ†ๆŠฝๅ– | RE | **LLMGraphTransformer** | REBEL, GLiREL, ReLiK | ๆ”ฏๆŒๆœฌๅœฐ+API LLM, Schema็บฆๆŸ |
| ็Ÿฅ่ฏ†ๆŠฝๅ– | ่žๅˆ | **Graphusion** | ๆ—  | ๅฎžไฝ“ๅฝ’ไธ€ๅŒ–+ๅ†ฒ็ชๆถˆ่งฃ, ๆฏ”naiveๅฅฝ9.2% |
| ็Ÿฅ่ฏ†ๅ›พ่ฐฑ | ๅ›พๆ•ฐๆฎๅบ“ | **Neo4j 5.x** | ArangoDB, NebulaGraph | LangChainๅŽŸ็”Ÿ้›†ๆˆ, Cypher็”Ÿๆ€ๆœ€ๆˆ็†Ÿ |
| ๅ‘้‡็ดขๅผ• | ๅ‘้‡ๅบ“ | **Qdrant** | Milvus, Weaviate | Rust้ซ˜ๆ€ง่ƒฝ, ๅŽŸ็”ŸHybridๆœ็ดข, ็ฎ€ๅ•้ƒจ็ฝฒ |
| ๆฃ€็ดข | ้‡ๆŽ’ๅ™จ | **bge-reranker-large** | Cohere, jina | ๅผ€ๆบSOTA, ๆ— APIไพ่ต– |
| ๆฃ€็ดข | ๆŸฅ่ฏขๅขžๅผบ | **HyDE** | Query2Doc | +10 NDCG้›ถๆˆๆœฌๆๅ‡ |
| ็ดขๅผ• | ๅฑ‚ๆฌก็ดขๅผ• | **RAPTOR** | GraphRAG Communities | ้€‚ๅˆๅฑ‚็บงๆ–‡ๆกฃ, +20%ๅ‡†็กฎ็އ |
| RAG | ๅ›พๅขžๅผบ | **LightRAG** | NodeRAG, GraphRAG | 34kโญ, ๅขž้‡ๆ›ดๆ–ฐ, ๅคšๅŽ็ซฏ |
| Agent | ็ผ–ๆŽ’ | **LangGraph** | smolagents, AutoGen | ๆœ‰็Šถๆ€ๅ›พ, ๆกไปถๅˆ†ๆ”ฏ, ็”Ÿไบง็บง |
| LLM | ็ปŸไธ€ๆŽฅๅ…ฅ | **LiteLLM** | OpenRouter | 20kโญ, ๆ‰€ๆœ‰ๆไพ›ๅ•†็ปŸไธ€ๆŽฅๅฃ |
| LLM | ๆœฌๅœฐๆŽจ็† | **vLLM + Ollama** | SGLang, llama.cpp | vLLM้ซ˜ๅžๅ, Ollamaๆ˜“็”จ |
| ๅŽ็ซฏ | Webๆก†ๆžถ | **FastAPI** | Flask, Django | ๅผ‚ๆญฅๅŽŸ็”Ÿ, ้ซ˜ๆ€ง่ƒฝ, OpenAPI่‡ชๅŠจๆ–‡ๆกฃ |
| ้˜Ÿๅˆ— | ไปปๅŠก้˜Ÿๅˆ— | **Celery + Redis** | RQ, Dramatiq | ๆˆ็†Ÿ็จณๅฎš, ๅˆ†ๅธƒๅผๆ”ฏๆŒ |
| ๅญ˜ๅ‚จ | ๅฏน่ฑกๅญ˜ๅ‚จ | **MinIO** | S3 | S3ๅ…ผๅฎน, ๆœฌๅœฐ้ƒจ็ฝฒ |
| ๅญ˜ๅ‚จ | ๅ…ณ็ณปๆ•ฐๆฎๅบ“ | **PostgreSQL** | MySQL | JSONๆ”ฏๆŒ, ๅ…จๆ–‡ๆœ็ดข |
| ๅ‰็ซฏ | ๅ›พๅฏ่ง†ๅŒ– | **react-force-graph** | vis.js, D3 | React็”Ÿๆ€, 3Dๆ”ฏๆŒ |
### 10.2 ๆ€ง่ƒฝ้ข„ไผฐ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1000็ฏ‡่ฎบๆ–‡็ณป็ปŸๆ€ง่ƒฝ้ข„ไผฐ (A100 80G) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PDF่งฃๆž โ”‚ ~80ๅˆ†้’Ÿ (MinerU 2.5, 2.12 pg/s) โ”‚
โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–(GLiNER NER) โ”‚ ~15ๅˆ†้’Ÿ (GPU batch) โ”‚
โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–(LLM RE) โ”‚ ~60ๅˆ†้’Ÿ (ๆœฌๅœฐ14Bๆจกๅž‹) โ”‚
โ”‚ โ”‚ ~30ๅˆ†้’Ÿ (GPT-4o-mini API) โ”‚
โ”‚ ๅ‘้‡็ดขๅผ•ๆž„ๅปบ โ”‚ ~10ๅˆ†้’Ÿ (text-embedding-3-small) โ”‚
โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑๆž„ๅปบ โ”‚ ~20ๅˆ†้’Ÿ (ๅซ่žๅˆ) โ”‚
โ”‚ RAPTORๆ ‘ๆž„ๅปบ โ”‚ ~30ๅˆ†้’Ÿ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ๆ€ป่ฎก(็ซฏๅˆฐ็ซฏ) โ”‚ ~3.5ๅฐๆ—ถ (ๅ…จๆœฌๅœฐ) / ~2.5ๅฐๆ—ถ (ๆททๅˆAPI) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ (P50) โ”‚ ~1.5s (ๆœฌๅœฐLLM) / ~0.8s (API) โ”‚
โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ (P99) โ”‚ ~4s (ๆœฌๅœฐLLM) / ~2s (API) โ”‚
โ”‚ ๅ›พ่ฐฑๆŸฅ่ฏขๅปถ่ฟŸ โ”‚ ~200ms (2่ทณๅญๅ›พ) โ”‚
โ”‚ ๅ‘้‡ๆฃ€็ดขๅปถ่ฟŸ โ”‚ ~50ms (Qdrant, 1Mๅ‘้‡) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## 11. ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ
### 11.1 ๆ ธๅฟƒ่ฎบๆ–‡
| ่ฎบๆ–‡ | ArXiv ID | ่ดก็Œฎ | ๆŽจ่ๅบฆ |
|------|---------|------|--------|
| **MinerU 2.5** | 2509.22186 | ็ปŸไธ€VLMๆ–‡ๆกฃ่งฃๆžSOTA | โญโญโญโญโญ |
| **OmniDocBench** | 2412.07626 | ๆ–‡ๆกฃ่งฃๆžๅŸบๅ‡† (CVPR 2025) | โญโญโญโญ |
| **Graphusion** | 2410.17600 | ้›ถๆ ทๆœฌKGๆž„ๅปบ+่žๅˆ | โญโญโญโญโญ |
| **GLiNER** | 2311.08526 | ้›ถๆ ทๆœฌNER, 440M | โญโญโญโญโญ |
| **SciER** | 2410.21155 | ๅญฆๆœฏ่ฎบๆ–‡IEๆ•ฐๆฎ้›†+ๅŸบๅ‡† | โญโญโญโญ |
| **ReLiK** | 2408.00103 | ๅฟซ้€Ÿๅฎžไฝ“้“พๆŽฅ+ๅ…ณ็ณปๆŠฝๅ– | โญโญโญโญ |
| **NodeRAG** | 2504.11544 | ๅผ‚ๆž„ๅ›พRAG SOTA | โญโญโญโญโญ |
| **LightRAG** | 2410.05779 | ่ฝป้‡ๅ›พRAG, ๅขž้‡ๆ›ดๆ–ฐ | โญโญโญโญโญ |
| **Microsoft GraphRAG** | 2404.16130 | ็คพๅŒบๆ‘˜่ฆ+ๅ…จๅฑ€ๆฃ€็ดข | โญโญโญโญ |
| **RAPTOR** | 2401.18059 | ้€’ๅฝ’ๆ‘˜่ฆๆ ‘ | โญโญโญโญ |
| **Self-RAG** | 2310.11511 | ่‡ชๅๆ€ๆฃ€็ดข็”Ÿๆˆ | โญโญโญ |
| **HyDE** | 2212.10496 | ๅ‡่ฎพๆ–‡ๆกฃๅตŒๅ…ฅ | โญโญโญโญ |
| **RAG vs GraphRAG** | 2502.11371 | RAG+GraphRAG่žๅˆๅฎž้ชŒ | โญโญโญโญ |
| **LLM-KGC Survey** | 2510.20345 | LLM็Ÿฅ่ฏ†ๅ›พ่ฐฑๆž„ๅปบ็ปผ่ฟฐ | โญโญโญโญ |
### 11.2 ๆ ธๅฟƒๅผ€ๆบ้กน็›ฎ
| ้กน็›ฎ | GitHub | Stars | ็”จ้€” |
|------|--------|-------|------|
| **MinerU** | opendatalab/MinerU | 61k+ | PDFๆทฑๅบฆ่งฃๆž |
| **LightRAG** | hkuds/lightrag | 34k+ | ๅ›พๅขžๅผบRAG |
| **RAGFlow** | infiniflow/ragflow | 36k+ | ๅ…จๆ ˆRAGๅนณๅฐ(ๅซUI) |
| **LiteLLM** | BerriAI/litellm | 20k+ | LLM็ปŸไธ€ไปฃ็† |
| **Neo4j LLM Graph Builder** | neo4j-labs/llm-graph-builder | 3k+ | PDFโ†’KGโ†’QA |
| **Kotaemon** | Cinnamon/kotaemon | 18k+ | ๆ–‡ๆกฃQA(ๅซGraphRAG) |
| **Dify** | langgenius/dify | 70k+ | AIๅบ”็”จๅผ€ๅ‘ๅนณๅฐ |
| **LangGraph** | langchain-ai/langgraph | 10k+ | Agent็Šถๆ€ๆœบ็ผ–ๆŽ’ |
| **GLiNER** | urchade/GLiNER | 2k+ | ้›ถๆ ทๆœฌNER |
| **Graphusion** | irenezihuili/graphusion | 27 | KG่žๅˆๅŽป้‡ |
| **RAPTOR** | parthsarthi03/raptor | 1.6k+ | ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘ |
| **NodeRAG** | Terry-Xu-666/NodeRAG | 412 | ๅผ‚ๆž„ๅ›พRAG |
| **Qdrant** | qdrant/qdrant | 22k+ | ๅ‘้‡ๆ•ฐๆฎๅบ“ |
| **vLLM** | vllm-project/vllm | 45k+ | ้ซ˜ๅžๅLLMๆŽจ็† |
---
## 12. ้กน็›ฎ็ป“ๆž„
```
scholarmind/
โ”œโ”€โ”€ docker-compose.yml # ไธ€้”ฎ้ƒจ็ฝฒ
โ”œโ”€โ”€ config/
โ”‚ โ”œโ”€โ”€ litellm_config.yaml # LLM่ทฏ็”ฑ้…็ฝฎ
โ”‚ โ”œโ”€โ”€ mineru_config.yaml # MinerU่งฃๆž้…็ฝฎ
โ”‚ โ””โ”€โ”€ settings.py # ๅ…จๅฑ€้…็ฝฎ
โ”‚
โ”œโ”€โ”€ services/
โ”‚ โ”œโ”€โ”€ api/ # FastAPI ไธปๆœๅŠก
โ”‚ โ”‚ โ”œโ”€โ”€ main.py # ๅ…ฅๅฃ
โ”‚ โ”‚ โ”œโ”€โ”€ routers/
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ papers.py # PDFไธŠไผ /่งฃๆžAPI
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ query.py # ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ”API
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ graph.py # ็Ÿฅ่ฏ†ๅ›พ่ฐฑAPI
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ admin.py # ็ฎก็†API
โ”‚ โ”‚ โ””โ”€โ”€ middleware/
โ”‚ โ”‚ โ”œโ”€โ”€ auth.py # ่ฎค่ฏ
โ”‚ โ”‚ โ””โ”€โ”€ rate_limit.py # ้™ๆต
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ parser/ # PDF่งฃๆžๆœๅŠก
โ”‚ โ”‚ โ”œโ”€โ”€ router.py # PDF็‰นๅพ่ทฏ็”ฑ
โ”‚ โ”‚ โ”œโ”€โ”€ mineru_worker.py # MinerU VLM Worker
โ”‚ โ”‚ โ”œโ”€โ”€ pymupdf_worker.py # PyMuPDF ๅฟซ้€Ÿ่งฃๆž
โ”‚ โ”‚ โ”œโ”€โ”€ metadata_extractor.py # ๅ…ƒๆ•ฐๆฎๆๅ–
โ”‚ โ”‚ โ””โ”€โ”€ tasks.py # CeleryไปปๅŠกๅฎšไน‰
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ extractor/ # ็Ÿฅ่ฏ†ๆŠฝๅ–ๆœๅŠก
โ”‚ โ”‚ โ”œโ”€โ”€ ner_engine.py # GLiNER NER
โ”‚ โ”‚ โ”œโ”€โ”€ re_engine.py # LLM ๅ…ณ็ณปๆŠฝๅ–
โ”‚ โ”‚ โ”œโ”€โ”€ fusion_engine.py # Graphusion ่žๅˆ
โ”‚ โ”‚ โ””โ”€โ”€ schema.py # ๅฎžไฝ“/ๅ…ณ็ณปSchema
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ graph/ # ็Ÿฅ่ฏ†ๅ›พ่ฐฑๆœๅŠก
โ”‚ โ”‚ โ”œโ”€โ”€ neo4j_client.py # Neo4j ่ฟžๆŽฅ็ฎก็†
โ”‚ โ”‚ โ”œโ”€โ”€ graph_builder.py # ๅ›พๆž„ๅปบ
โ”‚ โ”‚ โ”œโ”€โ”€ graph_query.py # ๅ›พๆŸฅ่ฏข
โ”‚ โ”‚ โ””โ”€โ”€ visualization.py # ๅ›พๅฏ่ง†ๅŒ–
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ indexer/ # ็ดขๅผ•ๆœๅŠก
โ”‚ โ”‚ โ”œโ”€โ”€ chunker.py # ๅญฆๆœฏ่ฎบๆ–‡ๅˆ†ๅ—ๅ™จ
โ”‚ โ”‚ โ”œโ”€โ”€ vector_indexer.py # Qdrant ๅ‘้‡็ดขๅผ•
โ”‚ โ”‚ โ”œโ”€โ”€ raptor_builder.py # RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘
โ”‚ โ”‚ โ””โ”€โ”€ embedder.py # ๅตŒๅ…ฅๆจกๅž‹็ฎก็†
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ retriever/ # ๆฃ€็ดขๆœๅŠก
โ”‚ โ”‚ โ”œโ”€โ”€ hybrid_retriever.py # ไธ‰่ทฏๆททๅˆๆฃ€็ดข
โ”‚ โ”‚ โ”œโ”€โ”€ hyde.py # HyDE ๆŸฅ่ฏขๅขžๅผบ
โ”‚ โ”‚ โ”œโ”€โ”€ reranker.py # ไบคๅ‰็ผ–็ ๅ™จ้‡ๆŽ’
โ”‚ โ”‚ โ””โ”€โ”€ rrf.py # RRF ่žๅˆ
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ agent/ # Agent็ผ–ๆŽ’ๆœๅŠก
โ”‚ โ”‚ โ”œโ”€โ”€ graph_definition.py # LangGraph ็Šถๆ€ๆœบ
โ”‚ โ”‚ โ”œโ”€โ”€ nodes.py # Agent่Š‚็‚นๅฎšไน‰
โ”‚ โ”‚ โ”œโ”€โ”€ tools.py # Agentๅทฅๅ…ท้›†
โ”‚ โ”‚ โ””โ”€โ”€ prompts.py # Promptๆจกๆฟ
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€ llm/ # LLM็ปŸไธ€ๆŽฅๅ…ฅ
โ”‚ โ”œโ”€โ”€ unified_llm.py # LiteLLMๅฐ่ฃ…
โ”‚ โ”œโ”€โ”€ model_router.py # ไปปๅŠกโ†’ๆจกๅž‹่ทฏ็”ฑ
โ”‚ โ””โ”€โ”€ cache.py # LLM็ผ“ๅญ˜
โ”‚
โ”œโ”€โ”€ models/ # ๆ•ฐๆฎๆจกๅž‹
โ”‚ โ”œโ”€โ”€ paper.py # ่ฎบๆ–‡ๆ•ฐๆฎๆจกๅž‹
โ”‚ โ”œโ”€โ”€ graph.py # ๅ›พ่ฐฑๆ•ฐๆฎๆจกๅž‹
โ”‚ โ””โ”€โ”€ query.py # ๆŸฅ่ฏขๆ•ฐๆฎๆจกๅž‹
โ”‚
โ”œโ”€โ”€ tests/
โ”‚ โ”œโ”€โ”€ test_parser.py
โ”‚ โ”œโ”€โ”€ test_extractor.py
โ”‚ โ”œโ”€โ”€ test_retriever.py
โ”‚ โ””โ”€โ”€ test_agent.py
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚ โ”œโ”€โ”€ setup_neo4j.cypher # Neo4jๅˆๅง‹ๅŒ–่„šๆœฌ
โ”‚ โ”œโ”€โ”€ batch_parse.py # ๆ‰น้‡่งฃๆž่„šๆœฌ
โ”‚ โ””โ”€โ”€ build_index.py # ็ดขๅผ•ๆž„ๅปบ่„šๆœฌ
โ”‚
โ”œโ”€โ”€ frontend/ # ๅ‰็ซฏ (React/Next.js)
โ”‚ โ”œโ”€โ”€ components/
โ”‚ โ”‚ โ”œโ”€โ”€ ChatInterface.tsx # ้—ฎ็ญ”็•Œ้ข
โ”‚ โ”‚ โ”œโ”€โ”€ GraphViewer.tsx # ็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฏ่ง†ๅŒ–
โ”‚ โ”‚ โ”œโ”€โ”€ PaperUploader.tsx # PDFไธŠไผ 
โ”‚ โ”‚ โ””โ”€โ”€ SearchResults.tsx # ๆœ็ดข็ป“ๆžœๅฑ•็คบ
โ”‚ โ””โ”€โ”€ ...
โ”‚
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ Dockerfile
โ””โ”€โ”€ README.md
```
---
## ๅฟซ้€Ÿๅผ€ๅง‹
```bash
# 1. ๅ…‹้š†้กน็›ฎ
git clone https://github.com/your-org/scholarmind.git
cd scholarmind
# 2. ้…็ฝฎ็Žฏๅขƒๅ˜้‡
cp .env.example .env
# ็ผ–่พ‘ .env: ่ฎพ็ฝฎ OPENAI_API_KEY, DEEPSEEK_API_KEY ็ญ‰
# 3. ไธ‹่ฝฝMinerUๆจกๅž‹
pip install mineru
mineru-models-download -s huggingface -m all
# 4. ๅฏๅŠจๆ‰€ๆœ‰ๆœๅŠก
docker-compose up -d
# 5. ไธ‹่ฝฝๆœฌๅœฐLLM (ๅฏ้€‰)
docker exec -it scholarmind-ollama ollama pull qwen2.5:14b-instruct
# 6. ๆ‰น้‡ๅฏผๅ…ฅ่ฎบๆ–‡
python scripts/batch_parse.py --input /path/to/pdfs/ --workers 4
# 7. ๆž„ๅปบ็ดขๅผ•
python scripts/build_index.py --vector --graph --raptor
# 8. ่ฎฟ้—ฎ็ณป็ปŸ
# API: http://localhost:8080/docs
# Neo4j: http://localhost:7474
# MinIO: http://localhost:9001
```
---
## ่ฎธๅฏ่ฏ
MIT License
---
*ๆžถๆž„่ฎพ่ฎกๅŸบไบŽ 2024-2025 ๅนดๆœ€ๆ–ฐ็ ”็ฉถๆˆๆžœๅ’Œๅผ€ๆบๅฎž่ทต๏ผŒๆ‰€ๆœ‰่ฎบๆ–‡ๅผ•็”จๅ’ŒBenchmarkๆ•ฐๆฎๅ‡ๅฏๆบฏๆบใ€‚*