scholarmind-architecture / docs /ARCHITECTURE.md
heyingyue's picture
Add docs/ARCHITECTURE.md
9195a5e verified

๐Ÿ—๏ธ ScholarMind โ€” ็”Ÿไบง็บงๅญฆๆœฏ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” & ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ณป็ปŸ

็ณป็ปŸๆฆ‚่ฟฐ

ScholarMind ๆ˜ฏไธ€ไธช้ขๅ‘ 1000+ ็ฏ‡ๅญฆๆœฏ PDF ่ฎบๆ–‡ ็š„็”Ÿไบง็บงๆ™บ่ƒฝ็Ÿฅ่ฏ†็ณป็ปŸ๏ผŒ้›†ๆˆ๏ผš

  • PDF ๆทฑๅบฆ่งฃๆž๏ผšๅŸบไบŽ MinerU 2.5 VLM ็š„้ซ˜็ฒพๅบฆ OCR๏ผˆๅ…ฌๅผ/่กจๆ ผ/ๅ›พ่กจ๏ผ‰
  • ็Ÿฅ่ฏ†ๅ›พ่ฐฑ่‡ชๅŠจๆž„ๅปบ๏ผšไปŽ่ฎบๆ–‡ไธญ่‡ชๅŠจๆŠฝๅ–ๅฎžไฝ“ไธŽๅ…ณ็ณป๏ผŒๆž„ๅปบ้ข†ๅŸŸ็Ÿฅ่ฏ†ๅ›พ่ฐฑ
  • ๆททๅˆๆฃ€็ดข้—ฎ็ญ”๏ผšGraphRAG + ๅ‘้‡ๆฃ€็ดข + BM25 ็จ€็–ๆฃ€็ดข็š„ไธ‰่ทฏ่žๅˆ
  • ๅคšๆจกๅž‹ๆ”ฏๆŒ๏ผšๅŒๆ—ถๆ”ฏๆŒๆœฌๅœฐ้ƒจ็ฝฒ๏ผˆvLLM/Ollama๏ผ‰ๅ’Œๅค–้ƒจ API๏ผˆOpenAI/Anthropic/DeepSeek๏ผ‰
  • Agent ็ผ–ๆŽ’๏ผšๅŸบไบŽ LangGraph ็š„ๅคš Agent ๅไฝœ๏ผŒๆ”ฏๆŒๅคš่ทณๆŽจ็†

ๆ ธๅฟƒๆŒ‡ๆ ‡๏ผšๅ• A100 80G ๅฏๅœจ 80 ๅˆ†้’Ÿๅ†…ๅฎŒๆˆ 1000 ็ฏ‡่ฎบๆ–‡๏ผˆ10000 ้กต๏ผ‰็š„ๅ…จ้‡่งฃๆž


็›ฎๅฝ•

  1. ็ณป็ปŸๆžถๆž„ๆ€ป่งˆ
  2. PDF ่งฃๆžๅฑ‚ โ€” MinerU Pipeline
  3. ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ โ€” ๅฎžไฝ“ๅ…ณ็ณปๆŠฝๅ–
  4. ็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚ โ€” ๅ›พๆž„ๅปบไธŽๅญ˜ๅ‚จ
  5. ็ดขๅผ•ๅฑ‚ โ€” ๅคš่ทฏ็ดขๅผ•ๆž„ๅปบ
  6. ๆฃ€็ดขๅฑ‚ โ€” ๆททๅˆๆฃ€็ดขไธŽ้‡ๆŽ’
  7. Agent ็ผ–ๆŽ’ๅฑ‚ โ€” ๆ™บ่ƒฝ้—ฎ็ญ”
  8. LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚
  9. ็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„
  10. ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ”
  11. ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ
  12. ้กน็›ฎ็ป“ๆž„

1. ็ณป็ปŸๆžถๆž„ๆ€ป่งˆ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                              ScholarMind ็ณป็ปŸๆžถๆž„                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  ็”จๆˆทๅฑ‚   โ”‚    โ”‚                    FastAPI Gateway                          โ”‚   โ”‚
โ”‚  โ”‚ Web UI   โ”‚โ”€โ”€โ”€โ–ถโ”‚  /upload  /query  /graph  /status  /chat  WebSocket SSE    โ”‚   โ”‚
โ”‚  โ”‚ API่ฐƒ็”จ  โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚             โ”‚                โ”‚                         โ”‚
โ”‚                           โ–ผ             โ–ผ                โ–ผ                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                      Agent ็ผ–ๆŽ’ๅฑ‚ (LangGraph)                               โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ ่ทฏ็”ฑAgent โ”‚  โ”‚ ๆฃ€็ดขAgent โ”‚  โ”‚ ๆŽจ็†Agent โ”‚  โ”‚ ๅ›พ่ฐฑAgent โ”‚  โ”‚ ๆ€ป็ป“Agent โ”‚    โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ (ๅˆ†็ฑปๆ„ๅ›พ)โ”‚  โ”‚ (ๆททๅˆๆฃ€็ดข)โ”‚  โ”‚ (ๅคš่ทณๆŽจ็†)โ”‚  โ”‚ (ๅ›พ่ฐฑๆŸฅ่ฏข)โ”‚  โ”‚ (็ญ”ๆกˆ็”Ÿๆˆ)โ”‚    โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚          โ”‚              โ”‚              โ”‚              โ”‚              โ”‚               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                      LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚ (LiteLLM Proxy)                         โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  vLLM   โ”‚  โ”‚ Ollama  โ”‚  โ”‚ OpenAI/Claudeโ”‚  โ”‚ DeepSeek โ”‚  โ”‚  Gemini  โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ (ๆœฌๅœฐ)  โ”‚  โ”‚ (ๆœฌๅœฐ)  โ”‚  โ”‚  (ๅค–้ƒจAPI)   โ”‚  โ”‚ (ๅค–้ƒจAPI)โ”‚  โ”‚ (ๅค–้ƒจAPI)โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                          ๆฃ€็ดขๅฑ‚ (Hybrid Retrieval)                          โ”‚   โ”‚
โ”‚  โ”‚                                                                             โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ Dense Vector โ”‚  โ”‚ Sparse BM25  โ”‚  โ”‚  Graph Query  โ”‚  โ”‚  Cross-Encoderโ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  (Qdrant)   โ”‚  โ”‚  (Qdrant)    โ”‚  โ”‚  (Neo4j)     โ”‚  โ”‚   Reranker   โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚   โ”‚
โ”‚  โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚   โ”‚
โ”‚  โ”‚                              RRF / ๅŠ ๆƒ่žๅˆ                                 โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                          ็ดขๅผ•ๅฑ‚ (Multi-Index)                               โ”‚   โ”‚
โ”‚  โ”‚                                                                             โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  ๅ‘้‡็ดขๅผ•      โ”‚  โ”‚  ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ดขๅผ•     โ”‚  โ”‚  RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘        โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  Qdrant       โ”‚  โ”‚  Neo4j 5.x       โ”‚  โ”‚  (้€’ๅฝ’่š็ฑปโ†’ๆ‘˜่ฆโ†’ๅ†ๅตŒๅ…ฅ)    โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  Dense+Sparse โ”‚  โ”‚  Entity/Relation  โ”‚  โ”‚  Paperโ†’Sectionโ†’Paragraph  โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚             โ”‚                   โ”‚                          โ”‚                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                      ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ (Knowledge Extraction)                      โ”‚   โ”‚
โ”‚  โ”‚                                                                             โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  ๅฎžไฝ“ๆŠฝๅ– (NER)       โ”‚  โ”‚  ๅ…ณ็ณปๆŠฝๅ– (RE)                         โ”‚      โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  GLiNER 440M         โ”‚  โ”‚  LLMGraphTransformer                   โ”‚      โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  ้›ถๆ ทๆœฌ, ่‡ชๅฎšไน‰ๆ ‡็ญพ    โ”‚  โ”‚  + Graphusion ่žๅˆๅŽป้‡                  โ”‚      โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                      PDF ่งฃๆžๅฑ‚ (MinerU Pipeline)                           โ”‚   โ”‚
โ”‚  โ”‚                                                                             โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ PDF้˜Ÿๅˆ—   โ”‚  โ”‚ MinerU 2.5 VLM   โ”‚  โ”‚ ๆ ผๅผ่ฝฌๆข  โ”‚  โ”‚   ๅ…ƒๆ•ฐๆฎๆๅ–     โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ Celery   โ”‚โ”€โ–ถโ”‚ vLLMๅŽ็ซฏ 2pg/s   โ”‚โ”€โ–ถโ”‚ JSONโ†’MD  โ”‚โ”€โ–ถโ”‚ ๆ ‡้ข˜/ไฝœ่€…/DOI   โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ +Redis   โ”‚  โ”‚ ๅธƒๅฑ€+OCR+ๅ…ฌๅผ+่กจๆ ผ โ”‚  โ”‚ +็ป“ๆž„ๅŒ–   โ”‚  โ”‚ +็ซ ่Š‚/ๅผ•็”จ      โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                         ๅญ˜ๅ‚จๅฑ‚ (Storage)                                    โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ PostgreSQLโ”‚  โ”‚  Qdrant  โ”‚  โ”‚  Neo4j   โ”‚  โ”‚  Redis   โ”‚  โ”‚ MinIO/S3 โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ”‚ ๅ…ƒๆ•ฐๆฎ    โ”‚  โ”‚ ๅ‘้‡็ดขๅผ•  โ”‚  โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑ  โ”‚  โ”‚ ็ผ“ๅญ˜/้˜Ÿๅˆ— โ”‚  โ”‚ PDFๅญ˜ๅ‚จ  โ”‚   โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. PDF ่งฃๆžๅฑ‚

2.1 ๆŠ€ๆœฏ้€‰ๅž‹๏ผšMinerU 2.5 VLM

ๆŒ‡ๆ ‡ MinerU 2.5 Marker Nougat PyMuPDF
ๅญฆๆœฏ่ฎบๆ–‡ๆ–‡ๆœฌ็ฒพๅบฆ(Edit Distanceโ†“) 0.047 0.080 0.365 N/A(ไป…ๆ•ฐๅญ—PDF)
ๅ…ฌๅผ่ฏ†ๅˆซ(CDMโ†‘) 88.46 17.6 15.1 โŒ
่กจๆ ผ่ฏ†ๅˆซ(TEDSโ†‘) 88.22 67.6 39.9 โŒ
ๅžๅ(A100, pg/s) 2.12 ~5 ~0.5 ~100
ๆ‰ซๆไปถๆ”ฏๆŒ โœ… โš ๏ธ โŒ โŒ

Benchmark ๆฅๆบ: OmniDocBench (CVPR 2025, arxiv:2412.07626)

2.2 ๆžถๆž„่ฎพ่ฎก

# ๆททๅˆ่ทฏ็”ฑ็ญ–็•ฅ๏ผšๆ•ฐๅญ—PDF่ตฐPyMuPDF(ๅฟซ), ๅคๆ‚PDF่ตฐMinerU 2.5(็ฒพ)
class PDFRouter:
    """ๆ นๆฎPDF็‰นๅพๆ™บ่ƒฝ้€‰ๆ‹ฉ่งฃๆžๅผ•ๆ“Ž"""
    
    def route(self, pdf_path: str) -> str:
        import fitz
        doc = fitz.open(pdf_path)
        avg_chars = sum(len(p.get_text()) for p in doc) / len(doc)
        has_images = any(p.get_images() for p in doc)
        
        if avg_chars > 500 and not has_images:
            return "pymupdf_fast"      # ็บฏๆ•ฐๅญ—PDF๏ผŒPyMuPDF็ง’็บง่งฃๆž
        elif avg_chars > 200:
            return "mineru_pipeline"    # ๆ•ฐๅญ—PDF+ๅ›พ่กจ๏ผŒPipelineๆจกๅผ(CPU)
        else:
            return "mineru_vlm"        # ๆ‰ซๆไปถ/ๅคๆ‚ๅธƒๅฑ€๏ผŒVLMๆจกๅผ(GPU)

2.3 ๆ‰น้‡ๅค„็†ๆตๆฐด็บฟ

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚           PDF ๆ‰น้‡ๅค„็†ๆตๆฐด็บฟ               โ”‚
                    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
                    โ”‚                                         โ”‚
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ PDF ๆ–‡ไปถ  โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  โ”‚ PDF่ทฏ็”ฑๅ™จ โ”‚โ”€โ”€โ–ถโ”‚ Celery Workerๆฑ   โ”‚   โ”‚โ”€โ”€โ–ถโ”‚ ็ป“ๆž„ๅŒ–   โ”‚
 โ”‚ ไธŠไผ /ๆ‰น้‡ โ”‚      โ”‚  โ”‚ (็‰นๅพๆฃ€ๆต‹) โ”‚   โ”‚                  โ”‚   โ”‚   โ”‚ JSON+MD  โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚ W1: MinerU VLM   โ”‚   โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚                 โ”‚ W2: MinerU VLM   โ”‚   โ”‚
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚                 โ”‚ W3: Pipeline     โ”‚   โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Redis   โ”‚โ—€โ”€โ”€โ”€โ”€โ–ถโ”‚                 โ”‚ W4: PyMuPDF      โ”‚   โ”‚โ”€โ”€โ–ถโ”‚ ๅ…ƒๆ•ฐๆฎ   โ”‚
 โ”‚ ไปปๅŠก้˜Ÿๅˆ—  โ”‚      โ”‚                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚   โ”‚ ๆๅ–     โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚                                         โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
                    โ”‚  โ”‚ ็›‘ๆŽง: ่ฟ›ๅบฆ/ๅคฑ่ดฅ้‡่ฏ•/ๅžๅ้‡็ปŸ่ฎก     โ”‚  โ”‚
                    โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ๅ…ณ้”ฎ้…็ฝฎ:
- MinerU VLM Worker: ๆฏGPUไธ€ไธช่ฟ›็จ‹, vLLMๅผ‚ๆญฅๆ‰นๅค„็†
- gpu_memory_utilization: 0.7 (้ข„็•™30%็ป™OOMๅฎ‰ๅ…จ่พน้™…)
- max_num_batched_tokens: 16384 (ๆ้ซ˜GPUๅˆฉ็”จ็އ)
- ๅคฑ่ดฅ้‡่ฏ•: ๆœ€ๅคš3ๆฌก, ๆŒ‡ๆ•ฐ้€€้ฟ
- ่ถ…ๆ—ถ: ๅ•PDF 300็ง’ไธŠ้™

2.4 ่พ“ๅ‡บๆ•ฐๆฎๆจกๅž‹

from pydantic import BaseModel
from typing import List, Optional
from enum import Enum

class ContentType(str, Enum):
    TITLE = "title"
    TEXT = "text"
    TABLE = "table"
    EQUATION = "equation"
    EQUATION_BLOCK = "equation_block"
    IMAGE = "image"
    CODE = "code"
    LIST = "list"
    REFERENCE = "reference"

class ContentBlock(BaseModel):
    type: ContentType
    content: str             # Markdown/LaTeX/HTML
    page_idx: int
    bbox: List[float]        # [x0, y0, x1, y1]
    reading_order: int
    section_hierarchy: List[str]  # ["3", "3.1", "Methods"]

class PaperMetadata(BaseModel):
    paper_id: str
    title: str
    authors: List[str]
    abstract: str
    doi: Optional[str]
    year: Optional[int]
    venue: Optional[str]
    keywords: List[str]
    references: List[str]    # ๅผ•็”จ็š„่ฎบๆ–‡ๆ ‡้ข˜

class ParsedPaper(BaseModel):
    metadata: PaperMetadata
    content_blocks: List[ContentBlock]
    markdown: str
    page_count: int
    parse_engine: str        # "mineru_vlm" | "mineru_pipeline" | "pymupdf"
    parse_time_seconds: float

3. ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚

3.1 ไธค้˜ถๆฎตๆŠฝๅ–็ญ–็•ฅ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    ็Ÿฅ่ฏ†ๆŠฝๅ–ๆตๆฐด็บฟ                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Stage 1: ๅฟซ้€Ÿๅฎžไฝ“ๆŠฝๅ– (GLiNER, ๆœฌๅœฐ, 440Mๅ‚ๆ•ฐ)                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  ่พ“ๅ…ฅ: ่ฎบๆ–‡ๆ–‡ๆœฌๅ—                                           โ”‚  โ”‚
โ”‚  โ”‚  ๆจกๅž‹: urchade/gliner_large-v2.1 (้›ถๆ ทๆœฌNER)                โ”‚  โ”‚
โ”‚  โ”‚  ๆ ‡็ญพ: [Author, Method, Dataset, Metric, Task,             โ”‚  โ”‚
โ”‚  โ”‚         Model, Concept, Venue, Score, Tool]                โ”‚  โ”‚
โ”‚  โ”‚  ่พ“ๅ‡บ: [(text, label, score, span), ...]                   โ”‚  โ”‚
โ”‚  โ”‚  ้€Ÿๅบฆ: ~1000 chunks/min (CPU), ~5000 chunks/min (GPU)      โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                           โ”‚                                      โ”‚
โ”‚                           โ–ผ                                      โ”‚
โ”‚  Stage 2: LLMๅ…ณ็ณปๆŠฝๅ– (LLMGraphTransformer, ๆœฌๅœฐๆˆ–API)           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  ่พ“ๅ…ฅ: ๆ–‡ๆœฌๅ— + Stage1ๅฎžไฝ“ๆ็คบ                               โ”‚  โ”‚
โ”‚  โ”‚  ๅ…ณ็ณป็ฑปๅž‹: [PROPOSED_BY, USED_FOR, EVALUATED_ON,           โ”‚  โ”‚
โ”‚  โ”‚     TRAINED_WITH, COMPARED_TO, PART_OF, ACHIEVED_SCORE,    โ”‚  โ”‚
โ”‚  โ”‚     HYPONYM_OF, CITED_BY, IMPROVES_ON]                     โ”‚  โ”‚
โ”‚  โ”‚  ๆœฌๅœฐ: Ollama(Qwen2.5-14B) ๆˆ– vLLM(Llama-3.1-8B)          โ”‚  โ”‚
โ”‚  โ”‚  API:  GPT-4o-mini ๆˆ– DeepSeek-V3                          โ”‚  โ”‚
โ”‚  โ”‚  ่พ“ๅ‡บ: [(head, relation, tail, properties), ...]            โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                           โ”‚                                      โ”‚
โ”‚                           โ–ผ                                      โ”‚
โ”‚  Stage 3: Graphusion ่žๅˆ (ๅฎžไฝ“ๅฝ’ไธ€ๅŒ– + ๅ†ฒ็ชๆถˆ่งฃ)                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  - ๅตŒๅ…ฅ็›ธไผผๅบฆๅˆๅนถ: "NMT" โ†” "neural machine translation"     โ”‚  โ”‚
โ”‚  โ”‚  - LLMๅ†ฒ็ชๆถˆ่งฃ: ็›ธๅŒๅฎžไฝ“ๅฏน็š„็Ÿ›็›พๅ…ณ็ณป                         โ”‚  โ”‚
โ”‚  โ”‚  - ๆ–ฐไธ‰ๅ…ƒ็ป„ๆŽจๆ–ญ: ๅŸบไบŽไธŠไธ‹ๆ–‡่กฅๅ…จ็ผบๅคฑๅ…ณ็ณป                       โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3.2 ๅญฆๆœฏ่ฎบๆ–‡ๅฎžไฝ“-ๅ…ณ็ณป Schema

ๅฎžไฝ“็ฑปๅž‹ (Node Types):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๅฎžไฝ“็ฑปๅž‹     โ”‚ ๆ่ฟฐ                           โ”‚ ๅฑžๆ€ง              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Paper       โ”‚ ่ฎบๆ–‡                            โ”‚ title, year, doi โ”‚
โ”‚ Author      โ”‚ ไฝœ่€…                            โ”‚ name, affiliationโ”‚
โ”‚ Method      โ”‚ ๆ–นๆณ•/็ฎ—ๆณ•                       โ”‚ name, descriptionโ”‚
โ”‚ Dataset     โ”‚ ๆ•ฐๆฎ้›†                          โ”‚ name, size, domainโ”‚
โ”‚ Task        โ”‚ ไปปๅŠก                            โ”‚ name, domain     โ”‚
โ”‚ Metric      โ”‚ ่ฏ„ไผฐๆŒ‡ๆ ‡                        โ”‚ name, value      โ”‚
โ”‚ Model       โ”‚ ๅ…ทไฝ“ๆจกๅž‹ๅฎžไพ‹                     โ”‚ name, params     โ”‚
โ”‚ Concept     โ”‚ ๅญฆๆœฏๆฆ‚ๅฟต                        โ”‚ name, definition โ”‚
โ”‚ Tool        โ”‚ ๅทฅๅ…ท/ๆก†ๆžถ                       โ”‚ name, version    โ”‚
โ”‚ Venue       โ”‚ ๅ‘่กจๅœบๆ‰€                        โ”‚ name, type       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ๅ…ณ็ณป็ฑปๅž‹ (Edge Types):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๅ…ณ็ณป็ฑปๅž‹         โ”‚ ๆ่ฟฐ (Head โ†’ Tail)                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PROPOSED_BY     โ”‚ Method โ†’ Author (ๆ–นๆณ•็”ฑไฝœ่€…ๆๅ‡บ)               โ”‚
โ”‚ PUBLISHED_IN    โ”‚ Paper โ†’ Venue (่ฎบๆ–‡ๅ‘่กจๅœจๆŸไผš่ฎฎ/ๆœŸๅˆŠ)          โ”‚
โ”‚ USED_FOR        โ”‚ Method โ†’ Task (ๆ–นๆณ•็”จไบŽๆŸไปปๅŠก)                 โ”‚
โ”‚ EVALUATED_ON    โ”‚ Method โ†’ Dataset (ๆ–นๆณ•ๅœจๆŸๆ•ฐๆฎ้›†ไธŠ่ฏ„ไผฐ)         โ”‚
โ”‚ ACHIEVED_SCORE  โ”‚ Method โ†’ Metric (ๆ–นๆณ•่พพๅˆฐๆŸๆŒ‡ๆ ‡ๅ€ผ)             โ”‚
โ”‚ TRAINED_WITH    โ”‚ Model โ†’ Dataset (ๆจกๅž‹ๅœจๆŸๆ•ฐๆฎ้›†ไธŠ่ฎญ็ปƒ)          โ”‚
โ”‚ COMPARED_TO     โ”‚ Method โ†’ Method (ๆ–นๆณ•ไน‹้—ด็š„ๅฏนๆฏ”)               โ”‚
โ”‚ IMPROVES_ON     โ”‚ Method โ†’ Method (ๆ–นๆณ•Aๆ”น่ฟ›ไบ†ๆ–นๆณ•B)             โ”‚
โ”‚ PART_OF         โ”‚ Concept โ†’ Concept (ๆฆ‚ๅฟตๅฑ‚็บงๅ…ณ็ณป)               โ”‚
โ”‚ CITES           โ”‚ Paper โ†’ Paper (ๅผ•็”จๅ…ณ็ณป)                      โ”‚
โ”‚ AUTHORED_BY     โ”‚ Paper โ†’ Author (่ฎบๆ–‡ไฝœ่€…)                     โ”‚
โ”‚ HYPONYM_OF      โ”‚ Concept โ†’ Concept (ไธŠไธ‹ไฝๅ…ณ็ณป)                โ”‚
โ”‚ USES_TOOL       โ”‚ Method โ†’ Tool (ๆ–นๆณ•ไฝฟ็”จๆŸๅทฅๅ…ท)                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3.3 ๆ ธๅฟƒๆŠฝๅ–ไปฃ็ 

from gliner import GLiNER
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_core.documents import Document

class KnowledgeExtractor:
    """ไธค้˜ถๆฎต็Ÿฅ่ฏ†ๆŠฝๅ–ๅ™จ"""
    
    def __init__(self, llm_backend: str = "local_ollama"):
        # Stage 1: ๅฟซ้€ŸNER
        self.ner_model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")
        self.entity_labels = [
            "author", "method", "dataset", "metric", 
            "task", "model", "concept", "tool", "venue", "score"
        ]
        
        # Stage 2: LLMๅ…ณ็ณปๆŠฝๅ–
        self.llm = self._init_llm(llm_backend)
        self.graph_transformer = LLMGraphTransformer(
            llm=self.llm,
            allowed_nodes=["Author","Method","Dataset","Metric","Task","Model","Concept","Tool","Venue"],
            allowed_relationships=[
                "PROPOSED_BY","USED_FOR","EVALUATED_ON","ACHIEVED_SCORE",
                "TRAINED_WITH","COMPARED_TO","IMPROVES_ON","PART_OF",
                "CITES","AUTHORED_BY","HYPONYM_OF","USES_TOOL","PUBLISHED_IN"
            ],
            node_properties=["description", "year"],
            relationship_properties=["score_value", "metric_name", "confidence"],
            strict_mode=True,
        )
    
    def _init_llm(self, backend: str):
        """็ปŸไธ€LLMๅˆๅง‹ๅŒ– โ€” ๆ”ฏๆŒๆœฌๅœฐๅ’Œๅค–้ƒจAPI"""
        if backend == "local_ollama":
            from langchain_community.llms import Ollama
            return Ollama(model="qwen2.5:14b-instruct", temperature=0)
        elif backend == "local_vllm":
            from langchain_openai import ChatOpenAI
            return ChatOpenAI(
                base_url="http://localhost:8000/v1",
                api_key="token",
                model="meta-llama/Llama-3.1-8B-Instruct",
                temperature=0
            )
        elif backend == "openai":
            from langchain_openai import ChatOpenAI
            return ChatOpenAI(model="gpt-4o-mini", temperature=0)
        elif backend == "deepseek":
            from langchain_openai import ChatOpenAI
            return ChatOpenAI(
                base_url="https://api.deepseek.com/v1",
                model="deepseek-chat",
                temperature=0
            )
    
    async def extract(self, text: str, paper_id: str) -> dict:
        """ไธค้˜ถๆฎตๆŠฝๅ–"""
        # Stage 1: GLiNERๅฟซ้€ŸNER
        entities = self.ner_model.predict_entities(
            text, self.entity_labels, threshold=0.5
        )
        
        # Stage 2: LLMๅ…ณ็ณปๆŠฝๅ– (ไผ ๅ…ฅๅฎžไฝ“ไฝœไธบๆ็คบ)
        entity_hint = ", ".join([f"{e['text']}({e['label']})" for e in entities[:20]])
        doc = Document(
            page_content=text,
            metadata={"paper_id": paper_id, "entity_hints": entity_hint}
        )
        graph_docs = await self.graph_transformer.aconvert_to_graph_documents([doc])
        
        return {
            "entities": entities,
            "graph_documents": graph_docs,
            "paper_id": paper_id
        }

4. ็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฑ‚

4.1 ๅ›พๆ•ฐๆฎๅบ“้€‰ๅž‹๏ผšNeo4j 5.x

ๅ›พๆ•ฐๆฎๅบ“ ่ฎธๅฏ่ฏ ๆŸฅ่ฏข่ฏญ่จ€ Python้ฉฑๅŠจ ็”Ÿๆ€้›†ๆˆ ้€‚็”จ่ง„ๆจก
Neo4j 5.x Community AGPL Cypher neo4j LangChain/LlamaIndexๅŽŸ็”Ÿ <1ไบฟ่Š‚็‚น
ArangoDB Apache 2.0 AQL python-arango ๅคšๆจกๅž‹(ๆ–‡ๆกฃ+ๅ›พ) <1ไบฟ่Š‚็‚น
NebulaGraph Apache 2.0 nGQL nebula3-python LlamaIndexๅŽŸ็”Ÿ 10ไบฟ+่Š‚็‚น
Kuzu MIT Cypher kuzu ๅตŒๅ…ฅๅผ, ่ฝป้‡ <1000ไธ‡่Š‚็‚น

ๆŽจ่ Neo4j 5.x๏ผšLangChain/LlamaIndex ๅŽŸ็”Ÿ้›†ๆˆๆœ€ๅฎŒๅ–„๏ผŒCypher ๆŸฅ่ฏข็”Ÿๆ€ๆœ€ๆˆ็†Ÿ๏ผŒ้€‚ๅˆ1000็ฏ‡่ฎบๆ–‡่ง„ๆจก

4.2 ๅ›พ่ฐฑๆ•ฐๆฎๆจกๅž‹

// ===== ่Š‚็‚น =====
(:Paper {id, title, year, doi, venue, abstract, embedding})
(:Author {id, name, affiliation, h_index})
(:Method {id, name, description, year_proposed, embedding})
(:Dataset {id, name, domain, size, description})
(:Task {id, name, domain, description})
(:Metric {id, name, description})
(:Concept {id, name, definition, embedding})

// ===== ๅ…ณ็ณป =====
(:Paper)-[:AUTHORED_BY {order}]->(:Author)
(:Paper)-[:PUBLISHED_IN {year}]->(:Venue)
(:Paper)-[:CITES]->(:Paper)
(:Paper)-[:PROPOSES]->(:Method)
(:Method)-[:USED_FOR]->(:Task)
(:Method)-[:EVALUATED_ON {score, metric}]->(:Dataset)
(:Method)-[:IMPROVES_ON {delta, metric}]->(:Method)
(:Method)-[:COMPARED_TO {result}]->(:Method)
(:Concept)-[:PART_OF]->(:Concept)
(:Concept)-[:HYPONYM_OF]->(:Concept)

// ===== ็ดขๅผ• =====
CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding)
  OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
CREATE VECTOR INDEX method_embedding FOR (m:Method) ON (m.embedding)
  OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
CREATE FULLTEXT INDEX paper_fulltext FOR (p:Paper) ON EACH [p.title, p.abstract];

4.3 ๅ›พ่ฐฑๆž„ๅปบๆตๆฐด็บฟ

่งฃๆžๅŽ็š„่ฎบๆ–‡ โ”€โ”€โ–ถ ็Ÿฅ่ฏ†ๆŠฝๅ– โ”€โ”€โ–ถ ไธ‰ๅ…ƒ็ป„่ง„่ŒƒๅŒ– โ”€โ”€โ–ถ Neo4j ๅ†™ๅ…ฅ
                                    โ”‚
                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                        โ”‚   Graphusion ่žๅˆๅผ•ๆ“Ž   โ”‚
                        โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
                        โ”‚ 1. ๅฎžไฝ“ๅฝ’ไธ€ๅŒ–            โ”‚
                        โ”‚    - ๅตŒๅ…ฅ็›ธไผผๅบฆ > 0.92   โ”‚
                        โ”‚    - LLM็กฎ่ฎคๅˆๅนถ          โ”‚
                        โ”‚    "BERT" = "bert model" โ”‚
                        โ”‚                         โ”‚
                        โ”‚ 2. ๅ…ณ็ณปๅ†ฒ็ชๆถˆ่งฃ           โ”‚
                        โ”‚    - ๅŒไธ€ๅฎžไฝ“ๅฏนๅคšๅ…ณ็ณป      โ”‚
                        โ”‚    - ๅ–็ฝฎไฟกๅบฆๆœ€้ซ˜็š„        โ”‚
                        โ”‚                         โ”‚
                        โ”‚ 3. ็ผบๅคฑๅ…ณ็ณปๆŽจๆ–ญ           โ”‚
                        โ”‚    - ๅŸบไบŽๅ›พ็ป“ๆž„ๆจกๅผ        โ”‚
                        โ”‚    - LLM่กฅๅ…จ              โ”‚
                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.4 ๅ›พ่ฐฑๅฏ่ง†ๅŒ–ๆ–นๆกˆ

# ๆ–นๆกˆ1: Neo4j Browser (ๅผ€ๅ‘้˜ถๆฎต)
# ๅ†…็ฝฎCypherๆŸฅ่ฏข + ไบคไบ’ๅผๅ›พๅฏ่ง†ๅŒ–

# ๆ–นๆกˆ2: vis-network (ๅ‰็ซฏ้›†ๆˆ)
# pip install pyvis
from pyvis.network import Network

def visualize_subgraph(nodes, edges, output_path="graph.html"):
    net = Network(height="800px", width="100%", directed=True)
    color_map = {
        "Method": "#ff6b6b", "Dataset": "#4ecdc4", 
        "Task": "#45b7d1", "Author": "#96ceb4",
        "Paper": "#ffeaa7", "Concept": "#dfe6e9"
    }
    for node in nodes:
        net.add_node(node["id"], label=node["name"], 
                     color=color_map.get(node["type"], "#95a5a6"))
    for edge in edges:
        net.add_edge(edge["from"], edge["to"], label=edge["type"])
    net.show(output_path)

# ๆ–นๆกˆ3: React + D3-force (็”Ÿไบงๅ‰็ซฏ)
# ๆŽจ่ react-force-graph ๆˆ– neo4j-viz

5. ็ดขๅผ•ๅฑ‚

5.1 ไธ‰่ทฏ็ดขๅผ•ๆžถๆž„

                  ่งฃๆžๅŽ็š„่ฎบๆ–‡ๅ†…ๅฎน
                        โ”‚
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ–ผ           โ–ผ           โ–ผ
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚ ๅ‘้‡็ดขๅผ•  โ”‚ โ”‚ ๅ›พ่ฐฑ็ดขๅผ•  โ”‚ โ”‚ RAPTORๆ ‘ โ”‚
     โ”‚          โ”‚ โ”‚          โ”‚ โ”‚          โ”‚
     โ”‚ Qdrant   โ”‚ โ”‚ Neo4j    โ”‚ โ”‚ ๅฑ‚ๆฌกๆ‘˜่ฆ  โ”‚
     โ”‚ Dense +  โ”‚ โ”‚ Cypher + โ”‚ โ”‚ ้€’ๅฝ’่š็ฑป  โ”‚
     โ”‚ Sparse   โ”‚ โ”‚ Vector   โ”‚ โ”‚ โ†’ ๆ‘˜่ฆ    โ”‚
     โ”‚          โ”‚ โ”‚ Index    โ”‚ โ”‚ โ†’ ๅ†ๅตŒๅ…ฅ  โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     
     ้€‚ๅˆ:         ้€‚ๅˆ:         ้€‚ๅˆ:
     ไบ‹ๅฎžๆŸฅ่ฏข      ๅคš่ทณๆŽจ็†      ๅ…จๅฑ€ๆฆ‚่งˆ
     ็ฒพ็กฎๆฃ€็ดข      ๅ…ณ็ณป่ฟฝๆบฏ      ไธป้ข˜ๆ€ป็ป“
     ็›ธไผผ่ฎบๆ–‡      ๅฏนๆฏ”ๅˆ†ๆž      ่ถ‹ๅŠฟๅˆ†ๆž

5.2 ๆ–‡ๆกฃๅˆ†ๅ—็ญ–็•ฅ

class AcademicChunker:
    """ๅญฆๆœฏ่ฎบๆ–‡ไธ“็”จๅˆ†ๅ—ๅ™จ โ€” ไฟ็•™็ซ ่Š‚ๅฑ‚็บง"""
    
    def __init__(self, chunk_size: int = 256, overlap: int = 50):
        self.chunk_size = chunk_size  # 256 tokens (ๅฎž้ชŒ้ชŒ่ฏๆœ€ไฝณ, arxiv:2502.11371)
        self.overlap = overlap
    
    def chunk(self, parsed_paper: ParsedPaper) -> list:
        chunks = []
        
        for block in parsed_paper.content_blocks:
            if block.type == ContentType.TABLE:
                # ่กจๆ ผไฝœไธบๅฎŒๆ•ดchunk, ้™„ๅŠ ๆ่ฟฐ
                chunks.append({
                    "text": f"[TABLE] {block.content}",
                    "metadata": {
                        "paper_id": parsed_paper.metadata.paper_id,
                        "type": "table",
                        "section": block.section_hierarchy,
                        "page": block.page_idx,
                    }
                })
            elif block.type == ContentType.EQUATION_BLOCK:
                # ๅ…ฌๅผๅ— + ไธŠไธ‹ๆ–‡
                chunks.append({
                    "text": f"[EQUATION] {block.content}",
                    "metadata": {
                        "paper_id": parsed_paper.metadata.paper_id,
                        "type": "equation",
                        "section": block.section_hierarchy,
                    }
                })
            else:
                # ๆ™ฎ้€šๆ–‡ๆœฌ: ๅ›บๅฎšๅคงๅฐๅˆ†ๅ—, ๆŒ‰ๅฅๅญ่พน็•Œๅฏน้ฝ
                text_chunks = self._split_text(block.content)
                for tc in text_chunks:
                    chunks.append({
                        "text": tc,
                        "metadata": {
                            "paper_id": parsed_paper.metadata.paper_id,
                            "type": block.type.value,
                            "section": block.section_hierarchy,
                            "page": block.page_idx,
                        }
                    })
        
        return chunks
    
    def _split_text(self, text: str) -> list:
        """ๆŒ‰ๅฅๅญ่พน็•Œๅˆ†ๅ—, ไฟๆŒ256 tokenๅคงๅฐ"""
        import re
        sentences = re.split(r'(?<=[.!?])\s+', text)
        chunks, current = [], []
        current_len = 0
        
        for sent in sentences:
            sent_len = len(sent.split())  # ็ฎ€ๅŒ–็š„token่ฎกๆ•ฐ
            if current_len + sent_len > self.chunk_size and current:
                chunks.append(" ".join(current))
                # ไฟ็•™overlap
                overlap_sents = []
                overlap_len = 0
                for s in reversed(current):
                    if overlap_len + len(s.split()) > self.overlap:
                        break
                    overlap_sents.insert(0, s)
                    overlap_len += len(s.split())
                current = overlap_sents
                current_len = overlap_len
            current.append(sent)
            current_len += sent_len
        
        if current:
            chunks.append(" ".join(current))
        return chunks

5.3 RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘

่ฎบๆ–‡้›†ๅˆ (1000็ฏ‡)
    โ”‚
    โ”œโ”€โ”€ Level 0: ๅŽŸๅง‹ๆ–‡ๆœฌๅ— (256 tokens)
    โ”‚       โ”‚
    โ”‚       โ–ผ SBERTๅตŒๅ…ฅ โ†’ GMM่š็ฑป โ†’ UMAP้™็ปด
    โ”‚
    โ”œโ”€โ”€ Level 1: ๆฎต่ฝ็บงๆ‘˜่ฆ (~50ไธช่š็ฑป)
    โ”‚       โ”‚ LLM็”Ÿๆˆๆ‘˜่ฆ โ†’ ้‡ๆ–ฐๅตŒๅ…ฅ
    โ”‚       โ–ผ ๅ†ๆฌก่š็ฑป
    โ”‚
    โ”œโ”€โ”€ Level 2: ไธป้ข˜็บงๆ‘˜่ฆ (~15ไธช่š็ฑป)
    โ”‚       โ”‚ "Transformerๆžถๆž„็š„ๆ”น่ฟ›ๆ–นๅ‘"
    โ”‚       โ–ผ "ๅคง่ง„ๆจก้ข„่ฎญ็ปƒๆ•ฐๆฎ้›†็ปผ่ฟฐ"
    โ”‚
    โ””โ”€โ”€ Level 3: ้ข†ๅŸŸ็บงๆ‘˜่ฆ (~5ไธช่š็ฑป)
            "NLP้ข†ๅŸŸ่ฟ‘ๅนดไธป่ฆ็ ”็ฉถๆ–นๅ‘ไธŽ็ช็ ด"

ๆŸฅ่ฏขๆ—ถ: ไปŽๆ‰€ๆœ‰ๅฑ‚็บงไธญๆฃ€็ดขๆœ€็›ธๅ…ณ่Š‚็‚น (Collapsed Treeๆจกๅผ)
ไผ˜ๅŠฟ: ๆ—ข่ƒฝๅ›ž็ญ”็ป†่Š‚้—ฎ้ข˜(Level 0), ไนŸ่ƒฝๅ›ž็ญ”ๅ…จๅฑ€้—ฎ้ข˜(Level 2-3)

6. ๆฃ€็ดขๅฑ‚

6.1 ๆททๅˆๆฃ€็ดขๆžถๆž„

                     ็”จๆˆทๆŸฅ่ฏข
                        โ”‚
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ–ผ               โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚  HyDE    โ”‚   โ”‚ ๆŸฅ่ฏขๅˆ†็ฑปๅ™จ    โ”‚
         โ”‚ ๅ‡่ฎพๆ–‡ๆกฃ  โ”‚   โ”‚ (Router LLM) โ”‚
         โ”‚ ็”Ÿๆˆ+ๅตŒๅ…ฅ โ”‚   โ”‚              โ”‚
         โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚                โ”‚
              โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚    โ–ผ           โ–ผ           โ–ผ
              โ”‚  factual    reasoning    global
              โ”‚  (ไบ‹ๅฎž)     (ๆŽจ็†)      (ๅ…จๅฑ€)
              โ”‚    โ”‚           โ”‚           โ”‚
              โ–ผ    โ–ผ           โ–ผ           โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚ ๅ‘้‡+BM25 โ”‚  โ”‚ ๅ›พ่ฐฑ้ๅކ  โ”‚  โ”‚ RAPTOR   โ”‚
         โ”‚ Qdrant   โ”‚  โ”‚ Neo4j    โ”‚  โ”‚ ๆ‘˜่ฆๆ ‘    โ”‚
         โ”‚ Hybrid   โ”‚  โ”‚ Cypher   โ”‚  โ”‚ ๅ…จๅฑ€ๆฃ€็ดข  โ”‚
         โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚             โ”‚              โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  RRF ่žๅˆๆŽ’ๅบ   โ”‚
                    โ”‚ (Reciprocal   โ”‚
                    โ”‚  Rank Fusion) โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Cross-Encoder โ”‚
                    โ”‚   Reranker   โ”‚
                    โ”‚ bge-reranker โ”‚
                    โ”‚   -large     โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                        Top-5 ็ป“ๆžœ
                            โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  LLM ็ญ”ๆกˆ็”Ÿๆˆ  โ”‚
                    โ”‚ + ๅผ•็”จๆบฏๆบ     โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6.2 ๆ ธๅฟƒๆฃ€็ดขไปฃ็ 

from qdrant_client import QdrantClient, models
from neo4j import GraphDatabase

class HybridRetriever:
    """ไธ‰่ทฏๆททๅˆๆฃ€็ดขๅ™จ"""
    
    def __init__(self):
        self.qdrant = QdrantClient("localhost", port=6333)
        self.neo4j = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
        self.reranker = self._load_reranker()
        self.embed_model = self._load_embedder()
    
    async def retrieve(self, query: str, mode: str = "hybrid", top_k: int = 20) -> list:
        """
        mode: "factual" | "reasoning" | "global" | "hybrid"
        """
        results = []
        
        if mode in ("factual", "hybrid"):
            # 1. Dense + Sparse ๅ‘้‡ๆฃ€็ดข
            query_vec = self.embed_model.encode(query)
            vec_results = self.qdrant.search(
                collection_name="papers",
                query_vector=models.NamedVector(name="dense", vector=query_vec),
                limit=top_k,
                with_payload=True,
            )
            results.extend([{"text": r.payload["text"], "score": r.score, 
                           "source": "vector", "metadata": r.payload} for r in vec_results])
        
        if mode in ("reasoning", "hybrid"):
            # 2. ๅ›พ่ฐฑๆฃ€็ดข โ€” ๅฎžไฝ“+ๅ…ณ็ณป่ทฏๅพ„
            graph_results = self._graph_search(query, limit=top_k // 2)
            results.extend(graph_results)
        
        if mode in ("global", "hybrid"):
            # 3. RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆฃ€็ดข
            raptor_results = self._raptor_search(query, limit=top_k // 3)
            results.extend(raptor_results)
        
        # 4. RRF ่žๅˆๆŽ’ๅบ
        fused = self._rrf_fusion(results)
        
        # 5. Cross-Encoder ้‡ๆŽ’
        reranked = self._rerank(query, fused[:top_k])
        
        return reranked[:5]
    
    def _graph_search(self, query: str, limit: int = 10) -> list:
        """Neo4j ๅญๅ›พๆฃ€็ดข"""
        # ๅ…ˆ็”จๅ‘้‡็ดขๅผ•ๆ‰พๅˆฐๆœ€็›ธๅ…ณ็š„ๅฎžไฝ“่Š‚็‚น
        # ๅ†็”จCypher้ๅކ1-2่ทณ้‚ปๅฑ…
        cypher = """
        CALL db.index.vector.queryNodes('method_embedding', $limit, $query_vec)
        YIELD node, score
        MATCH (node)-[r]-(neighbor)
        RETURN node, r, neighbor, score
        ORDER BY score DESC LIMIT $limit
        """
        with self.neo4j.session() as session:
            result = session.run(cypher, query_vec=self.embed_model.encode(query).tolist(), limit=limit)
            return [{"text": self._format_graph_result(r), "score": r["score"], 
                     "source": "graph"} for r in result]
    
    def _rrf_fusion(self, results: list, k: int = 60) -> list:
        """Reciprocal Rank Fusion โ€” ๅคš่ทฏ็ป“ๆžœ่žๅˆ"""
        doc_scores = {}
        for rank, r in enumerate(sorted(results, key=lambda x: x["score"], reverse=True)):
            doc_key = r["text"][:200]  # ๅŽป้‡key
            if doc_key not in doc_scores:
                doc_scores[doc_key] = {"result": r, "rrf_score": 0}
            doc_scores[doc_key]["rrf_score"] += 1.0 / (k + rank + 1)
        
        return [v["result"] | {"score": v["rrf_score"]} 
                for v in sorted(doc_scores.values(), key=lambda x: x["rrf_score"], reverse=True)]
    
    def _rerank(self, query: str, results: list) -> list:
        """BAAI/bge-reranker-large ไบคๅ‰็ผ–็ ๅ™จ้‡ๆŽ’"""
        pairs = [(query, r["text"]) for r in results]
        scores = self.reranker.predict(pairs)
        for r, s in zip(results, scores):
            r["rerank_score"] = float(s)
        return sorted(results, key=lambda x: x["rerank_score"], reverse=True)

7. Agent ็ผ–ๆŽ’ๅฑ‚

7.1 LangGraph ๅคšAgentๆžถๆž„

                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                        โ”‚   ็”จๆˆทๆŸฅ่ฏข    โ”‚
                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                        โ”‚  ่ทฏ็”ฑ Agent   โ”‚
                        โ”‚  (ๆ„ๅ›พๅˆ†็ฑป)   โ”‚
                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚                โ”‚                โ”‚
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚ ็ฎ€ๅ•้—ฎ็ญ”     โ”‚ โ”‚ ๅคš่ทณๆŽจ็†     โ”‚ โ”‚ ๅ…จๅฑ€ๅˆ†ๆž     โ”‚
       โ”‚             โ”‚ โ”‚             โ”‚ โ”‚             โ”‚
       โ”‚ ๅ‘้‡ๆฃ€็ดข    โ”‚ โ”‚ ๅ›พ่ฐฑ้ๅކ    โ”‚ โ”‚ RAPTOR+KG  โ”‚
       โ”‚ โ†’ ็”Ÿๆˆ็ญ”ๆกˆ  โ”‚ โ”‚ โ†’ ้“พๅผๆŽจ็†  โ”‚ โ”‚ โ†’ ็ปผๅˆๆ€ป็ป“  โ”‚
       โ”‚ โ†’ ๅผ•็”จๆบฏๆบ  โ”‚ โ”‚ โ†’ ่ฏๆฎๆ”ถ้›†  โ”‚ โ”‚ โ†’ ่ถ‹ๅŠฟๆดžๅฏŸ  โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚                โ”‚                โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                        โ”‚  ่‡ชๆฃ€ Agent   โ”‚
                        โ”‚ (็ญ”ๆกˆ้ชŒ่ฏ)    โ”‚
                        โ”‚ ๆ˜ฏๅฆๅ……ๅˆ†?     โ”‚
                        โ”‚ ๆ˜ฏๅฆๆœ‰ๅนป่ง‰?   โ”‚
                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                   โ”‚ ๅ……ๅˆ†       โ”‚ ไธๅ……ๅˆ†     โ”‚
                   โ–ผ           โ–ผ           โ”‚
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
            โ”‚ ่พ“ๅ‡บ็ญ”ๆกˆ  โ”‚ โ”‚ ่กฅๅ……ๆฃ€็ดข  โ”‚โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚ + ๅผ•็”จ   โ”‚ โ”‚ (ๆ›ดๅคšๆบ)  โ”‚ (ๆœ€ๅคš3่ฝฎ)
            โ”‚ + ๅ›พ่ฐฑ   โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

7.2 LangGraph ็Šถๆ€ๆœบๅฎšไน‰

from typing import TypedDict, Annotated, Literal
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    query: str
    query_type: Literal["factual", "reasoning", "global"]
    retrieved_docs: list
    graph_context: list
    answer: str
    citations: list
    confidence: float
    iteration: int

def build_agent_graph():
    graph = StateGraph(AgentState)
    
    # ๆทปๅŠ ่Š‚็‚น
    graph.add_node("router", route_query)
    graph.add_node("retriever", hybrid_retrieve)
    graph.add_node("graph_explorer", explore_knowledge_graph)
    graph.add_node("generator", generate_answer)
    graph.add_node("validator", validate_answer)
    graph.add_node("supplementer", supplement_retrieval)
    
    # ๅฎšไน‰่พน
    graph.set_entry_point("router")
    graph.add_edge("router", "retriever")
    graph.add_edge("retriever", "graph_explorer")
    graph.add_edge("graph_explorer", "generator")
    graph.add_edge("generator", "validator")
    
    # ๆกไปถ่พน: ้ชŒ่ฏ้€š่ฟ‡โ†’็ป“ๆŸ, ไธ้€š่ฟ‡โ†’่กฅๅ……ๆฃ€็ดข(ๆœ€ๅคš3่ฝฎ)
    graph.add_conditional_edges(
        "validator",
        lambda state: "end" if state["confidence"] > 0.8 or state["iteration"] >= 3 else "supplement",
        {"end": END, "supplement": "supplementer"}
    )
    graph.add_edge("supplementer", "retriever")
    
    return graph.compile()

async def route_query(state: AgentState) -> AgentState:
    """LLMๆ„ๅ›พๅˆ†็ฑป"""
    classification_prompt = f"""
    ๅฐ†ไปฅไธ‹ๅญฆๆœฏ้—ฎ้ข˜ๅˆ†็ฑปไธบไธ‰็ง็ฑปๅž‹ไน‹ไธ€:
    - factual: ๅ…ทไฝ“ไบ‹ๅฎžๆŸฅ่ฏข (ๆŸไธชๆ–นๆณ•็š„ๆ•ˆๆžœใ€ๆŸ็ฏ‡่ฎบๆ–‡็š„ไฝœ่€…)
    - reasoning: ้œ€่ฆๅคšๆญฅๆŽจ็† (ๆ–นๆณ•Aๅ’ŒB็š„ๅŒบๅˆซใ€ๆŸๆŠ€ๆœฏ็š„ๅ‘ๅฑ•่„‰็ปœ)
    - global: ๅ…จๅฑ€ๆ€งๅˆ†ๆž (ๆŸ้ข†ๅŸŸ็š„็ ”็ฉถ่ถ‹ๅŠฟใ€ไธป่ฆๆŒ‘ๆˆ˜)
    
    ้—ฎ้ข˜: {state['query']}
    ็ฑปๅž‹: """
    
    query_type = await llm.ainvoke(classification_prompt)
    return {"query_type": query_type.content.strip()}

async def validate_answer(state: AgentState) -> AgentState:
    """Self-RAG ๆจกๅผ: LLM่‡ชๆฃ€็ญ”ๆกˆ่ดจ้‡"""
    validation_prompt = f"""
    ่ฏ„ไผฐไปฅไธ‹็ญ”ๆกˆ็š„่ดจ้‡(0-1ๅˆ†):
    ้—ฎ้ข˜: {state['query']}
    ็ญ”ๆกˆ: {state['answer']}
    ๆฃ€็ดขไพๆฎ: {state['retrieved_docs'][:3]}
    
    ่ฏ„ๅˆ†ๆ ‡ๅ‡†:
    - ๆ˜ฏๅฆๅฎŒๆ•ดๅ›ž็ญ”ไบ†้—ฎ้ข˜
    - ๆ˜ฏๅฆๆœ‰ไพๆฎๆ”ฏๆ’‘
    - ๆ˜ฏๅฆๅญ˜ๅœจๅนป่ง‰
    
    ่ฟ”ๅ›žJSON: {{"confidence": 0.X, "issues": ["..."]}}
    """
    result = await llm.ainvoke(validation_prompt)
    confidence = parse_confidence(result.content)
    return {"confidence": confidence, "iteration": state["iteration"] + 1}

7.3 Agent ๅทฅๅ…ท้›†

from langchain.tools import tool

@tool
def vector_search(query: str, top_k: int = 5) -> str:
    """ๅœจ่ฎบๆ–‡ๅ‘้‡ๅบ“ไธญ่ฟ›่กŒ่ฏญไน‰ๆœ็ดข"""
    results = retriever.search_vectors(query, top_k)
    return format_search_results(results)

@tool
def graph_query(cypher: str) -> str:
    """ๆ‰ง่กŒCypherๆŸฅ่ฏข, ๅœจ็Ÿฅ่ฏ†ๅ›พ่ฐฑไธญๆฃ€็ดขๅฎžไฝ“ๅ’Œๅ…ณ็ณป"""
    with neo4j_driver.session() as session:
        result = session.run(cypher)
        return format_graph_results(result)

@tool
def find_related_methods(method_name: str) -> str:
    """ๆŸฅๆ‰พไธŽๆŒ‡ๅฎšๆ–นๆณ•็›ธๅ…ณ็š„ๆ‰€ๆœ‰ๆ–นๆณ•(ๆ”น่ฟ›ใ€ๅฏนๆฏ”ใ€ไฝฟ็”จ)"""
    cypher = """
    MATCH (m:Method {name: $name})-[r]-(related)
    RETURN type(r) as relation, labels(related) as type, 
           related.name as name, r.score_value as score
    ORDER BY r.score_value DESC
    LIMIT 20
    """
    return execute_and_format(cypher, {"name": method_name})

@tool  
def get_paper_summary(paper_id: str) -> str:
    """่Žทๅ–่ฎบๆ–‡็š„ๆ‘˜่ฆๅ’Œๆ ธๅฟƒ่ดก็Œฎ"""
    return paper_store.get_summary(paper_id)

@tool
def compare_methods(method_a: str, method_b: str) -> str:
    """ๅฏนๆฏ”ไธคไธชๆ–นๆณ•ๅœจ็›ธๅŒๆ•ฐๆฎ้›†ไธŠ็š„่กจ็Žฐ"""
    cypher = """
    MATCH (a:Method {name: $a})-[r1:EVALUATED_ON]->(d:Dataset)<-[r2:EVALUATED_ON]-(b:Method {name: $b})
    RETURN d.name as dataset, r1.score as score_a, r2.score as score_b, 
           r1.metric as metric
    """
    return execute_and_format(cypher, {"a": method_a, "b": method_b})

@tool
def research_trend(topic: str, years: int = 5) -> str:
    """ๅˆ†ๆžๆŸไธช็ ”็ฉถไธป้ข˜ๅœจ่ฟ‘Nๅนด็š„ๅ‘ๅฑ•่ถ‹ๅŠฟ"""
    raptor_results = raptor_index.search(topic, level="high")
    graph_stats = get_temporal_graph_stats(topic, years)
    return synthesize_trend(raptor_results, graph_stats)

8. LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚

8.1 ๆžถๆž„่ฎพ่ฎก

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚      LiteLLM Proxy Server         โ”‚
                    โ”‚      (็ปŸไธ€ OpenAI ๅ…ผๅฎนๆŽฅๅฃ)        โ”‚
                    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
                    โ”‚                                  โ”‚
                    โ”‚  model_list:                     โ”‚
                    โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
                    โ”‚  โ”‚ "local/qwen2.5-14b"        โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Ollama :11434
                    โ”‚  โ”‚ "local/llama-3.1-8b"       โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ vLLM   :8000
                    โ”‚  โ”‚ "gpt-4o-mini"              โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ OpenAI API
                    โ”‚  โ”‚ "claude-3-5-sonnet"        โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Anthropic API
                    โ”‚  โ”‚ "deepseek-chat"            โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ DeepSeek API
                    โ”‚  โ”‚ "gemini-2.0-flash"         โ”‚โ”€โ”€โ”‚โ”€โ”€โ–ถ Google API
                    โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
                    โ”‚                                  โ”‚
                    โ”‚  ๅŠŸ่ƒฝ:                            โ”‚
                    โ”‚  - ็ปŸไธ€ /chat/completions ๆŽฅๅฃ     โ”‚
                    โ”‚  - ่‡ชๅŠจfallback (ๆœฌๅœฐโ†’API)         โ”‚
                    โ”‚  - ่ดŸ่ฝฝๅ‡่กก (ๅคšvLLMๅฎžไพ‹)            โ”‚
                    โ”‚  - ้€Ÿ็އ้™ๅˆถ & ๆˆๆœฌ่ฟฝ่ธช              โ”‚
                    โ”‚  - ็ผ“ๅญ˜ (็›ธๅŒqueryๅค็”จ)             โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

8.2 LiteLLM ้…็ฝฎ

# litellm_config.yaml
model_list:
  # ===== ๆœฌๅœฐๆจกๅž‹ =====
  - model_name: "local/qwen2.5-14b"
    litellm_params:
      model: "openai/Qwen2.5-14B-Instruct"
      api_base: "http://localhost:11434/v1"  # Ollama
      api_key: "ollama"
    model_info:
      max_tokens: 32768
      input_cost_per_token: 0  # ๆœฌๅœฐๅ…่ดน
      
  - model_name: "local/llama-3.1-8b"
    litellm_params:
      model: "openai/meta-llama/Llama-3.1-8B-Instruct"
      api_base: "http://localhost:8000/v1"  # vLLM
      api_key: "token"
    model_info:
      max_tokens: 131072

  # ===== ๅค–้ƒจAPI =====
  - model_name: "gpt-4o-mini"
    litellm_params:
      model: "gpt-4o-mini"
      api_key: "os.environ/OPENAI_API_KEY"
      
  - model_name: "deepseek-chat"
    litellm_params:
      model: "deepseek/deepseek-chat"
      api_key: "os.environ/DEEPSEEK_API_KEY"

# ่ทฏ็”ฑ็ญ–็•ฅ
router_settings:
  routing_strategy: "latency-based-routing"  # ้€‰ๆ‹ฉๅปถ่ฟŸๆœ€ไฝŽ็š„
  num_retries: 3
  fallbacks:
    - "local/qwen2.5-14b": ["gpt-4o-mini"]  # ๆœฌๅœฐๅคฑ่ดฅโ†’API
    - "gpt-4o-mini": ["deepseek-chat"]       # OpenAIๅคฑ่ดฅโ†’DeepSeek
  
  # ไธๅŒไปปๅŠก็”จไธๅŒๆจกๅž‹
  model_group_alias:
    "extraction": "local/qwen2.5-14b"      # ็Ÿฅ่ฏ†ๆŠฝๅ–: ๆœฌๅœฐ(็œ้’ฑ)
    "generation": "gpt-4o-mini"             # ็ญ”ๆกˆ็”Ÿๆˆ: API(้ซ˜่ดจ้‡)
    "routing": "local/llama-3.1-8b"         # ๆ„ๅ›พๅˆ†็ฑป: ๆœฌๅœฐๅฐๆจกๅž‹(ๅฟซ)

8.3 ็ปŸไธ€่ฐƒ็”จๆŽฅๅฃ

import litellm
from typing import Optional

class UnifiedLLM:
    """็ปŸไธ€LLM่ฐƒ็”จๅฑ‚ โ€” ่‡ชๅŠจ่ทฏ็”ฑๆœฌๅœฐ/API"""
    
    def __init__(self, config_path: str = "litellm_config.yaml"):
        litellm.set_verbose = False
        # ๅฏ็”จ็ผ“ๅญ˜
        litellm.cache = litellm.Cache(type="redis", host="localhost", port=6379)
    
    async def complete(
        self, 
        messages: list,
        task: str = "generation",     # extraction | generation | routing
        temperature: float = 0,
        max_tokens: int = 4096,
        stream: bool = False,
    ) -> str:
        """
        ็ปŸไธ€่ฐƒ็”จๆŽฅๅฃ, ๆ นๆฎtask่‡ชๅŠจ้€‰ๆ‹ฉๆจกๅž‹
        """
        model = self._select_model(task)
        
        response = await litellm.acompletion(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            stream=stream,
            metadata={"task": task},  # ็”จไบŽๆˆๆœฌ่ฟฝ่ธช
        )
        
        if stream:
            return response  # ่ฟ”ๅ›žๅผ‚ๆญฅ็”Ÿๆˆๅ™จ
        return response.choices[0].message.content
    
    def _select_model(self, task: str) -> str:
        model_map = {
            "extraction": "local/qwen2.5-14b",
            "generation": "gpt-4o-mini",
            "routing": "local/llama-3.1-8b",
            "fusion": "gpt-4o-mini",        # Graphusion่žๅˆ้œ€่ฆๅผบๆจกๅž‹
            "rewrite": "local/llama-3.1-8b", # HyDEๆŸฅ่ฏขๆ”นๅ†™
        }
        return model_map.get(task, "local/qwen2.5-14b")

9. ็ณป็ปŸ้ƒจ็ฝฒๆžถๆž„

9.1 Docker Compose ้ƒจ็ฝฒ

# docker-compose.yml
version: '3.8'

services:
  # ===== ๆ ธๅฟƒๆœๅŠก =====
  api:
    build: ./services/api
    ports: ["8080:8080"]
    environment:
      - REDIS_URL=redis://redis:6379
      - QDRANT_URL=http://qdrant:6333
      - NEO4J_URL=bolt://neo4j:7687
      - LITELLM_URL=http://litellm:4000
    depends_on: [redis, qdrant, neo4j, litellm]

  # ===== PDF่งฃๆžๆœๅŠก =====
  mineru-worker:
    build: ./services/mineru
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - MINERU_MODEL_SOURCE=local
      - CELERY_BROKER_URL=redis://redis:6379
    volumes:
      - mineru-models:/models
      - pdf-storage:/pdfs

  # ===== LLMๆœๅŠก =====
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports: ["4000:4000"]
    volumes:
      - ./config/litellm_config.yaml:/app/config.yaml
    command: ["--config", "/app/config.yaml"]

  ollama:
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ollama-data:/root/.ollama

  # ===== ๅญ˜ๅ‚จๆœๅŠก =====
  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333"]
    volumes:
      - qdrant-data:/qdrant/storage

  neo4j:
    image: neo4j:5-community
    ports: ["7474:7474", "7687:7687"]
    environment:
      - NEO4J_AUTH=neo4j/password
      - NEO4J_PLUGINS=["apoc", "graph-data-science"]
    volumes:
      - neo4j-data:/data

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]

  postgres:
    image: postgres:16-alpine
    environment:
      - POSTGRES_DB=scholarmind
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres-data:/var/lib/postgresql/data

  minio:
    image: minio/minio:latest
    ports: ["9000:9000", "9001:9001"]
    command: server /data --console-address ":9001"
    volumes:
      - minio-data:/data

volumes:
  qdrant-data:
  neo4j-data:
  postgres-data:
  minio-data:
  ollama-data:
  mineru-models:
  pdf-storage:

9.2 ็กฌไปถ้…็ฝฎๅปบ่ฎฎ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    ็กฌไปถ้…็ฝฎๅปบ่ฎฎ                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ้…็ฝฎ           โ”‚ ๅผ€ๅ‘็Žฏๅขƒ    โ”‚ ็”Ÿไบง(ๅฐ)    โ”‚ ็”Ÿไบง(ๅคง)         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PDF่งฃๆž GPU    โ”‚ RTX 3090   โ”‚ A100 80G   โ”‚ 2ร—A100 80G      โ”‚
โ”‚ LLMๆŽจ็† GPU    โ”‚ RTX 4090   โ”‚ A100 80G   โ”‚ 2ร—H100 80G      โ”‚
โ”‚ CPU            โ”‚ 16ๆ ธ       โ”‚ 32ๆ ธ       โ”‚ 64ๆ ธ             โ”‚
โ”‚ RAM            โ”‚ 64GB       โ”‚ 128GB      โ”‚ 256GB            โ”‚
โ”‚ SSD            โ”‚ 1TB NVMe   โ”‚ 2TB NVMe   โ”‚ 4TB NVMe        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1000็ฏ‡่ฎบๆ–‡     โ”‚ ~3ๅฐๆ—ถ     โ”‚ ~80ๅˆ†้’Ÿ    โ”‚ ~40ๅˆ†้’Ÿ          โ”‚
โ”‚ ่งฃๆžๆ—ถ้—ด       โ”‚            โ”‚            โ”‚                  โ”‚
โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ     โ”‚ ~5s        โ”‚ ~2s        โ”‚ ~1s              โ”‚
โ”‚ ๅนถๅ‘็”จๆˆท       โ”‚ 1-5        โ”‚ 10-50      โ”‚ 50-200           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

9.3 API ่ฎพ่ฎก

from fastapi import FastAPI, UploadFile, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel

app = FastAPI(title="ScholarMind API", version="1.0")

# ===== PDFไธŠไผ ไธŽ่งฃๆž =====
@app.post("/api/v1/papers/upload")
async def upload_papers(files: list[UploadFile], bg: BackgroundTasks):
    """ๆ‰น้‡ไธŠไผ PDF่ฎบๆ–‡, ๅผ‚ๆญฅ่งฃๆž"""
    task_ids = []
    for f in files:
        task_id = await save_and_queue(f)
        task_ids.append(task_id)
    return {"task_ids": task_ids, "status": "processing"}

@app.get("/api/v1/papers/{task_id}/status")
async def get_parse_status(task_id: str):
    """ๆŸฅ่ฏข่งฃๆž่ฟ›ๅบฆ"""
    return celery_app.AsyncResult(task_id).info

# ===== ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” =====
class QueryRequest(BaseModel):
    query: str
    mode: str = "hybrid"           # factual | reasoning | global | hybrid
    llm_backend: str = "auto"      # auto | local | openai | deepseek
    top_k: int = 5
    stream: bool = False
    include_citations: bool = True
    include_graph: bool = False     # ๆ˜ฏๅฆ่ฟ”ๅ›ž็›ธๅ…ณๅญๅ›พ

@app.post("/api/v1/query")
async def query_knowledge_base(req: QueryRequest):
    """็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ”"""
    if req.stream:
        return StreamingResponse(
            agent.astream(req), media_type="text/event-stream"
        )
    result = await agent.ainvoke(req)
    return {
        "answer": result["answer"],
        "citations": result["citations"],
        "confidence": result["confidence"],
        "graph_snippet": result.get("graph_snippet"),
    }

# ===== ็Ÿฅ่ฏ†ๅ›พ่ฐฑ =====
@app.get("/api/v1/graph/entity/{name}")
async def get_entity(name: str, depth: int = 2):
    """่Žทๅ–ๅฎžไฝ“ๅŠๅ…ถN่ทณๅญๅ›พ"""
    subgraph = await graph_service.get_subgraph(name, depth)
    return subgraph

@app.get("/api/v1/graph/path")
async def find_path(source: str, target: str, max_hops: int = 4):
    """ๆŸฅๆ‰พไธคไธชๅฎžไฝ“ไน‹้—ด็š„ๆœ€็Ÿญ่ทฏๅพ„"""
    path = await graph_service.shortest_path(source, target, max_hops)
    return path

@app.get("/api/v1/graph/stats")
async def graph_statistics():
    """็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ปŸ่ฎกไฟกๆฏ"""
    return await graph_service.get_stats()

# ===== ๅ›พ่ฐฑๅฏ่ง†ๅŒ– =====
@app.get("/api/v1/graph/visualize")
async def visualize_graph(center: str, depth: int = 2, layout: str = "force"):
    """่ฟ”ๅ›žๅฏ่ง†ๅŒ–ๆ•ฐๆฎ (vis.jsๆ ผๅผ)"""
    data = await graph_service.get_vis_data(center, depth)
    return {"nodes": data["nodes"], "edges": data["edges"]}

10. ๆŠ€ๆœฏ้€‰ๅž‹ๅฏนๆฏ”

10.1 ๅฎŒๆ•ดๆŠ€ๆœฏๆ ˆ

ๅฑ‚ ็ป„ไปถ ้€‰ๅž‹ ๆ›ฟไปฃๆ–นๆกˆ ้€‰ๅž‹็†็”ฑ
PDF่งฃๆž OCRๅผ•ๆ“Ž MinerU 2.5 VLM Marker, Nougat, Docling ๅญฆๆœฏ่ฎบๆ–‡SOTA(0.047 Edit Dist), ๅ…ฌๅผ88.46 CDM
PDF่งฃๆž ๅฟซ้€Ÿ่ทฏๅพ„ PyMuPDF pdfplumber ๆ•ฐๅญ—PDF็ง’็บง่งฃๆž, ๆ— GPU้œ€ๆฑ‚
็Ÿฅ่ฏ†ๆŠฝๅ– NER GLiNER (440M) spaCy, DeepKE ้›ถๆ ทๆœฌ, ่‡ชๅฎšไน‰ๆ ‡็ญพ, ๆœฌๅœฐ่ฟ่กŒ
็Ÿฅ่ฏ†ๆŠฝๅ– RE LLMGraphTransformer REBEL, GLiREL, ReLiK ๆ”ฏๆŒๆœฌๅœฐ+API LLM, Schema็บฆๆŸ
็Ÿฅ่ฏ†ๆŠฝๅ– ่žๅˆ Graphusion ๆ—  ๅฎžไฝ“ๅฝ’ไธ€ๅŒ–+ๅ†ฒ็ชๆถˆ่งฃ, ๆฏ”naiveๅฅฝ9.2%
็Ÿฅ่ฏ†ๅ›พ่ฐฑ ๅ›พๆ•ฐๆฎๅบ“ Neo4j 5.x ArangoDB, NebulaGraph LangChainๅŽŸ็”Ÿ้›†ๆˆ, Cypher็”Ÿๆ€ๆœ€ๆˆ็†Ÿ
ๅ‘้‡็ดขๅผ• ๅ‘้‡ๅบ“ Qdrant Milvus, Weaviate Rust้ซ˜ๆ€ง่ƒฝ, ๅŽŸ็”ŸHybridๆœ็ดข, ็ฎ€ๅ•้ƒจ็ฝฒ
ๆฃ€็ดข ้‡ๆŽ’ๅ™จ bge-reranker-large Cohere, jina ๅผ€ๆบSOTA, ๆ— APIไพ่ต–
ๆฃ€็ดข ๆŸฅ่ฏขๅขžๅผบ HyDE Query2Doc +10 NDCG้›ถๆˆๆœฌๆๅ‡
็ดขๅผ• ๅฑ‚ๆฌก็ดขๅผ• RAPTOR GraphRAG Communities ้€‚ๅˆๅฑ‚็บงๆ–‡ๆกฃ, +20%ๅ‡†็กฎ็އ
RAG ๅ›พๅขžๅผบ LightRAG NodeRAG, GraphRAG 34kโญ, ๅขž้‡ๆ›ดๆ–ฐ, ๅคšๅŽ็ซฏ
Agent ็ผ–ๆŽ’ LangGraph smolagents, AutoGen ๆœ‰็Šถๆ€ๅ›พ, ๆกไปถๅˆ†ๆ”ฏ, ็”Ÿไบง็บง
LLM ็ปŸไธ€ๆŽฅๅ…ฅ LiteLLM OpenRouter 20kโญ, ๆ‰€ๆœ‰ๆไพ›ๅ•†็ปŸไธ€ๆŽฅๅฃ
LLM ๆœฌๅœฐๆŽจ็† vLLM + Ollama SGLang, llama.cpp vLLM้ซ˜ๅžๅ, Ollamaๆ˜“็”จ
ๅŽ็ซฏ Webๆก†ๆžถ FastAPI Flask, Django ๅผ‚ๆญฅๅŽŸ็”Ÿ, ้ซ˜ๆ€ง่ƒฝ, OpenAPI่‡ชๅŠจๆ–‡ๆกฃ
้˜Ÿๅˆ— ไปปๅŠก้˜Ÿๅˆ— Celery + Redis RQ, Dramatiq ๆˆ็†Ÿ็จณๅฎš, ๅˆ†ๅธƒๅผๆ”ฏๆŒ
ๅญ˜ๅ‚จ ๅฏน่ฑกๅญ˜ๅ‚จ MinIO S3 S3ๅ…ผๅฎน, ๆœฌๅœฐ้ƒจ็ฝฒ
ๅญ˜ๅ‚จ ๅ…ณ็ณปๆ•ฐๆฎๅบ“ PostgreSQL MySQL JSONๆ”ฏๆŒ, ๅ…จๆ–‡ๆœ็ดข
ๅ‰็ซฏ ๅ›พๅฏ่ง†ๅŒ– react-force-graph vis.js, D3 React็”Ÿๆ€, 3Dๆ”ฏๆŒ

10.2 ๆ€ง่ƒฝ้ข„ไผฐ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              1000็ฏ‡่ฎบๆ–‡็ณป็ปŸๆ€ง่ƒฝ้ข„ไผฐ (A100 80G)                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PDF่งฃๆž              โ”‚ ~80ๅˆ†้’Ÿ (MinerU 2.5, 2.12 pg/s)        โ”‚
โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–(GLiNER NER) โ”‚ ~15ๅˆ†้’Ÿ (GPU batch)                    โ”‚
โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–(LLM RE)     โ”‚ ~60ๅˆ†้’Ÿ (ๆœฌๅœฐ14Bๆจกๅž‹)                   โ”‚
โ”‚                      โ”‚ ~30ๅˆ†้’Ÿ (GPT-4o-mini API)               โ”‚
โ”‚ ๅ‘้‡็ดขๅผ•ๆž„ๅปบ          โ”‚ ~10ๅˆ†้’Ÿ (text-embedding-3-small)        โ”‚
โ”‚ ็Ÿฅ่ฏ†ๅ›พ่ฐฑๆž„ๅปบ          โ”‚ ~20ๅˆ†้’Ÿ (ๅซ่žๅˆ)                        โ”‚
โ”‚ RAPTORๆ ‘ๆž„ๅปบ          โ”‚ ~30ๅˆ†้’Ÿ                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ๆ€ป่ฎก(็ซฏๅˆฐ็ซฏ)         โ”‚ ~3.5ๅฐๆ—ถ (ๅ…จๆœฌๅœฐ) / ~2.5ๅฐๆ—ถ (ๆททๅˆAPI)   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ (P50)     โ”‚ ~1.5s (ๆœฌๅœฐLLM) / ~0.8s (API)          โ”‚
โ”‚ QAๅ“ๅบ”ๅปถ่ฟŸ (P99)     โ”‚ ~4s (ๆœฌๅœฐLLM) / ~2s (API)              โ”‚
โ”‚ ๅ›พ่ฐฑๆŸฅ่ฏขๅปถ่ฟŸ          โ”‚ ~200ms (2่ทณๅญๅ›พ)                       โ”‚
โ”‚ ๅ‘้‡ๆฃ€็ดขๅปถ่ฟŸ          โ”‚ ~50ms (Qdrant, 1Mๅ‘้‡)                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

11. ๅ…ณ้”ฎ่ฎบๆ–‡ไธŽๅผ€ๆบ้กน็›ฎ

11.1 ๆ ธๅฟƒ่ฎบๆ–‡

่ฎบๆ–‡ ArXiv ID ่ดก็Œฎ ๆŽจ่ๅบฆ
MinerU 2.5 2509.22186 ็ปŸไธ€VLMๆ–‡ๆกฃ่งฃๆžSOTA โญโญโญโญโญ
OmniDocBench 2412.07626 ๆ–‡ๆกฃ่งฃๆžๅŸบๅ‡† (CVPR 2025) โญโญโญโญ
Graphusion 2410.17600 ้›ถๆ ทๆœฌKGๆž„ๅปบ+่žๅˆ โญโญโญโญโญ
GLiNER 2311.08526 ้›ถๆ ทๆœฌNER, 440M โญโญโญโญโญ
SciER 2410.21155 ๅญฆๆœฏ่ฎบๆ–‡IEๆ•ฐๆฎ้›†+ๅŸบๅ‡† โญโญโญโญ
ReLiK 2408.00103 ๅฟซ้€Ÿๅฎžไฝ“้“พๆŽฅ+ๅ…ณ็ณปๆŠฝๅ– โญโญโญโญ
NodeRAG 2504.11544 ๅผ‚ๆž„ๅ›พRAG SOTA โญโญโญโญโญ
LightRAG 2410.05779 ่ฝป้‡ๅ›พRAG, ๅขž้‡ๆ›ดๆ–ฐ โญโญโญโญโญ
Microsoft GraphRAG 2404.16130 ็คพๅŒบๆ‘˜่ฆ+ๅ…จๅฑ€ๆฃ€็ดข โญโญโญโญ
RAPTOR 2401.18059 ้€’ๅฝ’ๆ‘˜่ฆๆ ‘ โญโญโญโญ
Self-RAG 2310.11511 ่‡ชๅๆ€ๆฃ€็ดข็”Ÿๆˆ โญโญโญ
HyDE 2212.10496 ๅ‡่ฎพๆ–‡ๆกฃๅตŒๅ…ฅ โญโญโญโญ
RAG vs GraphRAG 2502.11371 RAG+GraphRAG่žๅˆๅฎž้ชŒ โญโญโญโญ
LLM-KGC Survey 2510.20345 LLM็Ÿฅ่ฏ†ๅ›พ่ฐฑๆž„ๅปบ็ปผ่ฟฐ โญโญโญโญ

11.2 ๆ ธๅฟƒๅผ€ๆบ้กน็›ฎ

้กน็›ฎ GitHub Stars ็”จ้€”
MinerU opendatalab/MinerU 61k+ PDFๆทฑๅบฆ่งฃๆž
LightRAG hkuds/lightrag 34k+ ๅ›พๅขžๅผบRAG
RAGFlow infiniflow/ragflow 36k+ ๅ…จๆ ˆRAGๅนณๅฐ(ๅซUI)
LiteLLM BerriAI/litellm 20k+ LLM็ปŸไธ€ไปฃ็†
Neo4j LLM Graph Builder neo4j-labs/llm-graph-builder 3k+ PDFโ†’KGโ†’QA
Kotaemon Cinnamon/kotaemon 18k+ ๆ–‡ๆกฃQA(ๅซGraphRAG)
Dify langgenius/dify 70k+ AIๅบ”็”จๅผ€ๅ‘ๅนณๅฐ
LangGraph langchain-ai/langgraph 10k+ Agent็Šถๆ€ๆœบ็ผ–ๆŽ’
GLiNER urchade/GLiNER 2k+ ้›ถๆ ทๆœฌNER
Graphusion irenezihuili/graphusion 27 KG่žๅˆๅŽป้‡
RAPTOR parthsarthi03/raptor 1.6k+ ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘
NodeRAG Terry-Xu-666/NodeRAG 412 ๅผ‚ๆž„ๅ›พRAG
Qdrant qdrant/qdrant 22k+ ๅ‘้‡ๆ•ฐๆฎๅบ“
vLLM vllm-project/vllm 45k+ ้ซ˜ๅžๅLLMๆŽจ็†

12. ้กน็›ฎ็ป“ๆž„

scholarmind/
โ”œโ”€โ”€ docker-compose.yml              # ไธ€้”ฎ้ƒจ็ฝฒ
โ”œโ”€โ”€ config/
โ”‚   โ”œโ”€โ”€ litellm_config.yaml         # LLM่ทฏ็”ฑ้…็ฝฎ
โ”‚   โ”œโ”€โ”€ mineru_config.yaml          # MinerU่งฃๆž้…็ฝฎ
โ”‚   โ””โ”€โ”€ settings.py                 # ๅ…จๅฑ€้…็ฝฎ
โ”‚
โ”œโ”€โ”€ services/
โ”‚   โ”œโ”€โ”€ api/                        # FastAPI ไธปๆœๅŠก
โ”‚   โ”‚   โ”œโ”€โ”€ main.py                 # ๅ…ฅๅฃ
โ”‚   โ”‚   โ”œโ”€โ”€ routers/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ papers.py           # PDFไธŠไผ /่งฃๆžAPI
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ query.py            # ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ”API
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ graph.py            # ็Ÿฅ่ฏ†ๅ›พ่ฐฑAPI
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ admin.py            # ็ฎก็†API
โ”‚   โ”‚   โ””โ”€โ”€ middleware/
โ”‚   โ”‚       โ”œโ”€โ”€ auth.py             # ่ฎค่ฏ
โ”‚   โ”‚       โ””โ”€โ”€ rate_limit.py       # ้™ๆต
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ parser/                     # PDF่งฃๆžๆœๅŠก
โ”‚   โ”‚   โ”œโ”€โ”€ router.py               # PDF็‰นๅพ่ทฏ็”ฑ
โ”‚   โ”‚   โ”œโ”€โ”€ mineru_worker.py        # MinerU VLM Worker
โ”‚   โ”‚   โ”œโ”€โ”€ pymupdf_worker.py       # PyMuPDF ๅฟซ้€Ÿ่งฃๆž
โ”‚   โ”‚   โ”œโ”€โ”€ metadata_extractor.py   # ๅ…ƒๆ•ฐๆฎๆๅ–
โ”‚   โ”‚   โ””โ”€โ”€ tasks.py                # CeleryไปปๅŠกๅฎšไน‰
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ extractor/                  # ็Ÿฅ่ฏ†ๆŠฝๅ–ๆœๅŠก
โ”‚   โ”‚   โ”œโ”€โ”€ ner_engine.py           # GLiNER NER
โ”‚   โ”‚   โ”œโ”€โ”€ re_engine.py            # LLM ๅ…ณ็ณปๆŠฝๅ–
โ”‚   โ”‚   โ”œโ”€โ”€ fusion_engine.py        # Graphusion ่žๅˆ
โ”‚   โ”‚   โ””โ”€โ”€ schema.py               # ๅฎžไฝ“/ๅ…ณ็ณปSchema
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ graph/                      # ็Ÿฅ่ฏ†ๅ›พ่ฐฑๆœๅŠก
โ”‚   โ”‚   โ”œโ”€โ”€ neo4j_client.py         # Neo4j ่ฟžๆŽฅ็ฎก็†
โ”‚   โ”‚   โ”œโ”€โ”€ graph_builder.py        # ๅ›พๆž„ๅปบ
โ”‚   โ”‚   โ”œโ”€โ”€ graph_query.py          # ๅ›พๆŸฅ่ฏข
โ”‚   โ”‚   โ””โ”€โ”€ visualization.py        # ๅ›พๅฏ่ง†ๅŒ–
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ indexer/                    # ็ดขๅผ•ๆœๅŠก
โ”‚   โ”‚   โ”œโ”€โ”€ chunker.py              # ๅญฆๆœฏ่ฎบๆ–‡ๅˆ†ๅ—ๅ™จ
โ”‚   โ”‚   โ”œโ”€โ”€ vector_indexer.py       # Qdrant ๅ‘้‡็ดขๅผ•
โ”‚   โ”‚   โ”œโ”€โ”€ raptor_builder.py       # RAPTOR ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘
โ”‚   โ”‚   โ””โ”€โ”€ embedder.py             # ๅตŒๅ…ฅๆจกๅž‹็ฎก็†
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ retriever/                  # ๆฃ€็ดขๆœๅŠก
โ”‚   โ”‚   โ”œโ”€โ”€ hybrid_retriever.py     # ไธ‰่ทฏๆททๅˆๆฃ€็ดข
โ”‚   โ”‚   โ”œโ”€โ”€ hyde.py                 # HyDE ๆŸฅ่ฏขๅขžๅผบ
โ”‚   โ”‚   โ”œโ”€โ”€ reranker.py             # ไบคๅ‰็ผ–็ ๅ™จ้‡ๆŽ’
โ”‚   โ”‚   โ””โ”€โ”€ rrf.py                  # RRF ่žๅˆ
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ agent/                      # Agent็ผ–ๆŽ’ๆœๅŠก
โ”‚   โ”‚   โ”œโ”€โ”€ graph_definition.py     # LangGraph ็Šถๆ€ๆœบ
โ”‚   โ”‚   โ”œโ”€โ”€ nodes.py                # Agent่Š‚็‚นๅฎšไน‰
โ”‚   โ”‚   โ”œโ”€โ”€ tools.py                # Agentๅทฅๅ…ท้›†
โ”‚   โ”‚   โ””โ”€โ”€ prompts.py              # Promptๆจกๆฟ
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ llm/                        # LLM็ปŸไธ€ๆŽฅๅ…ฅ
โ”‚       โ”œโ”€โ”€ unified_llm.py          # LiteLLMๅฐ่ฃ…
โ”‚       โ”œโ”€โ”€ model_router.py         # ไปปๅŠกโ†’ๆจกๅž‹่ทฏ็”ฑ
โ”‚       โ””โ”€โ”€ cache.py                # LLM็ผ“ๅญ˜
โ”‚
โ”œโ”€โ”€ models/                         # ๆ•ฐๆฎๆจกๅž‹
โ”‚   โ”œโ”€โ”€ paper.py                    # ่ฎบๆ–‡ๆ•ฐๆฎๆจกๅž‹
โ”‚   โ”œโ”€โ”€ graph.py                    # ๅ›พ่ฐฑๆ•ฐๆฎๆจกๅž‹
โ”‚   โ””โ”€โ”€ query.py                    # ๆŸฅ่ฏขๆ•ฐๆฎๆจกๅž‹
โ”‚
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_parser.py
โ”‚   โ”œโ”€โ”€ test_extractor.py
โ”‚   โ”œโ”€โ”€ test_retriever.py
โ”‚   โ””โ”€โ”€ test_agent.py
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ setup_neo4j.cypher          # Neo4jๅˆๅง‹ๅŒ–่„šๆœฌ
โ”‚   โ”œโ”€โ”€ batch_parse.py              # ๆ‰น้‡่งฃๆž่„šๆœฌ
โ”‚   โ””โ”€โ”€ build_index.py              # ็ดขๅผ•ๆž„ๅปบ่„šๆœฌ
โ”‚
โ”œโ”€โ”€ frontend/                       # ๅ‰็ซฏ (React/Next.js)
โ”‚   โ”œโ”€โ”€ components/
โ”‚   โ”‚   โ”œโ”€โ”€ ChatInterface.tsx       # ้—ฎ็ญ”็•Œ้ข
โ”‚   โ”‚   โ”œโ”€โ”€ GraphViewer.tsx         # ็Ÿฅ่ฏ†ๅ›พ่ฐฑๅฏ่ง†ๅŒ–
โ”‚   โ”‚   โ”œโ”€โ”€ PaperUploader.tsx       # PDFไธŠไผ 
โ”‚   โ”‚   โ””โ”€โ”€ SearchResults.tsx       # ๆœ็ดข็ป“ๆžœๅฑ•็คบ
โ”‚   โ””โ”€โ”€ ...
โ”‚
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ Dockerfile
โ””โ”€โ”€ README.md

ๅฟซ้€Ÿๅผ€ๅง‹

# 1. ๅ…‹้š†้กน็›ฎ
git clone https://github.com/your-org/scholarmind.git
cd scholarmind

# 2. ้…็ฝฎ็Žฏๅขƒๅ˜้‡
cp .env.example .env
# ็ผ–่พ‘ .env: ่ฎพ็ฝฎ OPENAI_API_KEY, DEEPSEEK_API_KEY ็ญ‰

# 3. ไธ‹่ฝฝMinerUๆจกๅž‹
pip install mineru
mineru-models-download -s huggingface -m all

# 4. ๅฏๅŠจๆ‰€ๆœ‰ๆœๅŠก
docker-compose up -d

# 5. ไธ‹่ฝฝๆœฌๅœฐLLM (ๅฏ้€‰)
docker exec -it scholarmind-ollama ollama pull qwen2.5:14b-instruct

# 6. ๆ‰น้‡ๅฏผๅ…ฅ่ฎบๆ–‡
python scripts/batch_parse.py --input /path/to/pdfs/ --workers 4

# 7. ๆž„ๅปบ็ดขๅผ•
python scripts/build_index.py --vector --graph --raptor

# 8. ่ฎฟ้—ฎ็ณป็ปŸ
# API: http://localhost:8080/docs
# Neo4j: http://localhost:7474
# MinIO: http://localhost:9001

่ฎธๅฏ่ฏ

MIT License


ๆžถๆž„่ฎพ่ฎกๅŸบไบŽ 2024-2025 ๅนดๆœ€ๆ–ฐ็ ”็ฉถๆˆๆžœๅ’Œๅผ€ๆบๅฎž่ทต๏ผŒๆ‰€ๆœ‰่ฎบๆ–‡ๅผ•็”จๅ’ŒBenchmarkๆ•ฐๆฎๅ‡ๅฏๆบฏๆบใ€‚