| # ๐๏ธ ScholarMind โ ็ไบง็บงๅญฆๆฏ็ฅ่ฏๅบ้ฎ็ญ & ็ฅ่ฏๅพ่ฐฑ็ณป็ป |
|
|
| ## ็ณป็ปๆฆ่ฟฐ |
|
|
| ScholarMind ๆฏไธไธช้ขๅ **1000+ ็ฏๅญฆๆฏ PDF ่ฎบๆ** ็็ไบง็บงๆบ่ฝ็ฅ่ฏ็ณป็ป๏ผ้ๆ๏ผ |
| - **PDF ๆทฑๅบฆ่งฃๆ**๏ผๅบไบ MinerU 2.5 VLM ็้ซ็ฒพๅบฆ OCR๏ผๅ
ฌๅผ/่กจๆ ผ/ๅพ่กจ๏ผ |
| - **็ฅ่ฏๅพ่ฐฑ่ชๅจๆๅปบ**๏ผไป่ฎบๆไธญ่ชๅจๆฝๅๅฎไฝไธๅ
ณ็ณป๏ผๆๅปบ้ขๅ็ฅ่ฏๅพ่ฐฑ |
| - **ๆททๅๆฃ็ดข้ฎ็ญ**๏ผGraphRAG + ๅ้ๆฃ็ดข + BM25 ็จ็ๆฃ็ดข็ไธ่ทฏ่ๅ |
| - **ๅคๆจกๅๆฏๆ**๏ผๅๆถๆฏๆๆฌๅฐ้จ็ฝฒ๏ผvLLM/Ollama๏ผๅๅค้จ API๏ผOpenAI/Anthropic/DeepSeek๏ผ |
| - **Agent ็ผๆ**๏ผๅบไบ LangGraph ็ๅค Agent ๅไฝ๏ผๆฏๆๅค่ทณๆจ็ |
|
|
| > **ๆ ธๅฟๆๆ **๏ผๅ A100 80G ๅฏๅจ ~80 ๅ้ๅ
ๅฎๆ 1000 ็ฏ่ฎบๆ๏ผ~10000 ้กต๏ผ็ๅ
จ้่งฃๆ |
|
|
| --- |
|
|
| ## ็ฎๅฝ |
|
|
| 1. [็ณป็ปๆถๆๆป่ง](#1-็ณป็ปๆถๆๆป่ง) |
| 2. [PDF ่งฃๆๅฑ โ MinerU Pipeline](#2-pdf-่งฃๆๅฑ) |
| 3. [็ฅ่ฏๆฝๅๅฑ โ ๅฎไฝๅ
ณ็ณปๆฝๅ](#3-็ฅ่ฏๆฝๅๅฑ) |
| 4. [็ฅ่ฏๅพ่ฐฑๅฑ โ ๅพๆๅปบไธๅญๅจ](#4-็ฅ่ฏๅพ่ฐฑๅฑ) |
| 5. [็ดขๅผๅฑ โ ๅค่ทฏ็ดขๅผๆๅปบ](#5-็ดขๅผๅฑ) |
| 6. [ๆฃ็ดขๅฑ โ ๆททๅๆฃ็ดขไธ้ๆ](#6-ๆฃ็ดขๅฑ) |
| 7. [Agent ็ผๆๅฑ โ ๆบ่ฝ้ฎ็ญ](#7-agent-็ผๆๅฑ) |
| 8. [LLM ็ปไธๆฅๅ
ฅๅฑ](#8-llm-็ปไธๆฅๅ
ฅๅฑ) |
| 9. [็ณป็ป้จ็ฝฒๆถๆ](#9-็ณป็ป้จ็ฝฒๆถๆ) |
| 10. [ๆๆฏ้ๅๅฏนๆฏ](#10-ๆๆฏ้ๅๅฏนๆฏ) |
| 11. [ๅ
ณ้ฎ่ฎบๆไธๅผๆบ้กน็ฎ](#11-ๅ
ณ้ฎ่ฎบๆไธๅผๆบ้กน็ฎ) |
| 12. [้กน็ฎ็ปๆ](#12-้กน็ฎ็ปๆ) |
|
|
| --- |
|
|
| ## 1. ็ณป็ปๆถๆๆป่ง |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ScholarMind ็ณป็ปๆถๆ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ โ |
| โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ็จๆทๅฑ โ โ FastAPI Gateway โ โ |
| โ โ Web UI โโโโโถโ /upload /query /graph /status /chat WebSocket SSE โ โ |
| โ โ API่ฐ็จ โ โโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โโโโโโโโโโโโ โ โ โ โ |
| โ โผ โผ โผ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ Agent ็ผๆๅฑ (LangGraph) โ โ |
| โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ |
| โ โ โ ่ทฏ็ฑAgent โ โ ๆฃ็ดขAgent โ โ ๆจ็Agent โ โ ๅพ่ฐฑAgent โ โ ๆป็ปAgent โ โ โ |
| โ โ โ (ๅ็ฑปๆๅพ)โ โ (ๆททๅๆฃ็ดข)โ โ (ๅค่ทณๆจ็)โ โ (ๅพ่ฐฑๆฅ่ฏข)โ โ (็ญๆก็ๆ)โ โ โ |
| โ โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โ โ |
| โ โโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโ โ |
| โ โ โ โ โ โ โ |
| โ โโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโ โ |
| โ โ LLM ็ปไธๆฅๅ
ฅๅฑ (LiteLLM Proxy) โ โ |
| โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ |
| โ โ โ vLLM โ โ Ollama โ โ OpenAI/Claudeโ โ DeepSeek โ โ Gemini โ โ โ |
| โ โ โ (ๆฌๅฐ) โ โ (ๆฌๅฐ) โ โ (ๅค้จAPI) โ โ (ๅค้จAPI)โ โ (ๅค้จAPI)โ โ โ |
| โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ๆฃ็ดขๅฑ (Hybrid Retrieval) โ โ |
| โ โ โ โ |
| โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ โ |
| โ โ โ Dense Vector โ โ Sparse BM25 โ โ Graph Query โ โ Cross-Encoderโ โ โ |
| โ โ โ (Qdrant) โ โ (Qdrant) โ โ (Neo4j) โ โ Reranker โ โ โ |
| โ โ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โ โ |
| โ โ โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ โ โ |
| โ โ RRF / ๅ ๆ่ๅ โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ็ดขๅผๅฑ (Multi-Index) โ โ |
| โ โ โ โ |
| โ โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ |
| โ โ โ ๅ้็ดขๅผ โ โ ็ฅ่ฏๅพ่ฐฑ็ดขๅผ โ โ RAPTOR ๅฑๆฌกๆ่ฆๆ โ โ โ |
| โ โ โ Qdrant โ โ Neo4j 5.x โ โ (้ๅฝ่็ฑปโๆ่ฆโๅๅตๅ
ฅ) โ โ โ |
| โ โ โ Dense+Sparse โ โ Entity/Relation โ โ PaperโSectionโParagraph โ โ โ |
| โ โ โโโโโโโโโฌโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โ โ |
| โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ |
| โ โ โ โ โ |
| โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ็ฅ่ฏๆฝๅๅฑ (Knowledge Extraction) โ โ |
| โ โ โ โ |
| โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ |
| โ โ โ ๅฎไฝๆฝๅ (NER) โ โ ๅ
ณ็ณปๆฝๅ (RE) โ โ โ |
| โ โ โ GLiNER 440M โ โ LLMGraphTransformer โ โ โ |
| โ โ โ ้ถๆ ทๆฌ, ่ชๅฎไนๆ ็ญพ โ โ + Graphusion ่ๅๅป้ โ โ โ |
| โ โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ PDF ่งฃๆๅฑ (MinerU Pipeline) โ โ |
| โ โ โ โ |
| โ โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ |
| โ โ โ PDF้ๅ โ โ MinerU 2.5 VLM โ โ ๆ ผๅผ่ฝฌๆข โ โ ๅ
ๆฐๆฎๆๅ โ โ โ |
| โ โ โ Celery โโโถโ vLLMๅ็ซฏ 2pg/s โโโถโ JSONโMD โโโถโ ๆ ้ข/ไฝ่
/DOI โ โ โ |
| โ โ โ +Redis โ โ ๅธๅฑ+OCR+ๅ
ฌๅผ+่กจๆ ผ โ โ +็ปๆๅ โ โ +็ซ ่/ๅผ็จ โ โ โ |
| โ โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ๅญๅจๅฑ (Storage) โ โ |
| โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ |
| โ โ โ PostgreSQLโ โ Qdrant โ โ Neo4j โ โ Redis โ โ MinIO/S3 โ โ โ |
| โ โ โ ๅ
ๆฐๆฎ โ โ ๅ้็ดขๅผ โ โ ็ฅ่ฏๅพ่ฐฑ โ โ ็ผๅญ/้ๅ โ โ PDFๅญๅจ โ โ โ |
| โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## 2. PDF ่งฃๆๅฑ |
|
|
| ### 2.1 ๆๆฏ้ๅ๏ผMinerU 2.5 VLM |
|
|
| | ๆๆ | MinerU 2.5 | Marker | Nougat | PyMuPDF | |
| |------|-----------|--------|--------|---------| |
| | ๅญฆๆฏ่ฎบๆๆๆฌ็ฒพๅบฆ(Edit Distanceโ) | **0.047** | 0.080 | 0.365 | N/A(ไป
ๆฐๅญPDF) | |
| | ๅ
ฌๅผ่ฏๅซ(CDMโ) | **88.46** | 17.6 | 15.1 | โ | |
| | ่กจๆ ผ่ฏๅซ(TEDSโ) | **88.22** | 67.6 | 39.9 | โ | |
| | ๅๅ(A100, pg/s) | **2.12** | ~5 | ~0.5 | ~100 | |
| | ๆซๆไปถๆฏๆ | โ
| โ ๏ธ | โ | โ | |
|
|
| > **Benchmark ๆฅๆบ**: OmniDocBench (CVPR 2025, arxiv:2412.07626) |
|
|
| ### 2.2 ๆถๆ่ฎพ่ฎก |
|
|
| ```python |
| # ๆททๅ่ทฏ็ฑ็ญ็ฅ๏ผๆฐๅญPDF่ตฐPyMuPDF(ๅฟซ), ๅคๆPDF่ตฐMinerU 2.5(็ฒพ) |
| class PDFRouter: |
| """ๆ นๆฎPDF็นๅพๆบ่ฝ้ๆฉ่งฃๆๅผๆ""" |
| |
| def route(self, pdf_path: str) -> str: |
| import fitz |
| doc = fitz.open(pdf_path) |
| avg_chars = sum(len(p.get_text()) for p in doc) / len(doc) |
| has_images = any(p.get_images() for p in doc) |
| |
| if avg_chars > 500 and not has_images: |
| return "pymupdf_fast" # ็บฏๆฐๅญPDF๏ผPyMuPDF็ง็บง่งฃๆ |
| elif avg_chars > 200: |
| return "mineru_pipeline" # ๆฐๅญPDF+ๅพ่กจ๏ผPipelineๆจกๅผ(CPU) |
| else: |
| return "mineru_vlm" # ๆซๆไปถ/ๅคๆๅธๅฑ๏ผVLMๆจกๅผ(GPU) |
| ``` |
|
|
| ### 2.3 ๆน้ๅค็ๆตๆฐด็บฟ |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ PDF ๆน้ๅค็ๆตๆฐด็บฟ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ โ |
| โโโโโโโโโโโโ โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโ |
| โ PDF ๆไปถ โโโโโโโถโ โ PDF่ทฏ็ฑๅจ โโโโถโ Celery Workerๆฑ โ โโโโถโ ็ปๆๅ โ |
| โ ไธไผ /ๆน้ โ โ โ (็นๅพๆฃๆต) โ โ โ โ โ JSON+MD โ |
| โโโโโโโโโโโโ โ โโโโโโโโโโโโ โ W1: MinerU VLM โ โ โโโโโโโโโโโโ |
| โ โ W2: MinerU VLM โ โ |
| โโโโโโโโโโโโ โ โ W3: Pipeline โ โ โโโโโโโโโโโโ |
| โ Redis โโโโโโโถโ โ W4: PyMuPDF โ โโโโถโ ๅ
ๆฐๆฎ โ |
| โ ไปปๅก้ๅ โ โ โโโโโโโโโโโโโโโโโโโโ โ โ ๆๅ โ |
| โโโโโโโโโโโโ โ โ โโโโโโโโโโโโ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ็ๆง: ่ฟๅบฆ/ๅคฑ่ดฅ้่ฏ/ๅๅ้็ป่ฎก โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| |
| ๅ
ณ้ฎ้
็ฝฎ: |
| - MinerU VLM Worker: ๆฏGPUไธไธช่ฟ็จ, vLLMๅผๆญฅๆนๅค็ |
| - gpu_memory_utilization: 0.7 (้ข็30%็ปOOMๅฎๅ
จ่พน้
) |
| - max_num_batched_tokens: 16384 (ๆ้ซGPUๅฉ็จ็) |
| - ๅคฑ่ดฅ้่ฏ: ๆๅค3ๆฌก, ๆๆฐ้้ฟ |
| - ่ถ
ๆถ: ๅPDF 300็งไธ้ |
| ``` |
|
|
| ### 2.4 ่พๅบๆฐๆฎๆจกๅ |
|
|
| ```python |
| from pydantic import BaseModel |
| from typing import List, Optional |
| from enum import Enum |
| |
| class ContentType(str, Enum): |
| TITLE = "title" |
| TEXT = "text" |
| TABLE = "table" |
| EQUATION = "equation" |
| EQUATION_BLOCK = "equation_block" |
| IMAGE = "image" |
| CODE = "code" |
| LIST = "list" |
| REFERENCE = "reference" |
| |
| class ContentBlock(BaseModel): |
| type: ContentType |
| content: str # Markdown/LaTeX/HTML |
| page_idx: int |
| bbox: List[float] # [x0, y0, x1, y1] |
| reading_order: int |
| section_hierarchy: List[str] # ["3", "3.1", "Methods"] |
| |
| class PaperMetadata(BaseModel): |
| paper_id: str |
| title: str |
| authors: List[str] |
| abstract: str |
| doi: Optional[str] |
| year: Optional[int] |
| venue: Optional[str] |
| keywords: List[str] |
| references: List[str] # ๅผ็จ็่ฎบๆๆ ้ข |
| |
| class ParsedPaper(BaseModel): |
| metadata: PaperMetadata |
| content_blocks: List[ContentBlock] |
| markdown: str |
| page_count: int |
| parse_engine: str # "mineru_vlm" | "mineru_pipeline" | "pymupdf" |
| parse_time_seconds: float |
| ``` |
|
|
| --- |
|
|
| ## 3. ็ฅ่ฏๆฝๅๅฑ |
|
|
| ### 3.1 ไธค้ถๆฎตๆฝๅ็ญ็ฅ |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ็ฅ่ฏๆฝๅๆตๆฐด็บฟ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ โ |
| โ Stage 1: ๅฟซ้ๅฎไฝๆฝๅ (GLiNER, ๆฌๅฐ, 440Mๅๆฐ) โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ่พๅ
ฅ: ่ฎบๆๆๆฌๅ โ โ |
| โ โ ๆจกๅ: urchade/gliner_large-v2.1 (้ถๆ ทๆฌNER) โ โ |
| โ โ ๆ ็ญพ: [Author, Method, Dataset, Metric, Task, โ โ |
| โ โ Model, Concept, Venue, Score, Tool] โ โ |
| โ โ ่พๅบ: [(text, label, score, span), ...] โ โ |
| โ โ ้ๅบฆ: ~1000 chunks/min (CPU), ~5000 chunks/min (GPU) โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ โ |
| โ โผ โ |
| โ Stage 2: LLMๅ
ณ็ณปๆฝๅ (LLMGraphTransformer, ๆฌๅฐๆAPI) โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ ่พๅ
ฅ: ๆๆฌๅ + Stage1ๅฎไฝๆ็คบ โ โ |
| โ โ ๅ
ณ็ณป็ฑปๅ: [PROPOSED_BY, USED_FOR, EVALUATED_ON, โ โ |
| โ โ TRAINED_WITH, COMPARED_TO, PART_OF, ACHIEVED_SCORE, โ โ |
| โ โ HYPONYM_OF, CITED_BY, IMPROVES_ON] โ โ |
| โ โ ๆฌๅฐ: Ollama(Qwen2.5-14B) ๆ vLLM(Llama-3.1-8B) โ โ |
| โ โ API: GPT-4o-mini ๆ DeepSeek-V3 โ โ |
| โ โ ่พๅบ: [(head, relation, tail, properties), ...] โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ โ |
| โ โผ โ |
| โ Stage 3: Graphusion ่ๅ (ๅฎไฝๅฝไธๅ + ๅฒ็ชๆถ่งฃ) โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ - ๅตๅ
ฅ็ธไผผๅบฆๅๅนถ: "NMT" โ "neural machine translation" โ โ |
| โ โ - LLMๅฒ็ชๆถ่งฃ: ็ธๅๅฎไฝๅฏน็็็พๅ
ณ็ณป โ โ |
| โ โ - ๆฐไธๅ
็ปๆจๆญ: ๅบไบไธไธๆ่กฅๅ
จ็ผบๅคฑๅ
ณ็ณป โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### 3.2 ๅญฆๆฏ่ฎบๆๅฎไฝ-ๅ
ณ็ณป Schema |
|
|
| ``` |
| ๅฎไฝ็ฑปๅ (Node Types): |
| โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ |
| โ ๅฎไฝ็ฑปๅ โ ๆ่ฟฐ โ ๅฑๆง โ |
| โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค |
| โ Paper โ ่ฎบๆ โ title, year, doi โ |
| โ Author โ ไฝ่
โ name, affiliationโ |
| โ Method โ ๆนๆณ/็ฎๆณ โ name, descriptionโ |
| โ Dataset โ ๆฐๆฎ้ โ name, size, domainโ |
| โ Task โ ไปปๅก โ name, domain โ |
| โ Metric โ ่ฏไผฐๆๆ โ name, value โ |
| โ Model โ ๅ
ทไฝๆจกๅๅฎไพ โ name, params โ |
| โ Concept โ ๅญฆๆฏๆฆๅฟต โ name, definition โ |
| โ Tool โ ๅทฅๅ
ท/ๆกๆถ โ name, version โ |
| โ Venue โ ๅ่กจๅบๆ โ name, type โ |
| โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ |
| |
| ๅ
ณ็ณป็ฑปๅ (Edge Types): |
| โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ๅ
ณ็ณป็ฑปๅ โ ๆ่ฟฐ (Head โ Tail) โ |
| โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ PROPOSED_BY โ Method โ Author (ๆนๆณ็ฑไฝ่
ๆๅบ) โ |
| โ PUBLISHED_IN โ Paper โ Venue (่ฎบๆๅ่กจๅจๆไผ่ฎฎ/ๆๅ) โ |
| โ USED_FOR โ Method โ Task (ๆนๆณ็จไบๆไปปๅก) โ |
| โ EVALUATED_ON โ Method โ Dataset (ๆนๆณๅจๆๆฐๆฎ้ไธ่ฏไผฐ) โ |
| โ ACHIEVED_SCORE โ Method โ Metric (ๆนๆณ่พพๅฐๆๆๆ ๅผ) โ |
| โ TRAINED_WITH โ Model โ Dataset (ๆจกๅๅจๆๆฐๆฎ้ไธ่ฎญ็ป) โ |
| โ COMPARED_TO โ Method โ Method (ๆนๆณไน้ด็ๅฏนๆฏ) โ |
| โ IMPROVES_ON โ Method โ Method (ๆนๆณAๆน่ฟไบๆนๆณB) โ |
| โ PART_OF โ Concept โ Concept (ๆฆๅฟตๅฑ็บงๅ
ณ็ณป) โ |
| โ CITES โ Paper โ Paper (ๅผ็จๅ
ณ็ณป) โ |
| โ AUTHORED_BY โ Paper โ Author (่ฎบๆไฝ่
) โ |
| โ HYPONYM_OF โ Concept โ Concept (ไธไธไฝๅ
ณ็ณป) โ |
| โ USES_TOOL โ Method โ Tool (ๆนๆณไฝฟ็จๆๅทฅๅ
ท) โ |
| โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### 3.3 ๆ ธๅฟๆฝๅไปฃ็ |
|
|
| ```python |
| from gliner import GLiNER |
| from langchain_experimental.graph_transformers import LLMGraphTransformer |
| from langchain_core.documents import Document |
| |
| class KnowledgeExtractor: |
| """ไธค้ถๆฎต็ฅ่ฏๆฝๅๅจ""" |
| |
| def __init__(self, llm_backend: str = "local_ollama"): |
| # Stage 1: ๅฟซ้NER |
| self.ner_model = GLiNER.from_pretrained("urchade/gliner_large-v2.1") |
| self.entity_labels = [ |
| "author", "method", "dataset", "metric", |
| "task", "model", "concept", "tool", "venue", "score" |
| ] |
| |
| # Stage 2: LLMๅ
ณ็ณปๆฝๅ |
| self.llm = self._init_llm(llm_backend) |
| self.graph_transformer = LLMGraphTransformer( |
| llm=self.llm, |
| allowed_nodes=["Author","Method","Dataset","Metric","Task","Model","Concept","Tool","Venue"], |
| allowed_relationships=[ |
| "PROPOSED_BY","USED_FOR","EVALUATED_ON","ACHIEVED_SCORE", |
| "TRAINED_WITH","COMPARED_TO","IMPROVES_ON","PART_OF", |
| "CITES","AUTHORED_BY","HYPONYM_OF","USES_TOOL","PUBLISHED_IN" |
| ], |
| node_properties=["description", "year"], |
| relationship_properties=["score_value", "metric_name", "confidence"], |
| strict_mode=True, |
| ) |
| |
| def _init_llm(self, backend: str): |
| """็ปไธLLMๅๅงๅ โ ๆฏๆๆฌๅฐๅๅค้จAPI""" |
| if backend == "local_ollama": |
| from langchain_community.llms import Ollama |
| return Ollama(model="qwen2.5:14b-instruct", temperature=0) |
| elif backend == "local_vllm": |
| from langchain_openai import ChatOpenAI |
| return ChatOpenAI( |
| base_url="http://localhost:8000/v1", |
| api_key="token", |
| model="meta-llama/Llama-3.1-8B-Instruct", |
| temperature=0 |
| ) |
| elif backend == "openai": |
| from langchain_openai import ChatOpenAI |
| return ChatOpenAI(model="gpt-4o-mini", temperature=0) |
| elif backend == "deepseek": |
| from langchain_openai import ChatOpenAI |
| return ChatOpenAI( |
| base_url="https://api.deepseek.com/v1", |
| model="deepseek-chat", |
| temperature=0 |
| ) |
| |
| async def extract(self, text: str, paper_id: str) -> dict: |
| """ไธค้ถๆฎตๆฝๅ""" |
| # Stage 1: GLiNERๅฟซ้NER |
| entities = self.ner_model.predict_entities( |
| text, self.entity_labels, threshold=0.5 |
| ) |
| |
| # Stage 2: LLMๅ
ณ็ณปๆฝๅ (ไผ ๅ
ฅๅฎไฝไฝไธบๆ็คบ) |
| entity_hint = ", ".join([f"{e['text']}({e['label']})" for e in entities[:20]]) |
| doc = Document( |
| page_content=text, |
| metadata={"paper_id": paper_id, "entity_hints": entity_hint} |
| ) |
| graph_docs = await self.graph_transformer.aconvert_to_graph_documents([doc]) |
| |
| return { |
| "entities": entities, |
| "graph_documents": graph_docs, |
| "paper_id": paper_id |
| } |
| ``` |
|
|
| --- |
|
|
| ## 4. ็ฅ่ฏๅพ่ฐฑๅฑ |
|
|
| ### 4.1 ๅพๆฐๆฎๅบ้ๅ๏ผNeo4j 5.x |
|
|
| | ๅพๆฐๆฎๅบ | ่ฎธๅฏ่ฏ | ๆฅ่ฏข่ฏญ่จ | Python้ฉฑๅจ | ็ๆ้ๆ | ้็จ่งๆจก | |
| |---------|--------|---------|-----------|---------|---------| |
| | **Neo4j 5.x** | Community AGPL | Cypher | `neo4j` | LangChain/LlamaIndexๅ็ | <1ไบฟ่็น | |
| | ArangoDB | Apache 2.0 | AQL | `python-arango` | ๅคๆจกๅ(ๆๆกฃ+ๅพ) | <1ไบฟ่็น | |
| | NebulaGraph | Apache 2.0 | nGQL | `nebula3-python` | LlamaIndexๅ็ | 10ไบฟ+่็น | |
| | Kuzu | MIT | Cypher | `kuzu` | ๅตๅ
ฅๅผ, ่ฝป้ | <1000ไธ่็น | |
|
|
| > **ๆจ่ Neo4j 5.x**๏ผLangChain/LlamaIndex ๅ็้ๆๆๅฎๅ๏ผCypher ๆฅ่ฏข็ๆๆๆ็๏ผ้ๅ1000็ฏ่ฎบๆ่งๆจก |
|
|
| ### 4.2 ๅพ่ฐฑๆฐๆฎๆจกๅ |
|
|
| ```cypher |
| // ===== ่็น ===== |
| (:Paper {id, title, year, doi, venue, abstract, embedding}) |
| (:Author {id, name, affiliation, h_index}) |
| (:Method {id, name, description, year_proposed, embedding}) |
| (:Dataset {id, name, domain, size, description}) |
| (:Task {id, name, domain, description}) |
| (:Metric {id, name, description}) |
| (:Concept {id, name, definition, embedding}) |
| |
| // ===== ๅ
ณ็ณป ===== |
| (:Paper)-[:AUTHORED_BY {order}]->(:Author) |
| (:Paper)-[:PUBLISHED_IN {year}]->(:Venue) |
| (:Paper)-[:CITES]->(:Paper) |
| (:Paper)-[:PROPOSES]->(:Method) |
| (:Method)-[:USED_FOR]->(:Task) |
| (:Method)-[:EVALUATED_ON {score, metric}]->(:Dataset) |
| (:Method)-[:IMPROVES_ON {delta, metric}]->(:Method) |
| (:Method)-[:COMPARED_TO {result}]->(:Method) |
| (:Concept)-[:PART_OF]->(:Concept) |
| (:Concept)-[:HYPONYM_OF]->(:Concept) |
| |
| // ===== ็ดขๅผ ===== |
| CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding) |
| OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}; |
| CREATE VECTOR INDEX method_embedding FOR (m:Method) ON (m.embedding) |
| OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}; |
| CREATE FULLTEXT INDEX paper_fulltext FOR (p:Paper) ON EACH [p.title, p.abstract]; |
| ``` |
|
|
| ### 4.3 ๅพ่ฐฑๆๅปบๆตๆฐด็บฟ |
|
|
| ``` |
| ่งฃๆๅ็่ฎบๆ โโโถ ็ฅ่ฏๆฝๅ โโโถ ไธๅ
็ป่ง่ๅ โโโถ Neo4j ๅๅ
ฅ |
| โ |
| โโโโโโโโโโโโโดโโโโโโโโโโโโ |
| โ Graphusion ่ๅๅผๆ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ 1. ๅฎไฝๅฝไธๅ โ |
| โ - ๅตๅ
ฅ็ธไผผๅบฆ > 0.92 โ |
| โ - LLM็กฎ่ฎคๅๅนถ โ |
| โ "BERT" = "bert model" โ |
| โ โ |
| โ 2. ๅ
ณ็ณปๅฒ็ชๆถ่งฃ โ |
| โ - ๅไธๅฎไฝๅฏนๅคๅ
ณ็ณป โ |
| โ - ๅ็ฝฎไฟกๅบฆๆ้ซ็ โ |
| โ โ |
| โ 3. ็ผบๅคฑๅ
ณ็ณปๆจๆญ โ |
| โ - ๅบไบๅพ็ปๆๆจกๅผ โ |
| โ - LLM่กฅๅ
จ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### 4.4 ๅพ่ฐฑๅฏ่งๅๆนๆก |
|
|
| ```python |
| # ๆนๆก1: Neo4j Browser (ๅผๅ้ถๆฎต) |
| # ๅ
็ฝฎCypherๆฅ่ฏข + ไบคไบๅผๅพๅฏ่งๅ |
| |
| # ๆนๆก2: vis-network (ๅ็ซฏ้ๆ) |
| # pip install pyvis |
| from pyvis.network import Network |
| |
| def visualize_subgraph(nodes, edges, output_path="graph.html"): |
| net = Network(height="800px", width="100%", directed=True) |
| color_map = { |
| "Method": "#ff6b6b", "Dataset": "#4ecdc4", |
| "Task": "#45b7d1", "Author": "#96ceb4", |
| "Paper": "#ffeaa7", "Concept": "#dfe6e9" |
| } |
| for node in nodes: |
| net.add_node(node["id"], label=node["name"], |
| color=color_map.get(node["type"], "#95a5a6")) |
| for edge in edges: |
| net.add_edge(edge["from"], edge["to"], label=edge["type"]) |
| net.show(output_path) |
| |
| # ๆนๆก3: React + D3-force (็ไบงๅ็ซฏ) |
| # ๆจ่ react-force-graph ๆ neo4j-viz |
| ``` |
|
|
| --- |
|
|
| ## 5. ็ดขๅผๅฑ |
|
|
| ### 5.1 ไธ่ทฏ็ดขๅผๆถๆ |
|
|
| ``` |
| ่งฃๆๅ็่ฎบๆๅ
ๅฎน |
| โ |
| โโโโโโโโโโโโโผโโโโโโโโโโโโ |
| โผ โผ โผ |
| โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ |
| โ ๅ้็ดขๅผ โ โ ๅพ่ฐฑ็ดขๅผ โ โ RAPTORๆ โ |
| โ โ โ โ โ โ |
| โ Qdrant โ โ Neo4j โ โ ๅฑๆฌกๆ่ฆ โ |
| โ Dense + โ โ Cypher + โ โ ้ๅฝ่็ฑป โ |
| โ Sparse โ โ Vector โ โ โ ๆ่ฆ โ |
| โ โ โ Index โ โ โ ๅๅตๅ
ฅ โ |
| โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ |
| |
| ้ๅ: ้ๅ: ้ๅ: |
| ไบๅฎๆฅ่ฏข ๅค่ทณๆจ็ ๅ
จๅฑๆฆ่ง |
| ็ฒพ็กฎๆฃ็ดข ๅ
ณ็ณป่ฟฝๆบฏ ไธป้ขๆป็ป |
| ็ธไผผ่ฎบๆ ๅฏนๆฏๅๆ ่ถๅฟๅๆ |
| ``` |
|
|
| ### 5.2 ๆๆกฃๅๅ็ญ็ฅ |
|
|
| ```python |
| class AcademicChunker: |
| """ๅญฆๆฏ่ฎบๆไธ็จๅๅๅจ โ ไฟ็็ซ ่ๅฑ็บง""" |
| |
| def __init__(self, chunk_size: int = 256, overlap: int = 50): |
| self.chunk_size = chunk_size # 256 tokens (ๅฎ้ช้ช่ฏๆไฝณ, arxiv:2502.11371) |
| self.overlap = overlap |
| |
| def chunk(self, parsed_paper: ParsedPaper) -> list: |
| chunks = [] |
| |
| for block in parsed_paper.content_blocks: |
| if block.type == ContentType.TABLE: |
| # ่กจๆ ผไฝไธบๅฎๆดchunk, ้ๅ ๆ่ฟฐ |
| chunks.append({ |
| "text": f"[TABLE] {block.content}", |
| "metadata": { |
| "paper_id": parsed_paper.metadata.paper_id, |
| "type": "table", |
| "section": block.section_hierarchy, |
| "page": block.page_idx, |
| } |
| }) |
| elif block.type == ContentType.EQUATION_BLOCK: |
| # ๅ
ฌๅผๅ + ไธไธๆ |
| chunks.append({ |
| "text": f"[EQUATION] {block.content}", |
| "metadata": { |
| "paper_id": parsed_paper.metadata.paper_id, |
| "type": "equation", |
| "section": block.section_hierarchy, |
| } |
| }) |
| else: |
| # ๆฎ้ๆๆฌ: ๅบๅฎๅคงๅฐๅๅ, ๆๅฅๅญ่พน็ๅฏน้ฝ |
| text_chunks = self._split_text(block.content) |
| for tc in text_chunks: |
| chunks.append({ |
| "text": tc, |
| "metadata": { |
| "paper_id": parsed_paper.metadata.paper_id, |
| "type": block.type.value, |
| "section": block.section_hierarchy, |
| "page": block.page_idx, |
| } |
| }) |
| |
| return chunks |
| |
| def _split_text(self, text: str) -> list: |
| """ๆๅฅๅญ่พน็ๅๅ, ไฟๆ256 tokenๅคงๅฐ""" |
| import re |
| sentences = re.split(r'(?<=[.!?])\s+', text) |
| chunks, current = [], [] |
| current_len = 0 |
| |
| for sent in sentences: |
| sent_len = len(sent.split()) # ็ฎๅ็token่ฎกๆฐ |
| if current_len + sent_len > self.chunk_size and current: |
| chunks.append(" ".join(current)) |
| # ไฟ็overlap |
| overlap_sents = [] |
| overlap_len = 0 |
| for s in reversed(current): |
| if overlap_len + len(s.split()) > self.overlap: |
| break |
| overlap_sents.insert(0, s) |
| overlap_len += len(s.split()) |
| current = overlap_sents |
| current_len = overlap_len |
| current.append(sent) |
| current_len += sent_len |
| |
| if current: |
| chunks.append(" ".join(current)) |
| return chunks |
| ``` |
|
|
| ### 5.3 RAPTOR ๅฑๆฌกๆ่ฆๆ |
|
|
| ``` |
| ่ฎบๆ้ๅ (1000็ฏ) |
| โ |
| โโโ Level 0: ๅๅงๆๆฌๅ (256 tokens) |
| โ โ |
| โ โผ SBERTๅตๅ
ฅ โ GMM่็ฑป โ UMAP้็ปด |
| โ |
| โโโ Level 1: ๆฎต่ฝ็บงๆ่ฆ (~50ไธช่็ฑป) |
| โ โ LLM็ๆๆ่ฆ โ ้ๆฐๅตๅ
ฅ |
| โ โผ ๅๆฌก่็ฑป |
| โ |
| โโโ Level 2: ไธป้ข็บงๆ่ฆ (~15ไธช่็ฑป) |
| โ โ "Transformerๆถๆ็ๆน่ฟๆนๅ" |
| โ โผ "ๅคง่งๆจก้ข่ฎญ็ปๆฐๆฎ้็ปผ่ฟฐ" |
| โ |
| โโโ Level 3: ้ขๅ็บงๆ่ฆ (~5ไธช่็ฑป) |
| "NLP้ขๅ่ฟๅนดไธป่ฆ็ ็ฉถๆนๅไธ็ช็ ด" |
| |
| ๆฅ่ฏขๆถ: ไปๆๆๅฑ็บงไธญๆฃ็ดขๆ็ธๅ
ณ่็น (Collapsed Treeๆจกๅผ) |
| ไผๅฟ: ๆข่ฝๅ็ญ็ป่้ฎ้ข(Level 0), ไน่ฝๅ็ญๅ
จๅฑ้ฎ้ข(Level 2-3) |
| ``` |
|
|
| --- |
|
|
| ## 6. ๆฃ็ดขๅฑ |
|
|
| ### 6.1 ๆททๅๆฃ็ดขๆถๆ |
|
|
| ``` |
| ็จๆทๆฅ่ฏข |
| โ |
| โโโโโโโโโดโโโโโโโโ |
| โผ โผ |
| โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ |
| โ HyDE โ โ ๆฅ่ฏขๅ็ฑปๅจ โ |
| โ ๅ่ฎพๆๆกฃ โ โ (Router LLM) โ |
| โ ็ๆ+ๅตๅ
ฅ โ โ โ |
| โโโโโโฌโโโโโโ โโโโโโโโฌโโโโโโโโ |
| โ โ |
| โ โโโโโโโโโโโโโผโโโโโโโโโโโโ |
| โ โผ โผ โผ |
| โ factual reasoning global |
| โ (ไบๅฎ) (ๆจ็) (ๅ
จๅฑ) |
| โ โ โ โ |
| โผ โผ โผ โผ |
| โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ |
| โ ๅ้+BM25 โ โ ๅพ่ฐฑ้ๅ โ โ RAPTOR โ |
| โ Qdrant โ โ Neo4j โ โ ๆ่ฆๆ โ |
| โ Hybrid โ โ Cypher โ โ ๅ
จๅฑๆฃ็ดข โ |
| โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ |
| โ โ โ |
| โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ |
| โ |
| โโโโโโโโโผโโโโโโโโ |
| โ RRF ่ๅๆๅบ โ |
| โ (Reciprocal โ |
| โ Rank Fusion) โ |
| โโโโโโโโโฌโโโโโโโโ |
| โ |
| โโโโโโโโโผโโโโโโโโ |
| โ Cross-Encoder โ |
| โ Reranker โ |
| โ bge-reranker โ |
| โ -large โ |
| โโโโโโโโโฌโโโโโโโโ |
| โ |
| Top-5 ็ปๆ |
| โ |
| โโโโโโโโโผโโโโโโโโ |
| โ LLM ็ญๆก็ๆ โ |
| โ + ๅผ็จๆบฏๆบ โ |
| โโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### 6.2 ๆ ธๅฟๆฃ็ดขไปฃ็ |
|
|
| ```python |
| from qdrant_client import QdrantClient, models |
| from neo4j import GraphDatabase |
| |
| class HybridRetriever: |
| """ไธ่ทฏๆททๅๆฃ็ดขๅจ""" |
| |
| def __init__(self): |
| self.qdrant = QdrantClient("localhost", port=6333) |
| self.neo4j = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password")) |
| self.reranker = self._load_reranker() |
| self.embed_model = self._load_embedder() |
| |
| async def retrieve(self, query: str, mode: str = "hybrid", top_k: int = 20) -> list: |
| """ |
| mode: "factual" | "reasoning" | "global" | "hybrid" |
| """ |
| results = [] |
| |
| if mode in ("factual", "hybrid"): |
| # 1. Dense + Sparse ๅ้ๆฃ็ดข |
| query_vec = self.embed_model.encode(query) |
| vec_results = self.qdrant.search( |
| collection_name="papers", |
| query_vector=models.NamedVector(name="dense", vector=query_vec), |
| limit=top_k, |
| with_payload=True, |
| ) |
| results.extend([{"text": r.payload["text"], "score": r.score, |
| "source": "vector", "metadata": r.payload} for r in vec_results]) |
| |
| if mode in ("reasoning", "hybrid"): |
| # 2. ๅพ่ฐฑๆฃ็ดข โ ๅฎไฝ+ๅ
ณ็ณป่ทฏๅพ |
| graph_results = self._graph_search(query, limit=top_k // 2) |
| results.extend(graph_results) |
| |
| if mode in ("global", "hybrid"): |
| # 3. RAPTOR ๅฑๆฌกๆ่ฆๆฃ็ดข |
| raptor_results = self._raptor_search(query, limit=top_k // 3) |
| results.extend(raptor_results) |
| |
| # 4. RRF ่ๅๆๅบ |
| fused = self._rrf_fusion(results) |
| |
| # 5. Cross-Encoder ้ๆ |
| reranked = self._rerank(query, fused[:top_k]) |
| |
| return reranked[:5] |
| |
| def _graph_search(self, query: str, limit: int = 10) -> list: |
| """Neo4j ๅญๅพๆฃ็ดข""" |
| # ๅ
็จๅ้็ดขๅผๆพๅฐๆ็ธๅ
ณ็ๅฎไฝ่็น |
| # ๅ็จCypher้ๅ1-2่ทณ้ปๅฑ
|
| cypher = """ |
| CALL db.index.vector.queryNodes('method_embedding', $limit, $query_vec) |
| YIELD node, score |
| MATCH (node)-[r]-(neighbor) |
| RETURN node, r, neighbor, score |
| ORDER BY score DESC LIMIT $limit |
| """ |
| with self.neo4j.session() as session: |
| result = session.run(cypher, query_vec=self.embed_model.encode(query).tolist(), limit=limit) |
| return [{"text": self._format_graph_result(r), "score": r["score"], |
| "source": "graph"} for r in result] |
| |
| def _rrf_fusion(self, results: list, k: int = 60) -> list: |
| """Reciprocal Rank Fusion โ ๅค่ทฏ็ปๆ่ๅ""" |
| doc_scores = {} |
| for rank, r in enumerate(sorted(results, key=lambda x: x["score"], reverse=True)): |
| doc_key = r["text"][:200] # ๅป้key |
| if doc_key not in doc_scores: |
| doc_scores[doc_key] = {"result": r, "rrf_score": 0} |
| doc_scores[doc_key]["rrf_score"] += 1.0 / (k + rank + 1) |
| |
| return [v["result"] | {"score": v["rrf_score"]} |
| for v in sorted(doc_scores.values(), key=lambda x: x["rrf_score"], reverse=True)] |
| |
| def _rerank(self, query: str, results: list) -> list: |
| """BAAI/bge-reranker-large ไบคๅ็ผ็ ๅจ้ๆ""" |
| pairs = [(query, r["text"]) for r in results] |
| scores = self.reranker.predict(pairs) |
| for r, s in zip(results, scores): |
| r["rerank_score"] = float(s) |
| return sorted(results, key=lambda x: x["rerank_score"], reverse=True) |
| ``` |
|
|
| --- |
|
|
| ## 7. Agent ็ผๆๅฑ |
|
|
| ### 7.1 LangGraph ๅคAgentๆถๆ |
|
|
| ``` |
| โโโโโโโโโโโโโโโโ |
| โ ็จๆทๆฅ่ฏข โ |
| โโโโโโโโฌโโโโโโโโ |
| โ |
| โโโโโโโโผโโโโโโโโ |
| โ ่ทฏ็ฑ Agent โ |
| โ (ๆๅพๅ็ฑป) โ |
| โโโโโโโโฌโโโโโโโโ |
| โ |
| โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ |
| โ โ โ |
| โโโโโโโโผโโโโโโโ โโโโโโโโผโโโโโโโ โโโโโโโโผโโโโโโโ |
| โ ็ฎๅ้ฎ็ญ โ โ ๅค่ทณๆจ็ โ โ ๅ
จๅฑๅๆ โ |
| โ โ โ โ โ โ |
| โ ๅ้ๆฃ็ดข โ โ ๅพ่ฐฑ้ๅ โ โ RAPTOR+KG โ |
| โ โ ็ๆ็ญๆก โ โ โ ้พๅผๆจ็ โ โ โ ็ปผๅๆป็ป โ |
| โ โ ๅผ็จๆบฏๆบ โ โ โ ่ฏๆฎๆถ้ โ โ โ ่ถๅฟๆดๅฏ โ |
| โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ |
| โ โ โ |
| โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ |
| โ |
| โโโโโโโโผโโโโโโโโ |
| โ ่ชๆฃ Agent โ |
| โ (็ญๆก้ช่ฏ) โ |
| โ ๆฏๅฆๅ
ๅ? โ |
| โ ๆฏๅฆๆๅนป่ง? โ |
| โโโโโโโโฌโโโโโโโโ |
| โ |
| โโโโโโโโโโโโโผโโโโโโโโโโโโ |
| โ ๅ
ๅ โ ไธๅ
ๅ โ |
| โผ โผ โ |
| โโโโโโโโโโโโ โโโโโโโโโโโโ โ |
| โ ่พๅบ็ญๆก โ โ ่กฅๅ
ๆฃ็ดข โโโโโโโ |
| โ + ๅผ็จ โ โ (ๆดๅคๆบ) โ (ๆๅค3่ฝฎ) |
| โ + ๅพ่ฐฑ โ โโโโโโโโโโโโ |
| โโโโโโโโโโโโ |
| ``` |
|
|
| ### 7.2 LangGraph ็ถๆๆบๅฎไน |
|
|
| ```python |
| from typing import TypedDict, Annotated, Literal |
| from langgraph.graph import StateGraph, END |
| from langgraph.graph.message import add_messages |
| |
| class AgentState(TypedDict): |
| messages: Annotated[list, add_messages] |
| query: str |
| query_type: Literal["factual", "reasoning", "global"] |
| retrieved_docs: list |
| graph_context: list |
| answer: str |
| citations: list |
| confidence: float |
| iteration: int |
| |
| def build_agent_graph(): |
| graph = StateGraph(AgentState) |
| |
| # ๆทปๅ ่็น |
| graph.add_node("router", route_query) |
| graph.add_node("retriever", hybrid_retrieve) |
| graph.add_node("graph_explorer", explore_knowledge_graph) |
| graph.add_node("generator", generate_answer) |
| graph.add_node("validator", validate_answer) |
| graph.add_node("supplementer", supplement_retrieval) |
| |
| # ๅฎไน่พน |
| graph.set_entry_point("router") |
| graph.add_edge("router", "retriever") |
| graph.add_edge("retriever", "graph_explorer") |
| graph.add_edge("graph_explorer", "generator") |
| graph.add_edge("generator", "validator") |
| |
| # ๆกไปถ่พน: ้ช่ฏ้่ฟโ็ปๆ, ไธ้่ฟโ่กฅๅ
ๆฃ็ดข(ๆๅค3่ฝฎ) |
| graph.add_conditional_edges( |
| "validator", |
| lambda state: "end" if state["confidence"] > 0.8 or state["iteration"] >= 3 else "supplement", |
| {"end": END, "supplement": "supplementer"} |
| ) |
| graph.add_edge("supplementer", "retriever") |
| |
| return graph.compile() |
| |
| async def route_query(state: AgentState) -> AgentState: |
| """LLMๆๅพๅ็ฑป""" |
| classification_prompt = f""" |
| ๅฐไปฅไธๅญฆๆฏ้ฎ้ขๅ็ฑปไธบไธ็ง็ฑปๅไนไธ: |
| - factual: ๅ
ทไฝไบๅฎๆฅ่ฏข (ๆไธชๆนๆณ็ๆๆใๆ็ฏ่ฎบๆ็ไฝ่
) |
| - reasoning: ้่ฆๅคๆญฅๆจ็ (ๆนๆณAๅB็ๅบๅซใๆๆๆฏ็ๅๅฑ่็ป) |
| - global: ๅ
จๅฑๆงๅๆ (ๆ้ขๅ็็ ็ฉถ่ถๅฟใไธป่ฆๆๆ) |
| |
| ้ฎ้ข: {state['query']} |
| ็ฑปๅ: """ |
| |
| query_type = await llm.ainvoke(classification_prompt) |
| return {"query_type": query_type.content.strip()} |
| |
| async def validate_answer(state: AgentState) -> AgentState: |
| """Self-RAG ๆจกๅผ: LLM่ชๆฃ็ญๆก่ดจ้""" |
| validation_prompt = f""" |
| ่ฏไผฐไปฅไธ็ญๆก็่ดจ้(0-1ๅ): |
| ้ฎ้ข: {state['query']} |
| ็ญๆก: {state['answer']} |
| ๆฃ็ดขไพๆฎ: {state['retrieved_docs'][:3]} |
| |
| ่ฏๅๆ ๅ: |
| - ๆฏๅฆๅฎๆดๅ็ญไบ้ฎ้ข |
| - ๆฏๅฆๆไพๆฎๆฏๆ |
| - ๆฏๅฆๅญๅจๅนป่ง |
| |
| ่ฟๅJSON: {{"confidence": 0.X, "issues": ["..."]}} |
| """ |
| result = await llm.ainvoke(validation_prompt) |
| confidence = parse_confidence(result.content) |
| return {"confidence": confidence, "iteration": state["iteration"] + 1} |
| ``` |
|
|
| ### 7.3 Agent ๅทฅๅ
ท้ |
|
|
| ```python |
| from langchain.tools import tool |
| |
| @tool |
| def vector_search(query: str, top_k: int = 5) -> str: |
| """ๅจ่ฎบๆๅ้ๅบไธญ่ฟ่ก่ฏญไนๆ็ดข""" |
| results = retriever.search_vectors(query, top_k) |
| return format_search_results(results) |
| |
| @tool |
| def graph_query(cypher: str) -> str: |
| """ๆง่กCypherๆฅ่ฏข, ๅจ็ฅ่ฏๅพ่ฐฑไธญๆฃ็ดขๅฎไฝๅๅ
ณ็ณป""" |
| with neo4j_driver.session() as session: |
| result = session.run(cypher) |
| return format_graph_results(result) |
| |
| @tool |
| def find_related_methods(method_name: str) -> str: |
| """ๆฅๆพไธๆๅฎๆนๆณ็ธๅ
ณ็ๆๆๆนๆณ(ๆน่ฟใๅฏนๆฏใไฝฟ็จ)""" |
| cypher = """ |
| MATCH (m:Method {name: $name})-[r]-(related) |
| RETURN type(r) as relation, labels(related) as type, |
| related.name as name, r.score_value as score |
| ORDER BY r.score_value DESC |
| LIMIT 20 |
| """ |
| return execute_and_format(cypher, {"name": method_name}) |
| |
| @tool |
| def get_paper_summary(paper_id: str) -> str: |
| """่ทๅ่ฎบๆ็ๆ่ฆๅๆ ธๅฟ่ดก็ฎ""" |
| return paper_store.get_summary(paper_id) |
| |
| @tool |
| def compare_methods(method_a: str, method_b: str) -> str: |
| """ๅฏนๆฏไธคไธชๆนๆณๅจ็ธๅๆฐๆฎ้ไธ็่กจ็ฐ""" |
| cypher = """ |
| MATCH (a:Method {name: $a})-[r1:EVALUATED_ON]->(d:Dataset)<-[r2:EVALUATED_ON]-(b:Method {name: $b}) |
| RETURN d.name as dataset, r1.score as score_a, r2.score as score_b, |
| r1.metric as metric |
| """ |
| return execute_and_format(cypher, {"a": method_a, "b": method_b}) |
| |
| @tool |
| def research_trend(topic: str, years: int = 5) -> str: |
| """ๅๆๆไธช็ ็ฉถไธป้ขๅจ่ฟNๅนด็ๅๅฑ่ถๅฟ""" |
| raptor_results = raptor_index.search(topic, level="high") |
| graph_stats = get_temporal_graph_stats(topic, years) |
| return synthesize_trend(raptor_results, graph_stats) |
| ``` |
|
|
| --- |
|
|
| ## 8. LLM ็ปไธๆฅๅ
ฅๅฑ |
|
|
| ### 8.1 ๆถๆ่ฎพ่ฎก |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ LiteLLM Proxy Server โ |
| โ (็ปไธ OpenAI ๅ
ผๅฎนๆฅๅฃ) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ โ |
| โ model_list: โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ "local/qwen2.5-14b" โโโโโโโถ Ollama :11434 |
| โ โ "local/llama-3.1-8b" โโโโโโโถ vLLM :8000 |
| โ โ "gpt-4o-mini" โโโโโโโถ OpenAI API |
| โ โ "claude-3-5-sonnet" โโโโโโโถ Anthropic API |
| โ โ "deepseek-chat" โโโโโโโถ DeepSeek API |
| โ โ "gemini-2.0-flash" โโโโโโโถ Google API |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ ๅ่ฝ: โ |
| โ - ็ปไธ /chat/completions ๆฅๅฃ โ |
| โ - ่ชๅจfallback (ๆฌๅฐโAPI) โ |
| โ - ่ด่ฝฝๅ่กก (ๅคvLLMๅฎไพ) โ |
| โ - ้็้ๅถ & ๆๆฌ่ฟฝ่ธช โ |
| โ - ็ผๅญ (็ธๅqueryๅค็จ) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### 8.2 LiteLLM ้
็ฝฎ |
|
|
| ```yaml |
| # litellm_config.yaml |
| model_list: |
| # ===== ๆฌๅฐๆจกๅ ===== |
| - model_name: "local/qwen2.5-14b" |
| litellm_params: |
| model: "openai/Qwen2.5-14B-Instruct" |
| api_base: "http://localhost:11434/v1" # Ollama |
| api_key: "ollama" |
| model_info: |
| max_tokens: 32768 |
| input_cost_per_token: 0 # ๆฌๅฐๅ
่ดน |
| |
| - model_name: "local/llama-3.1-8b" |
| litellm_params: |
| model: "openai/meta-llama/Llama-3.1-8B-Instruct" |
| api_base: "http://localhost:8000/v1" # vLLM |
| api_key: "token" |
| model_info: |
| max_tokens: 131072 |
| |
| # ===== ๅค้จAPI ===== |
| - model_name: "gpt-4o-mini" |
| litellm_params: |
| model: "gpt-4o-mini" |
| api_key: "os.environ/OPENAI_API_KEY" |
| |
| - model_name: "deepseek-chat" |
| litellm_params: |
| model: "deepseek/deepseek-chat" |
| api_key: "os.environ/DEEPSEEK_API_KEY" |
| |
| # ่ทฏ็ฑ็ญ็ฅ |
| router_settings: |
| routing_strategy: "latency-based-routing" # ้ๆฉๅปถ่ฟๆไฝ็ |
| num_retries: 3 |
| fallbacks: |
| - "local/qwen2.5-14b": ["gpt-4o-mini"] # ๆฌๅฐๅคฑ่ดฅโAPI |
| - "gpt-4o-mini": ["deepseek-chat"] # OpenAIๅคฑ่ดฅโDeepSeek |
| |
| # ไธๅไปปๅก็จไธๅๆจกๅ |
| model_group_alias: |
| "extraction": "local/qwen2.5-14b" # ็ฅ่ฏๆฝๅ: ๆฌๅฐ(็้ฑ) |
| "generation": "gpt-4o-mini" # ็ญๆก็ๆ: API(้ซ่ดจ้) |
| "routing": "local/llama-3.1-8b" # ๆๅพๅ็ฑป: ๆฌๅฐๅฐๆจกๅ(ๅฟซ) |
| ``` |
|
|
| ### 8.3 ็ปไธ่ฐ็จๆฅๅฃ |
|
|
| ```python |
| import litellm |
| from typing import Optional |
| |
| class UnifiedLLM: |
| """็ปไธLLM่ฐ็จๅฑ โ ่ชๅจ่ทฏ็ฑๆฌๅฐ/API""" |
| |
| def __init__(self, config_path: str = "litellm_config.yaml"): |
| litellm.set_verbose = False |
| # ๅฏ็จ็ผๅญ |
| litellm.cache = litellm.Cache(type="redis", host="localhost", port=6379) |
| |
| async def complete( |
| self, |
| messages: list, |
| task: str = "generation", # extraction | generation | routing |
| temperature: float = 0, |
| max_tokens: int = 4096, |
| stream: bool = False, |
| ) -> str: |
| """ |
| ็ปไธ่ฐ็จๆฅๅฃ, ๆ นๆฎtask่ชๅจ้ๆฉๆจกๅ |
| """ |
| model = self._select_model(task) |
| |
| response = await litellm.acompletion( |
| model=model, |
| messages=messages, |
| temperature=temperature, |
| max_tokens=max_tokens, |
| stream=stream, |
| metadata={"task": task}, # ็จไบๆๆฌ่ฟฝ่ธช |
| ) |
| |
| if stream: |
| return response # ่ฟๅๅผๆญฅ็ๆๅจ |
| return response.choices[0].message.content |
| |
| def _select_model(self, task: str) -> str: |
| model_map = { |
| "extraction": "local/qwen2.5-14b", |
| "generation": "gpt-4o-mini", |
| "routing": "local/llama-3.1-8b", |
| "fusion": "gpt-4o-mini", # Graphusion่ๅ้่ฆๅผบๆจกๅ |
| "rewrite": "local/llama-3.1-8b", # HyDEๆฅ่ฏขๆนๅ |
| } |
| return model_map.get(task, "local/qwen2.5-14b") |
| ``` |
|
|
| --- |
|
|
| ## 9. ็ณป็ป้จ็ฝฒๆถๆ |
|
|
| ### 9.1 Docker Compose ้จ็ฝฒ |
|
|
| ```yaml |
| # docker-compose.yml |
| version: '3.8' |
| |
| services: |
| # ===== ๆ ธๅฟๆๅก ===== |
| api: |
| build: ./services/api |
| ports: ["8080:8080"] |
| environment: |
| - REDIS_URL=redis://redis:6379 |
| - QDRANT_URL=http://qdrant:6333 |
| - NEO4J_URL=bolt://neo4j:7687 |
| - LITELLM_URL=http://litellm:4000 |
| depends_on: [redis, qdrant, neo4j, litellm] |
| |
| # ===== PDF่งฃๆๆๅก ===== |
| mineru-worker: |
| build: ./services/mineru |
| deploy: |
| resources: |
| reservations: |
| devices: |
| - driver: nvidia |
| count: 1 |
| capabilities: [gpu] |
| environment: |
| - MINERU_MODEL_SOURCE=local |
| - CELERY_BROKER_URL=redis://redis:6379 |
| volumes: |
| - mineru-models:/models |
| - pdf-storage:/pdfs |
| |
| # ===== LLMๆๅก ===== |
| litellm: |
| image: ghcr.io/berriai/litellm:main-latest |
| ports: ["4000:4000"] |
| volumes: |
| - ./config/litellm_config.yaml:/app/config.yaml |
| command: ["--config", "/app/config.yaml"] |
| |
| ollama: |
| image: ollama/ollama:latest |
| ports: ["11434:11434"] |
| deploy: |
| resources: |
| reservations: |
| devices: |
| - driver: nvidia |
| count: 1 |
| capabilities: [gpu] |
| volumes: |
| - ollama-data:/root/.ollama |
| |
| # ===== ๅญๅจๆๅก ===== |
| qdrant: |
| image: qdrant/qdrant:latest |
| ports: ["6333:6333"] |
| volumes: |
| - qdrant-data:/qdrant/storage |
| |
| neo4j: |
| image: neo4j:5-community |
| ports: ["7474:7474", "7687:7687"] |
| environment: |
| - NEO4J_AUTH=neo4j/password |
| - NEO4J_PLUGINS=["apoc", "graph-data-science"] |
| volumes: |
| - neo4j-data:/data |
| |
| redis: |
| image: redis:7-alpine |
| ports: ["6379:6379"] |
| |
| postgres: |
| image: postgres:16-alpine |
| environment: |
| - POSTGRES_DB=scholarmind |
| - POSTGRES_PASSWORD=password |
| volumes: |
| - postgres-data:/var/lib/postgresql/data |
| |
| minio: |
| image: minio/minio:latest |
| ports: ["9000:9000", "9001:9001"] |
| command: server /data --console-address ":9001" |
| volumes: |
| - minio-data:/data |
| |
| volumes: |
| qdrant-data: |
| neo4j-data: |
| postgres-data: |
| minio-data: |
| ollama-data: |
| mineru-models: |
| pdf-storage: |
| ``` |
|
|
| ### 9.2 ็กฌไปถ้
็ฝฎๅปบ่ฎฎ |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ็กฌไปถ้
็ฝฎๅปบ่ฎฎ โ |
| โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโค |
| โ ้
็ฝฎ โ ๅผๅ็ฏๅข โ ็ไบง(ๅฐ) โ ็ไบง(ๅคง) โ |
| โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค |
| โ PDF่งฃๆ GPU โ RTX 3090 โ A100 80G โ 2รA100 80G โ |
| โ LLMๆจ็ GPU โ RTX 4090 โ A100 80G โ 2รH100 80G โ |
| โ CPU โ 16ๆ ธ โ 32ๆ ธ โ 64ๆ ธ โ |
| โ RAM โ 64GB โ 128GB โ 256GB โ |
| โ SSD โ 1TB NVMe โ 2TB NVMe โ 4TB NVMe โ |
| โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค |
| โ 1000็ฏ่ฎบๆ โ ~3ๅฐๆถ โ ~80ๅ้ โ ~40ๅ้ โ |
| โ ่งฃๆๆถ้ด โ โ โ โ |
| โ QAๅๅบๅปถ่ฟ โ ~5s โ ~2s โ ~1s โ |
| โ ๅนถๅ็จๆท โ 1-5 โ 10-50 โ 50-200 โ |
| โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### 9.3 API ่ฎพ่ฎก |
|
|
| ```python |
| from fastapi import FastAPI, UploadFile, BackgroundTasks |
| from fastapi.responses import StreamingResponse |
| from pydantic import BaseModel |
| |
| app = FastAPI(title="ScholarMind API", version="1.0") |
| |
| # ===== PDFไธไผ ไธ่งฃๆ ===== |
| @app.post("/api/v1/papers/upload") |
| async def upload_papers(files: list[UploadFile], bg: BackgroundTasks): |
| """ๆน้ไธไผ PDF่ฎบๆ, ๅผๆญฅ่งฃๆ""" |
| task_ids = [] |
| for f in files: |
| task_id = await save_and_queue(f) |
| task_ids.append(task_id) |
| return {"task_ids": task_ids, "status": "processing"} |
| |
| @app.get("/api/v1/papers/{task_id}/status") |
| async def get_parse_status(task_id: str): |
| """ๆฅ่ฏข่งฃๆ่ฟๅบฆ""" |
| return celery_app.AsyncResult(task_id).info |
| |
| # ===== ็ฅ่ฏๅบ้ฎ็ญ ===== |
| class QueryRequest(BaseModel): |
| query: str |
| mode: str = "hybrid" # factual | reasoning | global | hybrid |
| llm_backend: str = "auto" # auto | local | openai | deepseek |
| top_k: int = 5 |
| stream: bool = False |
| include_citations: bool = True |
| include_graph: bool = False # ๆฏๅฆ่ฟๅ็ธๅ
ณๅญๅพ |
| |
| @app.post("/api/v1/query") |
| async def query_knowledge_base(req: QueryRequest): |
| """็ฅ่ฏๅบ้ฎ็ญ""" |
| if req.stream: |
| return StreamingResponse( |
| agent.astream(req), media_type="text/event-stream" |
| ) |
| result = await agent.ainvoke(req) |
| return { |
| "answer": result["answer"], |
| "citations": result["citations"], |
| "confidence": result["confidence"], |
| "graph_snippet": result.get("graph_snippet"), |
| } |
| |
| # ===== ็ฅ่ฏๅพ่ฐฑ ===== |
| @app.get("/api/v1/graph/entity/{name}") |
| async def get_entity(name: str, depth: int = 2): |
| """่ทๅๅฎไฝๅๅ
ถN่ทณๅญๅพ""" |
| subgraph = await graph_service.get_subgraph(name, depth) |
| return subgraph |
| |
| @app.get("/api/v1/graph/path") |
| async def find_path(source: str, target: str, max_hops: int = 4): |
| """ๆฅๆพไธคไธชๅฎไฝไน้ด็ๆ็ญ่ทฏๅพ""" |
| path = await graph_service.shortest_path(source, target, max_hops) |
| return path |
| |
| @app.get("/api/v1/graph/stats") |
| async def graph_statistics(): |
| """็ฅ่ฏๅพ่ฐฑ็ป่ฎกไฟกๆฏ""" |
| return await graph_service.get_stats() |
| |
| # ===== ๅพ่ฐฑๅฏ่งๅ ===== |
| @app.get("/api/v1/graph/visualize") |
| async def visualize_graph(center: str, depth: int = 2, layout: str = "force"): |
| """่ฟๅๅฏ่งๅๆฐๆฎ (vis.jsๆ ผๅผ)""" |
| data = await graph_service.get_vis_data(center, depth) |
| return {"nodes": data["nodes"], "edges": data["edges"]} |
| ``` |
|
|
| --- |
|
|
| ## 10. ๆๆฏ้ๅๅฏนๆฏ |
|
|
| ### 10.1 ๅฎๆดๆๆฏๆ |
|
|
| | ๅฑ | ็ปไปถ | ้ๅ | ๆฟไปฃๆนๆก | ้ๅ็็ฑ | |
| |----|------|------|---------|---------| |
| | PDF่งฃๆ | OCRๅผๆ | **MinerU 2.5 VLM** | Marker, Nougat, Docling | ๅญฆๆฏ่ฎบๆSOTA(0.047 Edit Dist), ๅ
ฌๅผ88.46 CDM | |
| | PDF่งฃๆ | ๅฟซ้่ทฏๅพ | **PyMuPDF** | pdfplumber | ๆฐๅญPDF็ง็บง่งฃๆ, ๆ GPU้ๆฑ | |
| | ็ฅ่ฏๆฝๅ | NER | **GLiNER** (440M) | spaCy, DeepKE | ้ถๆ ทๆฌ, ่ชๅฎไนๆ ็ญพ, ๆฌๅฐ่ฟ่ก | |
| | ็ฅ่ฏๆฝๅ | RE | **LLMGraphTransformer** | REBEL, GLiREL, ReLiK | ๆฏๆๆฌๅฐ+API LLM, Schema็บฆๆ | |
| | ็ฅ่ฏๆฝๅ | ่ๅ | **Graphusion** | ๆ | ๅฎไฝๅฝไธๅ+ๅฒ็ชๆถ่งฃ, ๆฏnaiveๅฅฝ9.2% | |
| | ็ฅ่ฏๅพ่ฐฑ | ๅพๆฐๆฎๅบ | **Neo4j 5.x** | ArangoDB, NebulaGraph | LangChainๅ็้ๆ, Cypher็ๆๆๆ็ | |
| | ๅ้็ดขๅผ | ๅ้ๅบ | **Qdrant** | Milvus, Weaviate | Rust้ซๆง่ฝ, ๅ็Hybridๆ็ดข, ็ฎๅ้จ็ฝฒ | |
| | ๆฃ็ดข | ้ๆๅจ | **bge-reranker-large** | Cohere, jina | ๅผๆบSOTA, ๆ APIไพ่ต | |
| | ๆฃ็ดข | ๆฅ่ฏขๅขๅผบ | **HyDE** | Query2Doc | +10 NDCG้ถๆๆฌๆๅ | |
| | ็ดขๅผ | ๅฑๆฌก็ดขๅผ | **RAPTOR** | GraphRAG Communities | ้ๅๅฑ็บงๆๆกฃ, +20%ๅ็กฎ็ | |
| | RAG | ๅพๅขๅผบ | **LightRAG** | NodeRAG, GraphRAG | 34kโญ, ๅข้ๆดๆฐ, ๅคๅ็ซฏ | |
| | Agent | ็ผๆ | **LangGraph** | smolagents, AutoGen | ๆ็ถๆๅพ, ๆกไปถๅๆฏ, ็ไบง็บง | |
| | LLM | ็ปไธๆฅๅ
ฅ | **LiteLLM** | OpenRouter | 20kโญ, ๆๆๆไพๅ็ปไธๆฅๅฃ | |
| | LLM | ๆฌๅฐๆจ็ | **vLLM + Ollama** | SGLang, llama.cpp | vLLM้ซๅๅ, Ollamaๆ็จ | |
| | ๅ็ซฏ | Webๆกๆถ | **FastAPI** | Flask, Django | ๅผๆญฅๅ็, ้ซๆง่ฝ, OpenAPI่ชๅจๆๆกฃ | |
| | ้ๅ | ไปปๅก้ๅ | **Celery + Redis** | RQ, Dramatiq | ๆ็็จณๅฎ, ๅๅธๅผๆฏๆ | |
| | ๅญๅจ | ๅฏน่ฑกๅญๅจ | **MinIO** | S3 | S3ๅ
ผๅฎน, ๆฌๅฐ้จ็ฝฒ | |
| | ๅญๅจ | ๅ
ณ็ณปๆฐๆฎๅบ | **PostgreSQL** | MySQL | JSONๆฏๆ, ๅ
จๆๆ็ดข | |
| | ๅ็ซฏ | ๅพๅฏ่งๅ | **react-force-graph** | vis.js, D3 | React็ๆ, 3Dๆฏๆ | |
|
|
| ### 10.2 ๆง่ฝ้ขไผฐ |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ 1000็ฏ่ฎบๆ็ณป็ปๆง่ฝ้ขไผฐ (A100 80G) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ PDF่งฃๆ โ ~80ๅ้ (MinerU 2.5, 2.12 pg/s) โ |
| โ ็ฅ่ฏๆฝๅ(GLiNER NER) โ ~15ๅ้ (GPU batch) โ |
| โ ็ฅ่ฏๆฝๅ(LLM RE) โ ~60ๅ้ (ๆฌๅฐ14Bๆจกๅ) โ |
| โ โ ~30ๅ้ (GPT-4o-mini API) โ |
| โ ๅ้็ดขๅผๆๅปบ โ ~10ๅ้ (text-embedding-3-small) โ |
| โ ็ฅ่ฏๅพ่ฐฑๆๅปบ โ ~20ๅ้ (ๅซ่ๅ) โ |
| โ RAPTORๆ ๆๅปบ โ ~30ๅ้ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ๆป่ฎก(็ซฏๅฐ็ซฏ) โ ~3.5ๅฐๆถ (ๅ
จๆฌๅฐ) / ~2.5ๅฐๆถ (ๆททๅAPI) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ QAๅๅบๅปถ่ฟ (P50) โ ~1.5s (ๆฌๅฐLLM) / ~0.8s (API) โ |
| โ QAๅๅบๅปถ่ฟ (P99) โ ~4s (ๆฌๅฐLLM) / ~2s (API) โ |
| โ ๅพ่ฐฑๆฅ่ฏขๅปถ่ฟ โ ~200ms (2่ทณๅญๅพ) โ |
| โ ๅ้ๆฃ็ดขๅปถ่ฟ โ ~50ms (Qdrant, 1Mๅ้) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## 11. ๅ
ณ้ฎ่ฎบๆไธๅผๆบ้กน็ฎ |
|
|
| ### 11.1 ๆ ธๅฟ่ฎบๆ |
|
|
| | ่ฎบๆ | ArXiv ID | ่ดก็ฎ | ๆจ่ๅบฆ | |
| |------|---------|------|--------| |
| | **MinerU 2.5** | 2509.22186 | ็ปไธVLMๆๆกฃ่งฃๆSOTA | โญโญโญโญโญ | |
| | **OmniDocBench** | 2412.07626 | ๆๆกฃ่งฃๆๅบๅ (CVPR 2025) | โญโญโญโญ | |
| | **Graphusion** | 2410.17600 | ้ถๆ ทๆฌKGๆๅปบ+่ๅ | โญโญโญโญโญ | |
| | **GLiNER** | 2311.08526 | ้ถๆ ทๆฌNER, 440M | โญโญโญโญโญ | |
| | **SciER** | 2410.21155 | ๅญฆๆฏ่ฎบๆIEๆฐๆฎ้+ๅบๅ | โญโญโญโญ | |
| | **ReLiK** | 2408.00103 | ๅฟซ้ๅฎไฝ้พๆฅ+ๅ
ณ็ณปๆฝๅ | โญโญโญโญ | |
| | **NodeRAG** | 2504.11544 | ๅผๆๅพRAG SOTA | โญโญโญโญโญ | |
| | **LightRAG** | 2410.05779 | ่ฝป้ๅพRAG, ๅข้ๆดๆฐ | โญโญโญโญโญ | |
| | **Microsoft GraphRAG** | 2404.16130 | ็คพๅบๆ่ฆ+ๅ
จๅฑๆฃ็ดข | โญโญโญโญ | |
| | **RAPTOR** | 2401.18059 | ้ๅฝๆ่ฆๆ | โญโญโญโญ | |
| | **Self-RAG** | 2310.11511 | ่ชๅๆๆฃ็ดข็ๆ | โญโญโญ | |
| | **HyDE** | 2212.10496 | ๅ่ฎพๆๆกฃๅตๅ
ฅ | โญโญโญโญ | |
| | **RAG vs GraphRAG** | 2502.11371 | RAG+GraphRAG่ๅๅฎ้ช | โญโญโญโญ | |
| | **LLM-KGC Survey** | 2510.20345 | LLM็ฅ่ฏๅพ่ฐฑๆๅปบ็ปผ่ฟฐ | โญโญโญโญ | |
|
|
| ### 11.2 ๆ ธๅฟๅผๆบ้กน็ฎ |
|
|
| | ้กน็ฎ | GitHub | Stars | ็จ้ | |
| |------|--------|-------|------| |
| | **MinerU** | opendatalab/MinerU | 61k+ | PDFๆทฑๅบฆ่งฃๆ | |
| | **LightRAG** | hkuds/lightrag | 34k+ | ๅพๅขๅผบRAG | |
| | **RAGFlow** | infiniflow/ragflow | 36k+ | ๅ
จๆ RAGๅนณๅฐ(ๅซUI) | |
| | **LiteLLM** | BerriAI/litellm | 20k+ | LLM็ปไธไปฃ็ | |
| | **Neo4j LLM Graph Builder** | neo4j-labs/llm-graph-builder | 3k+ | PDFโKGโQA | |
| | **Kotaemon** | Cinnamon/kotaemon | 18k+ | ๆๆกฃQA(ๅซGraphRAG) | |
| | **Dify** | langgenius/dify | 70k+ | AIๅบ็จๅผๅๅนณๅฐ | |
| | **LangGraph** | langchain-ai/langgraph | 10k+ | Agent็ถๆๆบ็ผๆ | |
| | **GLiNER** | urchade/GLiNER | 2k+ | ้ถๆ ทๆฌNER | |
| | **Graphusion** | irenezihuili/graphusion | 27 | KG่ๅๅป้ | |
| | **RAPTOR** | parthsarthi03/raptor | 1.6k+ | ๅฑๆฌกๆ่ฆๆ | |
| | **NodeRAG** | Terry-Xu-666/NodeRAG | 412 | ๅผๆๅพRAG | |
| | **Qdrant** | qdrant/qdrant | 22k+ | ๅ้ๆฐๆฎๅบ | |
| | **vLLM** | vllm-project/vllm | 45k+ | ้ซๅๅLLMๆจ็ | |
|
|
| --- |
|
|
| ## 12. ้กน็ฎ็ปๆ |
|
|
| ``` |
| scholarmind/ |
| โโโ docker-compose.yml # ไธ้ฎ้จ็ฝฒ |
| โโโ config/ |
| โ โโโ litellm_config.yaml # LLM่ทฏ็ฑ้
็ฝฎ |
| โ โโโ mineru_config.yaml # MinerU่งฃๆ้
็ฝฎ |
| โ โโโ settings.py # ๅ
จๅฑ้
็ฝฎ |
| โ |
| โโโ services/ |
| โ โโโ api/ # FastAPI ไธปๆๅก |
| โ โ โโโ main.py # ๅ
ฅๅฃ |
| โ โ โโโ routers/ |
| โ โ โ โโโ papers.py # PDFไธไผ /่งฃๆAPI |
| โ โ โ โโโ query.py # ็ฅ่ฏๅบ้ฎ็ญAPI |
| โ โ โ โโโ graph.py # ็ฅ่ฏๅพ่ฐฑAPI |
| โ โ โ โโโ admin.py # ็ฎก็API |
| โ โ โโโ middleware/ |
| โ โ โโโ auth.py # ่ฎค่ฏ |
| โ โ โโโ rate_limit.py # ้ๆต |
| โ โ |
| โ โโโ parser/ # PDF่งฃๆๆๅก |
| โ โ โโโ router.py # PDF็นๅพ่ทฏ็ฑ |
| โ โ โโโ mineru_worker.py # MinerU VLM Worker |
| โ โ โโโ pymupdf_worker.py # PyMuPDF ๅฟซ้่งฃๆ |
| โ โ โโโ metadata_extractor.py # ๅ
ๆฐๆฎๆๅ |
| โ โ โโโ tasks.py # Celeryไปปๅกๅฎไน |
| โ โ |
| โ โโโ extractor/ # ็ฅ่ฏๆฝๅๆๅก |
| โ โ โโโ ner_engine.py # GLiNER NER |
| โ โ โโโ re_engine.py # LLM ๅ
ณ็ณปๆฝๅ |
| โ โ โโโ fusion_engine.py # Graphusion ่ๅ |
| โ โ โโโ schema.py # ๅฎไฝ/ๅ
ณ็ณปSchema |
| โ โ |
| โ โโโ graph/ # ็ฅ่ฏๅพ่ฐฑๆๅก |
| โ โ โโโ neo4j_client.py # Neo4j ่ฟๆฅ็ฎก็ |
| โ โ โโโ graph_builder.py # ๅพๆๅปบ |
| โ โ โโโ graph_query.py # ๅพๆฅ่ฏข |
| โ โ โโโ visualization.py # ๅพๅฏ่งๅ |
| โ โ |
| โ โโโ indexer/ # ็ดขๅผๆๅก |
| โ โ โโโ chunker.py # ๅญฆๆฏ่ฎบๆๅๅๅจ |
| โ โ โโโ vector_indexer.py # Qdrant ๅ้็ดขๅผ |
| โ โ โโโ raptor_builder.py # RAPTOR ๅฑๆฌกๆ่ฆๆ |
| โ โ โโโ embedder.py # ๅตๅ
ฅๆจกๅ็ฎก็ |
| โ โ |
| โ โโโ retriever/ # ๆฃ็ดขๆๅก |
| โ โ โโโ hybrid_retriever.py # ไธ่ทฏๆททๅๆฃ็ดข |
| โ โ โโโ hyde.py # HyDE ๆฅ่ฏขๅขๅผบ |
| โ โ โโโ reranker.py # ไบคๅ็ผ็ ๅจ้ๆ |
| โ โ โโโ rrf.py # RRF ่ๅ |
| โ โ |
| โ โโโ agent/ # Agent็ผๆๆๅก |
| โ โ โโโ graph_definition.py # LangGraph ็ถๆๆบ |
| โ โ โโโ nodes.py # Agent่็นๅฎไน |
| โ โ โโโ tools.py # Agentๅทฅๅ
ท้ |
| โ โ โโโ prompts.py # Promptๆจกๆฟ |
| โ โ |
| โ โโโ llm/ # LLM็ปไธๆฅๅ
ฅ |
| โ โโโ unified_llm.py # LiteLLMๅฐ่ฃ
|
| โ โโโ model_router.py # ไปปๅกโๆจกๅ่ทฏ็ฑ |
| โ โโโ cache.py # LLM็ผๅญ |
| โ |
| โโโ models/ # ๆฐๆฎๆจกๅ |
| โ โโโ paper.py # ่ฎบๆๆฐๆฎๆจกๅ |
| โ โโโ graph.py # ๅพ่ฐฑๆฐๆฎๆจกๅ |
| โ โโโ query.py # ๆฅ่ฏขๆฐๆฎๆจกๅ |
| โ |
| โโโ tests/ |
| โ โโโ test_parser.py |
| โ โโโ test_extractor.py |
| โ โโโ test_retriever.py |
| โ โโโ test_agent.py |
| โ |
| โโโ scripts/ |
| โ โโโ setup_neo4j.cypher # Neo4jๅๅงๅ่ๆฌ |
| โ โโโ batch_parse.py # ๆน้่งฃๆ่ๆฌ |
| โ โโโ build_index.py # ็ดขๅผๆๅปบ่ๆฌ |
| โ |
| โโโ frontend/ # ๅ็ซฏ (React/Next.js) |
| โ โโโ components/ |
| โ โ โโโ ChatInterface.tsx # ้ฎ็ญ็้ข |
| โ โ โโโ GraphViewer.tsx # ็ฅ่ฏๅพ่ฐฑๅฏ่งๅ |
| โ โ โโโ PaperUploader.tsx # PDFไธไผ |
| โ โ โโโ SearchResults.tsx # ๆ็ดข็ปๆๅฑ็คบ |
| โ โโโ ... |
| โ |
| โโโ requirements.txt |
| โโโ Dockerfile |
| โโโ README.md |
| ``` |
|
|
| --- |
|
|
| ## ๅฟซ้ๅผๅง |
|
|
| ```bash |
| # 1. ๅ
้้กน็ฎ |
| git clone https://github.com/your-org/scholarmind.git |
| cd scholarmind |
| |
| # 2. ้
็ฝฎ็ฏๅขๅ้ |
| cp .env.example .env |
| # ็ผ่พ .env: ่ฎพ็ฝฎ OPENAI_API_KEY, DEEPSEEK_API_KEY ็ญ |
| |
| # 3. ไธ่ฝฝMinerUๆจกๅ |
| pip install mineru |
| mineru-models-download -s huggingface -m all |
| |
| # 4. ๅฏๅจๆๆๆๅก |
| docker-compose up -d |
| |
| # 5. ไธ่ฝฝๆฌๅฐLLM (ๅฏ้) |
| docker exec -it scholarmind-ollama ollama pull qwen2.5:14b-instruct |
| |
| # 6. ๆน้ๅฏผๅ
ฅ่ฎบๆ |
| python scripts/batch_parse.py --input /path/to/pdfs/ --workers 4 |
| |
| # 7. ๆๅปบ็ดขๅผ |
| python scripts/build_index.py --vector --graph --raptor |
| |
| # 8. ่ฎฟ้ฎ็ณป็ป |
| # API: http://localhost:8080/docs |
| # Neo4j: http://localhost:7474 |
| # MinIO: http://localhost:9001 |
| ``` |
|
|
| --- |
|
|
| ## ่ฎธๅฏ่ฏ |
|
|
| MIT License |
|
|
| --- |
|
|
| *ๆถๆ่ฎพ่ฎกๅบไบ 2024-2025 ๅนดๆๆฐ็ ็ฉถๆๆๅๅผๆบๅฎ่ทต๏ผๆๆ่ฎบๆๅผ็จๅBenchmarkๆฐๆฎๅๅฏๆบฏๆบใ* |
|
|