--- tags: - ml-intern --- # ๐Ÿ—๏ธ ScholarMind โ€” ็”Ÿไบง็บงๅญฆๆœฏ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” & ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ณป็ปŸ > ๅฎŒๆ•ดๆžถๆž„่ฎพ่ฎกๆ–‡ๆกฃ๏ผŒ่ฏทๆŸฅ็œ‹ [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) ## ๆ–‡ๆกฃ็ดขๅผ• | ๆ–‡ๆกฃ | ่ฏดๆ˜Ž | |------|------| | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | **ๆ ธๅฟƒๆžถๆž„่ฎพ่ฎก** โ€” ็ณป็ปŸๆ€ป่งˆใ€ๅ„ๅฑ‚่ฏฆ็ป†่ฎพ่ฎกใ€ไปฃ็ ็คบไพ‹ | | [docs/DATAFLOW.md](docs/DATAFLOW.md) | **ๆ•ฐๆฎๆต่ฎพ่ฎก** โ€” ็ซฏๅˆฐ็ซฏๆต่ฝฌใ€ๅนถๅ‘ๆจกๅž‹ใ€็ผ“ๅญ˜็ญ–็•ฅใ€็›‘ๆŽง | | [docs/CACHING.md](docs/CACHING.md) | **๐Ÿ†• 7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿๆ–นๆกˆ** โ€” ่ฏญไน‰็ผ“ๅญ˜ใ€Provider็ผ“ๅญ˜ใ€vLLM APCใ€KVๅŽ‹็ผฉ็ญ‰ | | [docs/ADR.md](docs/ADR.md) | **ๆŠ€ๆœฏ้€‰ๅž‹ๅ†ณ็ญ–่ฎฐๅฝ•** โ€” ๆฏไธชๆŠ€ๆœฏ้€‰ๅž‹็š„ไพๆฎๅ’Œ่ฎบๆ–‡ๆฅๆบ | | [docs/PAPERS.md](docs/PAPERS.md) | **่ฎบๆ–‡็ดขๅผ•** โ€” 14็ฏ‡ๆ ธๅฟƒ่ฎบๆ–‡ + 15ไธชๅผ€ๆบ้กน็›ฎ้€ŸๆŸฅ | | [docs/requirements.txt](docs/requirements.txt) | **ๆ ธๅฟƒไพ่ต–** โ€” PythonๅฎŒๆ•ดไพ่ต–ๅˆ—่กจ | ## ็ณป็ปŸๆฆ‚่ฟฐ ScholarMind ๆ˜ฏไธ€ไธช้ขๅ‘ **1000+ ็ฏ‡ๅญฆๆœฏ PDF ่ฎบๆ–‡** ็š„็”Ÿไบง็บงๆ™บ่ƒฝ็Ÿฅ่ฏ†็ณป็ปŸ๏ผŒ้›†ๆˆ๏ผš - **PDF ๆทฑๅบฆ่งฃๆž**๏ผšๅŸบไบŽ MinerU 2.5 VLM ็š„้ซ˜็ฒพๅบฆ OCR๏ผˆๅ…ฌๅผ/่กจๆ ผ/ๅ›พ่กจ๏ผ‰ - **็Ÿฅ่ฏ†ๅ›พ่ฐฑ่‡ชๅŠจๆž„ๅปบ**๏ผšไปŽ่ฎบๆ–‡ไธญ่‡ชๅŠจๆŠฝๅ–ๅฎžไฝ“ไธŽๅ…ณ็ณป๏ผŒๆž„ๅปบ้ข†ๅŸŸ็Ÿฅ่ฏ†ๅ›พ่ฐฑ - **ๆททๅˆๆฃ€็ดข้—ฎ็ญ”**๏ผšGraphRAG + ๅ‘้‡ๆฃ€็ดข + BM25 ็จ€็–ๆฃ€็ดข็š„ไธ‰่ทฏ่žๅˆ - **ๅคšๆจกๅž‹ๆ”ฏๆŒ**๏ผšๅŒๆ—ถๆ”ฏๆŒๆœฌๅœฐ้ƒจ็ฝฒ๏ผˆvLLM/Ollama๏ผ‰ๅ’Œๅค–้ƒจ API๏ผˆOpenAI/Anthropic/DeepSeek๏ผ‰ - **Agent ็ผ–ๆŽ’**๏ผšๅŸบไบŽ LangGraph ็š„ๅคš Agent ๅไฝœ๏ผŒๆ”ฏๆŒๅคš่ทณๆŽจ็† - **7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿ**๏ผš่ฏญไน‰็ผ“ๅญ˜ + Provider็ผ“ๅญ˜ + vLLMๅ‰็ผ€็ผ“ๅญ˜ + KVๅŽ‹็ผฉ๏ผŒP50ๅปถ่ฟŸ้™่‡ณ~400ms ## ๆ ธๅฟƒๆžถๆž„ๅ›พ ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ็”จๆˆทๅฑ‚ (Web UI / API) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Agent ็ผ–ๆŽ’ๅฑ‚ (LangGraph) โ”‚ โ”‚ ่ทฏ็”ฑAgent โ†’ ๆฃ€็ดขAgent โ†’ ๆŽจ็†Agent โ†’ ๆ€ป็ป“Agent โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚ (LiteLLM Proxy) โ”‚ โ”‚ vLLM | Ollama | OpenAI | Anthropic | DeepSeek โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿๆ ˆ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ L1 ่ฏญไน‰็ผ“ๅญ˜(GPTCache) โ†’ L2 ๆฃ€็ดข็ผ“ๅญ˜ โ†’ L3 Provider็ผ“ๅญ˜ โ”‚ โ”‚ L4 vLLM APC โ†’ L5 CacheBlend โ†’ L6 SnapKV โ†’ L7 ๅฏน่ฏ็ผ“ๅญ˜โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ ๆฃ€็ดขๅฑ‚ (Hybrid Retrieval) โ”‚ โ”‚ Dense Vector + Sparse BM25 + Graph Query โ”‚ โ”‚ โ†’ RRF่žๅˆ โ†’ bge-reranker-large้‡ๆŽ’ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ ็ดขๅผ•ๅฑ‚ (Multi-Index) โ”‚ โ”‚ Qdrant(ๅ‘้‡) | Neo4j(ๅ›พ่ฐฑ) | RAPTOR(ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ (Knowledge Extraction) โ”‚ โ”‚ GLiNER(NER) โ†’ LLMGraphTransformer(RE) โ†’ Graphusion โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ PDF ่งฃๆžๅฑ‚ (MinerU Pipeline) โ”‚ โ”‚ PDF่ทฏ็”ฑ โ†’ MinerU 2.5 VLM / PyMuPDF โ†’ JSON+Markdown โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ ๅญ˜ๅ‚จๅฑ‚ (Storage) โ”‚ โ”‚ PostgreSQL | Qdrant | Neo4j | Redis | MinIO โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## ๆ€ง่ƒฝๆŒ‡ๆ ‡ | ๆŒ‡ๆ ‡ | ๆ— ็ผ“ๅญ˜ | 7ๅฑ‚็ผ“ๅญ˜ๅŽ | |------|--------|----------| | QAๅ“ๅบ”ๅปถ่ฟŸ (P50) | ~1.5s | **~400ms** | | QAๅ“ๅบ”ๅปถ่ฟŸ (P99) | ~4s | **~1.5s** | | ็ผ“ๅญ˜ๅ‘ฝไธญๆ—ถๅปถ่ฟŸ | โ€” | **~5ms** | | APIๆˆๆœฌ | ๅŸบๅ‡† | **้™ไฝŽ60%+** | | PDF่งฃๆž้€Ÿๅบฆ (A100) | 2.12 ้กต/็ง’ | โ€” | | 1000็ฏ‡่ฎบๆ–‡ๅ…จ้‡่งฃๆž | ~80 ๅˆ†้’Ÿ | โ€” | ## ๆ ธๅฟƒๆŠ€ๆœฏๆ ˆ | ็ป„ไปถ | ้€‰ๅž‹ | ่ฎบๆ–‡ไพๆฎ | |------|------|---------| | PDF่งฃๆž | **MinerU 2.5 VLM** | arxiv:2509.22186 (OmniDocBench SOTA) | | NER | **GLiNER** (440M) | arxiv:2311.08526 (F1=47.8, ้›ถๆ ทๆœฌ) | | KG่žๅˆ | **Graphusion** | arxiv:2410.17600 (+9.2% QAๅ‡†็กฎ็އ) | | GraphRAG | **LightRAG** | arxiv:2410.05779 (34kโญ, ๅขž้‡ๆ›ดๆ–ฐ) | | ๅฑ‚ๆฌก็ดขๅผ• | **RAPTOR** | arxiv:2401.18059 (+20%ๅ‡†็กฎ็އ) | | ๆฃ€็ดข้‡ๆŽ’ | **bge-reranker-large** | arxiv:2502.11371 (ๅ…ฑ่ฏ†ๆœ€ไผ˜) | | ่ฏญไน‰็ผ“ๅญ˜ | **GPTCache** | 7kโญ, ่ฏญไน‰็›ธไผผๅบฆๅ‘ฝไธญ | | KVๅค็”จ | **CacheBlend/LMCache** | arxiv:2405.16444 (2.2-3.3ร— TTFT) | | KVๅŽ‹็ผฉ | **SnapKV** | arxiv:2404.14469 (3.6ร—่งฃ็ ๅŠ ้€Ÿ) | | Agent | **LangGraph** | ๆœ‰็Šถๆ€ๅ›พ, ๆกไปถๅˆ†ๆ”ฏ, ็”Ÿไบง็บง | | LLM | **LiteLLM** | ็ปŸไธ€ๆœฌๅœฐ/APIๆŽฅๅฃ | | ๅ‘้‡ๅบ“ | **Qdrant** | Rust้ซ˜ๆ€ง่ƒฝ, ๅŽŸ็”ŸHybridๆœ็ดข | | ๅ›พๆ•ฐๆฎๅบ“ | **Neo4j 5.x** | LangChainๅŽŸ็”Ÿ้›†ๆˆ | ## License MIT ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = 'heyingyue/scholarmind-architecture' tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.