heyingyue's picture
Update ML Intern artifact metadata
e48c23d verified
---
tags:
- ml-intern
---
# ๐Ÿ—๏ธ ScholarMind โ€” ็”Ÿไบง็บงๅญฆๆœฏ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” & ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ณป็ปŸ
> ๅฎŒๆ•ดๆžถๆž„่ฎพ่ฎกๆ–‡ๆกฃ๏ผŒ่ฏทๆŸฅ็œ‹ [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)
## ๆ–‡ๆกฃ็ดขๅผ•
| ๆ–‡ๆกฃ | ่ฏดๆ˜Ž |
|------|------|
| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | **ๆ ธๅฟƒๆžถๆž„่ฎพ่ฎก** โ€” ็ณป็ปŸๆ€ป่งˆใ€ๅ„ๅฑ‚่ฏฆ็ป†่ฎพ่ฎกใ€ไปฃ็ ็คบไพ‹ |
| [docs/DATAFLOW.md](docs/DATAFLOW.md) | **ๆ•ฐๆฎๆต่ฎพ่ฎก** โ€” ็ซฏๅˆฐ็ซฏๆต่ฝฌใ€ๅนถๅ‘ๆจกๅž‹ใ€็ผ“ๅญ˜็ญ–็•ฅใ€็›‘ๆŽง |
| [docs/CACHING.md](docs/CACHING.md) | **๐Ÿ†• 7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿๆ–นๆกˆ** โ€” ่ฏญไน‰็ผ“ๅญ˜ใ€Provider็ผ“ๅญ˜ใ€vLLM APCใ€KVๅŽ‹็ผฉ็ญ‰ |
| [docs/ADR.md](docs/ADR.md) | **ๆŠ€ๆœฏ้€‰ๅž‹ๅ†ณ็ญ–่ฎฐๅฝ•** โ€” ๆฏไธชๆŠ€ๆœฏ้€‰ๅž‹็š„ไพๆฎๅ’Œ่ฎบๆ–‡ๆฅๆบ |
| [docs/PAPERS.md](docs/PAPERS.md) | **่ฎบๆ–‡็ดขๅผ•** โ€” 14็ฏ‡ๆ ธๅฟƒ่ฎบๆ–‡ + 15ไธชๅผ€ๆบ้กน็›ฎ้€ŸๆŸฅ |
| [docs/requirements.txt](docs/requirements.txt) | **ๆ ธๅฟƒไพ่ต–** โ€” PythonๅฎŒๆ•ดไพ่ต–ๅˆ—่กจ |
## ็ณป็ปŸๆฆ‚่ฟฐ
ScholarMind ๆ˜ฏไธ€ไธช้ขๅ‘ **1000+ ็ฏ‡ๅญฆๆœฏ PDF ่ฎบๆ–‡** ็š„็”Ÿไบง็บงๆ™บ่ƒฝ็Ÿฅ่ฏ†็ณป็ปŸ๏ผŒ้›†ๆˆ๏ผš
- **PDF ๆทฑๅบฆ่งฃๆž**๏ผšๅŸบไบŽ MinerU 2.5 VLM ็š„้ซ˜็ฒพๅบฆ OCR๏ผˆๅ…ฌๅผ/่กจๆ ผ/ๅ›พ่กจ๏ผ‰
- **็Ÿฅ่ฏ†ๅ›พ่ฐฑ่‡ชๅŠจๆž„ๅปบ**๏ผšไปŽ่ฎบๆ–‡ไธญ่‡ชๅŠจๆŠฝๅ–ๅฎžไฝ“ไธŽๅ…ณ็ณป๏ผŒๆž„ๅปบ้ข†ๅŸŸ็Ÿฅ่ฏ†ๅ›พ่ฐฑ
- **ๆททๅˆๆฃ€็ดข้—ฎ็ญ”**๏ผšGraphRAG + ๅ‘้‡ๆฃ€็ดข + BM25 ็จ€็–ๆฃ€็ดข็š„ไธ‰่ทฏ่žๅˆ
- **ๅคšๆจกๅž‹ๆ”ฏๆŒ**๏ผšๅŒๆ—ถๆ”ฏๆŒๆœฌๅœฐ้ƒจ็ฝฒ๏ผˆvLLM/Ollama๏ผ‰ๅ’Œๅค–้ƒจ API๏ผˆOpenAI/Anthropic/DeepSeek๏ผ‰
- **Agent ็ผ–ๆŽ’**๏ผšๅŸบไบŽ LangGraph ็š„ๅคš Agent ๅไฝœ๏ผŒๆ”ฏๆŒๅคš่ทณๆŽจ็†
- **7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿ**๏ผš่ฏญไน‰็ผ“ๅญ˜ + Provider็ผ“ๅญ˜ + vLLMๅ‰็ผ€็ผ“ๅญ˜ + KVๅŽ‹็ผฉ๏ผŒP50ๅปถ่ฟŸ้™่‡ณ~400ms
## ๆ ธๅฟƒๆžถๆž„ๅ›พ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ็”จๆˆทๅฑ‚ (Web UI / API) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Agent ็ผ–ๆŽ’ๅฑ‚ (LangGraph) โ”‚
โ”‚ ่ทฏ็”ฑAgent โ†’ ๆฃ€็ดขAgent โ†’ ๆŽจ็†Agent โ†’ ๆ€ป็ป“Agent โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚ (LiteLLM Proxy) โ”‚
โ”‚ vLLM | Ollama | OpenAI | Anthropic | DeepSeek โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿๆ ˆ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ L1 ่ฏญไน‰็ผ“ๅญ˜(GPTCache) โ†’ L2 ๆฃ€็ดข็ผ“ๅญ˜ โ†’ L3 Provider็ผ“ๅญ˜ โ”‚
โ”‚ L4 vLLM APC โ†’ L5 CacheBlend โ†’ L6 SnapKV โ†’ L7 ๅฏน่ฏ็ผ“ๅญ˜โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ๆฃ€็ดขๅฑ‚ (Hybrid Retrieval) โ”‚
โ”‚ Dense Vector + Sparse BM25 + Graph Query โ”‚
โ”‚ โ†’ RRF่žๅˆ โ†’ bge-reranker-large้‡ๆŽ’ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ็ดขๅผ•ๅฑ‚ (Multi-Index) โ”‚
โ”‚ Qdrant(ๅ‘้‡) | Neo4j(ๅ›พ่ฐฑ) | RAPTOR(ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ (Knowledge Extraction) โ”‚
โ”‚ GLiNER(NER) โ†’ LLMGraphTransformer(RE) โ†’ Graphusion โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PDF ่งฃๆžๅฑ‚ (MinerU Pipeline) โ”‚
โ”‚ PDF่ทฏ็”ฑ โ†’ MinerU 2.5 VLM / PyMuPDF โ†’ JSON+Markdown โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ๅญ˜ๅ‚จๅฑ‚ (Storage) โ”‚
โ”‚ PostgreSQL | Qdrant | Neo4j | Redis | MinIO โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
## ๆ€ง่ƒฝๆŒ‡ๆ ‡
| ๆŒ‡ๆ ‡ | ๆ— ็ผ“ๅญ˜ | 7ๅฑ‚็ผ“ๅญ˜ๅŽ |
|------|--------|----------|
| QAๅ“ๅบ”ๅปถ่ฟŸ (P50) | ~1.5s | **~400ms** |
| QAๅ“ๅบ”ๅปถ่ฟŸ (P99) | ~4s | **~1.5s** |
| ็ผ“ๅญ˜ๅ‘ฝไธญๆ—ถๅปถ่ฟŸ | โ€” | **~5ms** |
| APIๆˆๆœฌ | ๅŸบๅ‡† | **้™ไฝŽ60%+** |
| PDF่งฃๆž้€Ÿๅบฆ (A100) | 2.12 ้กต/็ง’ | โ€” |
| 1000็ฏ‡่ฎบๆ–‡ๅ…จ้‡่งฃๆž | ~80 ๅˆ†้’Ÿ | โ€” |
## ๆ ธๅฟƒๆŠ€ๆœฏๆ ˆ
| ็ป„ไปถ | ้€‰ๅž‹ | ่ฎบๆ–‡ไพๆฎ |
|------|------|---------|
| PDF่งฃๆž | **MinerU 2.5 VLM** | arxiv:2509.22186 (OmniDocBench SOTA) |
| NER | **GLiNER** (440M) | arxiv:2311.08526 (F1=47.8, ้›ถๆ ทๆœฌ) |
| KG่žๅˆ | **Graphusion** | arxiv:2410.17600 (+9.2% QAๅ‡†็กฎ็އ) |
| GraphRAG | **LightRAG** | arxiv:2410.05779 (34kโญ, ๅขž้‡ๆ›ดๆ–ฐ) |
| ๅฑ‚ๆฌก็ดขๅผ• | **RAPTOR** | arxiv:2401.18059 (+20%ๅ‡†็กฎ็އ) |
| ๆฃ€็ดข้‡ๆŽ’ | **bge-reranker-large** | arxiv:2502.11371 (ๅ…ฑ่ฏ†ๆœ€ไผ˜) |
| ่ฏญไน‰็ผ“ๅญ˜ | **GPTCache** | 7kโญ, ่ฏญไน‰็›ธไผผๅบฆๅ‘ฝไธญ |
| KVๅค็”จ | **CacheBlend/LMCache** | arxiv:2405.16444 (2.2-3.3ร— TTFT) |
| KVๅŽ‹็ผฉ | **SnapKV** | arxiv:2404.14469 (3.6ร—่งฃ็ ๅŠ ้€Ÿ) |
| Agent | **LangGraph** | ๆœ‰็Šถๆ€ๅ›พ, ๆกไปถๅˆ†ๆ”ฏ, ็”Ÿไบง็บง |
| LLM | **LiteLLM** | ็ปŸไธ€ๆœฌๅœฐ/APIๆŽฅๅฃ |
| ๅ‘้‡ๅบ“ | **Qdrant** | Rust้ซ˜ๆ€ง่ƒฝ, ๅŽŸ็”ŸHybridๆœ็ดข |
| ๅ›พๆ•ฐๆฎๅบ“ | **Neo4j 5.x** | LangChainๅŽŸ็”Ÿ้›†ๆˆ |
## License
MIT
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'heyingyue/scholarmind-architecture'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.