| --- |
| tags: |
| - ml-intern |
| --- |
| # ๐๏ธ ScholarMind โ ็ไบง็บงๅญฆๆฏ็ฅ่ฏๅบ้ฎ็ญ & ็ฅ่ฏๅพ่ฐฑ็ณป็ป |
|
|
| > ๅฎๆดๆถๆ่ฎพ่ฎกๆๆกฃ๏ผ่ฏทๆฅ็ [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) |
|
|
| ## ๆๆกฃ็ดขๅผ |
|
|
| | ๆๆกฃ | ่ฏดๆ | |
| |------|------| |
| | [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | **ๆ ธๅฟๆถๆ่ฎพ่ฎก** โ ็ณป็ปๆป่งใๅๅฑ่ฏฆ็ป่ฎพ่ฎกใไปฃ็ ็คบไพ | |
| | [docs/DATAFLOW.md](docs/DATAFLOW.md) | **ๆฐๆฎๆต่ฎพ่ฎก** โ ็ซฏๅฐ็ซฏๆต่ฝฌใๅนถๅๆจกๅใ็ผๅญ็ญ็ฅใ็ๆง | |
| | [docs/CACHING.md](docs/CACHING.md) | **๐ 7ๅฑ็ผๅญๅ ้ๆนๆก** โ ่ฏญไน็ผๅญใProvider็ผๅญใvLLM APCใKVๅ็ผฉ็ญ | |
| | [docs/ADR.md](docs/ADR.md) | **ๆๆฏ้ๅๅณ็ญ่ฎฐๅฝ** โ ๆฏไธชๆๆฏ้ๅ็ไพๆฎๅ่ฎบๆๆฅๆบ | |
| | [docs/PAPERS.md](docs/PAPERS.md) | **่ฎบๆ็ดขๅผ** โ 14็ฏๆ ธๅฟ่ฎบๆ + 15ไธชๅผๆบ้กน็ฎ้ๆฅ | |
| | [docs/requirements.txt](docs/requirements.txt) | **ๆ ธๅฟไพ่ต** โ Pythonๅฎๆดไพ่ตๅ่กจ | |
|
|
| ## ็ณป็ปๆฆ่ฟฐ |
|
|
| ScholarMind ๆฏไธไธช้ขๅ **1000+ ็ฏๅญฆๆฏ PDF ่ฎบๆ** ็็ไบง็บงๆบ่ฝ็ฅ่ฏ็ณป็ป๏ผ้ๆ๏ผ |
| - **PDF ๆทฑๅบฆ่งฃๆ**๏ผๅบไบ MinerU 2.5 VLM ็้ซ็ฒพๅบฆ OCR๏ผๅ
ฌๅผ/่กจๆ ผ/ๅพ่กจ๏ผ |
| - **็ฅ่ฏๅพ่ฐฑ่ชๅจๆๅปบ**๏ผไป่ฎบๆไธญ่ชๅจๆฝๅๅฎไฝไธๅ
ณ็ณป๏ผๆๅปบ้ขๅ็ฅ่ฏๅพ่ฐฑ |
| - **ๆททๅๆฃ็ดข้ฎ็ญ**๏ผGraphRAG + ๅ้ๆฃ็ดข + BM25 ็จ็ๆฃ็ดข็ไธ่ทฏ่ๅ |
| - **ๅคๆจกๅๆฏๆ**๏ผๅๆถๆฏๆๆฌๅฐ้จ็ฝฒ๏ผvLLM/Ollama๏ผๅๅค้จ API๏ผOpenAI/Anthropic/DeepSeek๏ผ |
| - **Agent ็ผๆ**๏ผๅบไบ LangGraph ็ๅค Agent ๅไฝ๏ผๆฏๆๅค่ทณๆจ็ |
| - **7ๅฑ็ผๅญๅ ้**๏ผ่ฏญไน็ผๅญ + Provider็ผๅญ + vLLMๅ็ผ็ผๅญ + KVๅ็ผฉ๏ผP50ๅปถ่ฟ้่ณ~400ms |
|
|
| ## ๆ ธๅฟๆถๆๅพ |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ็จๆทๅฑ (Web UI / API) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ Agent ็ผๆๅฑ (LangGraph) โ |
| โ ่ทฏ็ฑAgent โ ๆฃ็ดขAgent โ ๆจ็Agent โ ๆป็ปAgent โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ LLM ็ปไธๆฅๅ
ฅๅฑ (LiteLLM Proxy) โ |
| โ vLLM | Ollama | OpenAI | Anthropic | DeepSeek โ |
| โโโโโโโโโ 7ๅฑ็ผๅญๅ ้ๆ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ L1 ่ฏญไน็ผๅญ(GPTCache) โ L2 ๆฃ็ดข็ผๅญ โ L3 Provider็ผๅญ โ |
| โ L4 vLLM APC โ L5 CacheBlend โ L6 SnapKV โ L7 ๅฏน่ฏ็ผๅญโ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ๆฃ็ดขๅฑ (Hybrid Retrieval) โ |
| โ Dense Vector + Sparse BM25 + Graph Query โ |
| โ โ RRF่ๅ โ bge-reranker-large้ๆ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ็ดขๅผๅฑ (Multi-Index) โ |
| โ Qdrant(ๅ้) | Neo4j(ๅพ่ฐฑ) | RAPTOR(ๅฑๆฌกๆ่ฆๆ ) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ็ฅ่ฏๆฝๅๅฑ (Knowledge Extraction) โ |
| โ GLiNER(NER) โ LLMGraphTransformer(RE) โ Graphusion โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ PDF ่งฃๆๅฑ (MinerU Pipeline) โ |
| โ PDF่ทฏ็ฑ โ MinerU 2.5 VLM / PyMuPDF โ JSON+Markdown โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ ๅญๅจๅฑ (Storage) โ |
| โ PostgreSQL | Qdrant | Neo4j | Redis | MinIO โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ## ๆง่ฝๆๆ |
|
|
| | ๆๆ | ๆ ็ผๅญ | 7ๅฑ็ผๅญๅ | |
| |------|--------|----------| |
| | QAๅๅบๅปถ่ฟ (P50) | ~1.5s | **~400ms** | |
| | QAๅๅบๅปถ่ฟ (P99) | ~4s | **~1.5s** | |
| | ็ผๅญๅฝไธญๆถๅปถ่ฟ | โ | **~5ms** | |
| | APIๆๆฌ | ๅบๅ | **้ไฝ60%+** | |
| | PDF่งฃๆ้ๅบฆ (A100) | 2.12 ้กต/็ง | โ | |
| | 1000็ฏ่ฎบๆๅ
จ้่งฃๆ | ~80 ๅ้ | โ | |
|
|
| ## ๆ ธๅฟๆๆฏๆ |
|
|
| | ็ปไปถ | ้ๅ | ่ฎบๆไพๆฎ | |
| |------|------|---------| |
| | PDF่งฃๆ | **MinerU 2.5 VLM** | arxiv:2509.22186 (OmniDocBench SOTA) | |
| | NER | **GLiNER** (440M) | arxiv:2311.08526 (F1=47.8, ้ถๆ ทๆฌ) | |
| | KG่ๅ | **Graphusion** | arxiv:2410.17600 (+9.2% QAๅ็กฎ็) | |
| | GraphRAG | **LightRAG** | arxiv:2410.05779 (34kโญ, ๅข้ๆดๆฐ) | |
| | ๅฑๆฌก็ดขๅผ | **RAPTOR** | arxiv:2401.18059 (+20%ๅ็กฎ็) | |
| | ๆฃ็ดข้ๆ | **bge-reranker-large** | arxiv:2502.11371 (ๅ
ฑ่ฏๆไผ) | |
| | ่ฏญไน็ผๅญ | **GPTCache** | 7kโญ, ่ฏญไน็ธไผผๅบฆๅฝไธญ | |
| | KVๅค็จ | **CacheBlend/LMCache** | arxiv:2405.16444 (2.2-3.3ร TTFT) | |
| | KVๅ็ผฉ | **SnapKV** | arxiv:2404.14469 (3.6ร่งฃ็ ๅ ้) | |
| | Agent | **LangGraph** | ๆ็ถๆๅพ, ๆกไปถๅๆฏ, ็ไบง็บง | |
| | LLM | **LiteLLM** | ็ปไธๆฌๅฐ/APIๆฅๅฃ | |
| | ๅ้ๅบ | **Qdrant** | Rust้ซๆง่ฝ, ๅ็Hybridๆ็ดข | |
| | ๅพๆฐๆฎๅบ | **Neo4j 5.x** | LangChainๅ็้ๆ | |
|
|
| ## License |
|
|
| MIT |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = 'heyingyue/scholarmind-architecture' |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|