heyingyue's picture
Update ML Intern artifact metadata
e48c23d verified
metadata
tags:
  - ml-intern

๐Ÿ—๏ธ ScholarMind โ€” ็”Ÿไบง็บงๅญฆๆœฏ็Ÿฅ่ฏ†ๅบ“้—ฎ็ญ” & ็Ÿฅ่ฏ†ๅ›พ่ฐฑ็ณป็ปŸ

ๅฎŒๆ•ดๆžถๆž„่ฎพ่ฎกๆ–‡ๆกฃ๏ผŒ่ฏทๆŸฅ็œ‹ docs/ARCHITECTURE.md

ๆ–‡ๆกฃ็ดขๅผ•

ๆ–‡ๆกฃ ่ฏดๆ˜Ž
docs/ARCHITECTURE.md ๆ ธๅฟƒๆžถๆž„่ฎพ่ฎก โ€” ็ณป็ปŸๆ€ป่งˆใ€ๅ„ๅฑ‚่ฏฆ็ป†่ฎพ่ฎกใ€ไปฃ็ ็คบไพ‹
docs/DATAFLOW.md ๆ•ฐๆฎๆต่ฎพ่ฎก โ€” ็ซฏๅˆฐ็ซฏๆต่ฝฌใ€ๅนถๅ‘ๆจกๅž‹ใ€็ผ“ๅญ˜็ญ–็•ฅใ€็›‘ๆŽง
docs/CACHING.md ๐Ÿ†• 7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿๆ–นๆกˆ โ€” ่ฏญไน‰็ผ“ๅญ˜ใ€Provider็ผ“ๅญ˜ใ€vLLM APCใ€KVๅŽ‹็ผฉ็ญ‰
docs/ADR.md ๆŠ€ๆœฏ้€‰ๅž‹ๅ†ณ็ญ–่ฎฐๅฝ• โ€” ๆฏไธชๆŠ€ๆœฏ้€‰ๅž‹็š„ไพๆฎๅ’Œ่ฎบๆ–‡ๆฅๆบ
docs/PAPERS.md ่ฎบๆ–‡็ดขๅผ• โ€” 14็ฏ‡ๆ ธๅฟƒ่ฎบๆ–‡ + 15ไธชๅผ€ๆบ้กน็›ฎ้€ŸๆŸฅ
docs/requirements.txt ๆ ธๅฟƒไพ่ต– โ€” PythonๅฎŒๆ•ดไพ่ต–ๅˆ—่กจ

็ณป็ปŸๆฆ‚่ฟฐ

ScholarMind ๆ˜ฏไธ€ไธช้ขๅ‘ 1000+ ็ฏ‡ๅญฆๆœฏ PDF ่ฎบๆ–‡ ็š„็”Ÿไบง็บงๆ™บ่ƒฝ็Ÿฅ่ฏ†็ณป็ปŸ๏ผŒ้›†ๆˆ๏ผš

  • PDF ๆทฑๅบฆ่งฃๆž๏ผšๅŸบไบŽ MinerU 2.5 VLM ็š„้ซ˜็ฒพๅบฆ OCR๏ผˆๅ…ฌๅผ/่กจๆ ผ/ๅ›พ่กจ๏ผ‰
  • ็Ÿฅ่ฏ†ๅ›พ่ฐฑ่‡ชๅŠจๆž„ๅปบ๏ผšไปŽ่ฎบๆ–‡ไธญ่‡ชๅŠจๆŠฝๅ–ๅฎžไฝ“ไธŽๅ…ณ็ณป๏ผŒๆž„ๅปบ้ข†ๅŸŸ็Ÿฅ่ฏ†ๅ›พ่ฐฑ
  • ๆททๅˆๆฃ€็ดข้—ฎ็ญ”๏ผšGraphRAG + ๅ‘้‡ๆฃ€็ดข + BM25 ็จ€็–ๆฃ€็ดข็š„ไธ‰่ทฏ่žๅˆ
  • ๅคšๆจกๅž‹ๆ”ฏๆŒ๏ผšๅŒๆ—ถๆ”ฏๆŒๆœฌๅœฐ้ƒจ็ฝฒ๏ผˆvLLM/Ollama๏ผ‰ๅ’Œๅค–้ƒจ API๏ผˆOpenAI/Anthropic/DeepSeek๏ผ‰
  • Agent ็ผ–ๆŽ’๏ผšๅŸบไบŽ LangGraph ็š„ๅคš Agent ๅไฝœ๏ผŒๆ”ฏๆŒๅคš่ทณๆŽจ็†
  • 7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿ๏ผš่ฏญไน‰็ผ“ๅญ˜ + Provider็ผ“ๅญ˜ + vLLMๅ‰็ผ€็ผ“ๅญ˜ + KVๅŽ‹็ผฉ๏ผŒP50ๅปถ่ฟŸ้™่‡ณ~400ms

ๆ ธๅฟƒๆžถๆž„ๅ›พ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    ็”จๆˆทๅฑ‚ (Web UI / API)                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Agent ็ผ–ๆŽ’ๅฑ‚ (LangGraph)                   โ”‚
โ”‚  ่ทฏ็”ฑAgent โ†’ ๆฃ€็ดขAgent โ†’ ๆŽจ็†Agent โ†’ ๆ€ป็ป“Agent          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚            LLM ็ปŸไธ€ๆŽฅๅ…ฅๅฑ‚ (LiteLLM Proxy)               โ”‚
โ”‚  vLLM | Ollama | OpenAI | Anthropic | DeepSeek         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 7ๅฑ‚็ผ“ๅญ˜ๅŠ ้€Ÿๆ ˆ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  L1 ่ฏญไน‰็ผ“ๅญ˜(GPTCache) โ†’ L2 ๆฃ€็ดข็ผ“ๅญ˜ โ†’ L3 Provider็ผ“ๅญ˜ โ”‚
โ”‚  L4 vLLM APC โ†’ L5 CacheBlend โ†’ L6 SnapKV โ†’ L7 ๅฏน่ฏ็ผ“ๅญ˜โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              ๆฃ€็ดขๅฑ‚ (Hybrid Retrieval)                   โ”‚
โ”‚  Dense Vector + Sparse BM25 + Graph Query               โ”‚
โ”‚  โ†’ RRF่žๅˆ โ†’ bge-reranker-large้‡ๆŽ’                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚               ็ดขๅผ•ๅฑ‚ (Multi-Index)                      โ”‚
โ”‚  Qdrant(ๅ‘้‡) | Neo4j(ๅ›พ่ฐฑ) | RAPTOR(ๅฑ‚ๆฌกๆ‘˜่ฆๆ ‘)        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚             ็Ÿฅ่ฏ†ๆŠฝๅ–ๅฑ‚ (Knowledge Extraction)            โ”‚
โ”‚  GLiNER(NER) โ†’ LLMGraphTransformer(RE) โ†’ Graphusion    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚             PDF ่งฃๆžๅฑ‚ (MinerU Pipeline)                 โ”‚
โ”‚  PDF่ทฏ็”ฑ โ†’ MinerU 2.5 VLM / PyMuPDF โ†’ JSON+Markdown    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                    ๅญ˜ๅ‚จๅฑ‚ (Storage)                      โ”‚
โ”‚  PostgreSQL | Qdrant | Neo4j | Redis | MinIO            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ๆ€ง่ƒฝๆŒ‡ๆ ‡

ๆŒ‡ๆ ‡ ๆ— ็ผ“ๅญ˜ 7ๅฑ‚็ผ“ๅญ˜ๅŽ
QAๅ“ๅบ”ๅปถ่ฟŸ (P50) ~1.5s ~400ms
QAๅ“ๅบ”ๅปถ่ฟŸ (P99) ~4s ~1.5s
็ผ“ๅญ˜ๅ‘ฝไธญๆ—ถๅปถ่ฟŸ โ€” ~5ms
APIๆˆๆœฌ ๅŸบๅ‡† ้™ไฝŽ60%+
PDF่งฃๆž้€Ÿๅบฆ (A100) 2.12 ้กต/็ง’ โ€”
1000็ฏ‡่ฎบๆ–‡ๅ…จ้‡่งฃๆž ~80 ๅˆ†้’Ÿ โ€”

ๆ ธๅฟƒๆŠ€ๆœฏๆ ˆ

็ป„ไปถ ้€‰ๅž‹ ่ฎบๆ–‡ไพๆฎ
PDF่งฃๆž MinerU 2.5 VLM arxiv:2509.22186 (OmniDocBench SOTA)
NER GLiNER (440M) arxiv:2311.08526 (F1=47.8, ้›ถๆ ทๆœฌ)
KG่žๅˆ Graphusion arxiv:2410.17600 (+9.2% QAๅ‡†็กฎ็އ)
GraphRAG LightRAG arxiv:2410.05779 (34kโญ, ๅขž้‡ๆ›ดๆ–ฐ)
ๅฑ‚ๆฌก็ดขๅผ• RAPTOR arxiv:2401.18059 (+20%ๅ‡†็กฎ็އ)
ๆฃ€็ดข้‡ๆŽ’ bge-reranker-large arxiv:2502.11371 (ๅ…ฑ่ฏ†ๆœ€ไผ˜)
่ฏญไน‰็ผ“ๅญ˜ GPTCache 7kโญ, ่ฏญไน‰็›ธไผผๅบฆๅ‘ฝไธญ
KVๅค็”จ CacheBlend/LMCache arxiv:2405.16444 (2.2-3.3ร— TTFT)
KVๅŽ‹็ผฉ SnapKV arxiv:2404.14469 (3.6ร—่งฃ็ ๅŠ ้€Ÿ)
Agent LangGraph ๆœ‰็Šถๆ€ๅ›พ, ๆกไปถๅˆ†ๆ”ฏ, ็”Ÿไบง็บง
LLM LiteLLM ็ปŸไธ€ๆœฌๅœฐ/APIๆŽฅๅฃ
ๅ‘้‡ๅบ“ Qdrant Rust้ซ˜ๆ€ง่ƒฝ, ๅŽŸ็”ŸHybridๆœ็ดข
ๅ›พๆ•ฐๆฎๅบ“ Neo4j 5.x LangChainๅŽŸ็”Ÿ้›†ๆˆ

License

MIT

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'heyingyue/scholarmind-architecture'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.