# ScholarMind 数据流详细设计

## 1. 端到端数据流

```
                            ScholarMind 数据流全景图
                            
 ┌─────────────────────────────────────────────────────────────────────────────────┐
 │                           Phase 1: 数据摄入 (Ingestion)                         │
 │                                                                                 │
 │  PDF文件 ──▶ MinIO存储 ──▶ Redis队列 ──▶ PDF路由器                               │
 │                                            │                                    │
 │                              ┌─────────────┼─────────────┐                     │
 │                              ▼             ▼             ▼                     │
 │                         PyMuPDF       MinerU Pipeline  MinerU 2.5 VLM          │
 │                         (数字PDF)     (数字+图表)      (扫描件/复杂)             │
 │                              │             │             │                     │
 │                              └─────────────┼─────────────┘                     │
 │                                            ▼                                    │
 │                                    ParsedPaper JSON                             │
 │                                    (结构化内容块)                                │
 │                                            │                                    │
 │                               ┌────────────┼────────────┐                      │
 │                               ▼            ▼            ▼                      │
 │                          PostgreSQL    MinIO(原始)    Redis(状态)               │
 │                          (元数据)      (PDF+JSON)     (进度)                   │
 └───────────────────────────┬─────────────────────────────────────────────────────┘
                              │
 ┌────────────────────────────▼────────────────────────────────────────────────────┐
 │                        Phase 2: 知识抽取 (Extraction)                            │
 │                                                                                 │
 │  ParsedPaper ──▶ 学术分块器(256 tokens)                                         │
 │                       │                                                         │
 │                       ├──▶ GLiNER NER (实体抽取, 本地GPU)                        │
 │                       │     │                                                   │
 │                       │     ▼                                                   │
 │                       │   实体列表: [(text, type, score), ...]                   │
 │                       │     │                                                   │
 │                       │     ▼                                                   │
 │                       ├──▶ LLMGraphTransformer (关系抽取)                        │
 │                       │     │ 输入: 文本块 + 实体提示                             │
 │                       │     │ LLM: local(Qwen2.5-14B) 或 API(GPT-4o-mini)       │
 │                       │     ▼                                                   │
 │                       │   三元组列表: [(head, rel, tail, props), ...]             │
 │                       │     │                                                   │
 │                       │     ▼                                                   │
 │                       └──▶ Graphusion 融合引擎                                   │
 │                             │ - 嵌入相似度实体合并 (>0.92)                        │
 │                             │ - LLM冲突消解                                      │
 │                             │ - 缺失关系推断                                     │
 │                             ▼                                                   │
 │                         规范化三元组                                              │
 └────────────────────────────┬────────────────────────────────────────────────────┘
                              │
 ┌────────────────────────────▼────────────────────────────────────────────────────┐
 │                        Phase 3: 索引构建 (Indexing)                              │
 │                                                                                 │
 │  并行三路索引构建:                                                                │
 │                                                                                 │
 │  路径A: 向量索引                                                                 │
 │  文本块 ──▶ Embedding Model ──▶ Dense Vector ──┐                                │
 │  文本块 ──▶ BM42/SPLADE ──▶ Sparse Vector ─────┼──▶ Qdrant Collection           │
 │  元数据 ──────────────────────────────────────┘    (papers, 1M+ vectors)         │
 │                                                                                 │
 │  路径B: 知识图谱                                                                 │
 │  规范化三元组 ──▶ Neo4j Batch Import ──▶ Neo4j Graph                             │
 │  实体嵌入 ──▶ Neo4j Vector Index ──▶ 图内向量搜索                                │
 │  全文索引 ──▶ Neo4j Fulltext Index ──▶ 图内文本搜索                              │
 │                                                                                 │
 │  路径C: RAPTOR层次树                                                             │
 │  文本块 ──▶ SBERT嵌入 ──▶ GMM聚类 ──▶ LLM摘要 ──▶ 重新嵌入 ──▶ 递归            │
 │  Level 0 (原始) → Level 1 (段落) → Level 2 (主题) → Level 3 (领域)              │
 │  所有层级节点 ──▶ Qdrant Collection (raptor_tree)                                │
 └────────────────────────────┬────────────────────────────────────────────────────┘
                              │
 ┌────────────────────────────▼────────────────────────────────────────────────────┐
 │                        Phase 4: 检索与问答 (Query)                               │
 │                                                                                 │
 │  用户查询                                                                        │
 │     │                                                                           │
 │     ├──▶ 意图分类 (Router LLM)                                                  │
 │     │     │                                                                     │
 │     │     ├── factual → 向量+BM25检索 (Qdrant)                                  │
 │     │     ├── reasoning → 图谱遍历 (Neo4j) + 向量检索                            │
 │     │     └── global → RAPTOR高层摘要 + 社区检索                                 │
 │     │                                                                           │
 │     ├──▶ HyDE查询增强                                                           │
 │     │     LLM生成假设答案 → 嵌入 → 在文档空间搜索                                │
 │     │                                                                           │
 │     └──▶ 多路结果 ──▶ RRF融合 ──▶ bge-reranker-large重排 ──▶ Top-5              │
 │                                                                                 │
 │  Top-5 + 查询 ──▶ Generator LLM ──▶ 答案 + 引用                                │
 │                       │                                                         │
 │                       ▼                                                         │
 │  答案 ──▶ Validator LLM (自检) ──┬── 置信度>0.8 → 返回                          │
 │                                  └── 置信度<0.8 → 补充检索 (max 3轮)            │
 └─────────────────────────────────────────────────────────────────────────────────┘
```

## 2. 知识图谱查询流示例

### 示例1: "BERT模型在哪些数据集上超过了GPT-2?"

```
查询 → 意图分类: reasoning
     → 图谱查询:
     
     MATCH (bert:Method {name: "BERT"})-[r1:EVALUATED_ON]->(d:Dataset)
           <-[r2:EVALUATED_ON]-(gpt2:Method {name: "GPT-2"})
     WHERE r1.score > r2.score AND r1.metric = r2.metric
     RETURN d.name as dataset, 
            r1.metric as metric,
            r1.score as bert_score, 
            r2.score as gpt2_score
     ORDER BY (r1.score - r2.score) DESC
     
     → 结果: 
     ┌──────────┬──────────┬────────────┬────────────┐
     │ dataset  │ metric   │ bert_score │ gpt2_score │
     ├──────────┼──────────┼────────────┼────────────┤
     │ GLUE     │ accuracy │ 82.1       │ 75.4       │
     │ SQuAD    │ F1       │ 93.2       │ 89.1       │
     │ ...      │ ...      │ ...        │ ...        │
     └──────────┴──────────┴────────────┴────────────┘
     
     + 向量检索补充: 相关论文段落作为supporting evidence
     → LLM综合生成答案
```

### 示例2: "Transformer架构近3年的主要改进方向有哪些?"

```
查询 → 意图分类: global
     → RAPTOR Level 2-3 检索: 
       "Transformer改进" 相关的高层摘要节点
     → 图谱查询:
     
     MATCH (t:Concept {name: "Transformer"})<-[:IMPROVES_ON]-(m:Method)
     WHERE m.year >= 2022
     RETURN m.name, m.description, m.year
     ORDER BY m.year DESC
     
     → LLM综合: 从摘要+图谱结构生成趋势分析
```

## 3. 并发处理模型

```
┌──────────────────────────────────────────────────────────┐
│                   并发处理架构                             │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  FastAPI (asyncio) ────── 查询请求处理                    │
│     │                     并发: 100+ concurrent requests  │
│     │                                                    │
│  Celery Workers ─────── PDF解析任务                       │
│     │                    GPU Worker: 1 per GPU            │
│     │                    CPU Worker: 4-8 per node         │
│     │                                                    │
│  vLLM Server ────────── LLM推理                           │
│     │                    Async batching (自动)             │
│     │                    max_num_seqs: 32                 │
│     │                                                    │
│  Qdrant ─────────────── 向量检索                          │
│     │                    RPS: 3000+ (p99 < 10ms)         │
│     │                                                    │
│  Neo4j ──────────────── 图谱查询                          │
│     │                    Connection pool: 50              │
│     │                    Query timeout: 5s                │
│     │                                                    │
│  Redis ──────────────── 缓存 + 队列                       │
│                          LLM响应缓存                      │
│                          任务队列                         │
│                          Session状态                      │
└──────────────────────────────────────────────────────────┘

关键设计决策:
1. MinerU VLM Worker: 每GPU一个进程 (vLLM内部已做batch)
2. GLiNER: GPU batch inference, 共享单个GPU
3. LLM调用: 异步 (litellm.acompletion), 自动batch
4. 图谱写入: 批量UNWIND导入 (1000 triplets/batch)
5. 向量写入: Qdrant batch upload (100 points/batch)
```

## 4. 缓存策略

```
┌──────────────────────────────────────────────────────────┐
│                     多级缓存策略                           │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  L1: LLM响应缓存 (Redis, TTL: 24h)                       │
│      Key: hash(model + messages + temperature)           │
│      命中率预估: 30-40% (学术查询重复度高)                  │
│                                                          │
│  L2: 嵌入缓存 (Redis, TTL: 7d)                           │
│      Key: hash(text + model_name)                        │
│      避免重复计算嵌入向量                                  │
│                                                          │
│  L3: 查询结果缓存 (Redis, TTL: 1h)                       │
│      Key: hash(query + mode + top_k)                     │
│      完整检索结果缓存                                     │
│                                                          │
│  L4: 图谱子图缓存 (Application Memory, LRU)               │
│      热门实体的2跳子图预加载                               │
│      容量: top 1000 entities                             │
│                                                          │
│  失效策略:                                                │
│  - 新论文导入 → 清除相关query缓存 (L3)                    │
│  - 图谱更新 → 清除图谱缓存 (L4)                           │
│  - 模型更换 → 清除LLM+嵌入缓存 (L1+L2)                   │
└──────────────────────────────────────────────────────────┘
```

## 5. 错误处理与监控

```
┌──────────────────────────────────────────────────────────┐
│                    可观测性架构                             │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  Metrics (Prometheus + Grafana):                         │
│  - pdf_parse_duration_seconds (histogram)                │
│  - pdf_parse_total{status="success|failed"} (counter)    │
│  - query_latency_seconds{mode="factual|reasoning|global"}│
│  - llm_tokens_total{model, task} (counter)               │
│  - llm_cost_usd_total{model} (counter)                   │
│  - qdrant_search_latency_seconds (histogram)             │
│  - neo4j_query_latency_seconds (histogram)               │
│  - cache_hit_ratio{level="L1|L2|L3|L4"} (gauge)         │
│                                                          │
│  Logging (Structured JSON → ELK/Loki):                   │
│  - 每次查询完整trace (query→retrieval→generation)         │
│  - PDF解析异常详情                                        │
│  - LLM调用详情 (token count, latency, model)             │
│                                                          │
│  Alerts:                                                 │
│  - PDF解析失败率 > 5% → 检查PDF质量/MinerU状态            │
│  - 查询P99延迟 > 10s → 检查LLM/向量库负载                 │
│  - LLM成本日超 $X → 切换更多流量到本地模型                 │
│  - Neo4j内存 > 80% → 扩容或清理旧数据                     │
│                                                          │
│  Health Checks:                                          │
│  GET /health → 检查所有依赖服务状态                        │
│  - Redis: PING                                           │
│  - Qdrant: collection info                               │
│  - Neo4j: RETURN 1                                       │
│  - LiteLLM: /health                                      │
│  - MinerU Worker: Celery inspect                         │
└──────────────────────────────────────────────────────────┘
```