Update README for v3.0 with all new features documented
Browse files
README.md
CHANGED
|
@@ -1,146 +1,261 @@
|
|
| 1 |
-
|
| 2 |
-
tags:
|
| 3 |
-
- ml-intern
|
| 4 |
-
---
|
| 5 |
-
# 🏥 保险APP 用户行为分析模型训练平台
|
| 6 |
|
| 7 |
-
基于最新研究论文构建的工业级保险
|
| 8 |
|
| 9 |
🔗 **在线体验**: https://huggingface.co/spaces/Stephanwu/insurance-app-behavior
|
| 10 |
|
| 11 |
---
|
| 12 |
|
| 13 |
-
## 📋 功能
|
| 14 |
|
| 15 |
-
| 功能 |
|
| 16 |
-
|------|------|
|
| 17 |
-
| 🎲
|
| 18 |
-
| 📁
|
| 19 |
-
|
|
| 20 |
-
|
|
| 21 |
-
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
| **续保** | `renewal_reminder`, `renewal_click`, `renewal_complete`, `policy_cancel` |
|
| 36 |
-
| **其他** | `login`, `logout`, `app_uninstall` |
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
```csv
|
| 43 |
user_id,session_id,timestamp,event_type,page_id,product_id,amount
|
| 44 |
-
user_001,sess_001,1704067200000,page_view,
|
| 45 |
-
user_001,sess_001,1704067230000,product_view,
|
| 46 |
-
user_001,sess_001,1704067260000,quote_request,
|
| 47 |
-
user_001,sess_001,1704067290000,payment_success,
|
| 48 |
```
|
| 49 |
|
| 50 |
-
**必需列**:
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
- `timestamp`: Unix 时间戳 (毫秒)
|
| 54 |
-
- `event_type`: 事件类型 (见上表)
|
| 55 |
-
- `page_id`: 页面标识
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
---
|
| 62 |
|
| 63 |
-
##
|
| 64 |
|
| 65 |
-
###
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
-
- **架构**: 6层 Transformer, d_model=128, 8头注意力
|
| 70 |
-
- **训练**: Adam, lr=1e-4, batch_size=4096
|
| 71 |
-
- **评估**: PR-AUC, F1-score
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|
|
|
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
###
|
| 81 |
-
> 参考: *Tabular Transformers for Modeling Multivariate Time Series* (arXiv 2011.01843)
|
| 82 |
|
| 83 |
-
|
| 84 |
-
-
|
| 85 |
-
-
|
|
|
|
|
|
|
| 86 |
|
| 87 |
---
|
| 88 |
|
| 89 |
## ⚠️ 不平衡数据处理
|
| 90 |
|
| 91 |
-
保险场景数据高度不平衡 (流失率
|
| 92 |
|
| 93 |
-
| 技术 | 实现 |
|
| 94 |
-
|------|------|
|
| 95 |
-
| **类别权重** | `class_weight='balanced'`
|
| 96 |
-
| **
|
| 97 |
-
| **
|
| 98 |
-
| **评估指标** | **不用 Accuracy**, 用 F1 + PR-AUC
|
| 99 |
|
| 100 |
---
|
| 101 |
|
| 102 |
-
##
|
| 103 |
|
| 104 |
```
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
服务: Triton/TensorRT + REST/gRPC
|
| 112 |
-
监控: Trackio + 业务指标看板
|
| 113 |
```
|
| 114 |
|
| 115 |
---
|
| 116 |
|
| 117 |
-
##
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
#
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
-
#
|
| 127 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
```
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
## 📚 参考文献
|
| 133 |
|
| 134 |
-
| 论文 | 应用 |
|
| 135 |
-
|------|------|------|
|
| 136 |
-
| Deep Interest Network
|
| 137 |
-
| SDIM
|
| 138 |
-
| FMLP-Rec
|
| 139 |
-
| TabBERT/TabFormer | 表格时序异常检测 | [
|
| 140 |
-
| Transformer Churn | 非合约流失预测 | [
|
| 141 |
-
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
---
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🏥 保险APP 用户行为分析模型训练平台 v3.0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
基于最新研究论文构建的**工业级保险用户行为分析平台**,支持**7大功能模块**:演示模式、CSV上传、产品推荐、异常检测、模型管理、生存分析、帮助文档。
|
| 4 |
|
| 5 |
🔗 **在线体验**: https://huggingface.co/spaces/Stephanwu/insurance-app-behavior
|
| 6 |
|
| 7 |
---
|
| 8 |
|
| 9 |
+
## 📋 功能总览
|
| 10 |
|
| 11 |
+
| Tab | 功能 | 模型/方法 | 参考文献 |
|
| 12 |
+
|------|------|----------|---------|
|
| 13 |
+
| 🎲 演示 | 合成数据训练 | GBDT + Random Forest | sklearn |
|
| 14 |
+
| 📁 CSV上传 | 真实数据训练 | GBDT + RF + 自动特征工程 | sklearn |
|
| 15 |
+
| 🎯 产品推荐 | DIN深度学习 | PyTorch + LocalActivationUnit | [DIN KDD 2018](https://arxiv.org/abs/1706.06978) |
|
| 16 |
+
| 🔍 异常检测 | TabBERT | PyTorch + Transformer + Focal Loss | [TabBERT](https://arxiv.org/abs/2011.01843) |
|
| 17 |
+
| 💾 模型管理 | HF Hub 保存/加载 | joblib + torch.save | HuggingFace Hub |
|
| 18 |
+
| ⏱️ 生存分析 | Cox-PH + DeepSurv | lifelines + PyTorch Neural Cox-PH | [DeepSurv](https://arxiv.org/abs/1606.00931) |
|
| 19 |
+
| ❓ 帮助 | 完整文档 | - | - |
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
+
## 🎯 六大核心模型
|
| 24 |
|
| 25 |
+
### 1. 用户流失预测 (ChurnPrediction)
|
| 26 |
+
**方法**: Gradient Boosting + Random Forest
|
| 27 |
+
- 输入: 30+维用户行为统计特征
|
| 28 |
+
- 输出: 流失概率
|
| 29 |
+
- 评估: AUC-ROC, F1-Score, AP, 5折交叉验证
|
| 30 |
|
| 31 |
+
### 2. 保险产品推荐 (DIN)
|
| 32 |
+
**方法**: Deep Interest Network (KDD 2018)
|
| 33 |
+
- **核心**: LocalActivationUnit 注意力机制
|
| 34 |
+
- **输入**: 用户历史行为序列 + 候选保险产品
|
| 35 |
+
- **输出**: 购买概率 + 注意力权重可视化
|
| 36 |
+
- ** insight**: 用户兴趣表示随候选产品动态变化
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
```
|
| 39 |
+
用户历史: [event_1, product_1], [event_2, product_2], ...
|
| 40 |
+
↓ Embedding
|
| 41 |
+
事件嵌入(D/2) + 产品嵌入(D/2) → 行为嵌入(D)
|
| 42 |
+
↓
|
| 43 |
+
候选产品 Embedding ───┐
|
| 44 |
+
↓
|
| 45 |
+
[c, b, c-b, c*b] → Attention MLP → 权重 α
|
| 46 |
+
↓
|
| 47 |
+
加权求和 → 兴趣向量(D)
|
| 48 |
+
↓
|
| 49 |
+
[用户, 兴趣, 候选, 交互, 统计特征] → MLP → 购买概率
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### 3. 异常行为检测 (TabBERT)
|
| 53 |
+
**方法**: TabularBERT + Focal Loss
|
| 54 |
+
- **核心**: 层次化Transformer + 不平衡数据处理
|
| 55 |
+
- **损失**: Focal Loss (α=0.25, γ=2.0) — 聚焦难分异常样本
|
| 56 |
+
- **输入**: 理赔记录多维特征
|
| 57 |
+
- **输出**: 异常分数 + 最佳阈值推荐
|
| 58 |
+
|
| 59 |
+
### 4. 生存分析 (Survival Analysis)
|
| 60 |
+
**统计方法**: lifelines Cox-PH + Kaplan-Meier
|
| 61 |
+
- **输入**: 用户年龄、性别、收入、保单类型、保额
|
| 62 |
+
- **输出**: 生存函数 S(t) = P(T > t)
|
| 63 |
+
- **可视化**: KM曲线、Cox系数、风险分层
|
| 64 |
+
|
| 65 |
+
**深度方法**: DeepSurv (Neural Cox-PH)
|
| 66 |
+
- **架构**: MLP预测log hazard ratio
|
| 67 |
+
- **损失**: Cox partial likelihood
|
| 68 |
+
- **优势**: 学习非线性特征交互
|
| 69 |
+
- **评估**: Concordance Index (C-index)
|
| 70 |
+
|
| 71 |
+
### 5. 模型管理 (HF Hub)
|
| 72 |
+
**保存**: 一键上传 sklearn + PyTorch 模型到 Hugging Face Hub
|
| 73 |
+
**加载**: 从 Hub 下载模型, 恢复训练指标和可视化
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
|
| 77 |
+
## 📊 数据格式
|
| 78 |
+
|
| 79 |
+
### CSV 上传格式
|
| 80 |
|
| 81 |
```csv
|
| 82 |
user_id,session_id,timestamp,event_type,page_id,product_id,amount
|
| 83 |
+
user_001,sess_001,1704067200000,page_view,home,,
|
| 84 |
+
user_001,sess_001,1704067230000,product_view,product,health_basic,
|
| 85 |
+
user_001,sess_001,1704067260000,quote_request,quote,health_basic,50000
|
| 86 |
+
user_001,sess_001,1704067290000,payment_success,payment,health_basic,50000
|
| 87 |
```
|
| 88 |
|
| 89 |
+
**必需列**: `user_id`, `session_id`, `timestamp`, `event_type`, `page_id`
|
| 90 |
+
|
| 91 |
+
**可选列**: `product_id`, `amount`, `label`(流失标签)
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
### 事件类型 (30种)
|
| 94 |
+
|
| 95 |
+
| 类别 | 事件 |
|
| 96 |
+
|------|------|
|
| 97 |
+
| **浏览** | page_view, product_view, premium_calculator, article_read, faq_view, product_compare |
|
| 98 |
+
| **交互** | quote_request, form_submit, document_upload, chat_init, call_init, video_consult, quote_result_view |
|
| 99 |
+
| **转化** | policy_select, payment_init, payment_success, policy_issued |
|
| 100 |
+
| **理赔** | claim_init, claim_doc_upload, claim_review, claim_approved, claim_rejected |
|
| 101 |
+
| **续保** | renewal_reminder, renewal_click, renewal_complete, policy_cancel |
|
| 102 |
+
| **其他** | login, logout, app_uninstall |
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
+
## 🚀 使用方式
|
| 107 |
|
| 108 |
+
### 在线使用
|
| 109 |
+
直接访问: https://huggingface.co/spaces/Stephanwu/insurance-app-behavior
|
| 110 |
|
| 111 |
+
### 本地运行
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
+
```bash
|
| 114 |
+
# 克隆仓库
|
| 115 |
+
git clone https://huggingface.co/spaces/Stephanwu/insurance-app-behavior
|
| 116 |
|
| 117 |
+
# 安装依赖 (torch 较大, 需几分钟)
|
| 118 |
+
pip install -r requirements.txt
|
| 119 |
+
|
| 120 |
+
# 启动
|
| 121 |
+
python app.py
|
| 122 |
+
```
|
| 123 |
|
| 124 |
+
### Docker 运行
|
|
|
|
| 125 |
|
| 126 |
+
```bash
|
| 127 |
+
docker run -p 7860:7860 --platform=linux/amd64 \
|
| 128 |
+
-e HF_TOKEN="your_token" \
|
| 129 |
+
registry.hf.space/stephanwu-insurance-app-behavior:latest
|
| 130 |
+
```
|
| 131 |
|
| 132 |
---
|
| 133 |
|
| 134 |
## ⚠️ 不平衡数据处理
|
| 135 |
|
| 136 |
+
保险场景数据高度不平衡 (流失率 < 5%, 欺诈率 < 1%):
|
| 137 |
|
| 138 |
+
| 技术 | 实现 | 适用场景 |
|
| 139 |
+
|------|------|---------|
|
| 140 |
+
| **类别权重** | `class_weight='balanced'` | Random Forest |
|
| 141 |
+
| **Focal Loss** | `α=0.25, γ=2.0` | PyTorch 深度学习 |
|
| 142 |
+
| **上采样** | SMOTE / 重采样 | 训练前预处理 |
|
| 143 |
+
| **评估指标** | **不用 Accuracy**, 用 F1 + PR-AUC | 所有不平衡场景 |
|
| 144 |
|
| 145 |
---
|
| 146 |
|
| 147 |
+
## 📦 文件结构
|
| 148 |
|
| 149 |
```
|
| 150 |
+
Stephanwu/insurance-app-behavior/
|
| 151 |
+
├── app.py (86 KB) 主程序 — 7个Tab完整功能
|
| 152 |
+
├── models.py (23 KB) 深度学习模型定义 (DIN, TabBERT, Transformer)
|
| 153 |
+
├── README.md (7 KB) 完整文档
|
| 154 |
+
├── requirements.txt (188 B) 依赖清单
|
| 155 |
+
└── .gitattributes (1.5 KB) Git配置
|
|
|
|
|
|
|
| 156 |
```
|
| 157 |
|
| 158 |
---
|
| 159 |
|
| 160 |
+
## 🧠 模型架构详解
|
| 161 |
|
| 162 |
+
### DIN (Deep Interest Network)
|
| 163 |
+
```python
|
| 164 |
+
# LocalActivationUnit 核心
|
| 165 |
+
candidate_emb = embed(candidate_product) # (B, D)
|
| 166 |
+
behavior_emb = embed(events) + embed(products) # (B, L, D)
|
| 167 |
|
| 168 |
+
# 4路交互特征
|
| 169 |
+
interaction = concat([
|
| 170 |
+
candidate_emb, # 候选产品
|
| 171 |
+
behavior_emb, # 历史行为
|
| 172 |
+
candidate - behavior, # 差异
|
| 173 |
+
candidate * behavior, # 点积
|
| 174 |
+
]) # (B, L, 4D)
|
| 175 |
|
| 176 |
+
# 注意力权重
|
| 177 |
+
attention_weights = MLP(interaction) # (B, L)
|
| 178 |
+
attention_weights = softmax(attention_weights)
|
| 179 |
+
|
| 180 |
+
# 加权兴趣
|
| 181 |
+
interest = sum(behavior_emb * attention_weights) # (B, D)
|
| 182 |
+
|
| 183 |
+
# 预测
|
| 184 |
+
logits = MLP(concat([user, interest, candidate, interaction, stats]))
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
### TabBERT (简化版)
|
| 188 |
+
```python
|
| 189 |
+
# 层次化Transformer
|
| 190 |
+
input_features = [claim_amount, claim_type, days_since_policy, ...]
|
| 191 |
+
↓
|
| 192 |
+
Linear Projection: d_model (128)
|
| 193 |
+
↓
|
| 194 |
+
┌────────────────────────┐
|
| 195 |
+
│ Transformer × 4 │ # 模拟 Field + Sequence level
|
| 196 |
+
│ LayerNorm + Dropout │
|
| 197 |
+
└────────────────────────┘
|
| 198 |
+
↓
|
| 199 |
+
Global Average Pooling
|
| 200 |
+
↓
|
| 201 |
+
MLP: 128 → 256 → 64 → 1
|
| 202 |
+
↓
|
| 203 |
+
Focal Loss (解决1:4不平衡)
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
### DeepSurv (Neural Cox-PH)
|
| 207 |
+
```python
|
| 208 |
+
# Cox partial likelihood loss
|
| 209 |
+
def cox_ph_loss(pred, time, event):
|
| 210 |
+
# Sort by time descending
|
| 211 |
+
pred_sorted = pred[argsort(time, descending=True)]
|
| 212 |
+
event_sorted = event[argsort(time, descending=True)]
|
| 213 |
+
|
| 214 |
+
# logcumsumexp for numerical stability
|
| 215 |
+
log_cumsum_h = logcumsumexp(pred_sorted)
|
| 216 |
+
|
| 217 |
+
# Only event samples contribute
|
| 218 |
+
loss = -sum(event * (pred - log_cumsum_h)) / sum(event)
|
| 219 |
+
return loss
|
| 220 |
+
|
| 221 |
+
# Survival probability
|
| 222 |
+
S(t | x) = exp(-H_0(t) * exp(pred(x)))
|
| 223 |
```
|
| 224 |
|
| 225 |
---
|
| 226 |
|
| 227 |
## 📚 参考文献
|
| 228 |
|
| 229 |
+
| 论文 | 应用 | arXiv | 会议 |
|
| 230 |
+
|------|------|-------|------|
|
| 231 |
+
| Deep Interest Network | 产品推荐 | [1706.06978](https://arxiv.org/abs/1706.06978) | KDD 2018 |
|
| 232 |
+
| SDIM | 长期行为建模 | [2205.10249](https://arxiv.org/abs/2205.10249) | SIGIR 2022 |
|
| 233 |
+
| FMLP-Rec | 去噪序列推荐 | [2202.13556](https://arxiv.org/abs/2202.13556) | WWW 2022 |
|
| 234 |
+
| TabBERT/TabFormer | 表格时序异常检测 | [2011.01843](https://arxiv.org/abs/2011.01843) | - |
|
| 235 |
+
| Transformer Churn | 非合约流失预测 | [2309.14390](https://arxiv.org/abs/2309.14390) | - |
|
| 236 |
+
| DeepSurv | 生存分析 | [1606.00931](https://arxiv.org/abs/1606.00931) | JAMIA 2018 |
|
| 237 |
+
| RNN Survival | 购买时序预测 | [2304.00575](https://arxiv.org/abs/2304.00575) | - |
|
| 238 |
+
| Focal Loss | 不平衡分类 | [1708.02002](https://arxiv.org/abs/1708.02002) | ICCV 2017 |
|
| 239 |
|
| 240 |
---
|
| 241 |
|
| 242 |
+
## 🔧 技术栈
|
| 243 |
+
|
| 244 |
+
| 组件 | 库 | 版本 |
|
| 245 |
+
|------|-----|------|
|
| 246 |
+
| Web UI | Gradio | ≥4.0.0 |
|
| 247 |
+
| 数据处理 | Pandas, NumPy | ≥2.0.0, ≥1.24.0 |
|
| 248 |
+
| 机器学习 | scikit-learn | ≥1.3.0 |
|
| 249 |
+
| 深度学习 | PyTorch | ≥2.0.0 |
|
| 250 |
+
| 可视化 | Matplotlib, Seaborn | ≥3.7.0, ≥0.12.0 |
|
| 251 |
+
| 生存分析 | lifelines | ≥0.28.0 |
|
| 252 |
+
| 模型持久化 | joblib | ≥1.3.0 |
|
| 253 |
+
| 模型托管 | huggingface-hub | ≥0.23.0 |
|
| 254 |
+
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
## 📄 许可证
|
| 258 |
+
|
| 259 |
+
MIT License
|
| 260 |
+
|
| 261 |
+
**作者**: [Stephanwu](https://huggingface.co/Stephanwu)
|