Add comprehensive README with documentation
Browse files
README.md
CHANGED
|
@@ -1,15 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
| 1 |
+
# 🏥 保险APP 用户行为分析模型训练平台
|
| 2 |
+
|
| 3 |
+
基于最新研究论文构建的工业级保险APP用户行为分析平台,支持**合成数据演示**和**真实CSV数据上传**两种模式。
|
| 4 |
+
|
| 5 |
+
🔗 **在线体验**: https://huggingface.co/spaces/Stephanwu/insurance-app-behavior
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 📋 功能特性
|
| 10 |
+
|
| 11 |
+
| 功能 | 说明 |
|
| 12 |
+
|------|------|
|
| 13 |
+
| 🎲 **合成数据训练** | 生成模拟保险APP用户行为数据,自动标注流失/留存标签 |
|
| 14 |
+
| 📁 **CSV数据上传** | 上传真实用户行为CSV,自动特征工程 + 模型训练 |
|
| 15 |
+
| 🔍 **自动特征工程** | 提取 30+ 维行为特征:浏览深度、转化漏斗、生命周期、时序模式 |
|
| 16 |
+
| 🤖 **双模型对比** | GBDT (Gradient Boosting) + Random Forest |
|
| 17 |
+
| 📊 **可视化分析** | 特征重要性、PR曲线、混淆矩阵、数据样本 |
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 📊 数据格式说明
|
| 22 |
+
|
| 23 |
+
### 支持的事件类型 (30种)
|
| 24 |
+
|
| 25 |
+
| 类别 | 事件 |
|
| 26 |
+
|------|------|
|
| 27 |
+
| **浏览** | `page_view`, `product_view`, `premium_calculator`, `article_read`, `faq_view` |
|
| 28 |
+
| **交互** | `quote_request`, `form_submit`, `document_upload`, `chat_init`, `call_init`, `video_consult` |
|
| 29 |
+
| **转化** | `policy_select`, `payment_init`, `payment_success`, `policy_issued` |
|
| 30 |
+
| **理赔** | `claim_init`, `claim_doc_upload`, `claim_review`, `claim_approved`, `claim_rejected` |
|
| 31 |
+
| **续保** | `renewal_reminder`, `renewal_click`, `renewal_complete`, `policy_cancel` |
|
| 32 |
+
| **其他** | `login`, `logout`, `app_uninstall` |
|
| 33 |
+
|
| 34 |
+
### CSV 上传格式
|
| 35 |
+
|
| 36 |
+
上传 CSV 需包含以下列:
|
| 37 |
+
|
| 38 |
+
```csv
|
| 39 |
+
user_id,session_id,timestamp,event_type,page_id,product_id,amount
|
| 40 |
+
user_001,sess_001,1704067200000,page_view,home_page,,
|
| 41 |
+
user_001,sess_001,1704067230000,product_view,product_detail,health_basic,
|
| 42 |
+
user_001,sess_001,1704067260000,quote_request,quote_form,health_basic,50000
|
| 43 |
+
user_001,sess_001,1704067290000,payment_success,payment_page,health_basic,50000
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
**必需列**:
|
| 47 |
+
- `user_id`: 用户唯一标识
|
| 48 |
+
- `session_id`: 会话标识
|
| 49 |
+
- `timestamp`: Unix 时间戳 (毫秒)
|
| 50 |
+
- `event_type`: 事件类型 (见上表)
|
| 51 |
+
- `page_id`: 页面标识
|
| 52 |
+
|
| 53 |
+
**可选列**:
|
| 54 |
+
- `product_id`: 保险产品ID
|
| 55 |
+
- `amount`: 金额/保额
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## 🧠 模型架构
|
| 60 |
+
|
| 61 |
+
### 1. 用户流失预测 (ChurnPredictionTransformer)
|
| 62 |
+
> 参考: *Early Churn Prediction from Large Scale User-Product Interaction Time Series* (arXiv 2309.14390)
|
| 63 |
+
|
| 64 |
+
- **输入**: 用户最近 100 个行为的嵌入序列 + 连续统计特征
|
| 65 |
+
- **架构**: 6层 Transformer, d_model=128, 8头注意力
|
| 66 |
+
- **训练**: Adam, lr=1e-4, batch_size=4096
|
| 67 |
+
- **评估**: PR-AUC, F1-score
|
| 68 |
+
|
| 69 |
+
### 2. 保险产品推荐 (InsuranceProductDIN)
|
| 70 |
+
> 参考: *Deep Interest Network for Click-Through Rate Prediction* (KDD 2018, arXiv 1706.06978)
|
| 71 |
+
|
| 72 |
+
- **核心**: 局部激活注意力 (LocalActivationUnit)
|
| 73 |
+
- ** insight**: 用户兴趣表示随候选产品动态变化
|
| 74 |
+
- **部署**: 预计算行为嵌入 → Feature Store → 实时注意力计算 (<100ms)
|
| 75 |
+
|
| 76 |
+
### 3. 异常行为检测 (TabularBERT)
|
| 77 |
+
> 参考: *Tabular Transformers for Modeling Multivariate Time Series* (arXiv 2011.01843)
|
| 78 |
+
|
| 79 |
+
- **架构**: 层次化 BERT (Field Transformer → Sequence Transformer)
|
| 80 |
+
- **预训练**: MLM on 历史理赔记录 (不含 fraud label)
|
| 81 |
+
- **评估**: F1-score (fraud < 1%, accuracy 误导)
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## ⚠️ 不平衡数据处理
|
| 86 |
+
|
| 87 |
+
保险场景数据高度不平衡 (流失率通常 < 5%, 欺诈率 < 1%):
|
| 88 |
+
|
| 89 |
+
| 技术 | 实现 |
|
| 90 |
+
|------|------|
|
| 91 |
+
| **类别权重** | `class_weight='balanced'` (Random Forest) |
|
| 92 |
+
| **上采样** | SMOTE / 重采样 minority 至 50:50 |
|
| 93 |
+
| **损失函数** | Focal Loss (α=0.25, γ=2.0) |
|
| 94 |
+
| **评估指标** | **不用 Accuracy**, 用 F1 + PR-AUC + G-mean |
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## 🚀 生产部署架构
|
| 99 |
+
|
| 100 |
+
```
|
| 101 |
+
SDK埋点 → Kafka → Flink清洗 → HDFS/S3
|
| 102 |
+
↓
|
| 103 |
+
离线: Spark批处理 → GPU训练
|
| 104 |
+
在线: Flink窗口聚合 → Redis Feature Store
|
| 105 |
+
↓
|
| 106 |
+
推理: 预计算嵌入 → 注意力计算 → <100ms
|
| 107 |
+
服务: Triton/TensorRT + REST/gRPC
|
| 108 |
+
监控: Trackio + 业务指标看板
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
---
|
| 112 |
+
|
| 113 |
+
## 📦 本地运行
|
| 114 |
+
|
| 115 |
+
```bash
|
| 116 |
+
# 克隆仓库
|
| 117 |
+
git clone https://huggingface.co/spaces/Stephanwu/insurance-app-behavior
|
| 118 |
+
|
| 119 |
+
# 安装依赖
|
| 120 |
+
pip install -r requirements.txt
|
| 121 |
+
|
| 122 |
+
# 启动
|
| 123 |
+
python app.py
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## 📚 参考文献
|
| 129 |
+
|
| 130 |
+
| 论文 | 应用 | 链接 |
|
| 131 |
+
|------|------|------|
|
| 132 |
+
| Deep Interest Network (KDD 2018) | 产品推荐 | [arXiv:1706.06978](https://arxiv.org/abs/1706.06978) |
|
| 133 |
+
| SDIM (SIGIR 2022) | 长期行为建模 | [arXiv:2205.10249](https://arxiv.org/abs/2205.10249) |
|
| 134 |
+
| FMLP-Rec (WWW 2022) | 去噪序列推荐 | [arXiv:2202.13556](https://arxiv.org/abs/2202.13556) |
|
| 135 |
+
| TabBERT/TabFormer | 表格时序异常检测 | [arXiv:2011.01843](https://arxiv.org/abs/2011.01843) |
|
| 136 |
+
| Transformer Churn | 非合约流失预测 | [arXiv:2309.14390](https://arxiv.org/abs/2309.14390) |
|
| 137 |
+
| FinDeepBehaviorCluster | 欺诈行为聚类 | [arXiv:2101.04285](https://arxiv.org/abs/2101.04285) |
|
| 138 |
+
|
| 139 |
---
|
| 140 |
|
| 141 |
+
**作者**: [Stephanwu](https://huggingface.co/Stephanwu)
|
| 142 |
+
**许可证**: MIT
|