kangkangchen commited on
Commit
8171f7d
·
verified ·
1 Parent(s): c9fc5f7

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +62 -61
  2. algo_detector.py +465 -0
  3. config.json +40 -32
  4. model.pt +2 -2
  5. model.py +60 -39
README.md CHANGED
@@ -3,104 +3,105 @@ tags:
3
  - finance
4
  - order-book
5
  - institutional-trading
 
6
  - level-2
7
  - A-share
8
  - LOB
 
 
 
9
  - pytorch
10
  license: mit
11
  ---
12
 
13
- # LOBPatternNet V3 - 主力单模式识别模型
14
 
15
- ## 模型简介 / Overview
16
 
17
- 基于A股Level-2十档委托单(LOB)数据,利用深度学习自动识别主力(机构)的单模式
 
18
 
19
- Detects institutional trading patterns from Level-2 order book data (10-level bid/ask).
 
 
20
 
21
- ## 架构 / Architecture
22
 
23
  ```
24
- Input: (batch, 100, 40) - 100 consecutive LOB snapshots
25
- Each snapshot: [ask_p₁, ask_s₁, bid_p₁, bid_s₁, ..., ask_p₁₀, ask_s₁₀, bid_p₁₀, bid_s₁₀]
26
- BilinearNorm (adaptive normalization)
27
- Spatial CNN (cross-level patterns)
28
- Temporal CNN (multi-scale time features)
29
- Transformer Attention (temporal dependencies)
30
- ↓ 3-class Classification
31
  ```
32
 
33
- Parameters: 338,395
34
 
35
- ## 输出 / Output Classes
36
 
37
- | ID | 中文 | English |
38
- |----|------|---------|
39
- | 0 | 主力买入 | Institutional Buying |
40
- | 1 | 中性/散户 | Neutral / Retail |
41
- | 2 | 主力卖出 | Institutional Selling |
 
 
42
 
43
- ## 性能 / Performance
44
 
45
  | Metric | Value |
46
  |--------|-------|
47
- | Test Accuracy | 0.3980 |
48
- | Test F1 (Macro) | 0.3795 |
49
- | Test F1 (Weighted) | 0.4203 |
50
- | 主力买入 Precision | 0.2418 |
51
- | 主力买入 Recall | 0.5401 |
52
- | 主力卖出 Precision | 0.2448 |
53
- | 主力卖出 Recall | 0.6246 |
54
 
55
- ## 使用方法 / Usage
 
 
 
 
 
 
 
 
 
56
 
57
  ```python
58
- import torch
59
- import numpy as np
60
- from model import LOBPatternNetV3
61
 
62
- # Load model
63
- model = LOBPatternNetV3(num_classes=3, d_model=64, nhead=4, dropout=0.4)
64
  model.load_state_dict(torch.load("model.pt", weights_only=True))
65
  model.eval()
66
 
67
- # Load normalization stats
68
  stats = np.load("norm_stats.npz")
69
- means, stds = stats["means"], stats["stds"]
70
-
71
- # Prepare input: 100 consecutive Level-2 snapshots (N, 40)
72
- # Each snapshot: [ask_price_1, ask_size_1, bid_price_1, bid_size_1, ...]
73
- # 1. Replace sentinel values (abs > 1e9) with 0
74
- # 2. Normalize prices to basis points relative to mid-price
75
- # 3. Log-transform sizes with log1p
76
- # 4. Z-score normalize using means/stds
77
- raw_data = ... # your (100, 40) LOB snapshot array
78
- normalized = (raw_data - means) / stds
79
- x = torch.from_numpy(normalized).unsqueeze(0).float()
80
 
81
  with torch.no_grad():
82
- logits = model(x)
83
- probs = torch.softmax(logits, dim=1)
84
- pred = logits.argmax(dim=1).item()
85
 
86
- labels = ["主力买入 (Institutional Buy)", "中性 (Neutral)", "主力卖出 (Institutional Sell)"]
87
- print(f"预测: {labels[pred]}, 置信度: {probs[0, pred]:.1%}")
88
  ```
89
 
90
- ## 训练细节 / Training Details
91
 
92
- - **Dataset**: [LeonardoBerti/TRADES-LOB](https://huggingface.co/datasets/LeonardoBerti/TRADES-LOB) (265K order events, 10-level LOB)
93
- - **Label Construction**: Order Flow Imbalance (OFI) + Large Order Ratio + Cancellation Rate
94
- - **Loss**: Focal Loss (γ=2.0) + Label Smoothing (0.1) + Class Weighting
95
- - **Regularization**: Dropout 0.4, Weight Decay 5e-4, Mixup Augmentation (α=0.3)
96
- - **Optimizer**: AdamW, lr=3e-4, Cosine Annealing with Warm Restarts
97
 
98
- ## 参考 / References
99
 
100
- - DeepLOB: Zhang et al., TNNLS 2019 (arxiv:1808.03668)
101
- - TLOB: Berti & Kasneci, 2025 (arxiv:2502.15757)
 
 
 
102
 
103
- ## 声明 / Disclaimer
104
 
105
  本模型仅供研究学习使用,不构成任何投资建议。股市有风险,入市需谨慎。
106
- This model is for research purposes only. Not investment advice.
 
3
  - finance
4
  - order-book
5
  - institutional-trading
6
+ - algorithmic-trading
7
  - level-2
8
  - A-share
9
  - LOB
10
+ - TWAP
11
+ - VWAP
12
+ - iceberg-orders
13
  - pytorch
14
  license: mit
15
  ---
16
 
17
+ # LOBAlgoNet 主力算法单模式识别
18
 
19
+ ## 📌 核心思路
20
 
21
+ **真实行情数据没有标签**——我们不知道哪笔委托是"主力"但主力几乎都使用算法单(TWAP、VWAP、冰山订单等),
22
+ 这些算法在委托簿中会留下可识别的特征签名。
23
 
24
+ 本模型的方法:
25
+ 1. **规则引擎检测算法签名** → 生成伪标签(无需人工标注)
26
+ 2. **深度学习模型学习** → 从原始LOB数据直接识别这些模式(泛化超越规则)
27
 
28
+ ## 🏗️ 架构
29
 
30
  ```
31
+ Input: 100个连续Level-2快照 (batch, 100, 40)
32
+ BilinearNorm (自适应归一化)
33
+ Spatial CNN (十档价位间的空间模式)
34
+ Temporal CNN (多尺度时间特征)
35
+ Transformer Attention × 2 (时序依赖)
36
+ 5-class Classification
 
37
  ```
38
 
39
+ 参数量: 338,525
40
 
41
+ ## 🎯 识别的5种模式
42
 
43
+ | ID | 类型 | 说明 | 关键特征 |
44
+ |----|------|------|----------|
45
+ | 0 | **TWAP** | 时间加权平均价 | 等量等间隔下单,被动-主动交替 |
46
+ | 1 | **VWAP** | 量加权平均价 | 跟随市场成交量节奏,参与率稳定 |
47
+ | 2 | **ICEBERG** | 冰山订单 | 一档反复补单,实际量>>显示量 |
48
+ | 3 | **SUPPORT** | 护盘/支撑位 | 关键价位持续大单,深度不对称 |
49
+ | 4 | **NORMAL** | 散户/正常 | 无明显算法特征 |
50
 
51
+ ## 📊 性能
52
 
53
  | Metric | Value |
54
  |--------|-------|
55
+ | Test Accuracy | 0.5273 |
56
+ | Test F1 (Macro) | 0.3538 |
57
+ | Test F1 (Weighted) | 0.5822 |
 
 
 
 
58
 
59
+ Per-class:
60
+ | 类型 | Precision | Recall |
61
+ |------|-----------|--------|
62
+ | TWAP | 0.0000 | 0.0000 |
63
+ | VWAP | 0.1739 | 0.4444 |
64
+ | ICEBERG | 0.9820 | 0.5573 |
65
+ | SUPPORT | 0.4684 | 0.5764 |
66
+ | NORMAL | 0.2338 | 0.3858 |
67
+
68
+ ## 🚀 使用方法
69
 
70
  ```python
71
+ import torch, numpy as np
72
+ from model import LOBAlgoNet
 
73
 
74
+ model = LOBAlgoNet(num_classes=5, d_model=128, nhead=4, dropout=0.25)
 
75
  model.load_state_dict(torch.load("model.pt", weights_only=True))
76
  model.eval()
77
 
 
78
  stats = np.load("norm_stats.npz")
79
+ # raw_lob: your (100, 40) LOB data, normalized with stats["means"] and stats["stds"]
80
+ x = torch.from_numpy((raw_lob - stats["means"]) / stats["stds"]).unsqueeze(0).float()
 
 
 
 
 
 
 
 
 
81
 
82
  with torch.no_grad():
83
+ probs = torch.softmax(model(x), dim=1)
84
+ pred = probs.argmax(1).item()
 
85
 
86
+ labels = ["TWAP(时间加权)", "VWAP(量加权)", "冰山订单", "护盘/支撑", "散户/正常"]
87
+ print(f"识别结果: {labels[pred]} (置信度: {probs[0,pred]:.1%})")
88
  ```
89
 
90
+ ## 💡 如何用于实盘
91
 
92
+ 1. **接入Level-2行情**: 万得、聚宽、TuShare Pro等获取十档委托数据
93
+ 2. **实时推理**: 每收到新快照,用最近100个快照组成输入
94
+ 3. **信号融合**: 结合MACD、成交量异动等传统指标,多信号确认
95
+ 4. **跟随策略**: 检测到TWAP/VWAP买入 考虑跟随;检测到SUPPORT 关注支撑位
 
96
 
97
+ ## 📚 参考文献
98
 
99
+ - DeepLOB (Zhang et al., TNNLS 2019, arxiv:1808.03668)
100
+ - TLOB (Berti & Kasneci, 2025, arxiv:2502.15757)
101
+ - MarS (Li et al., Microsoft, arxiv:2409.07486) — TWAP签名定义
102
+ - PULSE (Cartea et al., 2023, arxiv:2312.05827) — 多时钟特征
103
+ - CoLES (Babaev et al., 2022, arxiv:2002.08232) — 对比学习
104
 
105
+ ## ⚠️ 声明
106
 
107
  本模型仅供研究学习使用,不构成任何投资建议。股市有风险,入市需谨慎。
 
algo_detector.py ADDED
@@ -0,0 +1,465 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ 算法单签名检测器 (Algorithm Order Signature Detector)
3
+
4
+ 从原始无标签的Level-2委托单数据中,通过规则引擎检测主力常用的算法单模式,
5
+ 生成伪标签用于训练深度学习模型。
6
+
7
+ 检测的5种模式:
8
+ 0: TWAP - 时间加权平均价算法 (等量等间隔下单)
9
+ 1: VWAP - 成交量加权平均价算法 (跟随市场成交量节奏)
10
+ 2: ICEBERG - 冰山订单 (显示小量,实际大量,一档反复补单)
11
+ 3: SUPPORT - 护盘/支撑 (关键价位持续大单挂单)
12
+ 4: NORMAL - 正常/散户 (无明显算法特征)
13
+
14
+ 数据输入格式 (原始Level-2行情):
15
+ - 10档买卖委托 (ask_price_1..10, ask_size_1..10, bid_price_1..10, bid_size_1..10)
16
+ - 逐笔委托 (ORDER_ID, PRICE, SIZE, BUY_SELL_FLAG, TYPE)
17
+
18
+ 参考文献:
19
+ - MarS (arxiv:2409.07486): TWAP签名定义
20
+ - PULSE (arxiv:2312.05827): 多时钟特征工程
21
+ - Hautsch & Huang (2012): 冰山订单识别
22
+ """
23
+
24
+ import numpy as np
25
+ import pandas as pd
26
+ from scipy.signal import find_peaks
27
+ from scipy.stats import pearsonr
28
+
29
+
30
+ # ============================================================
31
+ # 特征提取器 (Feature Extractors)
32
+ # ============================================================
33
+
34
+ def compute_order_size_cv(sizes, window=20):
35
+ """
36
+ 计算订单大小的变异系数 (Coefficient of Variation)
37
+ TWAP特征: CV < 0.15 表明等量下单
38
+ """
39
+ N = len(sizes)
40
+ cv = np.ones(N) * 999 # 默认高变异(非TWAP)
41
+ for i in range(window, N):
42
+ w = sizes[i-window:i]
43
+ mean_w = w.mean()
44
+ if mean_w > 0:
45
+ cv[i] = w.std() / mean_w
46
+ return cv
47
+
48
+
49
+ def compute_periodicity(timestamps, window=20, expected_lag=None):
50
+ """
51
+ 计算下单的周期性得分
52
+ TWAP特征: 等间隔下单 → 自相关函数在lag=Δt处有峰值
53
+ """
54
+ N = len(timestamps)
55
+ periodicity = np.zeros(N)
56
+
57
+ for i in range(window, N):
58
+ ts = timestamps[i-window:i]
59
+ # 计算相邻间隔
60
+ intervals = np.diff(ts)
61
+ if len(intervals) < 3 or intervals.std() == 0:
62
+ continue
63
+
64
+ mean_interval = intervals.mean()
65
+ std_interval = intervals.std()
66
+
67
+ # 间隔的规律性: 1 - CV(intervals), 越接近1越规律
68
+ if mean_interval > 0:
69
+ regularity = max(0, 1 - std_interval / mean_interval)
70
+ periodicity[i] = regularity
71
+
72
+ return periodicity
73
+
74
+
75
+ def compute_cancel_burst_ratio(types, timestamps, window=20, boundary_frac=0.2):
76
+ """
77
+ 计算撤单在时间窗口边界的集中度
78
+ TWAP特征: 在每个Δt结束时集中撤单
79
+ """
80
+ N = len(types)
81
+ cancel_burst = np.zeros(N)
82
+ is_cancel = (types == 'ORDER_CANCELLED').astype(float)
83
+
84
+ for i in range(window, N):
85
+ total_cancel = is_cancel[i-window:i].sum()
86
+ if total_cancel == 0:
87
+ continue
88
+
89
+ # 最后20%的时间窗口内的撤单比例
90
+ boundary_start = int(window * (1 - boundary_frac))
91
+ boundary_cancel = is_cancel[i-window+boundary_start:i].sum()
92
+ cancel_burst[i] = boundary_cancel / total_cancel
93
+
94
+ return cancel_burst
95
+
96
+
97
+ def compute_passive_aggressive_ratio(prices, mid_prices, buy_sell, window=20):
98
+ """
99
+ 计算被动/主动订单比例
100
+ TWAP特征: 被动-主动交替模式 (25s被动挂bid1, 5s主动扫ask)
101
+
102
+ 被动: 买单价 <= mid_price (挂在bid侧) 或 卖单价 >= mid_price (挂在ask侧)
103
+ 主动: 买单价 > mid_price (扫ask侧) 或 卖单价 < mid_price (扫bid侧)
104
+ """
105
+ N = len(prices)
106
+ pa_ratio = np.zeros(N)
107
+
108
+ is_aggressive = np.zeros(N)
109
+ for i in range(N):
110
+ if buy_sell[i]: # 买单
111
+ is_aggressive[i] = 1 if prices[i] >= mid_prices[i] else 0
112
+ else: # 卖单
113
+ is_aggressive[i] = 1 if prices[i] <= mid_prices[i] else 0
114
+
115
+ # 滚动计算被动/主动比例
116
+ cum_agg = np.cumsum(is_aggressive)
117
+ for i in range(window, N):
118
+ total_agg = cum_agg[i] - cum_agg[i - window]
119
+ pa_ratio[i] = total_agg / window # 主动比例
120
+
121
+ return pa_ratio
122
+
123
+
124
+ def compute_participation_rate(sizes, total_market_volume, window=20):
125
+ """
126
+ 计算参与率稳定性
127
+ VWAP特征: 参与率 ≈ 常数 (10-20%)
128
+ """
129
+ N = len(sizes)
130
+ participation_stability = np.ones(N) * 999
131
+
132
+ cum_sizes = np.cumsum(sizes)
133
+ cum_market = np.cumsum(total_market_volume)
134
+
135
+ for i in range(window, N):
136
+ # 每个子窗口的参与率
137
+ sub_window = max(1, window // 5)
138
+ rates = []
139
+ for j in range(5):
140
+ start = i - window + j * sub_window
141
+ end = min(start + sub_window, i)
142
+ if end <= start:
143
+ continue
144
+ vol = cum_sizes[end] - cum_sizes[start]
145
+ market_vol = cum_market[end] - cum_market[start]
146
+ if market_vol > 0:
147
+ rates.append(vol / market_vol)
148
+
149
+ if len(rates) >= 3:
150
+ rates = np.array(rates)
151
+ mean_rate = rates.mean()
152
+ if mean_rate > 0:
153
+ participation_stability[i] = rates.std() / mean_rate
154
+
155
+ return participation_stability
156
+
157
+
158
+ def compute_volume_correlation(sizes, buy_sell, total_market_volume, window=50):
159
+ """
160
+ 计算子订单量与市场成交量的相关性
161
+ VWAP特征: Pearson(child_vol, market_vol) > 0.7
162
+ """
163
+ N = len(sizes)
164
+ vol_corr = np.zeros(N)
165
+
166
+ for i in range(window, N):
167
+ child_vols = sizes[i-window:i]
168
+ market_vols = total_market_volume[i-window:i]
169
+
170
+ if child_vols.std() > 0 and market_vols.std() > 0:
171
+ corr, _ = pearsonr(child_vols, market_vols)
172
+ vol_corr[i] = max(0, corr)
173
+
174
+ return vol_corr
175
+
176
+
177
+ def compute_refill_ratio(ask_sizes_1, bid_sizes_1, window=20, refill_threshold=0.7):
178
+ """
179
+ 计算一档补单比率
180
+ 冰山订单特征: 成交后一档量瞬间恢复
181
+
182
+ 检测: V_level1(t) 大幅下降后又快速恢复到接近原值
183
+ """
184
+ N = len(ask_sizes_1)
185
+ refill_score = np.zeros(N)
186
+
187
+ for side_sizes in [ask_sizes_1, bid_sizes_1]:
188
+ for i in range(2, N):
189
+ prev = side_sizes[i-2]
190
+ curr = side_sizes[i-1]
191
+ next_v = side_sizes[i]
192
+
193
+ # 检测: 先减后增 (V大→V小→V大)
194
+ if prev > 0 and curr < prev * 0.5 and next_v > prev * refill_threshold:
195
+ refill_score[i] += 1
196
+
197
+ # 滚动窗口内的平均补单频率
198
+ cum_refill = np.cumsum(refill_score)
199
+ result = np.zeros(N)
200
+ for i in range(window, N):
201
+ result[i] = (cum_refill[i] - cum_refill[i - window]) / window
202
+
203
+ return result
204
+
205
+
206
+ def compute_hidden_volume_ratio(sizes, ask_sizes_1, bid_sizes_1, buy_sell, window=50):
207
+ """
208
+ 计算隐藏量比率
209
+ 冰山订单特征: total_executed / max_displayed > 3.0
210
+ """
211
+ N = len(sizes)
212
+ hidden_ratio = np.zeros(N)
213
+
214
+ for i in range(window, N):
215
+ # 总成交量
216
+ total_vol = sizes[i-window:i].sum()
217
+
218
+ # 最大显示量 (一档的最大值)
219
+ max_displayed = max(
220
+ ask_sizes_1[i-window:i].max(),
221
+ bid_sizes_1[i-window:i].max(),
222
+ 1 # 避免除以0
223
+ )
224
+
225
+ hidden_ratio[i] = total_vol / max_displayed
226
+
227
+ return hidden_ratio
228
+
229
+
230
+ def compute_level_persistence(lob_sizes, window=50, big_order_percentile=90):
231
+ """
232
+ 计算各价位大单持续性得分
233
+ 支撑/阻力位特征: 某价位长期保持大单
234
+
235
+ lob_sizes: (N, 20) - 10档买卖量 [ask_s_1..10, bid_s_1..10]
236
+ """
237
+ N = lob_sizes.shape[0]
238
+ threshold = np.percentile(lob_sizes[lob_sizes > 0], big_order_percentile)
239
+
240
+ persistence = np.zeros(N)
241
+ for i in range(window, N):
242
+ w = lob_sizes[i-window:i] # (window, 20)
243
+
244
+ # 每档的持续大单得分
245
+ max_persistence = 0
246
+ for level in range(20):
247
+ level_big = (w[:, level] > threshold).sum() / window
248
+ max_persistence = max(max_persistence, level_big)
249
+
250
+ persistence[i] = max_persistence
251
+
252
+ return persistence
253
+
254
+
255
+ def compute_depth_imbalance(ask_sizes, bid_sizes, top_levels=3):
256
+ """
257
+ 计算深度不平衡度
258
+ 支撑位特征: bid侧大量堆单 → imbalance > 0
259
+ 阻力位特征: ask侧大量堆单 → imbalance < 0
260
+ """
261
+ bid_depth = bid_sizes[:, :top_levels].sum(axis=1)
262
+ ask_depth = ask_sizes[:, :top_levels].sum(axis=1)
263
+
264
+ total = bid_depth + ask_depth + 1e-8
265
+ imbalance = (bid_depth - ask_depth) / total
266
+
267
+ return imbalance
268
+
269
+
270
+ def compute_ofi_multi_scale(ask_sizes_1, bid_sizes_1, windows=[5, 10, 20, 50]):
271
+ """
272
+ 多尺度订单流不平衡 (Order Flow Imbalance)
273
+ PULSE论文中的核心特征
274
+ """
275
+ N = len(ask_sizes_1)
276
+ features = {}
277
+
278
+ imb = (bid_sizes_1 - ask_sizes_1) / (bid_sizes_1 + ask_sizes_1 + 1e-8)
279
+
280
+ for w in windows:
281
+ # 滚动均值
282
+ cum = np.cumsum(imb)
283
+ roll_mean = np.zeros(N)
284
+ roll_mean[w:] = (cum[w:] - cum[:-w]) / w
285
+ features[f'ofi_{w}'] = roll_mean
286
+
287
+ # 滚动标准差
288
+ cum_sq = np.cumsum(imb ** 2)
289
+ roll_var = np.zeros(N)
290
+ roll_var[w:] = (cum_sq[w:] - cum_sq[:-w]) / w - roll_mean[w:] ** 2
291
+ features[f'ofi_vol_{w}'] = np.sqrt(np.maximum(roll_var, 0))
292
+
293
+ return features
294
+
295
+
296
+ # ============================================================
297
+ # 伪标签生成器 (Pseudo-Label Generator)
298
+ # ============================================================
299
+
300
+ def generate_pseudo_labels(df, verbose=True):
301
+ """
302
+ 从原始Level-2数据生成伪标签。
303
+
304
+ 输入: DataFrame,包含ORDER_ID, PRICE, SIZE, BUY_SELL_FLAG, TYPE,
305
+ ask_price_1..10, ask_size_1..10, bid_price_1..10, bid_size_1..10
306
+
307
+ 输出:
308
+ labels: (N,) int64, 0=TWAP, 1=VWAP, 2=ICEBERG, 3=SUPPORT, 4=NORMAL
309
+ scores: (N, 4) float32, 每种算法的置信度��数
310
+ features: (N, F) float32, 提取的全部特征
311
+ """
312
+ N = len(df)
313
+
314
+ # 基础数据
315
+ sizes = df['SIZE'].values.astype(np.float32)
316
+ buy_sell = df['BUY_SELL_FLAG'].values.astype(np.float32)
317
+ types = df['TYPE'].values
318
+ prices = df['PRICE'].values.astype(np.float32)
319
+
320
+ ask_sizes_1 = df['ask_size_1'].values.astype(np.float32)
321
+ bid_sizes_1 = df['bid_size_1'].values.astype(np.float32)
322
+ ask_price_1 = df['ask_price_1'].values.astype(np.float32)
323
+ bid_price_1 = df['bid_price_1'].values.astype(np.float32)
324
+
325
+ # 替换sentinel值
326
+ for arr in [ask_sizes_1, bid_sizes_1, ask_price_1, bid_price_1]:
327
+ arr[np.abs(arr) > 1e9] = 0
328
+
329
+ # mid price
330
+ valid = (ask_price_1 > 0) & (bid_price_1 > 0)
331
+ mid_prices = np.where(valid, (ask_price_1 + bid_price_1) / 2.0, 0.0)
332
+ for i in range(1, N):
333
+ if mid_prices[i] == 0 and mid_prices[i-1] != 0:
334
+ mid_prices[i] = mid_prices[i-1]
335
+
336
+ # 收集10档量
337
+ ask_sizes = np.zeros((N, 10), dtype=np.float32)
338
+ bid_sizes = np.zeros((N, 10), dtype=np.float32)
339
+ for i in range(10):
340
+ ask_s = df[f'ask_size_{i+1}'].values.astype(np.float32)
341
+ bid_s = df[f'bid_size_{i+1}'].values.astype(np.float32)
342
+ ask_s[np.abs(ask_s) > 1e9] = 0
343
+ bid_s[np.abs(bid_s) > 1e9] = 0
344
+ ask_sizes[:, i] = ask_s
345
+ bid_sizes[:, i] = bid_s
346
+
347
+ # 时间戳 (使用ORDER_ID作为序号近似)
348
+ timestamps = np.arange(N, dtype=np.float32)
349
+
350
+ if verbose:
351
+ print("Computing algorithm signatures...")
352
+
353
+ # ============ TWAP特征 ============
354
+ order_cv = compute_order_size_cv(sizes, window=20)
355
+ periodicity = compute_periodicity(timestamps, window=20)
356
+ cancel_burst = compute_cancel_burst_ratio(types, timestamps, window=20)
357
+ pa_ratio = compute_passive_aggressive_ratio(prices, mid_prices, buy_sell, window=20)
358
+
359
+ # TWAP得分: 低变异 + 高周期性 + 边界撤单 + 被动为主
360
+ twap_score = np.zeros(N)
361
+ twap_score += np.clip(1 - order_cv / 0.3, 0, 1) * 0.35 # CV < 0.3 → 高分
362
+ twap_score += periodicity * 0.30 # 周期性
363
+ twap_score += np.clip(cancel_burst / 0.5, 0, 1) * 0.20 # 撤单集中度
364
+ twap_score += np.clip(1 - pa_ratio / 0.3, 0, 1) * 0.15 # 被动为主
365
+
366
+ # ============ VWAP特征 ============
367
+ # 用全局sizes作为market_volume的代理
368
+ market_vol = np.convolve(sizes, np.ones(10)/10, mode='same') # 滑动平均
369
+ part_stability = compute_participation_rate(sizes, market_vol, window=50)
370
+ vol_corr = compute_volume_correlation(sizes, buy_sell, market_vol, window=50)
371
+
372
+ # VWAP得分: 稳定参与率 + 高量相关性
373
+ vwap_score = np.zeros(N)
374
+ vwap_score += np.clip(1 - part_stability / 0.5, 0, 1) * 0.50 # 参与率稳定
375
+ vwap_score += vol_corr * 0.50 # 量相关
376
+
377
+ # ============ 冰山订单特征 ============
378
+ refill = compute_refill_ratio(ask_sizes_1, bid_sizes_1, window=20)
379
+ hidden_vol = compute_hidden_volume_ratio(sizes, ask_sizes_1, bid_sizes_1, buy_sell, window=50)
380
+
381
+ # 冰山得分: 高补单率 + 高隐藏量比
382
+ iceberg_score = np.zeros(N)
383
+ iceberg_score += np.clip(refill / 0.5, 0, 1) * 0.50 # 补单频率
384
+ iceberg_score += np.clip(hidden_vol / 5.0, 0, 1) * 0.50 # 隐藏量比
385
+
386
+ # ============ 支撑/阻力位特征 ============
387
+ lob_sizes = np.concatenate([ask_sizes, bid_sizes], axis=1) # (N, 20)
388
+ persistence = compute_level_persistence(lob_sizes, window=50)
389
+ depth_imb = compute_depth_imbalance(ask_sizes, bid_sizes, top_levels=3)
390
+
391
+ # 支撑得分: 高持续性 + 不平衡度大
392
+ support_score = np.zeros(N)
393
+ support_score += persistence * 0.50 # 大单持续性
394
+ support_score += np.clip(np.abs(depth_imb) / 0.5, 0, 1) * 0.50 # 深度不平衡
395
+
396
+ # ============ 多尺度OFI (通用特征) ============
397
+ ofi_features = compute_ofi_multi_scale(ask_sizes_1, bid_sizes_1, windows=[5, 10, 20, 50])
398
+
399
+ # ============ 合并所有得分和特征 ============
400
+ scores = np.stack([twap_score, vwap_score, iceberg_score, support_score], axis=1)
401
+
402
+ # 伪标签: 每种模式用各自的百分位阈值
403
+ max_scores = scores.max(axis=1)
404
+ labels = np.full(N, 4, dtype=np.int64) # 默认NORMAL
405
+
406
+ # 每种模式单独设阈值 (取前15-25%为该类)
407
+ for cls in range(4):
408
+ cls_scores = scores[:, cls]
409
+ valid_scores = cls_scores[cls_scores > 0.01]
410
+ if len(valid_scores) > 0:
411
+ thr = np.percentile(valid_scores, 80) # top 20%
412
+ labels[(cls_scores >= thr) & (cls_scores > 0.2)] = cls
413
+
414
+ # 特征矩阵
415
+ all_features = np.column_stack([
416
+ order_cv, periodicity, cancel_burst, pa_ratio,
417
+ part_stability, vol_corr,
418
+ refill, hidden_vol,
419
+ persistence, depth_imb,
420
+ *[ofi_features[k] for k in sorted(ofi_features.keys())]
421
+ ]).astype(np.float32)
422
+
423
+ # 替换NaN/Inf
424
+ all_features = np.nan_to_num(all_features, nan=0.0, posinf=0.0, neginf=0.0)
425
+
426
+ if verbose:
427
+ label_names = {0: 'TWAP', 1: 'VWAP', 2: 'ICEBERG', 3: 'SUPPORT', 4: 'NORMAL'}
428
+ unique, counts = np.unique(labels, return_counts=True)
429
+ print(f"Pseudo-label distribution:")
430
+ for u, c in zip(unique, counts):
431
+ print(f" {u} ({label_names[u]}): {c} ({c/N*100:.1f}%)")
432
+ print(f"Feature matrix shape: {all_features.shape}")
433
+
434
+ return labels, scores, all_features
435
+
436
+
437
+ # ============================================================
438
+ # 使用示例
439
+ # ============================================================
440
+
441
+ if __name__ == "__main__":
442
+ from datasets import load_dataset
443
+
444
+ print("Loading TRADES-LOB dataset...")
445
+ ds = load_dataset("LeonardoBerti/TRADES-LOB", split="train")
446
+ df = ds.to_pandas()
447
+ print(f"Dataset: {len(df)} rows")
448
+
449
+ labels, scores, features = generate_pseudo_labels(df)
450
+
451
+ print(f"\nLabel shape: {labels.shape}")
452
+ print(f"Score shape: {scores.shape}")
453
+ print(f"Feature shape: {features.shape}")
454
+
455
+ # 展示每种模式的top案例
456
+ label_names = {0: 'TWAP', 1: 'VWAP', 2: 'ICEBERG', 3: 'SUPPORT', 4: 'NORMAL'}
457
+ for cls in range(4):
458
+ cls_mask = labels == cls
459
+ if cls_mask.sum() > 0:
460
+ top_idx = np.where(cls_mask)[0]
461
+ top_scores = scores[top_idx, cls]
462
+ best = top_idx[top_scores.argmax()]
463
+ print(f"\n{label_names[cls]} 最高置信度样本 (idx={best}, score={scores[best, cls]:.3f}):")
464
+ print(f" SIZE={df.iloc[best]['SIZE']}, PRICE={df.iloc[best]['PRICE']}, "
465
+ f"BUY_SELL={'Buy' if df.iloc[best]['BUY_SELL_FLAG'] else 'Sell'}")
config.json CHANGED
@@ -1,43 +1,51 @@
1
  {
2
- "model_type": "LOBPatternNetV3",
3
- "architecture": "CNN (Spatial) + CNN (Temporal) + Transformer Attention",
4
- "num_levels": 10,
5
- "seq_len": 100,
6
- "num_classes": 3,
7
- "d_model": 64,
8
- "nhead": 4,
9
- "dropout": 0.4,
10
- "total_parameters": 338395,
11
  "class_names": [
12
- "主力买入 (Buy)",
13
- "中性 (Neutral)",
14
- "主力卖出 (Sell)"
 
 
15
  ],
16
  "class_names_zh": [
17
- "主力买入",
18
- "中性/散户",
19
- "主力卖出"
 
 
20
  ],
21
- "test_accuracy": 0.3979949874686717,
22
- "test_f1_macro": 0.37945545254276664,
23
- "test_f1_weighted": 0.4203155400216087,
 
 
 
 
 
24
  "test_precision": [
25
- 0.24180967238689546,
26
- 0.7696245733788396,
27
- 0.24479166666666666
 
 
28
  ],
29
  "test_recall": [
30
- 0.5400696864111498,
31
- 0.3205401563610519,
32
- 0.6245847176079734
 
 
33
  ],
34
  "training_dataset": "LeonardoBerti/TRADES-LOB",
35
- "normalization": "z-score (means/stds in norm_stats.npz)",
36
- "label_construction": {
37
- "method": "OFI + large_order_ratio + cancellation_rate",
38
- "window": 50,
39
- "ofi_threshold": 0.15,
40
- "large_order_percentile": 85,
41
- "score_percentile": 80
42
- }
 
43
  }
 
1
  {
2
+ "model_type": "LOBAlgoNet",
3
+ "architecture": "BilinearNorm + Spatial CNN + Temporal CNN + Transformer Attention",
4
+ "num_classes": 5,
 
 
 
 
 
 
5
  "class_names": [
6
+ "TWAP",
7
+ "VWAP",
8
+ "ICEBERG",
9
+ "SUPPORT",
10
+ "NORMAL"
11
  ],
12
  "class_names_zh": [
13
+ "TWAP(时间加权)",
14
+ "VWAP(量加权)",
15
+ "冰山订单",
16
+ "护盘/支撑",
17
+ "散户/正常"
18
  ],
19
+ "seq_len": 100,
20
+ "d_model": 128,
21
+ "nhead": 4,
22
+ "dropout": 0.25,
23
+ "total_parameters": 338525,
24
+ "test_accuracy": 0.5273182957393484,
25
+ "test_f1_macro": 0.35382369182978163,
26
+ "test_f1_weighted": 0.5822412019482407,
27
  "test_precision": [
28
+ 0.0,
29
+ 0.17391304347826086,
30
+ 0.9820359281437125,
31
+ 0.4684317718940937,
32
+ 0.23381294964028776
33
  ],
34
  "test_recall": [
35
+ 0.0,
36
+ 0.4444444444444444,
37
+ 0.5573491928632116,
38
+ 0.5764411027568922,
39
+ 0.3857566765578635
40
  ],
41
  "training_dataset": "LeonardoBerti/TRADES-LOB",
42
+ "labeling_method": "Rule-based pseudo-labels from algorithm signature detection",
43
+ "label_description": {
44
+ "0_TWAP": "Time-Weighted Average Price execution (equal-size periodic orders)",
45
+ "1_VWAP": "Volume-Weighted Average Price execution (volume-proportional orders)",
46
+ "2_ICEBERG": "Iceberg/hidden orders (level-1 refill after fill, hidden volume)",
47
+ "3_SUPPORT": "Support/resistance defense (persistent large orders at key levels)",
48
+ "4_NORMAL": "Normal retail activity (no algorithmic signature detected)"
49
+ },
50
+ "normalization": "z-score (means/stds in norm_stats.npz)"
51
  }
model.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:af087d34c5d65978edcad41f0dd607b51b6a937a55de3ac73a0f194ed0029b91
3
- size 1377888
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2f734cbdc987b34df4464f0b8f73acd93beeaa1701cc55dd12a94f2ff757b299
3
+ size 1378336
model.py CHANGED
@@ -1,60 +1,81 @@
1
- """LOBPatternNet V3 - for loading saved model weights."""
2
  import torch
3
  import torch.nn as nn
 
4
 
5
  class BilinearNorm(nn.Module):
6
- def __init__(self, num_features):
7
  super().__init__()
8
- self.gamma = nn.Parameter(torch.ones(1, 1, num_features))
9
- self.beta = nn.Parameter(torch.zeros(1, 1, num_features))
10
- self.gate = nn.Parameter(torch.ones(1, 1, num_features))
11
  def forward(self, x):
12
- mean = x.mean(dim=1, keepdim=True)
13
- std = x.std(dim=1, keepdim=True) + 1e-8
14
- x_norm = (x - mean) / std
15
- gate = torch.sigmoid(self.gate)
16
- return gate * (self.gamma * x_norm + self.beta) + (1 - gate) * x
17
 
18
- class LOBPatternNetV3(nn.Module):
19
- def __init__(self, num_classes=3, d_model=128, nhead=4, dropout=0.25):
 
 
 
 
 
 
20
  super().__init__()
21
  self.norm = BilinearNorm(40)
 
 
22
  self.spatial = nn.Sequential(
23
- nn.Conv2d(1, 32, kernel_size=(1, 2), stride=(1, 2)),
24
  nn.BatchNorm2d(32), nn.LeakyReLU(0.01),
25
- nn.Conv2d(32, 32, kernel_size=(1, 2), stride=(1, 2)),
26
  nn.BatchNorm2d(32), nn.LeakyReLU(0.01),
27
- nn.Conv2d(32, 32, kernel_size=(1, 10)),
28
  nn.BatchNorm2d(32), nn.LeakyReLU(0.01),
29
  )
 
 
30
  self.temporal = nn.Sequential(
31
- nn.Conv1d(32, 64, kernel_size=3, padding=1),
32
- nn.BatchNorm1d(64), nn.LeakyReLU(0.01), nn.Dropout(dropout),
33
- nn.Conv1d(64, 64, kernel_size=5, padding=2),
34
- nn.BatchNorm1d(64), nn.LeakyReLU(0.01), nn.Dropout(dropout),
35
- nn.Conv1d(64, d_model, kernel_size=3, padding=1),
36
- nn.BatchNorm1d(d_model), nn.LeakyReLU(0.01), nn.Dropout(dropout),
37
- )
38
- encoder_layer = nn.TransformerEncoderLayer(
39
- d_model=d_model, nhead=nhead, dim_feedforward=d_model*2,
40
- dropout=dropout, batch_first=True, activation="gelu"
41
  )
42
- self.attention = nn.TransformerEncoder(encoder_layer, num_layers=2)
43
- self.classifier = nn.Sequential(
44
- nn.LayerNorm(d_model),
45
- nn.Dropout(dropout),
46
- nn.Linear(d_model, 64),
47
- nn.GELU(),
48
- nn.Dropout(dropout),
 
 
49
  nn.Linear(64, num_classes)
50
  )
 
 
 
 
 
 
 
 
 
51
  def forward(self, x):
52
  x = self.norm(x)
53
- x = x.unsqueeze(1)
54
- x = self.spatial(x)
55
- x = x.squeeze(-1)
 
 
 
 
 
 
 
56
  x = self.temporal(x)
57
- x = x.permute(0, 2, 1)
58
- x = self.attention(x)
59
- x = x.mean(dim=1)
60
- return self.classifier(x)
 
 
1
  import torch
2
  import torch.nn as nn
3
+ import torch.nn.functional as F
4
 
5
  class BilinearNorm(nn.Module):
6
+ def __init__(self, d):
7
  super().__init__()
8
+ self.gamma = nn.Parameter(torch.ones(1,1,d))
9
+ self.beta = nn.Parameter(torch.zeros(1,1,d))
10
+ self.gate = nn.Parameter(torch.ones(1,1,d))
11
  def forward(self, x):
12
+ m = x.mean(1, keepdim=True)
13
+ s = x.std(1, keepdim=True) + 1e-8
14
+ xn = (x - m) / s
15
+ g = torch.sigmoid(self.gate)
16
+ return g * (self.gamma * xn + self.beta) + (1 - g) * x
17
 
18
+
19
+ class LOBAlgoNet(nn.Module):
20
+ """
21
+ CNN + Transformer model for algorithmic order pattern detection.
22
+ Input: (B, T=100, 40) normalized LOB snapshots
23
+ Output: (B, 5) logits for [TWAP, VWAP, ICEBERG, SUPPORT, NORMAL]
24
+ """
25
+ def __init__(self, num_classes=5, d_model=128, nhead=4, dropout=0.25):
26
  super().__init__()
27
  self.norm = BilinearNorm(40)
28
+
29
+ # Spatial CNN: cross-level patterns
30
  self.spatial = nn.Sequential(
31
+ nn.Conv2d(1, 32, (1,2), stride=(1,2)), # 40→20
32
  nn.BatchNorm2d(32), nn.LeakyReLU(0.01),
33
+ nn.Conv2d(32, 32, (1,2), stride=(1,2)), # 20→10
34
  nn.BatchNorm2d(32), nn.LeakyReLU(0.01),
35
+ nn.Conv2d(32, 32, (1,10)), # 10→1
36
  nn.BatchNorm2d(32), nn.LeakyReLU(0.01),
37
  )
38
+
39
+ # Temporal CNN: multi-scale temporal features
40
  self.temporal = nn.Sequential(
41
+ nn.Conv1d(32, 64, 3, padding=1), nn.BatchNorm1d(64), nn.LeakyReLU(0.01), nn.Dropout(dropout),
42
+ nn.Conv1d(64, 64, 5, padding=2), nn.BatchNorm1d(64), nn.LeakyReLU(0.01), nn.Dropout(dropout),
43
+ nn.Conv1d(64, d_model, 3, padding=1), nn.BatchNorm1d(d_model), nn.LeakyReLU(0.01), nn.Dropout(dropout),
 
 
 
 
 
 
 
44
  )
45
+
46
+ # Transformer attention
47
+ enc_layer = nn.TransformerEncoderLayer(d_model, nhead, d_model*2, dropout, batch_first=True, activation='gelu')
48
+ self.attention = nn.TransformerEncoder(enc_layer, num_layers=2)
49
+
50
+ # Classifier
51
+ self.head = nn.Sequential(
52
+ nn.LayerNorm(d_model), nn.Dropout(dropout),
53
+ nn.Linear(d_model, 64), nn.GELU(), nn.Dropout(dropout),
54
  nn.Linear(64, num_classes)
55
  )
56
+
57
+ self._init()
58
+
59
+ def _init(self):
60
+ for m in self.modules():
61
+ if isinstance(m, (nn.Linear, nn.Conv1d, nn.Conv2d)):
62
+ nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
63
+ if m.bias is not None: nn.init.zeros_(m.bias)
64
+
65
  def forward(self, x):
66
  x = self.norm(x)
67
+ x = self.spatial(x.unsqueeze(1)).squeeze(-1) # (B,32,T)
68
+ x = self.temporal(x) # (B,d_model,T)
69
+ x = self.attention(x.permute(0,2,1)) # (B,T,d_model)
70
+ x = x.mean(dim=1) # (B,d_model)
71
+ return self.head(x)
72
+
73
+ def get_embeddings(self, x):
74
+ """提取特征向量,用于聚类/可视化分析"""
75
+ x = self.norm(x)
76
+ x = self.spatial(x.unsqueeze(1)).squeeze(-1)
77
  x = self.temporal(x)
78
+ x = self.attention(x.permute(0,2,1))
79
+ return x.mean(dim=1) # (B, d_model)
80
+
81
+