kangkangchen commited on
Commit
d730c93
·
verified ·
1 Parent(s): 6d99076

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +93 -19
  2. config.json +25 -8
  3. data_processor.py +301 -0
  4. model.pt +1 -1
  5. model.py +311 -0
README.md CHANGED
@@ -1,31 +1,105 @@
1
  # LOBPatternNet - 主力下单模式识别模型
 
2
 
3
- ## 模型简介
4
- 基于Level-2十档委托单数据的主力(机构)交易模式识别深度学习模型。
5
 
6
- ## 模型
7
- - **输入**: 100个连续的Level-2委托快照,每个快照包含40个特征(10档/价格+数量)
8
- - **特征工程**: 订单流不平衡(OFI)、价差动态、深度不平衡、大单集中度、价格压力、OFI波动率
9
- - **编码器**: CNN空间编码器 + Inception多尺度时间特征 + Transformer注意力机制
10
- - **输出**: 3分类(主力买入 / 中性 / 主力卖出)
11
 
12
- ## 性能指标
13
- - Test Accuracy: 0.8752
14
- - Test F1 (weighted): 0.8169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- ## 使用方法
17
  ```python
18
- from model import LOBPatternNet
19
  import torch
 
20
 
21
- model = LOBPatternNet(seq_len=100, num_classes=3)
22
- model.load_state_dict(torch.load("model.pt"))
 
23
  model.eval()
24
 
25
- # Input: (batch, 100, 40) LOB snapshots
26
- # Output: (batch, 3) logits [主力买入, 中性, 主力卖出]
 
 
 
 
 
 
 
 
 
 
 
27
  ```
28
 
29
- ## 引用
30
- - DeepLOB: Zhang et al., TNNLS 2019
31
- - TLOB: Berti & Kasneci, arxiv:2502.15757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # LOBPatternNet - 主力下单模式识别模型
2
+ # LOBPatternNet - Institutional Trading Pattern Detection from Level-2 Order Book
3
 
4
+ ## 模型简介 / Model Overview
 
5
 
6
+ 模型基于A股Level-2十档委托单数据,利用深度学习自动识别主力(机投资者)的下单模式。
7
+ 通过分析买卖委托的价格分布、挂单量、订单流不平衡(OFI)等微观结构特征,判断当前是否存在主力入或出行为。
 
 
 
8
 
9
+ This model detects institutional (主力) trading patterns from Level-2 order book data with 10 price levels.
10
+ It analyzes bid/ask price distributions, order sizes, Order Flow Imbalance (OFI), and other microstructure
11
+ features to classify market states into institutional buying, neutral, or institutional selling.
12
+
13
+ ## 架构 / Architecture
14
+
15
+ ```
16
+ Input: (batch, 100, 40) - 100 consecutive LOB snapshots × 40 features
17
+
18
+ BilinearNorm - 自适应归一化层
19
+
20
+ Spatial CNN (Conv2d) - 提取价位间空间特征 (cross-level patterns)
21
+
22
+ Inception Module × 2 - 多尺度时间特征提取 (multi-scale temporal)
23
+
24
+ Transformer Attention × 2 - 时序依赖建模 (temporal dependencies)
25
+
26
+ Fusion with Auxiliary Features:
27
+ - 订单流不平衡 (OFI)
28
+ - 价差动态 (Spread dynamics)
29
+ - 深度不平衡 (Depth imbalance)
30
+ - 大单集中度 (Volume concentration)
31
+ - 价格压力 (Price pressure)
32
+ - OFI波动率 (OFI volatility)
33
+
34
+ 3-class Classification Head
35
+ ```
36
+
37
+ **Total Parameters**: 259,899
38
+
39
+ ## 输出类别 / Output Classes
40
+
41
+ | Label | 中文 | English | Description |
42
+ |-------|------|---------|-------------|
43
+ | 0 | 主力买入 | Institutional Buying | 检测到机构大量买入信号 |
44
+ | 1 | 中性/散户 | Neutral/Retail | 无明显主力操盘迹象 |
45
+ | 2 | 主力卖出 | Institutional Selling | 检测到机构大量卖出信号 |
46
+
47
+ ## 性能指标 / Performance
48
+
49
+ | Metric | Value |
50
+ |--------|-------|
51
+ | Test Accuracy | 0.4777 |
52
+ | Test F1 (Macro) | 0.4127 |
53
+ | Test F1 (Weighted) | 0.5091 |
54
+ | 主力买入 Precision | 0.2369 |
55
+ | 主力买入 Recall | 0.4251 |
56
+ | 主力卖出 Precision | 0.2679 |
57
+ | 主力卖出 Recall | 0.4983 |
58
+
59
+ ## 使用方法 / Usage
60
 
 
61
  ```python
 
62
  import torch
63
+ from model import LOBPatternNet
64
 
65
+ # Load model
66
+ model = LOBPatternNet(seq_len=100, num_classes=3, d_model=128, nhead=4, num_attn_layers=2)
67
+ model.load_state_dict(torch.load("model.pt", weights_only=True))
68
  model.eval()
69
 
70
+ # Input: 100 consecutive Level-2 snapshots
71
+ # Each snapshot: [ask_p1, ask_s1, bid_p1, bid_s1, ask_p2, ask_s2, ..., bid_p10, bid_s10]
72
+ # Features should be z-score normalized (see data_processor.py)
73
+ x = torch.randn(1, 100, 40) # example input
74
+ with torch.no_grad():
75
+ logits = model(x)
76
+ probs = torch.softmax(logits, dim=1)
77
+ pred = logits.argmax(dim=1)
78
+
79
+ # pred: 0=主力买入, 1=中性, 2=主力卖出
80
+ labels = ["主力买入", "中性/散户", "主力卖出"]
81
+ print(f"Prediction: {labels[pred.item()]}")
82
+ print(f"Confidence: {probs[0, pred.item()]:.2%}")
83
  ```
84
 
85
+ ## 数据格式 / Input Format
86
+
87
+ 每个Level-2快照包含40个特征 (10档 × 4个字段):
88
+
89
+ | Feature | Description | 说明 |
90
+ |---------|-------------|------|
91
+ | ask_price_i | Ask price at level i | 第i档卖出价 |
92
+ | ask_size_i | Ask volume at level i | 第i档卖出量 |
93
+ | bid_price_i | Bid price at level i | 第i档买入价 |
94
+ | bid_size_i | Bid volume at level i | 第i档买入量 |
95
+
96
+ ## 参考文献 / References
97
+
98
+ - **DeepLOB**: Zhang et al., "Deep Convolutional Neural Networks for Limit Order Books", TNNLS 2019 (arxiv:1808.03668)
99
+ - **TLOB**: Berti & Kasneci, "TLOB: A Novel Transformer Model with Dual Attention for Stock Price Trend Prediction", 2025 (arxiv:2502.15757)
100
+ - **Training Dataset**: [LeonardoBerti/TRADES-LOB](https://huggingface.co/datasets/LeonardoBerti/TRADES-LOB)
101
+
102
+ ## 声明 / Disclaimer
103
+
104
+ 本模型仅供研究学习使用,不构成任何投资建议。股市有风险,入市需谨慎。
105
+ This model is for research purposes only and does not constitute investment advice.
config.json CHANGED
@@ -1,19 +1,36 @@
1
  {
2
  "model_type": "LOBPatternNet",
 
3
  "num_levels": 10,
4
  "seq_len": 100,
5
  "num_classes": 3,
6
  "d_model": 128,
7
  "nhead": 4,
8
  "num_attn_layers": 2,
9
- "dropout": 0.3,
10
  "class_names": [
11
- "\u4e3b\u529b\u4e70\u5165 (Institutional Buy)",
12
- "\u4e2d\u6027 (Neutral)",
13
- "\u4e3b\u529b\u5356\u51fa (Institutional Sell)"
14
  ],
15
- "description": "Institutional trading pattern detection from Level-2 order book data",
16
- "architecture": "CNN + Inception + Transformer Attention",
17
- "test_f1": 0.8168909018517501,
18
- "test_acc": 0.8751566808723991
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  }
 
1
  {
2
  "model_type": "LOBPatternNet",
3
+ "architecture": "CNN + Inception + Transformer Attention + Auxiliary Features",
4
  "num_levels": 10,
5
  "seq_len": 100,
6
  "num_classes": 3,
7
  "d_model": 128,
8
  "nhead": 4,
9
  "num_attn_layers": 2,
10
+ "dropout": 0.2,
11
  "class_names": [
12
+ "主力买入 (Institutional Buy)",
13
+ "中性 (Neutral)",
14
+ "主力卖出 (Institutional Sell)"
15
  ],
16
+ "class_names_zh": [
17
+ "主力买入",
18
+ "中性/散户",
19
+ "主力卖出"
20
+ ],
21
+ "total_parameters": 259899,
22
+ "training_dataset": "LeonardoBerti/TRADES-LOB",
23
+ "test_accuracy": 0.47769423558897245,
24
+ "test_f1_macro": 0.4126581408122072,
25
+ "test_f1_weighted": 0.5091308416210424,
26
+ "test_precision_per_class": [
27
+ 0.23689320388349513,
28
+ 0.7402173913043478,
29
+ 0.26785714285714285
30
+ ],
31
+ "test_recall_per_class": [
32
+ 0.4250871080139373,
33
+ 0.4840085287846482,
34
+ 0.4983388704318937
35
+ ]
36
  }
data_processor.py ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data processing pipeline for LOBPatternNet v2.
3
+ Fixed: proper normalization, balanced labeling, oversampling.
4
+ """
5
+
6
+ import numpy as np
7
+ import pandas as pd
8
+ from datasets import load_dataset
9
+ from sklearn.model_selection import train_test_split
10
+ import torch
11
+ from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
12
+ import os
13
+
14
+
15
+ def load_lob_data():
16
+ """Load TRADES-LOB dataset from HF Hub."""
17
+ ds = load_dataset("LeonardoBerti/TRADES-LOB", split="train")
18
+ df = ds.to_pandas()
19
+ print(f"Loaded dataset: {len(df)} rows")
20
+ return df
21
+
22
+
23
+ def extract_and_normalize_features(df):
24
+ """
25
+ Extract and normalize LOB features properly.
26
+
27
+ Approach:
28
+ 1. Separate price and size features
29
+ 2. Prices: normalize relative to mid-price (basis points)
30
+ 3. Sizes: log-transform then z-score
31
+ 4. Replace invalid values with 0
32
+ 5. Final z-score normalization per feature
33
+
34
+ Returns: (N, 40) normalized features
35
+ """
36
+ N = len(df)
37
+
38
+ # Collect raw features
39
+ ask_prices = np.zeros((N, 10), dtype=np.float64)
40
+ ask_sizes = np.zeros((N, 10), dtype=np.float64)
41
+ bid_prices = np.zeros((N, 10), dtype=np.float64)
42
+ bid_sizes = np.zeros((N, 10), dtype=np.float64)
43
+
44
+ for i in range(10):
45
+ ask_prices[:, i] = df[f'ask_price_{i+1}'].values.astype(np.float64)
46
+ ask_sizes[:, i] = df[f'ask_size_{i+1}'].values.astype(np.float64)
47
+ bid_prices[:, i] = df[f'bid_price_{i+1}'].values.astype(np.float64)
48
+ bid_sizes[:, i] = df[f'bid_size_{i+1}'].values.astype(np.float64)
49
+
50
+ # Mark sentinel/invalid values
51
+ SENTINEL = 1e9
52
+ ask_p_valid = np.abs(ask_prices) < SENTINEL
53
+ ask_s_valid = np.abs(ask_sizes) < SENTINEL
54
+ bid_p_valid = np.abs(bid_prices) < SENTINEL
55
+ bid_s_valid = np.abs(bid_sizes) < SENTINEL
56
+
57
+ n_invalid = (~ask_p_valid).sum() + (~bid_p_valid).sum() + (~ask_s_valid).sum() + (~bid_s_valid).sum()
58
+ print(f"Found {n_invalid} invalid/sentinel values")
59
+
60
+ # Compute mid-price from valid best bid/ask
61
+ best_ask = ask_prices[:, 0].copy()
62
+ best_bid = bid_prices[:, 0].copy()
63
+ both_valid = ask_p_valid[:, 0] & bid_p_valid[:, 0]
64
+ mid_price = np.where(both_valid, (best_ask + best_bid) / 2.0, 0.0)
65
+
66
+ # Forward-fill mid_price where it's 0
67
+ for i in range(1, N):
68
+ if mid_price[i] == 0 and mid_price[i-1] != 0:
69
+ mid_price[i] = mid_price[i-1]
70
+
71
+ # Normalize prices: (price - mid) / mid * 10000 = basis points
72
+ norm_ask_p = np.zeros_like(ask_prices)
73
+ norm_bid_p = np.zeros_like(bid_prices)
74
+
75
+ for i in range(10):
76
+ valid_a = ask_p_valid[:, i] & (mid_price > 0)
77
+ valid_b = bid_p_valid[:, i] & (mid_price > 0)
78
+ norm_ask_p[valid_a, i] = (ask_prices[valid_a, i] - mid_price[valid_a]) / mid_price[valid_a] * 10000
79
+ norm_bid_p[valid_b, i] = (bid_prices[valid_b, i] - mid_price[valid_b]) / mid_price[valid_b] * 10000
80
+
81
+ # Normalize sizes: log1p then z-score
82
+ norm_ask_s = np.zeros_like(ask_sizes)
83
+ norm_bid_s = np.zeros_like(bid_sizes)
84
+
85
+ for i in range(10):
86
+ valid_a = ask_s_valid[:, i] & (ask_sizes[:, i] > 0)
87
+ valid_b = bid_s_valid[:, i] & (bid_sizes[:, i] > 0)
88
+ norm_ask_s[valid_a, i] = np.log1p(ask_sizes[valid_a, i])
89
+ norm_bid_s[valid_b, i] = np.log1p(bid_sizes[valid_b, i])
90
+
91
+ # Assemble into (N, 40) array: [ask_p_1, ask_s_1, bid_p_1, bid_s_1, ...]
92
+ features = np.zeros((N, 40), dtype=np.float32)
93
+ for i in range(10):
94
+ features[:, i*4] = norm_ask_p[:, i]
95
+ features[:, i*4+1] = norm_ask_s[:, i]
96
+ features[:, i*4+2] = norm_bid_p[:, i]
97
+ features[:, i*4+3] = norm_bid_s[:, i]
98
+
99
+ # Final z-score normalization per feature (critical for model convergence)
100
+ means = features.mean(axis=0)
101
+ stds = features.std(axis=0)
102
+ stds[stds < 1e-8] = 1.0 # avoid division by 0
103
+ features = (features - means) / stds
104
+
105
+ # Replace any remaining NaN/inf
106
+ features = np.nan_to_num(features, nan=0.0, posinf=0.0, neginf=0.0)
107
+
108
+ print(f"Feature shape: {features.shape}")
109
+ print(f"Feature range: [{features.min():.4f}, {features.max():.4f}]")
110
+ print(f"Feature mean: {features.mean():.6f}, std: {features.std():.4f}")
111
+
112
+ return features, means, stds
113
+
114
+
115
+ def rolling_sum(arr, window):
116
+ """Fully vectorized rolling sum using cumsum trick."""
117
+ cum = np.cumsum(arr)
118
+ result = np.zeros_like(cum)
119
+ result[window:] = cum[window:] - cum[:-window]
120
+ return result
121
+
122
+
123
+ def construct_labels_vectorized(df, window=50, ofi_threshold=0.15, percentile=85):
124
+ """
125
+ Fully vectorized label construction for institutional trading detection.
126
+ Uses rolling windows and relaxed thresholds for better class balance.
127
+ """
128
+ N = len(df)
129
+ buy_sell = df['BUY_SELL_FLAG'].values.astype(np.float32) # 1=buy, 0=sell
130
+ sizes = df['SIZE'].values.astype(np.float32)
131
+ types = df['TYPE'].values
132
+
133
+ print(f"Constructing labels from {N} events, window={window}...")
134
+
135
+ # Signed volume
136
+ signed_vol = sizes * (2 * buy_sell - 1)
137
+
138
+ # Rolling sums (vectorized)
139
+ roll_signed = rolling_sum(signed_vol, window)
140
+ roll_total = rolling_sum(sizes, window)
141
+ norm_ofi = roll_signed / (roll_total + 1e-8)
142
+
143
+ # Large orders
144
+ is_large = (sizes > np.percentile(sizes, percentile)).astype(np.float32)
145
+ roll_large_buy = rolling_sum(is_large * buy_sell, window)
146
+ roll_large_sell = rolling_sum(is_large * (1 - buy_sell), window)
147
+
148
+ # Cancellation rate
149
+ is_cancel = (types == 'ORDER_CANCELLED').astype(np.float32)
150
+ roll_cancel = rolling_sum(is_cancel, window) / window
151
+
152
+ # Combined scores
153
+ large_diff = (roll_large_buy - roll_large_sell) / (window * 0.1 + 1e-8)
154
+ buy_score = norm_ofi + 0.3 * large_diff + 0.2 * roll_cancel
155
+ sell_score = -norm_ofi - 0.3 * large_diff + 0.2 * roll_cancel
156
+
157
+ # Use percentile thresholds for ~15-20% per class
158
+ valid = np.arange(window, N)
159
+ buy_threshold = np.percentile(buy_score[valid], 80)
160
+ sell_threshold = np.percentile(sell_score[valid], 80)
161
+
162
+ print(f"Buy threshold (p80): {buy_threshold:.4f}, Sell threshold (p80): {sell_threshold:.4f}")
163
+
164
+ labels = np.ones(N, dtype=np.int64)
165
+ labels[(buy_score > buy_threshold) & (norm_ofi > ofi_threshold)] = 0
166
+ labels[(sell_score > sell_threshold) & (norm_ofi < -ofi_threshold)] = 2
167
+
168
+ unique, counts = np.unique(labels, return_counts=True)
169
+ label_names = {0: '主力买入', 1: '中性', 2: '主力卖出'}
170
+ print("Label distribution:")
171
+ for u, c in zip(unique, counts):
172
+ print(f" {u} ({label_names.get(u, '?')}): {c} ({c/N*100:.1f}%)")
173
+
174
+ return labels
175
+
176
+
177
+ def create_sequences(features, labels, seq_len=100, stride=20):
178
+ """Create sliding window sequences using stride_tricks for efficiency."""
179
+ N = len(features)
180
+ F = features.shape[1]
181
+ n_sequences = (N - seq_len) // stride
182
+
183
+ # Use list comprehension (more memory efficient than pre-allocating huge array)
184
+ starts = np.arange(0, N - seq_len, stride)
185
+ n_sequences = len(starts)
186
+
187
+ print(f"Creating {n_sequences} sequences of length {seq_len}, stride {stride}...")
188
+
189
+ X = np.zeros((n_sequences, seq_len, F), dtype=np.float32)
190
+ y = np.zeros(n_sequences, dtype=np.int64)
191
+
192
+ for idx, start in enumerate(starts):
193
+ X[idx] = features[start:start + seq_len]
194
+ y[idx] = labels[start + seq_len - 1]
195
+
196
+ print(f"Created {n_sequences} sequences, memory: {X.nbytes / 1e6:.1f} MB")
197
+ return X, y
198
+
199
+
200
+ class LOBDataset(Dataset):
201
+ def __init__(self, X, y):
202
+ self.X = torch.from_numpy(X)
203
+ self.y = torch.from_numpy(y)
204
+
205
+ def __len__(self):
206
+ return len(self.X)
207
+
208
+ def __getitem__(self, idx):
209
+ return self.X[idx], self.y[idx]
210
+
211
+
212
+ def get_weighted_sampler(y_train):
213
+ """Create WeightedRandomSampler to oversample minority classes."""
214
+ class_counts = np.bincount(y_train)
215
+ class_weights = 1.0 / class_counts
216
+ sample_weights = class_weights[y_train]
217
+ sampler = WeightedRandomSampler(
218
+ weights=torch.from_numpy(sample_weights).double(),
219
+ num_samples=len(y_train),
220
+ replacement=True
221
+ )
222
+ return sampler
223
+
224
+
225
+ def prepare_data(seq_len=100, stride=5, window=50, ofi_threshold=0.2,
226
+ percentile=90, test_size=0.15, val_size=0.15,
227
+ random_state=42, batch_size=64):
228
+ """
229
+ Full data preparation pipeline.
230
+ Returns train, val, test DataLoaders with balanced sampling.
231
+ """
232
+ cache_path = f"/app/data_v2_{seq_len}_{stride}_{window}.npz"
233
+
234
+ if os.path.exists(cache_path):
235
+ print(f"Loading cached data from {cache_path}")
236
+ data = np.load(cache_path, allow_pickle=True)
237
+ X_train, y_train = data['X_train'], data['y_train']
238
+ X_val, y_val = data['X_val'], data['y_val']
239
+ X_test, y_test = data['X_test'], data['y_test']
240
+ else:
241
+ # Load raw data
242
+ df = load_lob_data()
243
+
244
+ # Extract and normalize features
245
+ features, means, stds = extract_and_normalize_features(df)
246
+
247
+ # Construct labels
248
+ labels = construct_labels_vectorized(df, window=window,
249
+ ofi_threshold=ofi_threshold,
250
+ percentile=percentile)
251
+
252
+ # Create sequences
253
+ X, y = create_sequences(features, labels, seq_len=seq_len, stride=stride)
254
+
255
+ # Split (stratified)
256
+ X_train, X_temp, y_train, y_temp = train_test_split(
257
+ X, y, test_size=test_size + val_size, random_state=random_state, stratify=y)
258
+ X_val, X_test, y_val, y_test = train_test_split(
259
+ X_temp, y_temp, test_size=test_size / (test_size + val_size),
260
+ random_state=random_state, stratify=y_temp)
261
+
262
+ # Save cache
263
+ np.savez_compressed(cache_path,
264
+ X_train=X_train, y_train=y_train,
265
+ X_val=X_val, y_val=y_val,
266
+ X_test=X_test, y_test=y_test,
267
+ means=means, stds=stds)
268
+ print(f"Cached to {cache_path}")
269
+
270
+ print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
271
+
272
+ # Print label distributions
273
+ for name, ys in [("Train", y_train), ("Val", y_val), ("Test", y_test)]:
274
+ unique, counts = np.unique(ys, return_counts=True)
275
+ dist = {u: c for u, c in zip(unique, counts)}
276
+ print(f" {name}: {dist}")
277
+
278
+ # Create datasets
279
+ train_dataset = LOBDataset(X_train, y_train)
280
+ val_dataset = LOBDataset(X_val, y_val)
281
+ test_dataset = LOBDataset(X_test, y_test)
282
+
283
+ # Weighted sampler for training (oversamples minority classes)
284
+ train_sampler = get_weighted_sampler(y_train)
285
+
286
+ train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler, num_workers=0)
287
+ val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
288
+ test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=0)
289
+
290
+ return train_loader, val_loader, test_loader, y_train
291
+
292
+
293
+ if __name__ == "__main__":
294
+ train_loader, val_loader, test_loader, y_train = prepare_data()
295
+
296
+ # Check a batch
297
+ for X_batch, y_batch in train_loader:
298
+ print(f"Batch X: {X_batch.shape}, y: {y_batch.shape}")
299
+ print(f"Batch labels: {y_batch[:20]}")
300
+ print(f"Batch X range: [{X_batch.min():.4f}, {X_batch.max():.4f}]")
301
+ break
model.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1b89a47feff6243a25222dcc4958860cb43dd705da131902c8fc3c4cc771bbb6
3
  size 1073163
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa391467f5bc207ba527cda22072d606488d8e3cb07b10e60512451e7bc8733b
3
  size 1073163
model.py ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ LOBPatternNet: Deep Learning Model for Institutional Trading Pattern Detection
3
+ from Level-2 Order Book Data (10-level bid/ask)
4
+
5
+ Architecture: CNN (spatial) + Inception (multi-scale) + Transformer Attention (temporal) + MLP Head
6
+ Based on DeepLOB (Zhang et al. 2019) + TLOB (Berti & Kasneci 2025) design principles
7
+
8
+ Input: (batch, seq_len, 40) - seq_len consecutive LOB snapshots, each with 40 features:
9
+ [ask_price_1..10, ask_size_1..10, bid_price_1..10, bid_size_1..10]
10
+
11
+ Output: 3-class classification
12
+ 0: 主力买入 (Institutional Buying)
13
+ 1: 中性/散户 (Neutral/Retail)
14
+ 2: 主力卖出 (Institutional Selling)
15
+ """
16
+
17
+ import torch
18
+ import torch.nn as nn
19
+ import torch.nn.functional as F
20
+ import math
21
+
22
+
23
+ class BilinearNorm(nn.Module):
24
+ """Bilinear normalization layer from TLOB - handles price/volume scale mismatch."""
25
+ def __init__(self, num_features):
26
+ super().__init__()
27
+ self.gamma = nn.Parameter(torch.ones(1, 1, num_features))
28
+ self.beta = nn.Parameter(torch.zeros(1, 1, num_features))
29
+ self.gate = nn.Parameter(torch.ones(1, 1, num_features))
30
+
31
+ def forward(self, x):
32
+ # x: (B, T, F)
33
+ mean = x.mean(dim=1, keepdim=True)
34
+ std = x.std(dim=1, keepdim=True) + 1e-8
35
+ x_norm = (x - mean) / std
36
+ gate = torch.sigmoid(self.gate)
37
+ return gate * (self.gamma * x_norm + self.beta) + (1 - gate) * x
38
+
39
+
40
+ class InceptionModule(nn.Module):
41
+ """Inception module for multi-scale temporal feature extraction."""
42
+ def __init__(self, in_channels, out_channels=32):
43
+ super().__init__()
44
+ self.branch1 = nn.Sequential(
45
+ nn.Conv1d(in_channels, out_channels, kernel_size=1),
46
+ nn.BatchNorm1d(out_channels),
47
+ nn.LeakyReLU(0.01)
48
+ )
49
+ self.branch3 = nn.Sequential(
50
+ nn.Conv1d(in_channels, out_channels, kernel_size=3, padding=1),
51
+ nn.BatchNorm1d(out_channels),
52
+ nn.LeakyReLU(0.01)
53
+ )
54
+ self.branch5 = nn.Sequential(
55
+ nn.Conv1d(in_channels, out_channels, kernel_size=5, padding=2),
56
+ nn.BatchNorm1d(out_channels),
57
+ nn.LeakyReLU(0.01)
58
+ )
59
+ self.pool_branch = nn.Sequential(
60
+ nn.MaxPool1d(kernel_size=3, stride=1, padding=1),
61
+ nn.Conv1d(in_channels, out_channels, kernel_size=1),
62
+ nn.BatchNorm1d(out_channels),
63
+ nn.LeakyReLU(0.01)
64
+ )
65
+
66
+ def forward(self, x):
67
+ # x: (B, C, T)
68
+ return torch.cat([self.branch1(x), self.branch3(x),
69
+ self.branch5(x), self.pool_branch(x)], dim=1)
70
+
71
+
72
+ class TemporalAttention(nn.Module):
73
+ """Multi-head self-attention for temporal dependencies in order flow."""
74
+ def __init__(self, d_model, nhead=4, dropout=0.1):
75
+ super().__init__()
76
+ self.attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=True)
77
+ self.norm = nn.LayerNorm(d_model)
78
+ self.dropout = nn.Dropout(dropout)
79
+
80
+ def forward(self, x):
81
+ # x: (B, T, D)
82
+ attn_out, _ = self.attn(x, x, x)
83
+ return self.norm(x + self.dropout(attn_out))
84
+
85
+
86
+ class LOBPatternNet(nn.Module):
87
+ """
88
+ Full model for institutional trading pattern detection from Level-2 LOB data.
89
+
90
+ Architecture:
91
+ 1. BilinearNorm → normalize raw LOB features
92
+ 2. CNN spatial encoder → extract cross-level order book patterns
93
+ 3. Inception → multi-scale temporal features
94
+ 4. Transformer attention → capture temporal dependencies
95
+ 5. Classification head → 3-class output
96
+ """
97
+ def __init__(self,
98
+ num_levels=10, # number of price levels (10 for Level-2)
99
+ seq_len=100, # number of consecutive LOB snapshots
100
+ num_classes=3, # 主力买入, 中性, 主力卖出
101
+ d_model=128, # internal feature dimension
102
+ nhead=4, # attention heads
103
+ num_attn_layers=2, # number of attention layers
104
+ dropout=0.2):
105
+ super().__init__()
106
+
107
+ self.num_levels = num_levels
108
+ self.seq_len = seq_len
109
+ self.num_features = num_levels * 4 # 40 features: ask_p, ask_s, bid_p, bid_s × 10 levels
110
+
111
+ # 1. Bilinear normalization
112
+ self.norm = BilinearNorm(self.num_features)
113
+
114
+ # 2. Spatial CNN encoder - processes each snapshot across price levels
115
+ # Reshape to (B, 1, T, 40) for 2D conv
116
+ self.spatial_cnn = nn.Sequential(
117
+ # Conv across features (price-volume pairing per level)
118
+ nn.Conv2d(1, 32, kernel_size=(1, 2), stride=(1, 2)), # (B, 32, T, 20)
119
+ nn.BatchNorm2d(32),
120
+ nn.LeakyReLU(0.01),
121
+
122
+ nn.Conv2d(32, 32, kernel_size=(1, 2), stride=(1, 2)), # (B, 32, T, 10)
123
+ nn.BatchNorm2d(32),
124
+ nn.LeakyReLU(0.01),
125
+
126
+ nn.Conv2d(32, 32, kernel_size=(1, 10)), # (B, 32, T, 1)
127
+ nn.BatchNorm2d(32),
128
+ nn.LeakyReLU(0.01),
129
+ )
130
+
131
+ # 3. Inception module for multi-scale temporal features
132
+ self.inception1 = InceptionModule(32, 32) # Output: 128 channels
133
+ self.inception2 = InceptionModule(128, 32) # Output: 128 channels
134
+
135
+ # 4. Projection to d_model
136
+ self.proj = nn.Sequential(
137
+ nn.Linear(128, d_model),
138
+ nn.LayerNorm(d_model),
139
+ nn.LeakyReLU(0.01),
140
+ nn.Dropout(dropout)
141
+ )
142
+
143
+ # 5. Transformer attention layers
144
+ self.attention_layers = nn.ModuleList([
145
+ TemporalAttention(d_model, nhead, dropout)
146
+ for _ in range(num_attn_layers)
147
+ ])
148
+
149
+ # 6. Classification head
150
+ self.classifier = nn.Sequential(
151
+ nn.Linear(d_model, 64),
152
+ nn.LeakyReLU(0.01),
153
+ nn.Dropout(dropout),
154
+ nn.Linear(64, num_classes)
155
+ )
156
+
157
+ # Additional feature engineering layer
158
+ # Processes derived features: OFI, VPIN, spread, depth imbalance
159
+ self.aux_features_dim = 6 # number of derived features
160
+ self.aux_encoder = nn.Sequential(
161
+ nn.Linear(self.aux_features_dim, 32),
162
+ nn.LeakyReLU(0.01),
163
+ nn.Linear(32, d_model),
164
+ nn.LeakyReLU(0.01),
165
+ nn.Dropout(dropout)
166
+ )
167
+
168
+ # Fusion layer
169
+ self.fusion = nn.Sequential(
170
+ nn.Linear(d_model * 2, d_model),
171
+ nn.LeakyReLU(0.01),
172
+ nn.Dropout(dropout)
173
+ )
174
+
175
+ self._init_weights()
176
+
177
+ def _init_weights(self):
178
+ for m in self.modules():
179
+ if isinstance(m, (nn.Linear, nn.Conv1d, nn.Conv2d)):
180
+ nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='leaky_relu')
181
+ if m.bias is not None:
182
+ nn.init.constant_(m.bias, 0)
183
+ elif isinstance(m, (nn.BatchNorm1d, nn.BatchNorm2d, nn.LayerNorm)):
184
+ nn.init.constant_(m.weight, 1)
185
+ nn.init.constant_(m.bias, 0)
186
+
187
+ def compute_aux_features(self, x):
188
+ """
189
+ Compute derived microstructure features from raw LOB data.
190
+ x: (B, T, 40) raw LOB features
191
+ Returns: (B, 6) aggregated auxiliary features
192
+ """
193
+ B, T, F = x.shape
194
+
195
+ # Parse LOB structure: ask_p(10), ask_s(10), bid_p(10), bid_s(10)
196
+ ask_prices = x[:, :, 0:10] # (B, T, 10)
197
+ ask_sizes = x[:, :, 10:20] # (B, T, 10)
198
+ bid_prices = x[:, :, 20:30] # (B, T, 10)
199
+ bid_sizes = x[:, :, 30:40] # (B, T, 10)
200
+
201
+ # 1. Order Flow Imbalance (OFI) - key institutional signal
202
+ total_bid = ask_sizes.sum(dim=-1) # (B, T)
203
+ total_ask = bid_sizes.sum(dim=-1) # (B, T)
204
+ ofi = (total_bid - total_ask) / (total_bid + total_ask + 1e-8)
205
+ ofi_mean = ofi.mean(dim=1, keepdim=True) # (B, 1)
206
+
207
+ # 2. Spread dynamics
208
+ spread = ask_prices[:, :, 0] - bid_prices[:, :, 0] # (B, T)
209
+ spread_mean = spread.mean(dim=1, keepdim=True)
210
+
211
+ # 3. Depth imbalance at top levels (1-3)
212
+ top_bid = bid_sizes[:, :, :3].sum(dim=-1) # (B, T)
213
+ top_ask = ask_sizes[:, :, :3].sum(dim=-1) # (B, T)
214
+ depth_imb = (top_bid - top_ask) / (top_bid + top_ask + 1e-8)
215
+ depth_imb_mean = depth_imb.mean(dim=1, keepdim=True)
216
+
217
+ # 4. Volume concentration (institutional = concentrated at few levels)
218
+ bid_concentration = bid_sizes[:, :, 0] / (bid_sizes.sum(dim=-1) + 1e-8)
219
+ bid_conc_mean = bid_concentration.mean(dim=1, keepdim=True)
220
+
221
+ # 5. Price pressure (weighted volume by distance from mid)
222
+ mid_price = (ask_prices[:, :, 0] + bid_prices[:, :, 0]) / 2
223
+ bid_pressure = (bid_sizes * (mid_price.unsqueeze(-1) - bid_prices).abs()).sum(dim=-1)
224
+ ask_pressure = (ask_sizes * (ask_prices - mid_price.unsqueeze(-1)).abs()).sum(dim=-1)
225
+ pressure_ratio = (bid_pressure - ask_pressure) / (bid_pressure + ask_pressure + 1e-8)
226
+ pressure_mean = pressure_ratio.mean(dim=1, keepdim=True)
227
+
228
+ # 6. Temporal volatility of OFI (sudden changes = institutional activity)
229
+ ofi_vol = ofi.std(dim=1, keepdim=True)
230
+
231
+ return torch.cat([ofi_mean, spread_mean, depth_imb_mean,
232
+ bid_conc_mean, pressure_mean, ofi_vol], dim=1) # (B, 6)
233
+
234
+ def forward(self, x):
235
+ """
236
+ x: (B, T, 40) - batch of LOB snapshot sequences
237
+ Returns: (B, num_classes) logits
238
+ """
239
+ B, T, F = x.shape
240
+
241
+ # Compute auxiliary features before normalization
242
+ aux_feats = self.compute_aux_features(x) # (B, 6)
243
+ aux_encoded = self.aux_encoder(aux_feats) # (B, d_model)
244
+
245
+ # 1. Bilinear normalization
246
+ x = self.norm(x) # (B, T, 40)
247
+
248
+ # 2. Spatial CNN
249
+ x = x.unsqueeze(1) # (B, 1, T, 40)
250
+ x = self.spatial_cnn(x) # (B, 32, T, 1)
251
+ x = x.squeeze(-1) # (B, 32, T)
252
+
253
+ # 3. Inception modules
254
+ x = self.inception1(x) # (B, 128, T)
255
+ x = self.inception2(x) # (B, 128, T)
256
+
257
+ # 4. Transpose and project for attention
258
+ x = x.permute(0, 2, 1) # (B, T, 128)
259
+ x = self.proj(x) # (B, T, d_model)
260
+
261
+ # 5. Temporal attention
262
+ for attn_layer in self.attention_layers:
263
+ x = attn_layer(x)
264
+
265
+ # Global average pooling
266
+ x = x.mean(dim=1) # (B, d_model)
267
+
268
+ # 6. Fusion with auxiliary features
269
+ x = self.fusion(torch.cat([x, aux_encoded], dim=1)) # (B, d_model)
270
+
271
+ # 7. Classification
272
+ return self.classifier(x) # (B, num_classes)
273
+
274
+ def get_attention_weights(self, x):
275
+ """Get attention weights for interpretability."""
276
+ B, T, F = x.shape
277
+ aux_feats = self.compute_aux_features(x)
278
+
279
+ x = self.norm(x)
280
+ x = x.unsqueeze(1)
281
+ x = self.spatial_cnn(x)
282
+ x = x.squeeze(-1)
283
+ x = self.inception1(x)
284
+ x = self.inception2(x)
285
+ x = x.permute(0, 2, 1)
286
+ x = self.proj(x)
287
+
288
+ weights = []
289
+ for attn_layer in self.attention_layers:
290
+ _, w = attn_layer.attn(x, x, x)
291
+ weights.append(w)
292
+ x = attn_layer(x)
293
+
294
+ return weights
295
+
296
+
297
+ def count_parameters(model):
298
+ return sum(p.numel() for p in model.parameters() if p.requires_grad)
299
+
300
+
301
+ if __name__ == "__main__":
302
+ # Test model
303
+ model = LOBPatternNet(seq_len=100, num_classes=3)
304
+ print(f"Total trainable parameters: {count_parameters(model):,}")
305
+
306
+ # Test forward pass
307
+ x = torch.randn(4, 100, 40)
308
+ out = model(x)
309
+ print(f"Input shape: {x.shape}")
310
+ print(f"Output shape: {out.shape}")
311
+ print(f"Output: {out}")