Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,26 +1,108 @@
|
|
| 1 |
-
-
|
| 2 |
-
tags:
|
| 3 |
-
- ml-intern
|
| 4 |
-
---
|
| 5 |
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
| 9 |
-
## Generated by ML Intern
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## Usage
|
| 17 |
|
| 18 |
```python
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
```
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Crypto 15-Minute Direction Classifier
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
A time-series classification model that predicts whether Bitcoin (BTC/USDT) price will move **up** or **down** over the next 15-minute interval using multivariate historical market data.
|
| 4 |
|
| 5 |
+
## Model Overview
|
|
|
|
| 6 |
|
| 7 |
+
| Attribute | Value |
|
| 8 |
+
|-----------|-------|
|
| 9 |
+
| **Task** | Binary time-series classification |
|
| 10 |
+
| **Target** | BTC price direction in next 15 minutes (up=1, down=0) |
|
| 11 |
+
| **Input** | 60 minutes of multivariate OHLCV + technical indicators |
|
| 12 |
+
| **Assets** | BTC/USDT + ETH/USDT (cross-asset features) |
|
| 13 |
+
| **Best Model** | Logistic Regression on flattened windows |
|
| 14 |
+
| **Dataset** | 300K rows of 1-minute candles from WinkingFace CryptoLM datasets |
|
| 15 |
|
| 16 |
+
## Performance
|
| 17 |
+
|
| 18 |
+
| Metric | Value |
|
| 19 |
+
|--------|-------|
|
| 20 |
+
| Test Accuracy | 53.1% |
|
| 21 |
+
| Test F1 | 0.574 |
|
| 22 |
+
| Test AUC | 0.540 |
|
| 23 |
+
|
| 24 |
+
**Note:** 15-minute crypto price direction prediction is an extremely hard problem due to market efficiency at short timeframes. The model consistently edges above random chance (50%), demonstrating a non-trivial but small signal. This pipeline is valuable as a complete data engineering and feature extraction system for further research.
|
| 25 |
+
|
| 26 |
+
## Data Sources
|
| 27 |
+
|
| 28 |
+
- [WinkingFace/CryptoLM-Bitcoin-BTC-USDT](https://huggingface.co/datasets/WinkingFace/CryptoLM-Bitcoin-BTC-USDT) - BTC 1-min OHLCV + 15 technical indicators
|
| 29 |
+
- [WinkingFace/CryptoLM-Ethereum-ETH-USDT](https://huggingface.co/datasets/WinkingFace/CryptoLM-Ethereum-ETH-USDT) - ETH 1-min OHLCV + 15 technical indicators
|
| 30 |
+
|
| 31 |
+
## Features (49 per timestep)
|
| 32 |
+
|
| 33 |
+
### BTC & ETH (separately)
|
| 34 |
+
- Price: `open`, `high`, `low`, `close`
|
| 35 |
+
- Volume: `volume`
|
| 36 |
+
- Moving Averages: `MA_20`, `MA_50`, `MA_200`
|
| 37 |
+
- Momentum: `RSI`, `%K`, `%D`, `ADX`, `ATR`
|
| 38 |
+
- Trend: `MACD`, `Signal`, `Histogram`, `Trendline`
|
| 39 |
+
- Volatility: `BL_Upper`, `BL_Lower`, `MN_Upper`, `MN_Lower`
|
| 40 |
+
|
| 41 |
+
### Cross-Asset Engineered
|
| 42 |
+
- `eth_btc_ratio` - ETH/BTC price ratio
|
| 43 |
+
- `btc_ret_1m`, `eth_ret_1m` - 1-minute returns
|
| 44 |
+
- `btc_vol_ma20`, `eth_vol_ma20` - 20-period volume MA
|
| 45 |
+
- `btc_range`, `eth_range` - Normalized price range
|
| 46 |
+
|
| 47 |
+
## Pipeline
|
| 48 |
+
|
| 49 |
+
1. **Load & Merge** BTC and ETH 1-minute datasets on timestamp
|
| 50 |
+
2. **Engineer Features** - Add returns, ratios, ranges, volume MAs
|
| 51 |
+
3. **Create Windows** - 60-minute lookback → predict next 15-minute direction
|
| 52 |
+
4. **Clean** - Drop NaN/Inf, standardize per-feature
|
| 53 |
+
5. **Split** - 70/15/15 temporal train/val/test (no data leakage)
|
| 54 |
+
6. **Train** - Logistic Regression + Random Forest baselines
|
| 55 |
|
| 56 |
## Usage
|
| 57 |
|
| 58 |
```python
|
| 59 |
+
import pickle
|
| 60 |
+
import numpy as np
|
| 61 |
+
|
| 62 |
+
# Load model
|
| 63 |
+
with open("model.pkl", "rb") as f:
|
| 64 |
+
model = pickle.load(f)
|
| 65 |
|
| 66 |
+
# Load preprocessing artifacts
|
| 67 |
+
mean = np.load("feature_mean.npy")
|
| 68 |
+
std = np.load("feature_std.npy")
|
| 69 |
+
valid = np.load("valid_cols.npy")
|
| 70 |
+
|
| 71 |
+
# X shape: (samples, 60 minutes, 49 features)
|
| 72 |
+
X_flat = X.reshape(X.shape[0], -1) # flatten to 2940 features
|
| 73 |
+
X_flat = X_flat[:, valid] # keep valid columns
|
| 74 |
+
X_norm = (X_flat - mean) / std # standardize
|
| 75 |
+
|
| 76 |
+
# Predict
|
| 77 |
+
preds = model.predict(X_norm) # 0=down, 1=up
|
| 78 |
+
probs = model.predict_proba(X_norm)[:, 1] # probability of up
|
| 79 |
```
|
| 80 |
|
| 81 |
+
## Files
|
| 82 |
+
|
| 83 |
+
| File | Description |
|
| 84 |
+
|------|-------------|
|
| 85 |
+
| `model.pkl` | Trained LogisticRegression classifier |
|
| 86 |
+
| `feature_mean.npy` | Per-feature means for standardization |
|
| 87 |
+
| `feature_std.npy` | Per-feature standard deviations |
|
| 88 |
+
| `valid_cols.npy` | Boolean mask of valid (finite) feature columns |
|
| 89 |
+
| `metrics.json` | Evaluation results |
|
| 90 |
+
|
| 91 |
+
## Limitations
|
| 92 |
+
|
| 93 |
+
- **Market Efficiency**: 15-min prediction is near-random walk; edge is small
|
| 94 |
+
- **No Costs**: Evaluation ignores fees, slippage, spread
|
| 95 |
+
- **Historical Data**: Trained on 2017-2020 data; may not generalize to current regimes
|
| 96 |
+
- **Simple Models**: Deep learning (Conv-LSTM, TCN, Transformer) may improve results
|
| 97 |
+
|
| 98 |
+
## Future Work
|
| 99 |
+
|
| 100 |
+
1. **Deep Learning**: Conv-LSTM, Temporal CNN, or Transformer architectures
|
| 101 |
+
2. **More Data**: Order book, funding rates, on-chain metrics, sentiment
|
| 102 |
+
3. **Multi-Scale**: Combine 1-min, 5-min, 15-min, 1-hour features
|
| 103 |
+
4. **Regime Detection**: Train separate models for bull/bear/sideways markets
|
| 104 |
+
5. **Cost-Aware Evaluation**: Incorporate transaction costs in metric
|
| 105 |
+
|
| 106 |
+
## License
|
| 107 |
+
|
| 108 |
+
MIT License
|