xenosaac's picture
Upload folder using huggingface_hub
60c3695 verified
---
license: cc-by-4.0
library_name: scikit-learn
tags:
- hackathon
- tabular-classification
- archetype
- alphahack
pipeline_tag: tabular-classification
---
# AlphaHack Model 1 — Event Regime Classifier
A `GradientBoostingClassifier` (with a logistic-regression fallback,
LabelEncoder, and StandardScaler bundled in the same pickle) that
predicts which **winner archetype** dominates a given hackathon event.
Companion: [Model 2 — winner predictor](https://huggingface.co/xenosaac/alphahack-models/tree/main/model2-winner-predictor)
## Model description
Given event-level features (host type, judge composition, prize pool,
duration, theme keywords, criteria text, etc.), the classifier predicts
which of 5 archetypes describes the event's prior winners:
| Archetype | Train rows |
|---|---|
| `tech_showoff` | 1,051 |
| `empathy_play` | 825 |
| `scrappy_utility` | 761 |
| `hype_surfer` | 717 |
| `narrative_master` | 658 |
Use the model to inform **idea-generation strategy** before a hackathon
— not to rank individual project submissions (use Model 2 for that).
## Training data
The model was fit on **4,012 event-level archetype labels** derived from
**23,785 winning-project rows** across **101,682 total projects** in
the [`xenosaac/alphahack-devpost`](https://huggingface.co/datasets/xenosaac/alphahack-devpost)
dataset. Source feature parquet:
`data/merged/alphahack_features_v7.parquet` (151 columns post-PII-scrub).
## Metrics
5-fold cross-validation, GroupKFold by `event_id` (no event appears in
both train and test of the same fold).
| Metric | GBC | LR baseline | Majority baseline |
|---|---|---|---|
| Top-1 accuracy | **0.381 ± 0.012** | 0.275 | 0.262 |
| Top-3 accuracy | **0.804 ± 0.010** | — | — |
| Train accuracy | 0.783 | — | — |
GBC delivers a **1.45× lift** over the majority baseline on top-1
accuracy and a **2.92×** lift over majority for the practically more
useful top-3 accuracy (which is what feeds the strategy engine's
portfolio prompting).
## Top features (GBC importance)
| Feature | Importance |
|---|---|
| `A06_total_submissions` | 0.291 |
| `A11_theme_keywords` | 0.075 |
| `A09_num_prize_categories` | 0.068 |
| `A05_prize_pool_usd` | 0.064 |
## Loading the model
```python
import joblib
bundle = joblib.load("regime_classifier.pkl")
gbc = bundle["gbc"] # primary classifier
lr = bundle["lr"] # logistic-regression baseline
le = bundle["le"] # LabelEncoder for archetype names
scaler = bundle["scaler"] # StandardScaler (event features)
feature_cols = bundle["feature_cols"] # list of 32 input column names
# Score a new event
import numpy as np
event_features_dict = {...} # build from your crawled event
X_raw = np.array([[event_features_dict[c] for c in feature_cols]])
X = scaler.transform(X_raw)
proba = gbc.predict_proba(X)[0]
top3_idx = proba.argsort()[::-1][:3]
top3_archetypes = le.inverse_transform(top3_idx)
print(list(zip(top3_archetypes, proba[top3_idx])))
```
## Reproducing this artifact
The full training pipeline is in the open-source companion repo:
```bash
pip install hackalpha
hackalpha train-model1 \
--features data/merged/alphahack_features_v7.parquet \
--model-out data/models/regime_classifier.pkl \
--metrics-out data/research/model1_training_metrics.json \
--report-out data/research/model1_archetype_report.json
```
The training metrics (`model1_training_metrics.json`) and label
distribution (`model1_archetype_report.json`) are included in this
HF directory.
## Known failure modes
- Top-1 accuracy of 38% is well below human-expert level. The
product-relevant metric is top-3 (80%), used to prompt a multi-idea
portfolio rather than commit to one bet.
- 3 test years available (2024–2026) is not enough for tight CIs on
Model 1's archetype labels.
- The label assignment is **heuristic-based** (a project is "tech_showoff"
if its rubric scores match a tech-showoff signature), not adjudicated
by humans. Label noise is real.
## Limitations
- Trained only on Devpost-hosted, English-language hackathons.
- In-person and non-Devpost events: performance unknown.
- Companion model 2 had a **prospective trial in April 2026 that did
not produce a prize**. Use both models as research artifacts, not
as a guaranteed winning recipe.
## License
CC BY 4.0.