--- license: cc-by-4.0 library_name: scikit-learn tags: - hackathon - tabular-classification - archetype - alphahack pipeline_tag: tabular-classification --- # AlphaHack Model 1 — Event Regime Classifier A `GradientBoostingClassifier` (with a logistic-regression fallback, LabelEncoder, and StandardScaler bundled in the same pickle) that predicts which **winner archetype** dominates a given hackathon event. Companion: [Model 2 — winner predictor](https://huggingface.co/xenosaac/alphahack-models/tree/main/model2-winner-predictor) ## Model description Given event-level features (host type, judge composition, prize pool, duration, theme keywords, criteria text, etc.), the classifier predicts which of 5 archetypes describes the event's prior winners: | Archetype | Train rows | |---|---| | `tech_showoff` | 1,051 | | `empathy_play` | 825 | | `scrappy_utility` | 761 | | `hype_surfer` | 717 | | `narrative_master` | 658 | Use the model to inform **idea-generation strategy** before a hackathon — not to rank individual project submissions (use Model 2 for that). ## Training data The model was fit on **4,012 event-level archetype labels** derived from **23,785 winning-project rows** across **101,682 total projects** in the [`xenosaac/alphahack-devpost`](https://huggingface.co/datasets/xenosaac/alphahack-devpost) dataset. Source feature parquet: `data/merged/alphahack_features_v7.parquet` (151 columns post-PII-scrub). ## Metrics 5-fold cross-validation, GroupKFold by `event_id` (no event appears in both train and test of the same fold). | Metric | GBC | LR baseline | Majority baseline | |---|---|---|---| | Top-1 accuracy | **0.381 ± 0.012** | 0.275 | 0.262 | | Top-3 accuracy | **0.804 ± 0.010** | — | — | | Train accuracy | 0.783 | — | — | GBC delivers a **1.45× lift** over the majority baseline on top-1 accuracy and a **2.92×** lift over majority for the practically more useful top-3 accuracy (which is what feeds the strategy engine's portfolio prompting). ## Top features (GBC importance) | Feature | Importance | |---|---| | `A06_total_submissions` | 0.291 | | `A11_theme_keywords` | 0.075 | | `A09_num_prize_categories` | 0.068 | | `A05_prize_pool_usd` | 0.064 | ## Loading the model ```python import joblib bundle = joblib.load("regime_classifier.pkl") gbc = bundle["gbc"] # primary classifier lr = bundle["lr"] # logistic-regression baseline le = bundle["le"] # LabelEncoder for archetype names scaler = bundle["scaler"] # StandardScaler (event features) feature_cols = bundle["feature_cols"] # list of 32 input column names # Score a new event import numpy as np event_features_dict = {...} # build from your crawled event X_raw = np.array([[event_features_dict[c] for c in feature_cols]]) X = scaler.transform(X_raw) proba = gbc.predict_proba(X)[0] top3_idx = proba.argsort()[::-1][:3] top3_archetypes = le.inverse_transform(top3_idx) print(list(zip(top3_archetypes, proba[top3_idx]))) ``` ## Reproducing this artifact The full training pipeline is in the open-source companion repo: ```bash pip install hackalpha hackalpha train-model1 \ --features data/merged/alphahack_features_v7.parquet \ --model-out data/models/regime_classifier.pkl \ --metrics-out data/research/model1_training_metrics.json \ --report-out data/research/model1_archetype_report.json ``` The training metrics (`model1_training_metrics.json`) and label distribution (`model1_archetype_report.json`) are included in this HF directory. ## Known failure modes - Top-1 accuracy of 38% is well below human-expert level. The product-relevant metric is top-3 (80%), used to prompt a multi-idea portfolio rather than commit to one bet. - 3 test years available (2024–2026) is not enough for tight CIs on Model 1's archetype labels. - The label assignment is **heuristic-based** (a project is "tech_showoff" if its rubric scores match a tech-showoff signature), not adjudicated by humans. Label noise is real. ## Limitations - Trained only on Devpost-hosted, English-language hackathons. - In-person and non-Devpost events: performance unknown. - Companion model 2 had a **prospective trial in April 2026 that did not produce a prize**. Use both models as research artifacts, not as a guaranteed winning recipe. ## License CC BY 4.0.