| --- |
| license: cc-by-4.0 |
| library_name: scikit-learn |
| tags: |
| - tfidf |
| - text-embedding |
| - alphahack |
| --- |
| |
| # AlphaHack Text Embedder |
|
|
| A `hackalpha.research.text_embeddings.TextEmbeddingFeatures` instance: |
| TF-IDF + `TruncatedSVD` (20 components) fit on **101,682 Devpost |
| project descriptions** (the `full_description` column of the |
| [AlphaHack dataset](https://huggingface.co/datasets/xenosaac/alphahack-devpost)). |
|
|
| This embedder produces the `emb_0` … `emb_19` columns in the released |
| parquet. **Neither Model 1 nor Model 2 uses `emb_*` features** — they |
| are provided for exploratory analysis on new project descriptions only. |
| |
| ## Loading |
| |
| This pkl serializes a project-specific class (`TextEmbeddingFeatures`). |
| You **must** install the companion package before loading: |
| |
| ```bash |
| pip install hackalpha |
| ``` |
| |
| ```python |
| import joblib |
| import pandas as pd |
| |
| embedder = joblib.load("text_embedder.pkl") |
| |
| # Embed new project descriptions (transform only — do NOT call .fit_*) |
| texts = pd.Series([ |
| "An AI-powered code review tool that integrates with GitHub.", |
| "A wearable that detects falls and alerts emergency contacts.", |
| ]) |
| emb_df = embedder.transform(texts) |
| # emb_df has columns emb_0, emb_1, ..., emb_19 |
| ``` |
| |
| **Important**: do not call `fit()` or `fit_transform()` on new data — |
| that re-fits the vocabulary and produces embeddings that are |
| incomparable with the released parquet. |
| |
| ## Training procedure |
| |
| - Vectorizer: `sklearn.feature_extraction.text.TfidfVectorizer` |
| - Reducer: `sklearn.decomposition.TruncatedSVD(n_components=20)` |
| - Corpus: 101,682 Devpost `full_description` strings (all projects in |
| the released dataset) |
| - Vocabulary: fit once on the full corpus; held frozen at `transform()` |
| time |
| |
| ## Why ship this if it's not used by the released models? |
| |
| Two reasons: |
| |
| 1. **Reproducibility of the released parquet.** The `emb_*` columns in |
| `alphahack_features_v7.parquet` were generated by this exact |
| embedder. Without it, those columns can't be regenerated. |
| 2. **Future research.** Anyone exploring "does adding text-embedding |
| features improve the prize predictor?" needs to start from the same |
| embedding basis. |
|
|
| ## Limitations |
|
|
| - Vocabulary is frozen at the 2026 corpus snapshot. Out-of-vocabulary |
| terms in new project text are silently ignored (TF-IDF default). |
| - 20 SVD components is a strong dimensionality cap; finer-grained text |
| semantics will be lost. |
| - Trained only on English Devpost text. |
|
|
| ## License |
|
|
| CC BY 4.0. |
|
|