--- license: cc-by-4.0 library_name: scikit-learn tags: - tfidf - text-embedding - alphahack --- # AlphaHack Text Embedder A `hackalpha.research.text_embeddings.TextEmbeddingFeatures` instance: TF-IDF + `TruncatedSVD` (20 components) fit on **101,682 Devpost project descriptions** (the `full_description` column of the [AlphaHack dataset](https://huggingface.co/datasets/xenosaac/alphahack-devpost)). This embedder produces the `emb_0` … `emb_19` columns in the released parquet. **Neither Model 1 nor Model 2 uses `emb_*` features** — they are provided for exploratory analysis on new project descriptions only. ## Loading This pkl serializes a project-specific class (`TextEmbeddingFeatures`). You **must** install the companion package before loading: ```bash pip install hackalpha ``` ```python import joblib import pandas as pd embedder = joblib.load("text_embedder.pkl") # Embed new project descriptions (transform only — do NOT call .fit_*) texts = pd.Series([ "An AI-powered code review tool that integrates with GitHub.", "A wearable that detects falls and alerts emergency contacts.", ]) emb_df = embedder.transform(texts) # emb_df has columns emb_0, emb_1, ..., emb_19 ``` **Important**: do not call `fit()` or `fit_transform()` on new data — that re-fits the vocabulary and produces embeddings that are incomparable with the released parquet. ## Training procedure - Vectorizer: `sklearn.feature_extraction.text.TfidfVectorizer` - Reducer: `sklearn.decomposition.TruncatedSVD(n_components=20)` - Corpus: 101,682 Devpost `full_description` strings (all projects in the released dataset) - Vocabulary: fit once on the full corpus; held frozen at `transform()` time ## Why ship this if it's not used by the released models? Two reasons: 1. **Reproducibility of the released parquet.** The `emb_*` columns in `alphahack_features_v7.parquet` were generated by this exact embedder. Without it, those columns can't be regenerated. 2. **Future research.** Anyone exploring "does adding text-embedding features improve the prize predictor?" needs to start from the same embedding basis. ## Limitations - Vocabulary is frozen at the 2026 corpus snapshot. Out-of-vocabulary terms in new project text are silently ignored (TF-IDF default). - 20 SVD components is a strong dimensionality cap; finer-grained text semantics will be lost. - Trained only on English Devpost text. ## License CC BY 4.0.