Upload folder using huggingface_hub

60c3695 verified 24 days ago

2.46 kB

license: cc-by-4.0
library_name: scikit-learn
tags:
  - tfidf
  - text-embedding
  - alphahack

AlphaHack Text Embedder

A hackalpha.research.text_embeddings.TextEmbeddingFeatures instance: TF-IDF + TruncatedSVD (20 components) fit on 101,682 Devpost project descriptions (the full_description column of the AlphaHack dataset).

This embedder produces the emb_0 … emb_19 columns in the released parquet. Neither Model 1 nor Model 2 uses emb_* features — they are provided for exploratory analysis on new project descriptions only.

Loading

This pkl serializes a project-specific class (TextEmbeddingFeatures). You must install the companion package before loading:

pip install hackalpha

import joblib
import pandas as pd

embedder = joblib.load("text_embedder.pkl")

# Embed new project descriptions (transform only — do NOT call .fit_*)
texts = pd.Series([
    "An AI-powered code review tool that integrates with GitHub.",
    "A wearable that detects falls and alerts emergency contacts.",
])
emb_df = embedder.transform(texts)
# emb_df has columns emb_0, emb_1, ..., emb_19

Important: do not call fit() or fit_transform() on new data — that re-fits the vocabulary and produces embeddings that are incomparable with the released parquet.

Training procedure

Vectorizer: sklearn.feature_extraction.text.TfidfVectorizer
Reducer: sklearn.decomposition.TruncatedSVD(n_components=20)
Corpus: 101,682 Devpost full_description strings (all projects in the released dataset)
Vocabulary: fit once on the full corpus; held frozen at transform() time

Why ship this if it's not used by the released models?

Two reasons:

Reproducibility of the released parquet. The emb_* columns in alphahack_features_v7.parquet were generated by this exact embedder. Without it, those columns can't be regenerated.
Future research. Anyone exploring "does adding text-embedding features improve the prize predictor?" needs to start from the same embedding basis.

Limitations

Vocabulary is frozen at the 2026 corpus snapshot. Out-of-vocabulary terms in new project text are silently ignored (TF-IDF default).
20 SVD components is a strong dimensionality cap; finer-grained text semantics will be lost.
Trained only on English Devpost text.

License

CC BY 4.0.