xenosaac
/

alphahack-models

Model card Files Files and versions

alphahack-models / text-embedder /README.md

xenosaac's picture

Upload folder using huggingface_hub

60c3695 verified 25 days ago

|

history blame contribute delete

2.46 kB

	---
	license: cc-by-4.0
	library_name: scikit-learn
	tags:
	- tfidf
	- text-embedding
	- alphahack
	---

	# AlphaHack Text Embedder

	A `hackalpha.research.text_embeddings.TextEmbeddingFeatures` instance:
	TF-IDF + `TruncatedSVD` (20 components) fit on **101,682 Devpost
	project descriptions** (the `full_description` column of the
	[AlphaHack dataset](https://huggingface.co/datasets/xenosaac/alphahack-devpost)).

	This embedder produces the `emb_0` … `emb_19` columns in the released
	parquet. *Neither Model 1 nor Model 2 uses `emb_` features** — they
	are provided for exploratory analysis on new project descriptions only.

	## Loading

	This pkl serializes a project-specific class (`TextEmbeddingFeatures`).
	You must install the companion package before loading:

	```bash
	pip install hackalpha
	```

	```python
	import joblib
	import pandas as pd

	embedder = joblib.load("text_embedder.pkl")

	# Embed new project descriptions (transform only — do NOT call .fit_*)
	texts = pd.Series([
	"An AI-powered code review tool that integrates with GitHub.",
	"A wearable that detects falls and alerts emergency contacts.",
	])
	emb_df = embedder.transform(texts)
	# emb_df has columns emb_0, emb_1, ..., emb_19
	```

	Important: do not call `fit()` or `fit_transform()` on new data —
	that re-fits the vocabulary and produces embeddings that are
	incomparable with the released parquet.

	## Training procedure

	- Vectorizer: `sklearn.feature_extraction.text.TfidfVectorizer`
	- Reducer: `sklearn.decomposition.TruncatedSVD(n_components=20)`
	- Corpus: 101,682 Devpost `full_description` strings (all projects in
	the released dataset)
	- Vocabulary: fit once on the full corpus; held frozen at `transform()`
	time

	## Why ship this if it's not used by the released models?

	Two reasons:

	1. Reproducibility of the released parquet. The `emb_*` columns in
	`alphahack_features_v7.parquet` were generated by this exact
	embedder. Without it, those columns can't be regenerated.
	2. Future research. Anyone exploring "does adding text-embedding
	features improve the prize predictor?" needs to start from the same
	embedding basis.

	## Limitations

	- Vocabulary is frozen at the 2026 corpus snapshot. Out-of-vocabulary
	terms in new project text are silently ignored (TF-IDF default).
	- 20 SVD components is a strong dimensionality cap; finer-grained text
	semantics will be lost.
	- Trained only on English Devpost text.

	## License

	CC BY 4.0.