File size: 2,459 Bytes
60c3695
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: cc-by-4.0
library_name: scikit-learn
tags:
  - tfidf
  - text-embedding
  - alphahack
---

# AlphaHack Text Embedder

A `hackalpha.research.text_embeddings.TextEmbeddingFeatures` instance:
TF-IDF + `TruncatedSVD` (20 components) fit on **101,682 Devpost
project descriptions** (the `full_description` column of the
[AlphaHack dataset](https://huggingface.co/datasets/xenosaac/alphahack-devpost)).

This embedder produces the `emb_0``emb_19` columns in the released
parquet. **Neither Model 1 nor Model 2 uses `emb_*` features** — they
are provided for exploratory analysis on new project descriptions only.

## Loading

This pkl serializes a project-specific class (`TextEmbeddingFeatures`).
You **must** install the companion package before loading:

```bash
pip install hackalpha
```

```python
import joblib
import pandas as pd

embedder = joblib.load("text_embedder.pkl")

# Embed new project descriptions (transform only — do NOT call .fit_*)
texts = pd.Series([
    "An AI-powered code review tool that integrates with GitHub.",
    "A wearable that detects falls and alerts emergency contacts.",
])
emb_df = embedder.transform(texts)
# emb_df has columns emb_0, emb_1, ..., emb_19
```

**Important**: do not call `fit()` or `fit_transform()` on new data —
that re-fits the vocabulary and produces embeddings that are
incomparable with the released parquet.

## Training procedure

- Vectorizer: `sklearn.feature_extraction.text.TfidfVectorizer`
- Reducer: `sklearn.decomposition.TruncatedSVD(n_components=20)`
- Corpus: 101,682 Devpost `full_description` strings (all projects in
  the released dataset)
- Vocabulary: fit once on the full corpus; held frozen at `transform()`
  time

## Why ship this if it's not used by the released models?

Two reasons:

1. **Reproducibility of the released parquet.** The `emb_*` columns in
   `alphahack_features_v7.parquet` were generated by this exact
   embedder. Without it, those columns can't be regenerated.
2. **Future research.** Anyone exploring "does adding text-embedding
   features improve the prize predictor?" needs to start from the same
   embedding basis.

## Limitations

- Vocabulary is frozen at the 2026 corpus snapshot. Out-of-vocabulary
  terms in new project text are silently ignored (TF-IDF default).
- 20 SVD components is a strong dimensionality cap; finer-grained text
  semantics will be lost.
- Trained only on English Devpost text.

## License

CC BY 4.0.