Sentence Similarity
sentence-transformers
Safetensors
Portuguese
modernbert
feature-extraction
semantic-search
embeddings
portuguese
brazilian-portuguese
pt-br
b2b
fine-tuned
mteb
granite
autoresearch
Eval Results (legacy)
text-embeddings-inference
Instructions to use calneymgp/braza-embedding-ptbr-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use calneymgp/braza-embedding-ptbr-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("calneymgp/braza-embedding-ptbr-v1") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 7,617 Bytes
bb33fec fb1dfff bb33fec fb1dfff bb33fec fb1dfff bb33fec fb1dfff bb33fec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | ---
language:
- pt
license: apache-2.0
library_name: sentence-transformers
base_model: ibm-granite/granite-embedding-97m-multilingual-r2
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- semantic-search
- embeddings
- portuguese
- brazilian-portuguese
- pt-br
- b2b
- fine-tuned
- mteb
- granite
- autoresearch
pipeline_tag: sentence-similarity
model-index:
- name: braza-embedding-ptbr-v1
results:
- task:
type: STS
dataset:
name: ASSIN2 STS
type: mteb/assin2-sts
metrics:
- type: pearson
value: 0.8082
- task:
type: STS
dataset:
name: SICK-BR STS
type: mteb/sick-br-sts
metrics:
- type: pearson
value: 0.8513
- task:
type: PairClassification
dataset:
name: ASSIN2 RTE
type: mteb/assin2-rte
metrics:
- type: ap
value: 0.8408
---
# braza-embedding-ptbr-v1
> **The first PT-BR embedding model fine-tuned on 474,000 real Brazilian B2B companies.**
Fine-tuned from [`ibm-granite/granite-embedding-97m-multilingual-r2`](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2) using a **Karpathy-style autoresearch loop** — 36 autonomous training iterations on an RTX 5090, each proposing and self-validating its own strategy. **35 out of 36 iterations improved the model (97% acceptance rate).**
Built at [**TAMZ**](https://tamz.ai) — a Brazilian B2B sales intelligence platform that identifies, enriches, and delivers company leads ready for outreach. The training data comes directly from TAMZ's enrichment pipeline over 32M Brazilian companies from the Receita Federal.
---
## MTEB Benchmark Results (PT-BR)
| Task | Baseline (granite-97m) | **braza-embedding-ptbr-v1** | Δ |
|------|:----------------------:|:---------------------------:|:-:|
| ASSIN2 STS | 0.6655 | **0.8082** | +21.5% 🟢 |
| SICK-BR STS | 0.7062 | **0.8513** | +20.6% 🟢 |
| ASSIN2 RTE | 0.7254 | **0.8408** | +16.0% 🟢 |
| **MTEB Primary (weighted avg)** | 0.5826 | **0.6596** | **+13.2%** |
> STS (Semantic Textual Similarity) scores represent Spearman correlation with human judgements — the gold standard for measuring semantic embedding quality in Portuguese.
---
## Why This Model Exists
There are virtually no PT-BR embedding models trained on **real business data**. Most multilingual models learn from Wikipedia and web crawls — they don't understand the vocabulary of Brazilian B2B:
- *"distribuidora de insumos industriais no Nordeste"*
- *"SaaS de gestão condominial para construtoras"*
- *"startup de fintech para MEI e pequenas empresas"*
- *"consultorias de RH para empresas de médio porte em SP"*
This model was built to fix that.
---
## Model Details
| Property | Value |
|----------|-------|
| Base model | ibm-granite/granite-embedding-97m-multilingual-r2 |
| Architecture | ModernBERT |
| Parameters | 97M |
| Embedding dimension | 384 (Matryoshka: supports 256 / 128 / 64) |
| Max sequence length | 512 tokens |
| Language | Portuguese (PT-BR) |
| Domain | B2B, SaaS, tech, startups, comercial, empresas brasileiras |
| Training hardware | NVIDIA RTX 5090 32GB |
| Training time | ~8 hours overnight |
---
## Training: The Autoresearch Loop
Instead of static hyperparameter tuning, this model was trained using an **autonomous iterative loop** inspired by Karpathy's autoresearch approach:
```
for each of 36 iterations:
1. Generate 150 company → query pairs using Qwen3.5-35B
2. Stage 1: MatryoshkaLoss(MultipleNegativesRankingLoss, scale=30)
→ 180s on RTX 5090
→ trains 384/256/128/64 dims simultaneously
3. Stage 2: CoSENTLoss on ASSIN2 + SICK-BR real annotated scores
→ 60s calibration on human-labelled PT-BR pairs
4. Evaluate: ASSIN2-STS + SICK-BR-STS
5. KEEP if improved → checkpoint saved
DISCARD if regressed → restore previous checkpoint
```
**Score progression over 36 iterations:**
```
iter 0: 0.7322
iter 5: 0.7589
iter 10: 0.7831
iter 20: 0.8105
iter 30: 0.8270
iter 32: 0.8297 ← best checkpoint (this model)
```
---
## Training Data: 474K Brazilian B2B Companies
The synthetic training data was generated from a proprietary dataset of **474,000 enriched Brazilian companies** sourced from the Receita Federal (Brazilian tax authority) and enriched with web scraping + AI analysis.
Each company record contains:
- Business description and value proposition (AI-generated from web scraping)
- CNAE sector classification, company size, location, revenue range
- Tech stack, AI adoption score, target market (B2B/B2C/B2B2C)
- LinkedIn data, founding year, company stage
For each company, **Qwen3.5-35B APEX** generated 6 diverse semantic queries — simulating real B2B buyers and sales reps searching for vendors. Queries vary by:
- Geographic filters (*"empresa de TI em Curitiba"*, *"fornecedor no Nordeste"*)
- Company maturity (*"startup de 1 ano"*, *"empresa estabelecida há mais de 10 anos"*)
- Decision-maker role (*"CTO buscando"*, *"gestor de compras procurando"*)
- Tech signals (*"empresa que usa Salesforce e HubSpot"*)
- Revenue/size filters (*"faturamento entre R$5M e R$20M"*)
---
## Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("calneymgp/braza-embedding-ptbr-v1")
# B2B semantic search
query = "empresa de tecnologia SaaS B2B em São Paulo"
companies = [
"Desenvolvemos software de gestão para pequenas empresas brasileiras",
"Restaurante especializado em culinária italiana no centro de SP",
"Plataforma de CRM para times de vendas corporativas B2B",
"Distribuidora de equipamentos agrícolas no interior do Paraná",
]
query_emb = model.encode(query)
company_embs = model.encode(companies)
scores = model.similarity(query_emb, company_embs)
# Results: [0.81, 0.12, 0.79, 0.18] — retrieves tech SaaS companies correctly
```
### Matryoshka — flexible embedding dimensions
```python
# Full quality (384-dim) — default
embeddings = model.encode(texts)
# 2x faster search, ~1% quality loss (256-dim)
embeddings = model.encode(texts, truncate_dim=256)
# 9x faster search, ~3% quality loss (128-dim) — great for large-scale search
embeddings = model.encode(texts, truncate_dim=128)
```
---
## Best For
✅ Semantic search over Brazilian business data
✅ B2B lead discovery and company matching (e.g. [TAMZ](https://tamz.ai))
✅ Company similarity, clustering, deduplication
✅ PT-BR RAG pipelines with business documents
✅ Memory systems for Portuguese AI agents
✅ Sales intelligence and market research (Brazil)
---
## Infrastructure
| Component | Details |
|-----------|---------|
| Data generation | Qwen3.5-35B APEX GGUF on RTX 3090 (llama.cpp) |
| Training | NVIDIA RTX 5090 32GB (PyTorch + sentence-transformers 5.x) |
| Evaluation | MTEB 2.x — official PT-BR benchmark tasks |
| Monitoring | Discord notifications + HTTP dashboard per iteration |
| Loop controller | Custom autoresearch script (KEEP/DISCARD per iteration) |
---
## License
Apache 2.0 — same as the base IBM Granite Embedding model.
---
## Citation
```bibtex
@misc{gerhardt2026braza,
author = {Calney Gerhardt},
title = {braza-embedding-ptbr-v1: Portuguese B2B Embedding via Autoresearch},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/calneymgp/braza-embedding-ptbr-v1},
note = {Fine-tuned from IBM Granite 97M on 474K Brazilian B2B companies
using 36-iteration autonomous training loop (RTX 5090).
Built at TAMZ (https://tamz.ai)}
}
```
|