Korean Neural Sparse Encoder (klue/roberta-large)
A Korean sparse retrieval model trained with SPLADE-max on klue/roberta-large (337M params).
This model generates sparse lexical representations for Korean text, enabling efficient keyword-expanded retrieval via OpenSearch neural_sparse queries.
Model Details
| Property | Value |
|---|---|
| Base Model | klue/roberta-large |
| Parameters | 337M (24 layers, 1024 hidden, 32K vocab) |
| Training Method | SPLADE-max (log(1+ReLU) + max pooling) |
| Loss | InfoNCE + FLOPS regularization |
| Training Data | 4.84M Korean triplets (46 shards) |
| Epochs | 25 |
| Effective Batch Size | 2048 (128/GPU x 2 grad_accum x 8 GPUs) |
| Hardware | 8x NVIDIA B200 (183GB VRAM) |
| Training Time | 13.4 hours |
Benchmark Results (Recall@1)
| Benchmark | Queries | BM25 | V33-base (149M) | This Model (337M) | Delta vs V33 |
|---|---|---|---|---|---|
| Ko-StrategyQA | 592 | 53.7% | 62.2% | 64.2% | +2.0pp |
| MIRACL-ko | 213 | 44.1% | 61.0% | 70.4% | +9.4pp |
| Mr.TyDi-ko | 421 | 55.6% | 72.7% | 76.2% | +3.6pp |
Full Metrics
| Benchmark | R@1 | R@5 | R@10 | MRR@10 | nz_q | nz_d |
|---|---|---|---|---|---|---|
| Ko-StrategyQA | 64.2% | 82.4% | 85.3% | 72.1% | 37 | 71 |
| MIRACL-ko | 70.4% | 89.7% | 96.2% | 79.1% | 35 | 77 |
| Mr.TyDi-ko | 76.2% | 92.6% | 96.0% | 83.4% | 35 | 75 |
Usage
With Transformers (PyTorch)
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "sewoong/korean-neural-sparse-encoder-base-klue-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()
text = "์์ธ ๋ง์ง ์ถ์ฒํด์ฃผ์ธ์"
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
sparse = torch.log1p(torch.relu(logits))
mask = inputs["attention_mask"].unsqueeze(-1).float()
sparse_vec = (sparse * mask).max(dim=1).values.squeeze()
# Top activated tokens
top_indices = torch.nonzero(sparse_vec).squeeze()
top_values = sparse_vec[top_indices]
sorted_idx = top_values.argsort(descending=True)
for idx in sorted_idx[:20]:
token_id = top_indices[idx].item()
weight = top_values[idx].item()
token = tokenizer.decode([token_id])
print(f" {token}: {weight:.3f}")
With OpenSearch
PUT /my-index
{
"settings": {
"index": {
"default_pipeline": "sparse-pipeline"
}
},
"mappings": {
"properties": {
"text": { "type": "text" },
"sparse_embedding": { "type": "rank_features" }
}
}
}
Training Configuration
model:
name: "klue/roberta-large"
dropout: 0.1
loss:
lambda_q: 0.01
lambda_d: 0.003
temperature: 1.0
flops_warmup_steps: 20000
lambda_initial_ratio: 0.1
data:
batch_size: 128
query_max_length: 64
doc_max_length: 256
training:
num_epochs: 25
learning_rate: 5.0e-5
weight_decay: 0.01
warmup_ratio: 0.06
gradient_accumulation_steps: 2
Related Models
- sewoong/korean-neural-sparse-encoder - V33 base model (149M, skt/A.X-Encoder-base)
- opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 - OpenSearch official multilingual sparse model
License
Apache License 2.0
- Downloads last month
- 18
Model tree for sewoong/korean-neural-sparse-encoder-base-klue-large
Base model
klue/roberta-large