Korean Neural Sparse Encoder (klue/roberta-large)

A Korean sparse retrieval model trained with SPLADE-max on klue/roberta-large (337M params).

This model generates sparse lexical representations for Korean text, enabling efficient keyword-expanded retrieval via OpenSearch neural_sparse queries.

Model Details

Property Value
Base Model klue/roberta-large
Parameters 337M (24 layers, 1024 hidden, 32K vocab)
Training Method SPLADE-max (log(1+ReLU) + max pooling)
Loss InfoNCE + FLOPS regularization
Training Data 4.84M Korean triplets (46 shards)
Epochs 25
Effective Batch Size 2048 (128/GPU x 2 grad_accum x 8 GPUs)
Hardware 8x NVIDIA B200 (183GB VRAM)
Training Time 13.4 hours

Benchmark Results (Recall@1)

Benchmark Queries BM25 V33-base (149M) This Model (337M) Delta vs V33
Ko-StrategyQA 592 53.7% 62.2% 64.2% +2.0pp
MIRACL-ko 213 44.1% 61.0% 70.4% +9.4pp
Mr.TyDi-ko 421 55.6% 72.7% 76.2% +3.6pp

Full Metrics

Benchmark R@1 R@5 R@10 MRR@10 nz_q nz_d
Ko-StrategyQA 64.2% 82.4% 85.3% 72.1% 37 71
MIRACL-ko 70.4% 89.7% 96.2% 79.1% 35 77
Mr.TyDi-ko 76.2% 92.6% 96.0% 83.4% 35 75

Usage

With Transformers (PyTorch)

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "sewoong/korean-neural-sparse-encoder-base-klue-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

text = "์„œ์šธ ๋ง›์ง‘ ์ถ”์ฒœํ•ด์ฃผ์„ธ์š”"
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
    sparse = torch.log1p(torch.relu(logits))
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    sparse_vec = (sparse * mask).max(dim=1).values.squeeze()

# Top activated tokens
top_indices = torch.nonzero(sparse_vec).squeeze()
top_values = sparse_vec[top_indices]
sorted_idx = top_values.argsort(descending=True)

for idx in sorted_idx[:20]:
    token_id = top_indices[idx].item()
    weight = top_values[idx].item()
    token = tokenizer.decode([token_id])
    print(f"  {token}: {weight:.3f}")

With OpenSearch

PUT /my-index
{
  "settings": {
    "index": {
      "default_pipeline": "sparse-pipeline"
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "sparse_embedding": { "type": "rank_features" }
    }
  }
}

Training Configuration

model:
  name: "klue/roberta-large"
  dropout: 0.1
loss:
  lambda_q: 0.01
  lambda_d: 0.003
  temperature: 1.0
  flops_warmup_steps: 20000
  lambda_initial_ratio: 0.1
data:
  batch_size: 128
  query_max_length: 64
  doc_max_length: 256
training:
  num_epochs: 25
  learning_rate: 5.0e-5
  weight_decay: 0.01
  warmup_ratio: 0.06
  gradient_accumulation_steps: 2

Related Models

License

Apache License 2.0

Downloads last month
18
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sewoong/korean-neural-sparse-encoder-base-klue-large

Finetuned
(79)
this model