Korean Neural Sparse Encoder (klue/roberta-large)

A Korean sparse retrieval model trained with SPLADE-max on klue/roberta-large (337M params).

This model generates sparse lexical representations for Korean text, enabling efficient keyword-expanded retrieval via OpenSearch neural_sparse queries.

Model Details

Property	Value
Base Model	`klue/roberta-large`
Parameters	337M (24 layers, 1024 hidden, 32K vocab)
Training Method	SPLADE-max (log(1+ReLU) + max pooling)
Loss	InfoNCE + FLOPS regularization
Training Data	4.84M Korean triplets (46 shards)
Epochs	25
Effective Batch Size	2048 (128/GPU x 2 grad_accum x 8 GPUs)
Hardware	8x NVIDIA B200 (183GB VRAM)
Training Time	13.4 hours

Benchmark Results (Recall@1)

Benchmark	Queries	BM25	V33-base (149M)	This Model (337M)	Delta vs V33
Ko-StrategyQA	592	53.7%	62.2%	64.2%	+2.0pp
MIRACL-ko	213	44.1%	61.0%	70.4%	+9.4pp
Mr.TyDi-ko	421	55.6%	72.7%	76.2%	+3.6pp

Full Metrics

Benchmark	R@1	R@5	R@10	MRR@10	nz_q	nz_d
Ko-StrategyQA	64.2%	82.4%	85.3%	72.1%	37	71
MIRACL-ko	70.4%	89.7%	96.2%	79.1%	35	77
Mr.TyDi-ko	76.2%	92.6%	96.0%	83.4%	35	75

Usage

With Transformers (PyTorch)

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "sewoong/korean-neural-sparse-encoder-base-klue-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

text = "서울 맛집 추천해주세요"
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
    sparse = torch.log1p(torch.relu(logits))
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    sparse_vec = (sparse * mask).max(dim=1).values.squeeze()

# Top activated tokens
top_indices = torch.nonzero(sparse_vec).squeeze()
top_values = sparse_vec[top_indices]
sorted_idx = top_values.argsort(descending=True)

for idx in sorted_idx[:20]:
    token_id = top_indices[idx].item()
    weight = top_values[idx].item()
    token = tokenizer.decode([token_id])
    print(f"  {token}: {weight:.3f}")

With OpenSearch

PUT /my-index
{
  "settings": {
    "index": {
      "default_pipeline": "sparse-pipeline"
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "sparse_embedding": { "type": "rank_features" }
    }
  }
}

Training Configuration

model:
  name: "klue/roberta-large"
  dropout: 0.1
loss:
  lambda_q: 0.01
  lambda_d: 0.003
  temperature: 1.0
  flops_warmup_steps: 20000
  lambda_initial_ratio: 0.1
data:
  batch_size: 128
  query_max_length: 64
  doc_max_length: 256
training:
  num_epochs: 25
  learning_rate: 5.0e-5
  weight_decay: 0.01
  warmup_ratio: 0.06
  gradient_accumulation_steps: 2

Related Models

sewoong/korean-neural-sparse-encoder - V33 base model (149M, skt/A.X-Encoder-base)
opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 - OpenSearch official multilingual sparse model

License

Apache License 2.0

Downloads last month: 18

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for sewoong/korean-neural-sparse-encoder-base-klue-large

Base model

klue/roberta-large

Finetuned

(79)

this model