Chunky — Bitext Chunk Alignment Model (LoRA)

A LoRA adapter fine-tuned on Qwen3-4B-Instruct-2507 for the task of finding optimal split points in parallel bilingual text (bitext chunking). Given a source and target text pair with pre-inserted split markers, the model predicts which pairs of split indices align semantically.

Visualize in Weights & Biases

Task

Given <src> and <tgt> blocks with numbered split markers [|1|], [|2|], ..., predict the optimal alignment pairs as <answer>src_idx-tgt_idx, ...</answer>.

Example input:

<src>Document title[|1|]First paragraph content.[|2|]Second paragraph.</src>
<tgt>문서 제목[|1|]첫 번째 단락 내용.[|2|]두 번째 단락.</tgt>

Expected output:

<answer>1-1, 2-2</answer>

Model Details

Property Value
Base model unsloth/Qwen3-4B-Instruct-2507
Method SFT with LoRA (PEFT)
LoRA rank 32
LoRA alpha 64
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training precision fp32
Training steps 8000 (best checkpoint: step 7000)
Max sequence length 12800 tokens
Training samples ~316k augmented bitext alignment samples (Korean–English)
Optimizer AdamW (lr=2e-4, warmup=5%, cosine decay)
Effective batch size 8 (4 × grad_accum 2)
Best eval loss 0.03788 (step 7000)
Final eval loss 0.03844 (step 8000)

Framework Versions

  • PEFT 0.18.1
  • TRL 0.23.0
  • Transformers 4.56.2
  • PyTorch 2.9.1
  • Unsloth 2026.4.4

Usage

With Unsloth (recommended)

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from peft import PeftModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Instruct-2507",
    max_seq_length=12800,
    load_in_4bit=False,
)
tokenizer = get_chat_template(tokenizer, chat_template="qwen3-instruct")
model = PeftModel.from_pretrained(model, "p4b/chunky-qwen3-4b-sft")
model = FastLanguageModel.for_inference(model)

With standard transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = "unsloth/Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(model, "p4b/chunky-qwen3-4b-sft")
model.eval()

Inference Example

SYSTEM_PROMPT = """## Task Description
You are a Linguistic Structure Analyst and Translation Alignment Expert. Your task is to analyze the provided `<src>` (source) and `<tgt>` (target) blocks to identify "Optimal Split Points.". Each split should be in closed form when translating bidirectionally. You should carefully look for pronoun relation or tense.

## Task Objective
Find the index numbers `[|n|]` where the text can be naturally divided into two parts. A split is considered "optimal" if the segments before and after the split remain independently understandable and do not break the semantic flow.

## Guidelines for Selection
1. **Structural Cues**: Prioritize indices located next to structural markers, such as hyphens (`-`), bullet points, or section dividers.
2. **Contextual Independence**: The content after the split point should start a new logical section or thought (e.g., a new heading or a different category of information).
3. Example Logic: In the text `...Information [|32|]-[|33|] Nearby...`, the index `[|33|]` is an ideal split point because it follows a hyphen and precedes a new sub-topic.
4. Alignment: Match the corresponding split point index from the `<src>` block with the equivalent split point index in the `<tgt>` block.

## Constraint
- The output must be formatted strictly as: `<answer>SourceIndex-TargetIndex, SourceIndex-TargetIndex</answer>`

## Input
"""

def insert_split_tokens(chunks: list[str]) -> str:
    parts = []
    for i, chunk in enumerate(chunks, start=1):
        parts.append(chunk)
        parts.append(f"[|{i}|]")
    return "".join(parts)

src_chunks = ["Introduction paragraph.", "Main content section.", "Conclusion."]
tgt_chunks = ["서론 단락.", "본문 내용 섹션.", "결론."]

text = f"<src>{insert_split_tokens(src_chunks)}</src><tgt>{insert_split_tokens(tgt_chunks)}</tgt>"
messages = [{"role": "user", "content": SYSTEM_PROMPT + text}]

input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=False).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(
    output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
print(response)  # <answer>2-2</answer>

Training Data

Fine-tuned on augmented bitext alignment samples generated from Korean–English parallel corpora. Augmentation was applied to increase diversity in split-point configurations. Training used the augmented subset only (~316k samples), leaving the original ~105k samples as an unseen evaluation pool.

Evaluation Metrics

Evaluated using FN/FP reward scoring (same as GRPO training objective):

Metric Description Weight
FN (false negative) Missed ground-truth split pairs 1.0
FP (false positive) Predicted pairs not in ground truth 0.2
Length penalty Predicted segments >3× longer than GT 1.0
Reward -(1.0×FN + 0.2×FP + length_penalty)

Evaluation Results

Evaluated on 300 randomly sampled held-out original (non-augmented) samples from train_split.jsonl:

Metric Value
Perfect reward (=0) 53.7% (161/300)
Reward mean / median -1.654 / 0.000
Reward stdev 3.277
FN mean 1.393
FP mean 0.687
Length penalty mean 0.070
Parse error rate 5.3%

Reward distribution:

Range Count %
= 0 (perfect) 161 53.7%
-1 ~ 0 3 1.0%
-2 ~ -1 70 23.3%
-5 ~ -2 43 14.3%
≤ -5 23 7.7%

The median reward is 0 — the majority of samples are predicted perfectly. The mean is pulled down by a small number of hard samples with many chunks (100+).

Limitations

  • Primarily trained on Korean–English parallel text; other language pairs are untested.
  • May underperform on documents longer than 12800 tokens.
  • Trained without thinking/reasoning mode — purely output-direct fine-tuning.
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for p4b/qwen3-4b-chunky

Adapter
(400)
this model
Quantizations
1 model