Chunky — Bitext Chunk Alignment Model (LoRA)
A LoRA adapter fine-tuned on Qwen3-4B-Instruct-2507 for the task of finding optimal split points in parallel bilingual text (bitext chunking). Given a source and target text pair with pre-inserted split markers, the model predicts which pairs of split indices align semantically.
Task
Given <src> and <tgt> blocks with numbered split markers [|1|], [|2|], ..., predict the optimal alignment pairs as <answer>src_idx-tgt_idx, ...</answer>.
Example input:
<src>Document title[|1|]First paragraph content.[|2|]Second paragraph.</src>
<tgt>문서 제목[|1|]첫 번째 단락 내용.[|2|]두 번째 단락.</tgt>
Expected output:
<answer>1-1, 2-2</answer>
Model Details
| Property | Value |
|---|---|
| Base model | unsloth/Qwen3-4B-Instruct-2507 |
| Method | SFT with LoRA (PEFT) |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training precision | fp32 |
| Training steps | 8000 (best checkpoint: step 7000) |
| Max sequence length | 12800 tokens |
| Training samples | ~316k augmented bitext alignment samples (Korean–English) |
| Optimizer | AdamW (lr=2e-4, warmup=5%, cosine decay) |
| Effective batch size | 8 (4 × grad_accum 2) |
| Best eval loss | 0.03788 (step 7000) |
| Final eval loss | 0.03844 (step 8000) |
Framework Versions
- PEFT 0.18.1
- TRL 0.23.0
- Transformers 4.56.2
- PyTorch 2.9.1
- Unsloth 2026.4.4
Usage
With Unsloth (recommended)
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from peft import PeftModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Qwen3-4B-Instruct-2507",
max_seq_length=12800,
load_in_4bit=False,
)
tokenizer = get_chat_template(tokenizer, chat_template="qwen3-instruct")
model = PeftModel.from_pretrained(model, "p4b/chunky-qwen3-4b-sft")
model = FastLanguageModel.for_inference(model)
With standard transformers + PEFT
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model = "unsloth/Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(model, "p4b/chunky-qwen3-4b-sft")
model.eval()
Inference Example
SYSTEM_PROMPT = """## Task Description
You are a Linguistic Structure Analyst and Translation Alignment Expert. Your task is to analyze the provided `<src>` (source) and `<tgt>` (target) blocks to identify "Optimal Split Points.". Each split should be in closed form when translating bidirectionally. You should carefully look for pronoun relation or tense.
## Task Objective
Find the index numbers `[|n|]` where the text can be naturally divided into two parts. A split is considered "optimal" if the segments before and after the split remain independently understandable and do not break the semantic flow.
## Guidelines for Selection
1. **Structural Cues**: Prioritize indices located next to structural markers, such as hyphens (`-`), bullet points, or section dividers.
2. **Contextual Independence**: The content after the split point should start a new logical section or thought (e.g., a new heading or a different category of information).
3. Example Logic: In the text `...Information [|32|]-[|33|] Nearby...`, the index `[|33|]` is an ideal split point because it follows a hyphen and precedes a new sub-topic.
4. Alignment: Match the corresponding split point index from the `<src>` block with the equivalent split point index in the `<tgt>` block.
## Constraint
- The output must be formatted strictly as: `<answer>SourceIndex-TargetIndex, SourceIndex-TargetIndex</answer>`
## Input
"""
def insert_split_tokens(chunks: list[str]) -> str:
parts = []
for i, chunk in enumerate(chunks, start=1):
parts.append(chunk)
parts.append(f"[|{i}|]")
return "".join(parts)
src_chunks = ["Introduction paragraph.", "Main content section.", "Conclusion."]
tgt_chunks = ["서론 단락.", "본문 내용 섹션.", "결론."]
text = f"<src>{insert_split_tokens(src_chunks)}</src><tgt>{insert_split_tokens(tgt_chunks)}</tgt>"
messages = [{"role": "user", "content": SYSTEM_PROMPT + text}]
input_text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=False).to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
print(response) # <answer>2-2</answer>
Training Data
Fine-tuned on augmented bitext alignment samples generated from Korean–English parallel corpora. Augmentation was applied to increase diversity in split-point configurations. Training used the augmented subset only (~316k samples), leaving the original ~105k samples as an unseen evaluation pool.
Evaluation Metrics
Evaluated using FN/FP reward scoring (same as GRPO training objective):
| Metric | Description | Weight |
|---|---|---|
| FN (false negative) | Missed ground-truth split pairs | 1.0 |
| FP (false positive) | Predicted pairs not in ground truth | 0.2 |
| Length penalty | Predicted segments >3× longer than GT | 1.0 |
| Reward | -(1.0×FN + 0.2×FP + length_penalty) |
— |
Evaluation Results
Evaluated on 300 randomly sampled held-out original (non-augmented) samples from train_split.jsonl:
| Metric | Value |
|---|---|
| Perfect reward (=0) | 53.7% (161/300) |
| Reward mean / median | -1.654 / 0.000 |
| Reward stdev | 3.277 |
| FN mean | 1.393 |
| FP mean | 0.687 |
| Length penalty mean | 0.070 |
| Parse error rate | 5.3% |
Reward distribution:
| Range | Count | % |
|---|---|---|
| = 0 (perfect) | 161 | 53.7% |
| -1 ~ 0 | 3 | 1.0% |
| -2 ~ -1 | 70 | 23.3% |
| -5 ~ -2 | 43 | 14.3% |
| ≤ -5 | 23 | 7.7% |
The median reward is 0 — the majority of samples are predicted perfectly. The mean is pulled down by a small number of hard samples with many chunks (100+).
Limitations
- Primarily trained on Korean–English parallel text; other language pairs are untested.
- May underperform on documents longer than 12800 tokens.
- Trained without thinking/reasoning mode — purely output-direct fine-tuning.
- Downloads last month
- 19