Question about KsponSpeech evaluation setup and reproduction
Thank you for releasing this awesome model!
I have a question regarding the evaluation setup for KsponSpeech. Could you share which reference text format was used for these benchmarks, along with the inference settings? Specifically, I'd like to know details such as the prompt format, decoding setup, normalization, and any other evaluation-related configurations.
I ran Raon-Speech-9B for STT using vllm-omni with the example prompt from the README ("Transcribe the audio into text"). I used the "spelling transcripts" obtained from the following repository as reference:
https://github.com/sooftware/ksponspeech
However, I got over 10% CER on both eval-clean and eval-other, which seems significantly worse than the reported results.
Would it be possible to share the evaluation scripts used, or provide some guidance on how to reproduce the reported results? A preprocessing script for KsponSpeech would also be immensely helpful.
In any case, Raon-Speech-9B is the first open-source model I've seen that surpasses Whisper Large v3 on various Korean ASR benchmarks. Thanks again for the great work!
Thanks for trying Raon-Speech!
The CER gap youβre seeing is most likely due to normalization differences.
We apply the following normalization pipeline before computing CER.
Benchmark Dataset
- Ksponspeech : Use
spelling transcript. - Fleurs-ko : Some incorrect ground-truth transcripts were manually corrected.
Normalization Pipeline
- Hanja β Hangul : ζΌ’ε β νμ
- Currency prefix β suffix :
$5β5λ¬λ¬,β¬100β100μ λ‘ - Special symbols β Korean reading :
%β νΌμΌνΈ,ββ λ, etc. - Exception cases :
6μβ μ μ,10μβ μμ,20μ΄β μ€λ¬΄ μ΄ - Phone numbers :
010-1234-5678β 곡μΌκ³΅-μΌμ΄μΌμ¬-μ€μ‘μΉ ν (digit-by-digit) - Range + unit :
1~3λ§λ¦¬β νλ§λ¦¬μμμΈλ§λ¦¬ - Number + unit : converts digit+unit into Korean numeral reading (native korean:
5κ°β λ€μ―κ°; sino-korean:100μβ λ°±μ). The numeral system is chosen by the counter word:
- Native Korean (κ³ μ μ΄) for κ°, λͺ , λ§λ¦¬, μκ°, μ, λ³ β¦ (1β99 only, falls back to Sino-Korean above 99)
- Sino-Korean (νμμ΄) for μ, λ , μ, μΈ΅, νΌμΌνΈ, λ¬λ¬ β¦
- English units are also mapped:
kgβ ν¬λ‘κ·Έλ¨,kmβ ν¬λ‘λ―Έν°, etc.
- Decimal numbers :
3.14β μΌμ μΌμ¬ - Remaining pure numbers :
1234β μ²μ΄λ°±μΌμμ¬ - Remove punctuation & parenthesized content : keep only Hangul, Latin, digits, whitespace
- Lowercase Latin characters :
Helloβ hello - Remove whitespace (it is not included in the attached evaluation script)
Comparison with Existing Korean Normalizer
- N2gk+: Our normalizer is based on N2gk+. Differences from N2gk+:
- Added: currency prefix handling (
β¬,β©,Β£,Β₯), Hanja β Hangul conversion - Changed: native numerals above 99 fall back to Sino-Korean; pure number conversion includes Hangul-adjacent numbers
- Added: currency prefix handling (
- g2pK: Grapheme-to-phoneme converter that changes spelling (e.g.,
λ 립βλλ), making it unsuitable for CER evaluation against spelling transcripts. - KoLM: Converts all numbers to Sino-Korean only.
- KoNLPy: Limited to social-media-style text correction. No number conversion or ASR-specific normalization.
| Feature | Ours | N2gk+ | g2pK | KoLM | KoNLPy |
|---|---|---|---|---|---|
| Number β Korean (native/sino korean by counter word) | O | O | O | β³ (sino only) | X |
| Phone number (digit-by-digit) | O | O | X | X | X |
Currency prefix ($, β¬, β©) |
O | β³ ($ only) |
X | X | X |
Special symbols (%, β, &) |
O | O | X | X | X |
Range expression (1~3λ§λ¦¬) |
O | O | X | X | X |
| Exception cases (μ μ, μμ) | O | O | X | X | X |
English unit mapping (kg β ν¬λ‘κ·Έλ¨) |
O | O | X | X | X |
| Hanja β Hangul | O | X | X | β³ | X |
| Punctuation removal | O | O | X | O | X |
Korean text normalizer code
"""Korean text normalizer for ASR evaluation (CER comparison).
Handles numberβKorean conversion, special symbol mapping, hanjaβhangul,
and punctuation removal for fair CER comparison.
Reference: CoreaSpeech N2gk (https://github.com/CoreaSpeech/CoreaSpeech)
"""
from __future__ import annotations
import re
# ββ Number β Korean conversion ββββββββββββββββββββββββββββββββββββββββββββββ
NUM_KOR = ["", "μΌ", "μ΄", "μΌ", "μ¬", "μ€", "μ‘", "μΉ ", "ν", "ꡬ"]
UNIT_SMALL = ["", "μ", "λ°±", "μ²"]
UNIT_LARGE = ["", "λ§", "μ΅", "μ‘°", "κ²½"]
NEVER_SKIP_ONE = {"μ΅", "μ‘°", "κ²½"}
GOOYO_SIP = {
10: "μ΄", 20: "μ€λ¬Ό", 30: "μλ₯Έ", 40: "λ§ν",
50: "μ°", 60: "μμ", 70: "μΌν", 80: "μ¬λ ", 90: "μν",
}
BASIC_NATIVE = {
1: ("νλ", "ν"), 2: ("λ", "λ"), 3: ("μ
", "μΈ"), 4: ("λ·", "λ€"),
5: ("λ€μ―", "λ€μ―"), 6: ("μ¬μ―", "μ¬μ―"), 7: ("μΌκ³±", "μΌκ³±"),
8: ("μ¬λ", "μ¬λ"), 9: ("μν", "μν"),
}
GOOYO_PREFIX_TENS = {20: "μ€λ¬΄"}
# κ³ μ μ΄ μμ¬λ₯Ό μ°λ λ¨μ
NATIVE_UNITS = {
"λͺ
", "μ¬λ", "λ§λ¦¬", "λ²μ§Έ", "μ", "λ°°", "λ°©", "κ°κ΅¬", "κ²μ", "건", "μΈνΈ",
"κ°", "κ°μ§", "κ°λΉ", "μ", "λ²", "μ₯", "λ³", "κΆ", "λ²", "κ³³", "μκ°",
"μ²", "μ°¨λ‘", "λ°ν΄", "κ²½κΈ°", "골", "μ΄", "μ°μΈ", "μΆμΆ",
"λ¬", "κΈμ",
}
# νμμ΄ μμ¬λ₯Ό μ°λ λ¨μ
HANJA_UNITS = {
"μ΄", "λΆ", "μΌ", "μ£Ό", "κ°μ", "μ", "λ
",
"μ ", "ν¬μΈνΈ", "νΌμΌνΈ", "λ 벨", "μ μ", "λ±κΈ", "λ±", "κ°κ΅", "λ³ΌνΈ",
"μ", "λ¬λ¬", "μ λ‘", "μ", "νμ", "λ°°λ΄",
"ν", "μ°¨", "κΈ°", "νΈ", "νμ΄μ§",
"λ",
# κ΅μ‘/μμ΄
"νλ
", "νκΈ°", "νμ ", "νλ²", "κ΅μ", "λ°",
"κΈ", "λ¨κ³", "μ", "ν",
# κ΅°μ¬/μ‘°μ§
"μ¬λ¨",
# 건물/곡κ°
"μΈ΅",
# μΈλ/μλ
"μΈκΈ°", "λ", "μΈλ",
# κ²½μ/μ€ν¬μΈ
"λΌμ΄λ",
# κΈμ‘ ν° λ¨μ
"λ§μ", "λ§λͺ
", "λ§",
# μΈλ λ¨μ
"νΌνΈ", "νμ΄λ", "λ§μΌ", "μΈμΉ", "ν₯νλ₯΄",
# μΈμΉ/μΈκ·Ή
"μΈμΉ", "μΈκ·Ή",
}
# λ¨μλͺ
λ³ν (μλ¬Έ λ¨μ β νκΈ)
UNIT_NAME_MAP = {
"kg": "ν¬λ‘κ·Έλ¨", "Kg": "ν¬λ‘κ·Έλ¨", "g": "κ·Έλ¨", "mg": "λ°λ¦¬κ·Έλ¨",
"t": "ν€", "T": "ν€", "l": "리ν°", "L": "리ν°", "ml": "λ°λ¦¬λ¦¬ν°",
"cm": "μΌν°λ―Έν°", "mm": "λ°λ¦¬λ―Έν°", "m": "λ―Έν°", "km": "ν¬λ‘λ―Έν°",
"mi": "λ§μΌ",
}
# νΉμκΈ°νΈ β νκΈ
SPECIAL_SYMBOL_MAP = {
"οΌ
": "νΌμΌνΈ", "%": "νΌμΌνΈ",
"%p": "νΌμΌνΈν¬μΈνΈ", "% p": "νΌμΌνΈν¬μΈνΈ",
"&": "μ€", "#": "μ΅", "@": "μ³",
"+": "νλ¬μ€", "Β±": "νλ¬μ€λ§μ΄λμ€",
"γ": "μΌν°λ―Έν°", "γ": "λ°λ¦¬λ―Έν°", "γ": "ν¬λ‘κ·Έλ¨",
"γ": "λ°λ¦¬λ¦¬ν°", "β": "λ", "γ": "ν¬λ‘λ―Έν°", "γ": "λ°λ¦¬κ·Έλ¨",
"γ‘": "μ κ³±λ―Έν°", "γ₯": "μΈμ κ³±λ―Έν°",
"ο½": "~", "ο½": "λ―Έν°",
"Β°C": "λ", "Β°c": "λ",
}
# μμΈ μΌμ΄μ€
EXCEPTION_CASES = {
r"\b20\s?μ΄\b": "μ€λ¬΄ μ΄",
r"\b1\s?λ±\b": "μΌ λ±",
r"(?<!\d)(0?6)\s*μ": "μ μ",
r"(?<!\d)(10)\s*μ": "μμ",
}
def _to_gooyo(num: int, prefix: bool = False) -> str:
"""κ³ μ μ΄ μμ¬ λ³ν (1~99)."""
if num <= 0:
return "μ"
if num <= 9:
base = BASIC_NATIVE.get(num)
return (base[1] if prefix else base[0]) if base else "μ"
if num == 10:
return "μ΄"
if num < 100:
tens = (num // 10) * 10
ones = num % 10
if prefix and ones == 0 and tens in GOOYO_PREFIX_TENS:
return GOOYO_PREFIX_TENS[tens]
tens_str = GOOYO_SIP.get(tens, "")
return tens_str + (_to_gooyo(ones, prefix=prefix) if ones else "")
# 100 μ΄μμ νμμ΄λ‘ fallback
return _to_hanja(num)
def _convert_small_unit(chunk: str) -> str:
"""4μ리 μ΄ν μ«μ μ²ν¬λ₯Ό νμμ΄ μμ¬λ‘ λ³ν."""
result = ""
length = len(chunk)
for i, ch in enumerate(chunk):
digit = int(ch)
if digit == 0:
continue
unit = UNIT_SMALL[length - i - 1]
if digit == 1 and unit:
result += unit
else:
result += NUM_KOR[digit] + unit
return result
def _to_hanja(num, natural: bool = True) -> str:
"""νμμ΄ μμ¬ λ³ν."""
if isinstance(num, float):
int_part = int(num)
frac_str = str(num).split(".")[1]
int_kor = _to_hanja(int_part, natural)
frac_kor = "".join(
NUM_KOR[int(ch)] if ch != "0" else "μ" for ch in frac_str
)
return f"{int_kor}μ {frac_kor}"
if isinstance(num, str):
try:
num = float(num) if "." in num else int(num)
return _to_hanja(num, natural)
except ValueError:
return num
if num == 0:
return "μ"
if num < 0:
return "λ§μ΄λμ€ " + _to_hanja(-num, natural)
s = str(num)
chunks = [s[max(i - 4, 0):i] for i in range(len(s), 0, -4)][::-1]
if len(chunks) > 5:
return str(num)
result = ""
for i, chunk in enumerate(chunks):
if int(chunk) == 0:
continue
part = _convert_small_unit(chunk.zfill(4))
unit = UNIT_LARGE[len(chunks) - i - 1]
if part == "μΌ" and unit:
if natural and unit not in NEVER_SKIP_ONE:
part = ""
result += part + unit
return result
def _n2gk_with_unit(num: int, unit: str) -> str:
"""μ«μ+λ¨μλ₯Ό μΈμνμ¬ μ μ ν μμ¬(κ³ μ μ΄/νμμ΄)λ‘ λ³ν."""
# λ¨μλͺ
λ³ν (μλ¬Έ β νκΈ)
display_unit = UNIT_NAME_MAP.get(unit, unit)
if unit in NATIVE_UNITS and 1 <= num <= 99:
return _to_gooyo(num, prefix=True) + display_unit
else:
return _to_hanja(num) + display_unit
# ββ All unit strings sorted by length (longest first for regex matching) ββ
_ALL_UNITS = sorted(
list(NATIVE_UNITS) + list(HANJA_UNITS) + list(UNIT_NAME_MAP.keys()),
key=len, reverse=True,
)
# μλ¬Έ λ¨μλ λ€μ word boundary μΆκ° (mμ΄ miλ₯Ό λ§€μΉνλ κ² λ°©μ§)
_UNITS_PATTERN = "|".join(
re.escape(u) + r"(?![a-zA-Z])" if re.match(r"^[a-zA-Z]+$", u) else re.escape(u)
for u in _ALL_UNITS
)
# ν΅ν κΈ°νΈ (μ«μ μμ μ€λ prefix ν΅ν β μ«μ λ€λ‘ μ΄λ)
CURRENCY_PREFIX_MAP = {
"$": "λ¬λ¬",
"β¬": "μ λ‘",
"Β£": "νμ΄λ",
"Β₯": "μ",
"β©": "μ",
}
_CURRENCY_SYMBOLS = "|".join(re.escape(s) for s in CURRENCY_PREFIX_MAP)
def _convert_currency_prefix(text: str) -> str:
"""ν΅ν κΈ°νΈ+μ«μ β μ«μ+ν΅νλͺ
λ³ν ($5 β 5λ¬λ¬, β¬100 β 100μ λ‘)."""
pattern = rf"({_CURRENCY_SYMBOLS})\s*(\d+(?:[.,]\d+)*)"
def replacer(m):
currency = CURRENCY_PREFIX_MAP[m.group(1)]
num = m.group(2)
return f"{num}{currency}"
return re.sub(pattern, replacer, text)
def _convert_phone_numbers(text: str) -> str:
"""μ νλ²νΈ ν¨ν΄ λ³ν (010-1234-5678 β 곡μΌκ³΅-μΌμ΄μΌμ¬-μ€μ‘μΉ ν)."""
DIGIT_KOR = ["곡", "μΌ", "μ΄", "μΌ", "μ¬", "μ€", "μ‘", "μΉ ", "ν", "ꡬ"]
def _digits_to_kr(s: str) -> str:
return "".join(DIGIT_KOR[int(d)] for d in s)
# νμ΄ν ν¬ν¨
text = re.sub(
r"(?<!\d)(\d{2,3})-(\d{3,4})-(\d{4})(?!\d)",
lambda m: f"{_digits_to_kr(m[1])}-{_digits_to_kr(m[2])}-{_digits_to_kr(m[3])}",
text,
)
# 11μ리 μ°μ
text = re.sub(
r"(?<!\d)(\d{11})(?!\d)",
lambda m: f"{_digits_to_kr(m[1][:3])}-{_digits_to_kr(m[1][3:7])}-{_digits_to_kr(m[1][7:])}",
text,
)
return text
def _convert_range_with_units(text: str) -> str:
"""λ²μ ν¨ν΄ λ³ν (1~3λ§λ¦¬ β νλ§λ¦¬μμμΈλ§λ¦¬, 10β60λΆ β μλΆμμμ‘μλΆ)."""
range_sep = r"[~\u2013\u2014]" # ~, β(en-dash), β(em-dash)
pattern = rf"(\d+(?:\.\d+)?)\s*{range_sep}\s*(\d+(?:\.\d+)?)\s*({_UNITS_PATTERN})"
def replacer(m):
try:
left = float(m.group(1)) if "." in m.group(1) else int(m.group(1))
right = float(m.group(2)) if "." in m.group(2) else int(m.group(2))
unit = m.group(3)
l_str = _n2gk_with_unit(left, unit)
r_str = _n2gk_with_unit(right, unit)
return f"{l_str}μμ{r_str}"
except (ValueError, OverflowError):
return m.group(0)
return re.sub(pattern, replacer, text)
def _convert_numbers_with_units(text: str) -> str:
"""μ«μ+λ¨μ ν¨ν΄ λ³ν (5κ° β λ€μ―κ°, 100μ β λ°±μ)."""
pattern = rf"(\d{{1,3}}(?:,\d{{3}})*|\d+(?:\.\d+)?)\s?({_UNITS_PATTERN})"
def replacer(m):
raw = m.group(1).replace(",", "")
word = m.group(2)
try:
num = float(raw) if "." in raw else int(raw)
return _n2gk_with_unit(num, word)
except (ValueError, OverflowError):
return m.group(0)
return re.sub(pattern, replacer, text)
def _convert_float_numbers(text: str) -> str:
"""μμμ μ«μ λ³ν (3.14 β μΌμ μΌμ¬)."""
def replacer(m):
try:
num = float(m.group(1))
return _to_hanja(num)
except (ValueError, OverflowError):
return m.group(1)
return re.sub(r"(\d+\.\d+)", replacer, text)
def _convert_pure_numbers(text: str) -> str:
"""λ¨μ λͺ¨λ μ«μ λ³ν (1234 β μ²μ΄λ°±μΌμμ¬).
_convert_numbers_with_units μ΄ν νΈμΆλλ―λ‘ λ¨μ λΆμ μ«μλ μ΄λ―Έ λ³νλ¨.
νκΈ μΈμ μ«μλ λ³ν (μ: 200μ β μ΄λ°±μ, 1νλ
β μΌνλ
).
"""
pattern = r"(\d{1,3}(?:,\d{3})+|\d+)"
def replacer(m):
try:
num = int(m.group(1).replace(",", ""))
return _to_hanja(num)
except (ValueError, OverflowError):
return m.group(0)
return re.sub(pattern, replacer, text)
def _apply_exceptions(text: str) -> str:
"""μμΈ μΌμ΄μ€ μ²λ¦¬ (μ€λ¬΄μ΄, μ μ, μμ λ±)."""
for pattern, replacement in EXCEPTION_CASES.items():
text = re.sub(pattern, replacement, text)
return text
def _apply_special_symbols(text: str) -> str:
"""νΉμκΈ°νΈ β νκΈ λ³ν."""
# κΈ΄ ν¨ν΄ λ¨Όμ λ§€μΉνκΈ° μν΄ κΈΈμ΄ μμ μ λ ¬
for symbol in sorted(SPECIAL_SYMBOL_MAP, key=len, reverse=True):
text = text.replace(symbol, SPECIAL_SYMBOL_MAP[symbol])
return text
def _remove_punctuation(text: str) -> str:
"""ꡬλμ /κ΄νΈ/κΈ°ν κΈ°νΈ μ κ±°."""
# κ΄νΈ μ λ΄μ© μ κ±°
text = re.sub(r"\([^)]*\)", "", text)
# νκΈ(μμ±ν+μλͺ¨), λΌν΄λ¬Έμ(μ
μΌνΈ ν¬ν¨), μ«μ, κ³΅λ°±λ§ μ μ§
# \u00C0-\u024F: Latin Extended (Γ , Γ³, Γ±, ΓΌ λ±)
text = re.sub(r"[^\uAC00-\uD7A3\u3131-\u3163a-zA-Z\u00C0-\u024F0-9\s]", "", text)
return text
# ββ Main entry point βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
def normalize_korean(text: str, kss=None) -> str:
"""νκ΅μ΄ ASR νκ°μ© ν
μ€νΈ μ κ·ν.
Pipeline:
1. νμ β νκΈ λ³ν (kss)
2. νΉμκΈ°νΈ β νκΈ (%, β λ±)
3. μμΈ μΌμ΄μ€ (μ μ, μμ, μ€λ¬΄μ΄)
4. μ νλ²νΈ λ³ν
5. μ«μ+λ¨μ λ³ν (5κ°βλ€μ―κ°, 100μβλ°±μ)
6. μμμ μ«μ λ³ν (3.14βμΌμ μΌμ¬)
7. μμ μ«μ λ³ν (1234βμ²μ΄λ°±μΌμμ¬)
8. ꡬλμ μ κ±°
9. μλ¬Έμ λ³ν
10. 곡백 μ κ±° (CER λΉκ΅μ©)
"""
if not text:
return ""
# 1. νμ β νκΈ
if kss:
try:
text = kss.hanja2hangul(text)
except Exception:
pass
# 2. ν΅ν κΈ°νΈ+μ«μ β μ«μ+ν΅νλͺ
($5 β 5λ¬λ¬)
text = _convert_currency_prefix(text)
# 3. νΉμκΈ°νΈ β νκΈ
text = _apply_special_symbols(text)
# 3. μμΈ μΌμ΄μ€
text = _apply_exceptions(text)
# 4. μ νλ²νΈ
text = _convert_phone_numbers(text)
# 5. λ²μ+λ¨μ (μ«μ+λ¨μλ³΄λ€ λ¨Όμ )
text = _convert_range_with_units(text)
# 6. μ«μ+λ¨μ
text = _convert_numbers_with_units(text)
# 6. μμμ μ«μ
text = _convert_float_numbers(text)
# 7. μμ μ«μ
text = _convert_pure_numbers(text)
# 8. ꡬλμ μ κ±°
text = _remove_punctuation(text)
# 9. μλ¬Έμ
text = text.lower()
return text
Thank you for using our model.
Please note that vllm-omni is a vLLM port of our research model, and some differences in model behavior may therefore be expected.
We have verified that, for a subset of our target evaluation sets, the metrics are consistent. However, Korean datasets were not included in the evaluation scope at the time of publication.
We appreciate your feedback and will look into this matter promptly.
Thanks for detailed explanation! I'll check transformers backend and normalization method.
Also, I'm wondering is there any timeline for releasing korean benchmarks(KVoiceBench, KOpenAudioBench, KMMAU)? It would be very good for community
Thank you for your interest in our models and research.
We are currently writing a paper that includes the development process of KVoiceBench, KOpenAudioBench, KMMAU, and details of each benchmark. We plan to publicly release these three benchmarks along with this paper. We anticipate this will take place around next week.
I tested transformers backend(RaonSpeechPipeline.stt) and got 8.35 and 7.72 for clean and other subsets. It seems vllm-omni's high CER was originated from different temperature (1.0 vs 0.2).
Even it does not match with paper's result, it's already SOTA among open models so i'll not try to reproduce anymore because results can be different due to environment and non-deterministic algorithms.
Thanks again for text normalization code, and I'm looking forward to release of benchmark and its paper
Thank you for providing your reproduction report. RaonSpeechPipeline was developed as a simplified version of our research pipeline to facilitate easier public use, as our full internal environment is very complex.
Because the benchmarks in our technical report were based on the original research codebase, it appears that variations in metrics were not fully captured during the refactoring process.
We will prioritize an inspection of these differences and will keep the community updated on our findings.