Question about KsponSpeech evaluation setup and reproduction

#3
by seastar105 - opened

Thank you for releasing this awesome model!

I have a question regarding the evaluation setup for KsponSpeech. Could you share which reference text format was used for these benchmarks, along with the inference settings? Specifically, I'd like to know details such as the prompt format, decoding setup, normalization, and any other evaluation-related configurations.

I ran Raon-Speech-9B for STT using vllm-omni with the example prompt from the README ("Transcribe the audio into text"). I used the "spelling transcripts" obtained from the following repository as reference:
https://github.com/sooftware/ksponspeech

However, I got over 10% CER on both eval-clean and eval-other, which seems significantly worse than the reported results.

Would it be possible to share the evaluation scripts used, or provide some guidance on how to reproduce the reported results? A preprocessing script for KsponSpeech would also be immensely helpful.

In any case, Raon-Speech-9B is the first open-source model I've seen that surpasses Whisper Large v3 on various Korean ASR benchmarks. Thanks again for the great work!

Thanks for trying Raon-Speech!

The CER gap you’re seeing is most likely due to normalization differences.
We apply the following normalization pipeline before computing CER.

Benchmark Dataset

  1. Ksponspeech : Use spelling transcript.
  2. Fleurs-ko : Some incorrect ground-truth transcripts were manually corrected.

Normalization Pipeline

  1. Hanja β†’ Hangul : ζΌ’ε­— β†’ ν•œμž
  2. Currency prefix β†’ suffix : $5 β†’ 5λ‹¬λŸ¬, €100 β†’ 100유둜
  3. Special symbols β†’ Korean reading : % β†’ νΌμ„ΌνŠΈ, ℃ β†’ 도, etc.
  4. Exception cases : 6μ›” β†’ μœ μ›”, 10μ›” β†’ μ‹œμ›”, 20μ‚΄ β†’ 슀무 μ‚΄
  5. Phone numbers : 010-1234-5678 β†’ 곡일곡-일이삼사-μ˜€μœ‘μΉ νŒ” (digit-by-digit)
  6. Range + unit : 1~3마리 β†’ ν•œλ§ˆλ¦¬μ—μ„œμ„Έλ§ˆλ¦¬
  7. Number + unit : converts digit+unit into Korean numeral reading (native korean: 5개 β†’ λ‹€μ„―κ°œ; sino-korean: 100원 β†’ 백원). The numeral system is chosen by the counter word:
  • Native Korean (κ³ μœ μ–΄) for 개, λͺ…, 마리, μ‹œκ°„, μž”, 병 … (1–99 only, falls back to Sino-Korean above 99)
  • Sino-Korean (ν•œμžμ–΄) for 원, λ…„, μ›”, μΈ΅, νΌμ„ΌνŠΈ, λ‹¬λŸ¬ …
  • English units are also mapped: kg β†’ ν‚¬λ‘œκ·Έλž¨, km β†’ ν‚¬λ‘œλ―Έν„°, etc.
  1. Decimal numbers : 3.14 β†’ 삼점일사
  2. Remaining pure numbers : 1234 β†’ μ²œμ΄λ°±μ‚Όμ‹­μ‚¬
  3. Remove punctuation & parenthesized content : keep only Hangul, Latin, digits, whitespace
  4. Lowercase Latin characters : Hello β†’ hello
  5. Remove whitespace (it is not included in the attached evaluation script)

Comparison with Existing Korean Normalizer

  • N2gk+: Our normalizer is based on N2gk+. Differences from N2gk+:
    • Added: currency prefix handling (€, β‚©, Β£, Β₯), Hanja β†’ Hangul conversion
    • Changed: native numerals above 99 fall back to Sino-Korean; pure number conversion includes Hangul-adjacent numbers
  • g2pK: Grapheme-to-phoneme converter that changes spelling (e.g., 독립 β†’ 동닙), making it unsuitable for CER evaluation against spelling transcripts.
  • KoLM: Converts all numbers to Sino-Korean only.
  • KoNLPy: Limited to social-media-style text correction. No number conversion or ASR-specific normalization.
Feature Ours N2gk+ g2pK KoLM KoNLPy
Number β†’ Korean (native/sino korean by counter word) O O O β–³ (sino only) X
Phone number (digit-by-digit) O O X X X
Currency prefix ($, €, β‚©) O β–³ ($ only) X X X
Special symbols (%, ℃, &) O O X X X
Range expression (1~3마리) O O X X X
Exception cases (μœ μ›”, μ‹œμ›”) O O X X X
English unit mapping (kg β†’ ν‚¬λ‘œκ·Έλž¨) O O X X X
Hanja β†’ Hangul O X X β–³ X
Punctuation removal O O X O X
Korean text normalizer code
"""Korean text normalizer for ASR evaluation (CER comparison).

Handles number→Korean conversion, special symbol mapping, hanja→hangul,
and punctuation removal for fair CER comparison.

Reference: CoreaSpeech N2gk (https://github.com/CoreaSpeech/CoreaSpeech)
"""
from __future__ import annotations

import re


# ── Number β†’ Korean conversion ──────────────────────────────────────────────

NUM_KOR = ["", "일", "이", "μ‚Ό", "사", "였", "윑", "μΉ ", "νŒ”", "ꡬ"]
UNIT_SMALL = ["", "μ‹­", "λ°±", "천"]
UNIT_LARGE = ["", "만", "μ–΅", "μ‘°", "κ²½"]
NEVER_SKIP_ONE = {"μ–΅", "μ‘°", "κ²½"}

GOOYO_SIP = {
    10: "μ—΄", 20: "슀물", 30: "μ„œλ₯Έ", 40: "λ§ˆν”",
    50: "μ‰°", 60: "예순", 70: "일흔", 80: "μ—¬λ“ ", 90: "아흔",
}
BASIC_NATIVE = {
    1: ("ν•˜λ‚˜", "ν•œ"), 2: ("λ‘˜", "두"), 3: ("μ…‹", "μ„Έ"), 4: ("λ„·", "λ„€"),
    5: ("λ‹€μ„―", "λ‹€μ„―"), 6: ("μ—¬μ„―", "μ—¬μ„―"), 7: ("일곱", "일곱"),
    8: ("μ—¬λŸ", "μ—¬λŸ"), 9: ("아홉", "아홉"),
}
GOOYO_PREFIX_TENS = {20: "슀무"}

# κ³ μœ μ–΄ μˆ˜μ‚¬λ₯Ό μ“°λŠ” λ‹¨μœ„
NATIVE_UNITS = {
    "λͺ…", "μ‚¬λžŒ", "마리", "번째", "μ‹œ", "λ°°", "λ°©", "가ꡬ", "κ²Œμž„", "건", "μ„ΈνŠΈ",
    "개", "κ°€μ§€", "κ°œλΉ„", "μž”", "번", "μž₯", "병", "ꢌ", "벌", "κ³³", "μ‹œκ°„",
    "μ²™", "μ°¨λ‘€", "바퀴", "κ²½κΈ°", "골", "μ‚΄", "μ—°μ„Έ", "μΆ˜μΆ”",
    "달", "κΈ€μž",
}

# ν•œμžμ–΄ μˆ˜μ‚¬λ₯Ό μ“°λŠ” λ‹¨μœ„
HANJA_UNITS = {
    "초", "λΆ„", "일", "μ£Ό", "κ°œμ›”", "μ›”", "λ…„",
    "점", "포인트", "νΌμ„ΌνŠΈ", "레벨", "점수", "λ“±κΈ‰", "λ“±", "개ꡭ", "볼트",
    "원", "λ‹¬λŸ¬", "유둜", "μ—”", "νŽ˜μ†Œ", "배럴",
    "회", "μ°¨", "κΈ°", "호", "νŽ˜μ΄μ§€",
    "도",
    # ꡐ윑/μ„œμ—΄
    "ν•™λ…„", "ν•™κΈ°", "학점", "ν•™λ²ˆ", "κ΅μ‹œ", "반",
    "κΈ‰", "단계", "μœ„", "ν˜•",
    # ꡰ사/쑰직
    "사단",
    # 건물/곡간
    "μΈ΅",
    # μ„ΈλŒ€/μ‹œλŒ€
    "μ„ΈκΈ°", "λŒ€", "μ„ΈλŒ€",
    # 경쟁/슀포츠
    "λΌμš΄λ“œ",
    # κΈˆμ•‘ 큰 λ‹¨μœ„
    "λ§Œμ›", "만λͺ…", "만",
    # μ™Έλž˜ λ‹¨μœ„
    "ν”ΌνŠΈ", "νŒŒμš΄λ“œ", "마일", "인치", "ν—₯타λ₯΄",
    # 인승/인극
    "인승", "인극",
}

# λ‹¨μœ„λͺ… λ³€ν™˜ (영문 λ‹¨μœ„ β†’ ν•œκΈ€)
UNIT_NAME_MAP = {
    "kg": "ν‚¬λ‘œκ·Έλž¨", "Kg": "ν‚¬λ‘œκ·Έλž¨", "g": "그램", "mg": "λ°€λ¦¬κ·Έλž¨",
    "t": "톀", "T": "톀", "l": "리터", "L": "리터", "ml": "밀리리터",
    "cm": "μ„Όν‹°λ―Έν„°", "mm": "밀리미터", "m": "λ―Έν„°", "km": "ν‚¬λ‘œλ―Έν„°",
    "mi": "마일",
}

# 특수기호 β†’ ν•œκΈ€
SPECIAL_SYMBOL_MAP = {
    "οΌ…": "νΌμ„ΌνŠΈ", "%": "νΌμ„ΌνŠΈ",
    "%p": "νΌμ„ΌνŠΈν¬μΈνŠΈ", "% p": "νΌμ„ΌνŠΈν¬μΈνŠΈ",
    "&": "μ•€", "#": "샡", "@": "μ•³",
    "+": "ν”ŒλŸ¬μŠ€", "Β±": "ν”ŒλŸ¬μŠ€λ§ˆμ΄λ„ˆμŠ€",
    "㎝": "μ„Όν‹°λ―Έν„°", "㎜": "밀리미터", "㎏": "ν‚¬λ‘œκ·Έλž¨",
    "γŽ–": "밀리리터", "℃": "도", "㎞": "ν‚¬λ‘œλ―Έν„°", "㎎": "λ°€λ¦¬κ·Έλž¨",
    "㎑": "μ œκ³±λ―Έν„°", "γŽ₯": "μ„Έμ œκ³±λ―Έν„°",
    "~": "~", "m": "λ―Έν„°",
    "Β°C": "도", "Β°c": "도",
}

# μ˜ˆμ™Έ μΌ€μ΄μŠ€
EXCEPTION_CASES = {
    r"\b20\s?μ‚΄\b": "슀무 μ‚΄",
    r"\b1\s?λ“±\b": "일 λ“±",
    r"(?<!\d)(0?6)\s*μ›”": "μœ μ›”",
    r"(?<!\d)(10)\s*μ›”": "μ‹œμ›”",
}


def _to_gooyo(num: int, prefix: bool = False) -> str:
    """κ³ μœ μ–΄ μˆ˜μ‚¬ λ³€ν™˜ (1~99)."""
    if num <= 0:
        return "영"
    if num <= 9:
        base = BASIC_NATIVE.get(num)
        return (base[1] if prefix else base[0]) if base else "영"
    if num == 10:
        return "μ—΄"
    if num < 100:
        tens = (num // 10) * 10
        ones = num % 10
        if prefix and ones == 0 and tens in GOOYO_PREFIX_TENS:
            return GOOYO_PREFIX_TENS[tens]
        tens_str = GOOYO_SIP.get(tens, "")
        return tens_str + (_to_gooyo(ones, prefix=prefix) if ones else "")
    # 100 이상은 ν•œμžμ–΄λ‘œ fallback
    return _to_hanja(num)


def _convert_small_unit(chunk: str) -> str:
    """4자리 μ΄ν•˜ 숫자 청크λ₯Ό ν•œμžμ–΄ μˆ˜μ‚¬λ‘œ λ³€ν™˜."""
    result = ""
    length = len(chunk)
    for i, ch in enumerate(chunk):
        digit = int(ch)
        if digit == 0:
            continue
        unit = UNIT_SMALL[length - i - 1]
        if digit == 1 and unit:
            result += unit
        else:
            result += NUM_KOR[digit] + unit
    return result


def _to_hanja(num, natural: bool = True) -> str:
    """ν•œμžμ–΄ μˆ˜μ‚¬ λ³€ν™˜."""
    if isinstance(num, float):
        int_part = int(num)
        frac_str = str(num).split(".")[1]
        int_kor = _to_hanja(int_part, natural)
        frac_kor = "".join(
            NUM_KOR[int(ch)] if ch != "0" else "영" for ch in frac_str
        )
        return f"{int_kor}점{frac_kor}"

    if isinstance(num, str):
        try:
            num = float(num) if "." in num else int(num)
            return _to_hanja(num, natural)
        except ValueError:
            return num

    if num == 0:
        return "영"
    if num < 0:
        return "λ§ˆμ΄λ„ˆμŠ€ " + _to_hanja(-num, natural)

    s = str(num)
    chunks = [s[max(i - 4, 0):i] for i in range(len(s), 0, -4)][::-1]
    if len(chunks) > 5:
        return str(num)

    result = ""
    for i, chunk in enumerate(chunks):
        if int(chunk) == 0:
            continue
        part = _convert_small_unit(chunk.zfill(4))
        unit = UNIT_LARGE[len(chunks) - i - 1]
        if part == "일" and unit:
            if natural and unit not in NEVER_SKIP_ONE:
                part = ""
        result += part + unit
    return result


def _n2gk_with_unit(num: int, unit: str) -> str:
    """숫자+λ‹¨μœ„λ₯Ό μΈμ‹ν•˜μ—¬ μ μ ˆν•œ μˆ˜μ‚¬(κ³ μœ μ–΄/ν•œμžμ–΄)둜 λ³€ν™˜."""
    # λ‹¨μœ„λͺ… λ³€ν™˜ (영문 β†’ ν•œκΈ€)
    display_unit = UNIT_NAME_MAP.get(unit, unit)

    if unit in NATIVE_UNITS and 1 <= num <= 99:
        return _to_gooyo(num, prefix=True) + display_unit
    else:
        return _to_hanja(num) + display_unit


# ── All unit strings sorted by length (longest first for regex matching) ──
_ALL_UNITS = sorted(
    list(NATIVE_UNITS) + list(HANJA_UNITS) + list(UNIT_NAME_MAP.keys()),
    key=len, reverse=True,
)
# 영문 λ‹¨μœ„λŠ” 뒀에 word boundary μΆ”κ°€ (m이 miλ₯Ό λ§€μΉ­ν•˜λŠ” 것 λ°©μ§€)
_UNITS_PATTERN = "|".join(
    re.escape(u) + r"(?![a-zA-Z])" if re.match(r"^[a-zA-Z]+$", u) else re.escape(u)
    for u in _ALL_UNITS
)


# 톡화 기호 (숫자 μ•žμ— μ˜€λŠ” prefix 톡화 β†’ 숫자 λ’€λ‘œ 이동)
CURRENCY_PREFIX_MAP = {
    "$": "λ‹¬λŸ¬",
    "€": "유둜",
    "Β£": "νŒŒμš΄λ“œ",
    "Β₯": "μ—”",
    "β‚©": "원",
}
_CURRENCY_SYMBOLS = "|".join(re.escape(s) for s in CURRENCY_PREFIX_MAP)


def _convert_currency_prefix(text: str) -> str:
    """톡화 기호+숫자 β†’ 숫자+톡화λͺ… λ³€ν™˜ ($5 β†’ 5λ‹¬λŸ¬, €100 β†’ 100유둜)."""
    pattern = rf"({_CURRENCY_SYMBOLS})\s*(\d+(?:[.,]\d+)*)"

    def replacer(m):
        currency = CURRENCY_PREFIX_MAP[m.group(1)]
        num = m.group(2)
        return f"{num}{currency}"

    return re.sub(pattern, replacer, text)


def _convert_phone_numbers(text: str) -> str:
    """μ „ν™”λ²ˆν˜Έ νŒ¨ν„΄ λ³€ν™˜ (010-1234-5678 β†’ 곡일곡-일이삼사-μ˜€μœ‘μΉ νŒ”)."""
    DIGIT_KOR = ["곡", "일", "이", "μ‚Ό", "사", "였", "윑", "μΉ ", "νŒ”", "ꡬ"]

    def _digits_to_kr(s: str) -> str:
        return "".join(DIGIT_KOR[int(d)] for d in s)

    # ν•˜μ΄ν”ˆ 포함
    text = re.sub(
        r"(?<!\d)(\d{2,3})-(\d{3,4})-(\d{4})(?!\d)",
        lambda m: f"{_digits_to_kr(m[1])}-{_digits_to_kr(m[2])}-{_digits_to_kr(m[3])}",
        text,
    )
    # 11자리 연속
    text = re.sub(
        r"(?<!\d)(\d{11})(?!\d)",
        lambda m: f"{_digits_to_kr(m[1][:3])}-{_digits_to_kr(m[1][3:7])}-{_digits_to_kr(m[1][7:])}",
        text,
    )
    return text


def _convert_range_with_units(text: str) -> str:
    """λ²”μœ„ νŒ¨ν„΄ λ³€ν™˜ (1~3마리 β†’ ν•œλ§ˆλ¦¬μ—μ„œμ„Έλ§ˆλ¦¬, 10–60λΆ„ β†’ μ‹­λΆ„μ—μ„œμœ‘μ‹­λΆ„)."""
    range_sep = r"[~\u2013\u2014]"  # ~, –(en-dash), β€”(em-dash)
    pattern = rf"(\d+(?:\.\d+)?)\s*{range_sep}\s*(\d+(?:\.\d+)?)\s*({_UNITS_PATTERN})"

    def replacer(m):
        try:
            left = float(m.group(1)) if "." in m.group(1) else int(m.group(1))
            right = float(m.group(2)) if "." in m.group(2) else int(m.group(2))
            unit = m.group(3)
            l_str = _n2gk_with_unit(left, unit)
            r_str = _n2gk_with_unit(right, unit)
            return f"{l_str}μ—μ„œ{r_str}"
        except (ValueError, OverflowError):
            return m.group(0)

    return re.sub(pattern, replacer, text)


def _convert_numbers_with_units(text: str) -> str:
    """숫자+λ‹¨μœ„ νŒ¨ν„΄ λ³€ν™˜ (5개 β†’ λ‹€μ„―κ°œ, 100원 β†’ 백원)."""
    pattern = rf"(\d{{1,3}}(?:,\d{{3}})*|\d+(?:\.\d+)?)\s?({_UNITS_PATTERN})"

    def replacer(m):
        raw = m.group(1).replace(",", "")
        word = m.group(2)
        try:
            num = float(raw) if "." in raw else int(raw)
            return _n2gk_with_unit(num, word)
        except (ValueError, OverflowError):
            return m.group(0)

    return re.sub(pattern, replacer, text)


def _convert_float_numbers(text: str) -> str:
    """μ†Œμˆ˜μ  숫자 λ³€ν™˜ (3.14 β†’ 삼점일사)."""
    def replacer(m):
        try:
            num = float(m.group(1))
            return _to_hanja(num)
        except (ValueError, OverflowError):
            return m.group(1)

    return re.sub(r"(\d+\.\d+)", replacer, text)


def _convert_pure_numbers(text: str) -> str:
    """남은 λͺ¨λ“  숫자 λ³€ν™˜ (1234 β†’ μ²œμ΄λ°±μ‚Όμ‹­μ‚¬).

    _convert_numbers_with_units 이후 ν˜ΈμΆœλ˜λ―€λ‘œ λ‹¨μœ„ 뢙은 μˆ«μžλŠ” 이미 λ³€ν™˜λ¨.
    ν•œκΈ€ 인접 μˆ«μžλ„ λ³€ν™˜ (예: 200은 β†’ 이백은, 1ν•™λ…„ β†’ 일학년).
    """
    pattern = r"(\d{1,3}(?:,\d{3})+|\d+)"

    def replacer(m):
        try:
            num = int(m.group(1).replace(",", ""))
            return _to_hanja(num)
        except (ValueError, OverflowError):
            return m.group(0)

    return re.sub(pattern, replacer, text)


def _apply_exceptions(text: str) -> str:
    """μ˜ˆμ™Έ μΌ€μ΄μŠ€ 처리 (μŠ€λ¬΄μ‚΄, μœ μ›”, μ‹œμ›” λ“±)."""
    for pattern, replacement in EXCEPTION_CASES.items():
        text = re.sub(pattern, replacement, text)
    return text


def _apply_special_symbols(text: str) -> str:
    """특수기호 β†’ ν•œκΈ€ λ³€ν™˜."""
    # κΈ΄ νŒ¨ν„΄ λ¨Όμ € λ§€μΉ­ν•˜κΈ° μœ„ν•΄ 길이 μ—­μˆœ μ •λ ¬
    for symbol in sorted(SPECIAL_SYMBOL_MAP, key=len, reverse=True):
        text = text.replace(symbol, SPECIAL_SYMBOL_MAP[symbol])
    return text


def _remove_punctuation(text: str) -> str:
    """ꡬ두점/κ΄„ν˜Έ/기타 기호 제거."""
    # κ΄„ν˜Έ μ•ˆ λ‚΄μš© 제거
    text = re.sub(r"\([^)]*\)", "", text)
    # ν•œκΈ€(μ™„μ„±ν˜•+자λͺ¨), λΌν‹΄λ¬Έμž(μ•…μ„ΌνŠΈ 포함), 숫자, 곡백만 μœ μ§€
    # \u00C0-\u024F: Latin Extended (Γ , Γ³, Γ±, ΓΌ λ“±)
    text = re.sub(r"[^\uAC00-\uD7A3\u3131-\u3163a-zA-Z\u00C0-\u024F0-9\s]", "", text)
    return text


# ── Main entry point ─────────────────────────────────────────────────────────

def normalize_korean(text: str, kss=None) -> str:
    """ν•œκ΅­μ–΄ ASR ν‰κ°€μš© ν…μŠ€νŠΈ μ •κ·œν™”.

    Pipeline:
    1. ν•œμž β†’ ν•œκΈ€ λ³€ν™˜ (kss)
    2. 특수기호 β†’ ν•œκΈ€ (%, ℃ λ“±)
    3. μ˜ˆμ™Έ μΌ€μ΄μŠ€ (μœ μ›”, μ‹œμ›”, μŠ€λ¬΄μ‚΄)
    4. μ „ν™”λ²ˆν˜Έ λ³€ν™˜
    5. 숫자+λ‹¨μœ„ λ³€ν™˜ (5κ°œβ†’λ‹€μ„―κ°œ, 100원→백원)
    6. μ†Œμˆ˜μ  숫자 λ³€ν™˜ (3.14→삼점일사)
    7. 순수 숫자 λ³€ν™˜ (1234β†’μ²œμ΄λ°±μ‚Όμ‹­μ‚¬)
    8. ꡬ두점 제거
    9. μ†Œλ¬Έμž λ³€ν™˜
    10. 곡백 제거 (CER λΉ„κ΅μš©)
    """
    if not text:
        return ""

    # 1. ν•œμž β†’ ν•œκΈ€
    if kss:
        try:
            text = kss.hanja2hangul(text)
        except Exception:
            pass

    # 2. 톡화 기호+숫자 β†’ 숫자+톡화λͺ… ($5 β†’ 5λ‹¬λŸ¬)
    text = _convert_currency_prefix(text)

    # 3. 특수기호 β†’ ν•œκΈ€
    text = _apply_special_symbols(text)

    # 3. μ˜ˆμ™Έ μΌ€μ΄μŠ€
    text = _apply_exceptions(text)

    # 4. μ „ν™”λ²ˆν˜Έ
    text = _convert_phone_numbers(text)

    # 5. λ²”μœ„+λ‹¨μœ„ (숫자+λ‹¨μœ„λ³΄λ‹€ λ¨Όμ €)
    text = _convert_range_with_units(text)

    # 6. 숫자+λ‹¨μœ„
    text = _convert_numbers_with_units(text)

    # 6. μ†Œμˆ˜μ  숫자
    text = _convert_float_numbers(text)

    # 7. 순수 숫자
    text = _convert_pure_numbers(text)

    # 8. ꡬ두점 제거
    text = _remove_punctuation(text)

    # 9. μ†Œλ¬Έμž
    text = text.lower()

    return text
KRAFTON org

Thank you for using our model.

Please note that vllm-omni is a vLLM port of our research model, and some differences in model behavior may therefore be expected.

We have verified that, for a subset of our target evaluation sets, the metrics are consistent. However, Korean datasets were not included in the evaluation scope at the time of publication.

We appreciate your feedback and will look into this matter promptly.

Thanks for detailed explanation! I'll check transformers backend and normalization method.

Also, I'm wondering is there any timeline for releasing korean benchmarks(KVoiceBench, KOpenAudioBench, KMMAU)? It would be very good for community

KRAFTON org

Thank you for your interest in our models and research.

We are currently writing a paper that includes the development process of KVoiceBench, KOpenAudioBench, KMMAU, and details of each benchmark. We plan to publicly release these three benchmarks along with this paper. We anticipate this will take place around next week.

I tested transformers backend(RaonSpeechPipeline.stt) and got 8.35 and 7.72 for clean and other subsets. It seems vllm-omni's high CER was originated from different temperature (1.0 vs 0.2).

Even it does not match with paper's result, it's already SOTA among open models so i'll not try to reproduce anymore because results can be different due to environment and non-deterministic algorithms.

Thanks again for text normalization code, and I'm looking forward to release of benchmark and its paper

KRAFTON org

Thank you for providing your reproduction report. RaonSpeechPipeline was developed as a simplified version of our research pipeline to facilitate easier public use, as our full internal environment is very complex.

Because the benchmarks in our technical report were based on the original research codebase, it appears that variations in metrics were not fully captured during the refactoring process.

We will prioritize an inspection of these differences and will keep the community updated on our findings.

Sign up or log in to comment