llm-jp-4-8b-speech-asr

English (日本語版は後述)

Overview

llm-jp-4-8b-speech-asr is a Japanese speech-language model specialized for automatic speech recognition (ASR). It takes audio as input and generates text transcriptions. This release is the ASR-specialized checkpoint of our Japanese Speech LLM project.

Provenance

This model belongs to the LLM-jp-4 8B family, but it is not a direct derivative of the publicly released llm-jp/llm-jp-4-8b-base or llm-jp/llm-jp-4-8b-thinking. It was initialized from a competition-distributed pre-release intermediate checkpoint / derived pseudo-base model from the llm-jp-4-8b development line.

Usage

Quick start

pip install git+https://github.com/Atotti/ja-speech-llm.git

Minimal inference example

import torch
import torchaudio
from transformers import AutoProcessor, AutoTokenizer
from speech_llm_ja import LlamaForSpeechLM, LlamaForSpeechLMConfig

MODEL_ID = "Atotti/llm-jp-4-8b-speech-asr"

config = LlamaForSpeechLMConfig.from_pretrained(MODEL_ID)
model = LlamaForSpeechLM.from_pretrained(
    MODEL_ID,
    config=config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
).eval()

encoder_processor = AutoProcessor.from_pretrained(model.config.encoder_id)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# Load audio
waveform, sample_rate = torchaudio.load("path/to/your_audio_file.wav")
if waveform.size(0) > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform.squeeze(0)
if sample_rate != 16000:
    waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

# Build prompt
instruction = "音声を書き起こしてください。"
prompt = f"""あなたは音声を理解できるAIアシスタントです。

<|reserved_343|><|reserved_342|>### 指示:
{instruction}

### 応答:
"""

# Encode
encoder_inputs = encoder_processor(
    [waveform.numpy()],
    return_tensors="pt",
    return_attention_mask=True,
    sampling_rate=16000,
)
decoder_inputs = tokenizer(prompt, return_tensors="pt")

# Generate
with torch.no_grad():
    output_ids = model.generate(
        input_features=encoder_inputs.input_features.to(model.device),
        input_ids=decoder_inputs.input_ids.to(model.device),
        encoder_attention_mask=encoder_inputs.attention_mask.to(model.device),
        decoder_attention_mask=decoder_inputs.attention_mask.to(model.device),
        max_new_tokens=256,
        do_sample=False,
    )

generated_ids = output_ids[0, decoder_inputs.input_ids.shape[1]:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

Recommended prompt

Task	Prompt
Speech transcription	`音声を書き起こしてください。`

Notes:

This is an ASR-specialized checkpoint.
For dialogue-style interaction, use the chat checkpoint instead.
Performance may degrade on noisy, long, or highly spontaneous speech.

Evaluation

Model	ADU-Bench (ja) ↑	CommonVoice 8 (ja) CER ↓
Whisper-large-v3	-	8.51
SALMONN	1.37	-
Qwen-Audio-Chat	1.08	-
Voxtral Mini3B-2507	5.181	15.65
Gemma3n E4B-it	5.143	51.23
llm-jp-4-8b-speech-asr	-	8.36
llm-jp-4-8b-speech-chat	5.335	10.25
llm-jp-4-8b-speech-chat-dpo-exp	5.165	10.42

Limitations

Primarily optimized for Japanese.
May make transcription errors.

日本語

概要

llm-jp-4-8b-speech-asr は、自動音声認識（ASR）に特化した日本語音声言語モデルです。音声を入力として受け取り、テキスト書き起こしを生成します。この公開物は、日本語音声LLMプロジェクトにおける ASR 特化チェックポイントです。

モデルの由来

本モデルは LLM-jp-4 8B 系列に属しますが、公開されている llm-jp/llm-jp-4-8b-base や llm-jp/llm-jp-4-8b-thinking から直接派生したものではありません。 llm-jp-4-8b 開発系列のコンペ配布中間チェックポイント / そこから派生した仮モデルを初期値として使用しています。

使い方

クイックスタート

pip install git+https://github.com/Atotti/ja-speech-llm.git

最小推論例

import torch
import torchaudio
from transformers import AutoProcessor, AutoTokenizer
from speech_llm_ja import LlamaForSpeechLM, LlamaForSpeechLMConfig

MODEL_ID = "Atotti/llm-jp-4-8b-speech-asr"

config = LlamaForSpeechLMConfig.from_pretrained(MODEL_ID)
model = LlamaForSpeechLM.from_pretrained(
    MODEL_ID,
    config=config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
).eval()

encoder_processor = AutoProcessor.from_pretrained(model.config.encoder_id)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# 音声読み込み
waveform, sample_rate = torchaudio.load("path/to/your_audio_file.wav")
if waveform.size(0) > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform.squeeze(0)
if sample_rate != 16000:
    waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

# プロンプト構築
instruction = "音声を書き起こしてください。"
prompt = f"""あなたは音声を理解できるAIアシスタントです。

<|reserved_343|><|reserved_342|>### 指示:
{instruction}

### 応答:
"""

# エンコード
encoder_inputs = encoder_processor(
    [waveform.numpy()],
    return_tensors="pt",
    return_attention_mask=True,
    sampling_rate=16000,
)
decoder_inputs = tokenizer(prompt, return_tensors="pt")

# 生成
with torch.no_grad():
    output_ids = model.generate(
        input_features=encoder_inputs.input_features.to(model.device),
        input_ids=decoder_inputs.input_ids.to(model.device),
        encoder_attention_mask=encoder_inputs.attention_mask.to(model.device),
        decoder_attention_mask=decoder_inputs.attention_mask.to(model.device),
        max_new_tokens=256,
        do_sample=False,
    )

generated_ids = output_ids[0, decoder_inputs.input_ids.shape[1]:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

推奨プロンプト

タスク	プロンプト
音声書き起こし	`音声を書き起こしてください。`

注意:

これは ASR 特化チェックポイントです。
対話用途には chat チェックポイントの利用を推奨します。
雑音環境、長時間音声、自発性の高い発話では性能が低下する可能性があります。

評価

モデル	ADU-Bench (ja) ↑	CommonVoice 8 (ja) CER ↓
Whisper-large-v3	-	8.51
SALMONN	1.37	-
Qwen-Audio-Chat	1.08	-
Voxtral Mini3B-2507	5.181	15.65
Gemma3n E4B-it	5.143	51.23
llm-jp-4-8b-speech-asr	-	8.36
llm-jp-4-8b-speech-chat	5.335	10.25
llm-jp-4-8b-speech-chat-dpo-exp	5.165	10.42

制限

主対象は日本語です。
書き起こし誤りが発生する可能性があります。

Reference

@misc{tsutsumi2026jaspeechllmasr,
  title={atotti/llm-jp-4-8b-speech-asr},
  url={https://huggingface.co/atotti/llm-jp-4-8b-speech-asr},
  author={Ayuto Tsutsumi and Haruki Oshiro},
  year={2026},
}

Downloads last month: 126

Safetensors

Model size

9B params

Tensor type

F32

BF16

Model tree for Atotti/llm-jp-4-8b-speech-asr

Base model

llm-jp/llm-jp-4-8b-base

Finetuned

(2)

this model

Atotti
/

llm-jp-4-8b-speech-asr