Fine tune by: Arttu Pakarinen
Fine TTS model with Finnish parlament speeches.
Produces a lot of 'Ööö, äää öhm...'
Better with longer sentences.
🧪 Test script
#!/usr/bin/env python3
import torch
from transformers import AutoProcessor, CsmForConditionalGeneration
MODEL_ID = "ArttuPakarinen/sesame-csm-FIN-parlament-full-finetune"
BASE_ID = "sesame/csm-1b" # processor comes from the base model
device = "cuda" if torch.cuda.is_available() else "cpu"
# Disable flash / mem-efficient SDPA if your setup has issues with them
if hasattr(torch.backends.cuda, "sdp_kernel"):
torch.backends.cuda.sdp_kernel(
enable_flash=False,
enable_math=True,
enable_mem_efficient=False,
)
processor = AutoProcessor.from_pretrained(BASE_ID)
model = CsmForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype="auto",
low_cpu_mem_usage=True,
attn_implementation="eager",
).to(device)
model.eval()
model.config.use_cache = True
try:
model.generation_config.attn_implementation = "eager"
except Exception:
pass
text = "Ihanaa, kun voi generoida ääntä!"
conversation = [{"role": "0", "content": [{"type": "text", "text": text}]}]
raw = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
)
# attention_mask -> bool (some setups expect this)
inputs = {
k: (v.to(device).to(torch.bool) if k == "attention_mask" else v.to(device))
for k, v in raw.items()
}
with torch.no_grad(), torch.amp.autocast("cuda", enabled=(device == "cuda")):
audio = model.generate(
**inputs,
output_audio=True,
use_cache=True,
max_new_tokens=600,
do_sample=True,
temperature=0.8,
top_p=0.95,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
)
processor.save_audio(audio, "tulos.wav")
print("OK: tulos.wav")
- Downloads last month
- 3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ArttuPakarinen/sesame-csm-FIN-parlament-full-finetune
Base model
sesame/csm-1b