Instructions to use Supertone/supertonic-3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Supertonic
How to use Supertone/supertonic-3 with Supertonic:
from supertonic import TTS tts = TTS(auto_download=True) style = tts.get_voice_style(voice_name="M1") text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance." wav, duration = tts.synthesize(text, voice_style=style) tts.save_audio(wav, "output.wav")
- Notebooks
- Google Colab
- Kaggle
emotion tags not working
I tried adding emotion tags that described in repo also added those in tags but still now worked, model just ignored that and synthesized speech. anyone knows why that happened or its my mistake?
this problem is happening to me too! i thought it was only me at first.
same here brother like i tried to pass in different different way( laugh, laughs,laughing inside <> )but no one did not worked.
Thanks for reporting this, and sorry for the confusion.
You are right that the current expression tag support is still limited. We added tags such as <laugh>, <breath>, and <sigh> because many users asked for non-verbal and expressive sounds after the previous release, and we did include tagged examples in the training data. However, the tagged data is currently available mostly for Korean, English, and Japanese, and even within those languages the consistency of the labels is not yet as strong as we would like.
Because of that, the model may sometimes ignore the tag or simply synthesize the surrounding text normally. This is a known limitation of the current Supertonic 3 release, and we should have communicated it more clearly. Sorry about that.
A few things may help in the current version:
- Tags tend to work better in Korean, English, and Japanese.
- Tags often work better when placed at the beginning or end of the sentence.
- If a single tag is ignored, repeating it two or three times can sometimes make the expression more likely to appear.
For example, you may get better results with inputs like:
<laugh> <laugh> That was not what I expected. <laugh> <laugh>
We are planning to improve this with additional synthetic data and supervised fine-tuning, so expression tags should become more reliable in future updates.
Thanks again for trying the model and pointing this out. We appreciate the feedback, and weβll share updates when this improves.
I did several tests with the tags, phrase by phrase, one summary at a time. My particular conclusion was:
<laugh> - Best as a single triple sequence at the end. "That joke was actually funny. <laugh> <laugh> <laugh>"
<breath> - Best as a single triple sequence at the beginning. "<breath> <breath> <breath> I finally made it to the top of the mountain."
<surprise> - Best as a triple combination. "<surprise> <surprise> <surprise> You bought this for me? No way! <surprise> <surprise> <surprise>"
<sigh> - Best as a double combination. "<sigh> <sigh> Another meeting added to my calendar. <sigh> <sigh>"
<scream> - Spoke the tag in all situations.
<throatclear> - Spoke the tag in all situations.
In the second test, tested in the middle of the sentence with "Before we begin, <throatclear> I have something to say.", the model also spoke the tag in all situations.
Tested with one, two, and three tags:
"Before we begin, <throatclear> I have something to tell you. Due to recent events <throatclear>... sorry."
<sad> - Best as a triple combination. "<sad> <sad> <sad> I never thought it would end like this. <sad> <sad> <sad>"
<angry> - Missed tag in all situations.
<cough> - Best as a single triple sequence at the beginning. "<cough> <cough> <cough> This room is so dusty."
<yawn> - Missed tag in all situations. Test in the middle of the sentence.
As with <throatclear>, the model returned the tag in tests with <yawn> in the middle of the sentence.
However, in most situations, the tag was suppressed, being indifferent in the sentence.
Claude said:
Hypothesis on the "speaking the tag" mechanism
The "speaking N-1 tags" pattern (Γ3 speaks two, Γ2 speaks one) suggests that the model consumes one tag as a prosodic signal but treats the excess tags as text. Position matters because at the beginning of the sentence the model is still establishing the contextβwithout prior text, some tags lack semantic anchoring and are vocalized. In the end, the same problem but reversed: the text flow has already finished, so the tag is suspended.
The test results follow:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
REPORT β EXPRESSION TAGS SUPERTONIC 3
Generated : 2026-05-16 11:27:20
Voice : F1
Total : 100 sentences
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TAG: <laugh>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : That joke was actually funny.
Duration : 2.17s
Rating : OK
[start Γ1 [T] text]
Text : <laugh> That joke was actually funny.
Duration : 2.58s
Rating : INDIFFERENT
[start Γ2 [TT] text]
Text : <laugh> <laugh> That joke was actually funny.
Duration : 2.90s
Rating : OK
[start Γ3 [TTT] text]
Text : <laugh> <laugh> <laugh> That joke was actually funny.
Duration : 3.22s
Rating : BEST
[end Γ1 text [T]]
Text : That joke was actually funny. <laugh>
Duration : 3.00s
Rating : OK
[end Γ2 text [TT]]
Text : That joke was actually funny. <laugh> <laugh>
Duration : 3.31s
Rating : BEST*
[end Γ3 text [TTT]]
Text : That joke was actually funny. <laugh> <laugh> <laugh>
Duration : 3.61s
Rating : BEST**
[both Γ1 [T] text [T]]
Text : <laugh> That joke was actually funny. <laugh>
Duration : 3.41s
Rating : OK
[both Γ2 [TT] text [TT]]
Text : <laugh> <laugh> That joke was actually funny. <laugh> <laugh>
Duration : 4.00s
Rating : BAD
[both Γ3 [TTT] text [TTT]]
Text : <laugh> <laugh> <laugh> That joke was actually funny. <laugh> <laugh> <laugh>
Duration : 4.60s
Rating : BAD
TAG: <breath>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : I finally made it to the top of the mountain.
Duration : 2.93s
Rating : OK
[start Γ1 [T] text]
Text : <breath> I finally made it to the top of the mountain.
Duration : 3.43s
Rating : BAD, SPOKE TAG
[start Γ2 [TT] text]
Text : <breath> <breath> I finally made it to the top of the mountain.
Duration : 3.85s
Rating : BEST*
[start Γ3 [TTT] text]
Text : <breath> <breath> <breath> I finally made it to the top of the mountain.
Duration : 4.25s
Rating : BEST**
[end Γ1 text [T]]
Text : I finally made it to the top of the mountain. <breath>
Duration : 3.69s
Rating : BAD, SPOKE TAG
[end Γ2 text [TT]]
Text : I finally made it to the top of the mountain. <breath> <breath>
Duration : 4.06s
Rating : BAD, SPOKE TAG TWICE
[end Γ3 text [TTT]]
Text : I finally made it to the top of the mountain. <breath> <breath> <breath>
Duration : 4.43s
Rating : BAD, PERFORMED SIGH BUT SPOKE TAG TWICE
[both Γ1 [T] text [T]]
Text : <breath> I finally made it to the top of the mountain. <breath>
Duration : 4.16s
Rating : BAD, SIGHED AT START BUT SPOKE TAG AT END
[both Γ2 [TT] text [TT]]
Text : <breath> <breath> I finally made it to the top of the mountain. <breath> <breath>
Duration : 4.87s
Rating : BAD, SPOKE TAG TWICE AT START BUT PERFORMED SIGH OK AT END
[both Γ3 [TTT] text [TTT]]
Text : <breath> <breath> <breath> I finally made it to the top of the mountain. <breath> <breath> <breath>
Duration : 5.65s
Rating : BAD, SIGHED WELL AS IN START-ONLY BUT SPOKE TAG AT END
TAG: <surprise>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : You bought this for me? No way!
Duration : 2.68s
Rating : OK
[start Γ1 [T] text]
Text : <surprise> You bought this for me? No way!
Duration : 3.34s
Rating : BAD, SPOKE TAG. SLIGHT INTONATION CHANGE
[start Γ2 [TT] text]
Text : <surprise> <surprise> You bought this for me? No way!
Duration : 3.90s
Rating : BAD, SPOKE TAG. SLIGHT INTONATION CHANGE
[start Γ3 [TTT] text]
Text : <surprise> <surprise> <surprise> You bought this for me? No way!
Duration : 4.49s
Rating : BAD, SPOKE TAG TWICE. SLIGHT INTONATION CHANGE
[end Γ1 text [T]]
Text : You bought this for me? No way! <surprise>
Duration : 3.81s
Rating : BAD, SPOKE TAG. ROBOTIC VOICE
[end Γ2 text [TT]]
Text : You bought this for me? No way! <surprise> <surprise>
Duration : 4.41s
Rating : OK, BETTER INTONATION BUT SPOKE TAG TWICE
[end Γ3 text [TTT]]
Text : You bought this for me? No way! <surprise> <surprise> <surprise>
Duration : 4.95s
Rating : BAD, EXAGGERATED INTONATION, SPOKE TAG TWICE
[both Γ1 [T] text [T]]
Text : <surprise> You bought this for me? No way! <surprise>
Duration : 4.50s
Rating : BAD, SPOKE TAG TWICE
[both Γ2 [TT] text [TT]]
Text : <surprise> <surprise> You bought this for me? No way! <surprise> <surprise>
Duration : 5.61s
Rating : BAD, SPOKE TAG TWICE
[both Γ3 [TTT] text [TTT]]
Text : <surprise> <surprise> <surprise> You bought this for me? No way! <surprise> <surprise> <surprise>
Duration : 6.74s
Rating : OK, BETTER INTONATION. SPOKE TAG FOUR TIMES
TAG: <sigh>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : Another meeting added to my calendar.
Duration : 2.63s
Rating : OK
[start Γ1 [T] text]
Text : <sigh> Another meeting added to my calendar.
Duration : 3.00s
Rating : INDIFFERENT
[start Γ2 [TT] text]
Text : <sigh> <sigh> Another meeting added to my calendar.
Duration : 3.27s
Rating : INDIFFERENT, SAID "UHM" AS IF THINKING
[start Γ3 [TTT] text]
Text : <sigh> <sigh> <sigh> Another meeting added to my calendar.
Duration : 3.58s
Rating : BAD, SPOKE TAG ONCE
[end Γ1 text [T]]
Text : Another meeting added to my calendar. <sigh>
Duration : 3.33s
Rating : INDIFFERENT
[end Γ2 text [TT]]
Text : Another meeting added to my calendar. <sigh> <sigh>
Duration : 3.56s
Rating : BAD, SPOKE TAG ONCE
[end Γ3 text [TTT]]
Text : Another meeting added to my calendar. <sigh> <sigh> <sigh>
Duration : 3.81s
Rating : BAD, SPOKE TAG ONCE
[both Γ1 [T] text [T]]
Text : <sigh> Another meeting added to my calendar. <sigh>
Duration : 3.66s
Rating : INDIFFERENT, STRANGE MURMUR
[both Γ2 [TT] text [TT]]
Text : <sigh> <sigh> Another meeting added to my calendar. <sigh> <sigh>
Duration : 4.17s
Rating : INDIFFERENT, STRANGE MURMUR AT START AND END
[both Γ3 [TTT] text [TTT]]
Text : <sigh> <sigh> <sigh> Another meeting added to my calendar. <sigh> <sigh> <sigh>
Duration : 4.70s
Rating : BAD, STRANGE MURMUR AT START, SPOKE TAG AT START AND END
TAG: <scream>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : Watch out, behind you!
Duration : 1.81s
Rating : OK
[start Γ1 [T] text]
Text : <scream> Watch out, behind you!
Duration : 2.28s
Rating : BAD, SPOKE TAG
[start Γ2 [TT] text]
Text : <scream> <scream> Watch out, behind you!
Duration : 2.74s
Rating : BAD, SPOKE TAG ONCE
[start Γ3 [TTT] text]
Text : <scream> <scream> <scream> Watch out, behind you!
Duration : 3.19s
Rating : BAD, SPOKE TAG TWICE
[end Γ1 text [T]]
Text : Watch out, behind you! <scream>
Duration : 2.64s
Rating : BAD, SLIGHT INTONATION CHANGE, SPOKE TAG
[end Γ2 text [TT]]
Text : Watch out, behind you! <scream> <scream>
Duration : 3.11s
Rating : BAD, SLIGHT INTONATION CHANGE, SPOKE TAG ONCE
[end Γ3 text [TTT]]
Text : Watch out, behind you! <scream> <scream> <scream>
Duration : 3.57s
Rating : BAD, OK INTONATION, SPOKE TAG TWICE
[both Γ1 [T] text [T]]
Text : <scream> Watch out, behind you! <scream>
Duration : 3.19s
Rating : BAD, BAD INTONATION. SPOKE TAG AT START
[both Γ2 [TT] text [TT]]
Text : <scream> <scream> Watch out, behind you! <scream> <scream>
Duration : 4.10s
Rating : BAD, BAD INTONATION. SPOKE TAG TWICE AT START AND ONCE AT END
[both Γ3 [TTT] text [TTT]]
Text : <scream> <scream> <scream> Watch out, behind you! <scream> <scream> <scream>
Duration : 4.97s
Rating : BAD, BAD INTONATION, SPOKE TAG MULTIPLE TIMES, ROBOTIC VOICE
TAG: <throatclear>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : Before we begin, I have something to say.
Duration : 2.83s
Rating : OK
[start Γ1 [T] text]
Text : <throatclear> Before we begin, I have something to say.
Duration : 3.74s
Rating : BAD, SPOKE TAG
[start Γ2 [TT] text]
Text : <throatclear> <throatclear> Before we begin, I have something to say.
Duration : 4.48s
Rating : BAD, SPOKE TAG TWICE
[start Γ3 [TTT] text]
Text : <throatclear> <throatclear> <throatclear> Before we begin, I have something to say.
Duration : 5.18s
Rating : BAD, SPOKE TAG THREE TIMES
[end Γ1 text [T]]
Text : Before we begin, I have something to say. <throatclear>
Duration : 3.97s
Rating : BAD, SPOKE TAG
[end Γ2 text [TT]]
Text : Before we begin, I have something to say. <throatclear> <throatclear>
Duration : 4.63s
Rating : BAD, SPOKE TAG TWICE
[end Γ3 text [TTT]]
Text : Before we begin, I have something to say. <throatclear> <throatclear> <throatclear>
Duration : 5.29s
Rating : BAD, SPOKE TAG THREE TIMES
[both Γ1 [T] text [T]]
Text : <throatclear> Before we begin, I have something to say. <throatclear>
Duration : 4.77s
Rating : BAD, SPOKE TAG TWICE
[both Γ2 [TT] text [TT]]
Text : <throatclear> <throatclear> Before we begin, I have something to say. <throatclear> <throatclear>
Duration : 6.08s
Rating : BAD, SPOKE TAG FOUR TIMES
[both Γ3 [TTT] text [TTT]]
Text : <throatclear> <throatclear> <throatclear> Before we begin, I have something to say. <throatclear> <throatclear> <throatclear>
Duration : 7.47s
Rating : BAD, SPOKE TAG SIX TIMES
TAG: <sad>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : I never thought it would end like this.
Duration : 2.63s
Rating : OK
[start Γ1 [T] text]
Text : <sad> I never thought it would end like this.
Duration : 2.95s
Rating : INDIFFERENT
[start Γ2 [TT] text]
Text : <sad> <sad> I never thought it would end like this.
Duration : 3.19s
Rating : OK, STRANGE MOAN
[start Γ3 [TTT] text]
Text : <sad> <sad> <sad> I never thought it would end like this.
Duration : 3.46s
Rating : OK, ROBOTIC EMOTIONAL SOUND
[end Γ1 text [T]]
Text : I never thought it would end like this. <sad>
Duration : 3.18s
Rating : INDIFFERENT
[end Γ2 text [TT]]
Text : I never thought it would end like this. <sad> <sad>
Duration : 3.39s
Rating : OK, STRANGE MOAN
[end Γ3 text [TTT]]
Text : I never thought it would end like this. <sad> <sad> <sad>
Duration : 3.62s
Rating : BAD, SPOKE TAG ONCE
[both Γ1 [T] text [T]]
Text : <sad> I never thought it would end like this. <sad>
Duration : 3.47s
Rating : BEST*, SLIGHTLY ROBOTIC
[both Γ2 [TT] text [TT]]
Text : <sad> <sad> I never thought it would end like this. <sad> <sad>
Duration : 3.93s
Rating : BEST**, ALMOST SPOKE TAG AT END
[both Γ3 [TTT] text [TTT]]
Text : <sad> <sad> <sad> I never thought it would end like this. <sad> <sad> <sad>
Duration : 4.43s
Rating : BEST***, EMOTIONAL, NO SPOKEN TAGS, NO ROBOTIC VOICE
TAG: <angry>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : You crossed the line this time.
Duration : 2.21s
Rating : OK
[start Γ1 [T] text]
Text : <angry> You crossed the line this time.
Duration : 2.62s
Rating : BAD, SPOKE TAG
[start Γ2 [TT] text]
Text : <angry> <angry> You crossed the line this time.
Duration : 2.94s
Rating : BAD, SPOKE TAG ONCE
[start Γ3 [TTT] text]
Text : <angry> <angry> <angry> You crossed the line this time.
Duration : 3.26s
Rating : BAD, SPOKE TAG TWICE
[end Γ1 text [T]]
Text : You crossed the line this time. <angry>
Duration : 2.87s
Rating : BAD, SPOKE TAG
[end Γ2 text [TT]]
Text : You crossed the line this time. <angry> <angry>
Duration : 3.16s
Rating : BAD, SPOKE TAG TWICE
[end Γ3 text [TTT]]
Text : You crossed the line this time. <angry> <angry> <angry>
Duration : 3.47s
Rating : BAD, SPOKE TAG TWICE
[both Γ1 [T] text [T]]
Text : <angry> You crossed the line this time. <angry>
Duration : 3.22s
Rating : BAD, SPOKE TAG TWICE
[both Γ2 [TT] text [TT]]
Text : <angry> <angry> You crossed the line this time. <angry> <angry>
Duration : 3.84s
Rating : BAD, SPOKE TAG THREE TIMES
[both Γ3 [TTT] text [TTT]]
Text : <angry> <angry> <angry> You crossed the line this time. <angry> <angry> <angry>
Duration : 4.46s
Rating : BAD, SPOKE TAG FIVE TIMES.
TAG: <cough>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : This room is so dusty.
Duration : 1.88s
Rating : OK
[start Γ1 [T] text]
Text : <cough> This room is so dusty.
Duration : 2.23s
Rating : BAD, SPOKE TAG
[start Γ2 [TT] text]
Text : <cough> <cough> This room is so dusty.
Duration : 2.57s
Rating : BAD, SPOKE TAG ONCE
[start Γ3 [TTT] text]
Text : <cough> <cough> <cough> This room is so dusty.
Duration : 2.91s
Rating : BEST, NOT VERY REALISTIC
[end Γ1 text [T]]
Text : This room is so dusty. <cough>
Duration : 2.67s
Rating : INDIFFERENT
[end Γ2 text [TT]]
Text : This room is so dusty. <cough> <cough>
Duration : 2.99s
Rating : OK, SOUNDS LIKE A LAUGH
[end Γ3 text [TTT]]
Text : This room is so dusty. <cough> <cough> <cough>
Duration : 3.28s
Rating : OK, ROBOTIC
[both Γ1 [T] text [T]]
Text : <cough> This room is so dusty. <cough>
Duration : 3.05s
Rating : BAD, SPOKE TAG
[both Γ2 [TT] text [TT]]
Text : <cough> <cough> This room is so dusty. <cough> <cough>
Duration : 3.64s
Rating : OK, PERFORMED COUGH AT START, STRANGE AT END
[both Γ3 [TTT] text [TTT]]
Text : <cough> <cough> <cough> This room is so dusty. <cough> <cough> <cough>
Duration : 4.25s
Rating : BAD, SPOKE TAG AT START, OK COUGH AT END
TAG: <yawn>
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[control (no tag)]
Text : I have been awake since four in the morning.
Duration : 2.89s
Rating : OK
[start Γ1 [T] text]
Text : <yawn> I have been awake since four in the morning.
Duration : 3.26s
Rating : BAD, SPOKE TAG
[start Γ2 [TT] text]
Text : <yawn> <yawn> I have been awake since four in the morning.
Duration : 3.55s
Rating : BAD, SPOKE TAG ONCE
[start Γ3 [TTT] text]
Text : <yawn> <yawn> <yawn> I have been awake since four in the morning.
Duration : 3.86s
Rating : BAD, SPOKE TAG TWICE
[end Γ1 text [T]]
Text : I have been awake since four in the morning. <yawn>
Duration : 3.63s
Rating : BAD, SPOKE TAG
[end Γ2 text [TT]]
Text : I have been awake since four in the morning. <yawn> <yawn>
Duration : 3.88s
Rating : BAD, SPOKE TAG TWICE
[end Γ3 text [TTT]]
Text : I have been awake since four in the morning. <yawn> <yawn> <yawn>
Duration : 4.16s
Rating : BAD, SPOKE TAG TWICE
[both Γ1 [T] text [T]]
Text : <yawn> I have been awake since four in the morning. <yawn>
Duration : 3.96s
Rating : BAD, SPOKE TAG
[both Γ2 [TT] text [TT]]
Text : <yawn> <yawn> I have been awake since four in the morning. <yawn> <yawn>
Duration : 4.47s
Rating : BAD, SPOKE TAG ONCE AT START AND AT END
[both Γ3 [TTT] text [TTT]]
Text : <yawn> <yawn> <yawn> I have been awake since four in the morning. <yawn> <yawn> <yawn>
Duration : 5.00s
Rating : BAD, SPOKE TAG TWICE AT START AND ONCE AT END
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
END OF REPORT
Wow, thank you for such a detailed analysis @gabrielmaneschy .
Honestly, we havenβt tested the tag behavior this systematically ourselves, so this is extremely helpful. The patterns you found are very practical, and I think they will also be useful for other users who want to experiment with the current version.
Your results make it clearer where the model is treating the tag as a non-verbal cue, where it ignores it, and where it simply reads the tag as text. That gives us much better test cases than just knowing that βtags sometimes do not work.β
For future updates, weβll try to make this behavior more stable by cleaning up the different non-verbal token patterns in the data and collecting / generating more consistent examples for these tags.
Thanks again for taking the time to test this so carefully and share the results. This is genuinely valuable feedback for us.