emotion tags not working

#6
by Thekrishna01 - opened

I tried adding emotion tags that described in repo also added those in tags but still now worked, model just ignored that and synthesized speech. anyone knows why that happened or its my mistake?

this problem is happening to me too! i thought it was only me at first.

same here brother like i tried to pass in different different way( laugh, laughs,laughing inside <> )but no one did not worked.

Supertone org

Thanks for reporting this, and sorry for the confusion.

You are right that the current expression tag support is still limited. We added tags such as <laugh>, <breath>, and <sigh> because many users asked for non-verbal and expressive sounds after the previous release, and we did include tagged examples in the training data. However, the tagged data is currently available mostly for Korean, English, and Japanese, and even within those languages the consistency of the labels is not yet as strong as we would like.

Because of that, the model may sometimes ignore the tag or simply synthesize the surrounding text normally. This is a known limitation of the current Supertonic 3 release, and we should have communicated it more clearly. Sorry about that.

A few things may help in the current version:

  • Tags tend to work better in Korean, English, and Japanese.
  • Tags often work better when placed at the beginning or end of the sentence.
  • If a single tag is ignored, repeating it two or three times can sometimes make the expression more likely to appear.

For example, you may get better results with inputs like:

<laugh> <laugh> That was not what I expected. <laugh> <laugh>

We are planning to improve this with additional synthetic data and supervised fine-tuning, so expression tags should become more reliable in future updates.

Thanks again for trying the model and pointing this out. We appreciate the feedback, and we’ll share updates when this improves.

I did several tests with the tags, phrase by phrase, one summary at a time. My particular conclusion was:

<laugh> - Best as a single triple sequence at the end. "That joke was actually funny. <laugh> <laugh> <laugh>"
<breath> - Best as a single triple sequence at the beginning. "<breath> <breath> <breath> I finally made it to the top of the mountain."
<surprise> - Best as a triple combination. "<surprise> <surprise> <surprise> You bought this for me? No way! <surprise> <surprise> <surprise>"
<sigh> - Best as a double combination. "<sigh> <sigh> Another meeting added to my calendar. <sigh> <sigh>"
<scream> - Spoke the tag in all situations.
<throatclear> - Spoke the tag in all situations.
In the second test, tested in the middle of the sentence with "Before we begin, <throatclear> I have something to say.", the model also spoke the tag in all situations. 
Tested with one, two, and three tags:
"Before we begin, <throatclear> I have something to tell you. Due to recent events <throatclear>... sorry."
<sad> - Best as a triple combination. "<sad> <sad> <sad> I never thought it would end like this. <sad> <sad> <sad>"
<angry> - Missed tag in all situations.
<cough> - Best as a single triple sequence at the beginning. "<cough> <cough> <cough> This room is so dusty."
<yawn> - Missed tag in all situations. Test in the middle of the sentence.
As with <throatclear>, the model returned the tag in tests with <yawn> in the middle of the sentence.
However, in most situations, the tag was suppressed, being indifferent in the sentence.

Claude said:
Hypothesis on the "speaking the tag" mechanism
The "speaking N-1 tags" pattern (Γ—3 speaks two, Γ—2 speaks one) suggests that the model consumes one tag as a prosodic signal but treats the excess tags as text. Position matters because at the beginning of the sentence the model is still establishing the contextβ€”without prior text, some tags lack semantic anchoring and are vocalized. In the end, the same problem but reversed: the text flow has already finished, so the tag is suspended.

The test results follow:

════════════════════════════════════════════════════════════════════════
REPORT β€” EXPRESSION TAGS SUPERTONIC 3
Generated : 2026-05-16 11:27:20
Voice     : F1
Total     : 100 sentences
════════════════════════════════════════════════════════════════════════

TAG: <laugh>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : That joke was actually funny.
  Duration  : 2.17s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <laugh> That joke was actually funny.
  Duration  : 2.58s
  Rating    : INDIFFERENT

  [start   Γ—2 [TT] text]
  Text      : <laugh> <laugh> That joke was actually funny.
  Duration  : 2.90s
  Rating    : OK

  [start   Γ—3 [TTT] text]
  Text      : <laugh> <laugh> <laugh> That joke was actually funny.
  Duration  : 3.22s
  Rating    : BEST

  [end     Γ—1 text [T]]
  Text      : That joke was actually funny. <laugh>
  Duration  : 3.00s
  Rating    : OK

  [end     Γ—2 text [TT]]
  Text      : That joke was actually funny. <laugh> <laugh>
  Duration  : 3.31s
  Rating    : BEST*

  [end     Γ—3 text [TTT]]
  Text      : That joke was actually funny. <laugh> <laugh> <laugh>
  Duration  : 3.61s
  Rating    : BEST**

  [both    Γ—1 [T] text [T]]
  Text      : <laugh> That joke was actually funny. <laugh>
  Duration  : 3.41s
  Rating    : OK

  [both    Γ—2 [TT] text [TT]]
  Text      : <laugh> <laugh> That joke was actually funny. <laugh> <laugh>
  Duration  : 4.00s
  Rating    : BAD

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <laugh> <laugh> <laugh> That joke was actually funny. <laugh> <laugh> <laugh>
  Duration  : 4.60s
  Rating    : BAD


TAG: <breath>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : I finally made it to the top of the mountain.
  Duration  : 2.93s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <breath> I finally made it to the top of the mountain.
  Duration  : 3.43s
  Rating    : BAD, SPOKE TAG

  [start   Γ—2 [TT] text]
  Text      : <breath> <breath> I finally made it to the top of the mountain.
  Duration  : 3.85s
  Rating    : BEST*

  [start   Γ—3 [TTT] text]
  Text      : <breath> <breath> <breath> I finally made it to the top of the mountain.
  Duration  : 4.25s
  Rating    : BEST**

  [end     Γ—1 text [T]]
  Text      : I finally made it to the top of the mountain. <breath>
  Duration  : 3.69s
  Rating    : BAD, SPOKE TAG

  [end     Γ—2 text [TT]]
  Text      : I finally made it to the top of the mountain. <breath> <breath>
  Duration  : 4.06s
  Rating    : BAD, SPOKE TAG TWICE

  [end     Γ—3 text [TTT]]
  Text      : I finally made it to the top of the mountain. <breath> <breath> <breath>
  Duration  : 4.43s
  Rating    : BAD, PERFORMED SIGH BUT SPOKE TAG TWICE

  [both    Γ—1 [T] text [T]]
  Text      : <breath> I finally made it to the top of the mountain. <breath>
  Duration  : 4.16s
  Rating    : BAD, SIGHED AT START BUT SPOKE TAG AT END

  [both    Γ—2 [TT] text [TT]]
  Text      : <breath> <breath> I finally made it to the top of the mountain. <breath> <breath>
  Duration  : 4.87s
  Rating    : BAD, SPOKE TAG TWICE AT START BUT PERFORMED SIGH OK AT END

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <breath> <breath> <breath> I finally made it to the top of the mountain. <breath> <breath> <breath>
  Duration  : 5.65s
  Rating    : BAD, SIGHED WELL AS IN START-ONLY BUT SPOKE TAG AT END


TAG: <surprise>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : You bought this for me? No way!
  Duration  : 2.68s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <surprise> You bought this for me? No way!
  Duration  : 3.34s
  Rating    : BAD, SPOKE TAG. SLIGHT INTONATION CHANGE

  [start   Γ—2 [TT] text]
  Text      : <surprise> <surprise> You bought this for me? No way!
  Duration  : 3.90s
  Rating    : BAD, SPOKE TAG. SLIGHT INTONATION CHANGE

  [start   Γ—3 [TTT] text]
  Text      : <surprise> <surprise> <surprise> You bought this for me? No way!
  Duration  : 4.49s
  Rating    : BAD, SPOKE TAG TWICE. SLIGHT INTONATION CHANGE

  [end     Γ—1 text [T]]
  Text      : You bought this for me? No way! <surprise>
  Duration  : 3.81s
  Rating    : BAD, SPOKE TAG. ROBOTIC VOICE

  [end     Γ—2 text [TT]]
  Text      : You bought this for me? No way! <surprise> <surprise>
  Duration  : 4.41s
  Rating    : OK, BETTER INTONATION BUT SPOKE TAG TWICE

  [end     Γ—3 text [TTT]]
  Text      : You bought this for me? No way! <surprise> <surprise> <surprise>
  Duration  : 4.95s
  Rating    : BAD, EXAGGERATED INTONATION, SPOKE TAG TWICE

  [both    Γ—1 [T] text [T]]
  Text      : <surprise> You bought this for me? No way! <surprise>
  Duration  : 4.50s
  Rating    : BAD, SPOKE TAG TWICE

  [both    Γ—2 [TT] text [TT]]
  Text      : <surprise> <surprise> You bought this for me? No way! <surprise> <surprise>
  Duration  : 5.61s
  Rating    : BAD, SPOKE TAG TWICE

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <surprise> <surprise> <surprise> You bought this for me? No way! <surprise> <surprise> <surprise>
  Duration  : 6.74s
  Rating    : OK, BETTER INTONATION. SPOKE TAG FOUR TIMES


TAG: <sigh>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : Another meeting added to my calendar.
  Duration  : 2.63s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <sigh> Another meeting added to my calendar.
  Duration  : 3.00s
  Rating    : INDIFFERENT

  [start   Γ—2 [TT] text]
  Text      : <sigh> <sigh> Another meeting added to my calendar.
  Duration  : 3.27s
  Rating    : INDIFFERENT, SAID "UHM" AS IF THINKING

  [start   Γ—3 [TTT] text]
  Text      : <sigh> <sigh> <sigh> Another meeting added to my calendar.
  Duration  : 3.58s
  Rating    : BAD, SPOKE TAG ONCE

  [end     Γ—1 text [T]]
  Text      : Another meeting added to my calendar. <sigh>
  Duration  : 3.33s
  Rating    : INDIFFERENT

  [end     Γ—2 text [TT]]
  Text      : Another meeting added to my calendar. <sigh> <sigh>
  Duration  : 3.56s
  Rating    : BAD, SPOKE TAG ONCE

  [end     Γ—3 text [TTT]]
  Text      : Another meeting added to my calendar. <sigh> <sigh> <sigh>
  Duration  : 3.81s
  Rating    : BAD, SPOKE TAG ONCE

  [both    Γ—1 [T] text [T]]
  Text      : <sigh> Another meeting added to my calendar. <sigh>
  Duration  : 3.66s
  Rating    : INDIFFERENT, STRANGE MURMUR

  [both    Γ—2 [TT] text [TT]]
  Text      : <sigh> <sigh> Another meeting added to my calendar. <sigh> <sigh>
  Duration  : 4.17s
  Rating    : INDIFFERENT, STRANGE MURMUR AT START AND END

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <sigh> <sigh> <sigh> Another meeting added to my calendar. <sigh> <sigh> <sigh>
  Duration  : 4.70s
  Rating    : BAD, STRANGE MURMUR AT START, SPOKE TAG AT START AND END


TAG: <scream>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : Watch out, behind you!
  Duration  : 1.81s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <scream> Watch out, behind you!
  Duration  : 2.28s
  Rating    : BAD, SPOKE TAG

  [start   Γ—2 [TT] text]
  Text      : <scream> <scream> Watch out, behind you!
  Duration  : 2.74s
  Rating    : BAD, SPOKE TAG ONCE

  [start   Γ—3 [TTT] text]
  Text      : <scream> <scream> <scream> Watch out, behind you!
  Duration  : 3.19s
  Rating    : BAD, SPOKE TAG TWICE

  [end     Γ—1 text [T]]
  Text      : Watch out, behind you! <scream>
  Duration  : 2.64s
  Rating    : BAD, SLIGHT INTONATION CHANGE, SPOKE TAG

  [end     Γ—2 text [TT]]
  Text      : Watch out, behind you! <scream> <scream>
  Duration  : 3.11s
  Rating    : BAD, SLIGHT INTONATION CHANGE, SPOKE TAG ONCE

  [end     Γ—3 text [TTT]]
  Text      : Watch out, behind you! <scream> <scream> <scream>
  Duration  : 3.57s
  Rating    : BAD, OK INTONATION, SPOKE TAG TWICE

  [both    Γ—1 [T] text [T]]
  Text      : <scream> Watch out, behind you! <scream>
  Duration  : 3.19s
  Rating    : BAD, BAD INTONATION. SPOKE TAG AT START

  [both    Γ—2 [TT] text [TT]]
  Text      : <scream> <scream> Watch out, behind you! <scream> <scream>
  Duration  : 4.10s
  Rating    : BAD, BAD INTONATION. SPOKE TAG TWICE AT START AND ONCE AT END

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <scream> <scream> <scream> Watch out, behind you! <scream> <scream> <scream>
  Duration  : 4.97s
  Rating    : BAD, BAD INTONATION, SPOKE TAG MULTIPLE TIMES, ROBOTIC VOICE


TAG: <throatclear>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : Before we begin, I have something to say.
  Duration  : 2.83s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <throatclear> Before we begin, I have something to say.
  Duration  : 3.74s
  Rating    : BAD, SPOKE TAG

  [start   Γ—2 [TT] text]
  Text      : <throatclear> <throatclear> Before we begin, I have something to say.
  Duration  : 4.48s
  Rating    : BAD, SPOKE TAG TWICE

  [start   Γ—3 [TTT] text]
  Text      : <throatclear> <throatclear> <throatclear> Before we begin, I have something to say.
  Duration  : 5.18s
  Rating    : BAD, SPOKE TAG THREE TIMES

  [end     Γ—1 text [T]]
  Text      : Before we begin, I have something to say. <throatclear>
  Duration  : 3.97s
  Rating    : BAD, SPOKE TAG

  [end     Γ—2 text [TT]]
  Text      : Before we begin, I have something to say. <throatclear> <throatclear>
  Duration  : 4.63s
  Rating    : BAD, SPOKE TAG TWICE

  [end     Γ—3 text [TTT]]
  Text      : Before we begin, I have something to say. <throatclear> <throatclear> <throatclear>
  Duration  : 5.29s
  Rating    : BAD, SPOKE TAG THREE TIMES

  [both    Γ—1 [T] text [T]]
  Text      : <throatclear> Before we begin, I have something to say. <throatclear>
  Duration  : 4.77s
  Rating    : BAD, SPOKE TAG TWICE

  [both    Γ—2 [TT] text [TT]]
  Text      : <throatclear> <throatclear> Before we begin, I have something to say. <throatclear> <throatclear>
  Duration  : 6.08s
  Rating    : BAD, SPOKE TAG FOUR TIMES

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <throatclear> <throatclear> <throatclear> Before we begin, I have something to say. <throatclear> <throatclear> <throatclear>
  Duration  : 7.47s
  Rating    : BAD, SPOKE TAG SIX TIMES


TAG: <sad>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : I never thought it would end like this.
  Duration  : 2.63s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <sad> I never thought it would end like this.
  Duration  : 2.95s
  Rating    : INDIFFERENT

  [start   Γ—2 [TT] text]
  Text      : <sad> <sad> I never thought it would end like this.
  Duration  : 3.19s
  Rating    : OK, STRANGE MOAN

  [start   Γ—3 [TTT] text]
  Text      : <sad> <sad> <sad> I never thought it would end like this.
  Duration  : 3.46s
  Rating    : OK, ROBOTIC EMOTIONAL SOUND

  [end     Γ—1 text [T]]
  Text      : I never thought it would end like this. <sad>
  Duration  : 3.18s
  Rating    : INDIFFERENT

  [end     Γ—2 text [TT]]
  Text      : I never thought it would end like this. <sad> <sad>
  Duration  : 3.39s
  Rating    : OK, STRANGE MOAN

  [end     Γ—3 text [TTT]]
  Text      : I never thought it would end like this. <sad> <sad> <sad>
  Duration  : 3.62s
  Rating    : BAD, SPOKE TAG ONCE

  [both    Γ—1 [T] text [T]]
  Text      : <sad> I never thought it would end like this. <sad>
  Duration  : 3.47s
  Rating    : BEST*, SLIGHTLY ROBOTIC

  [both    Γ—2 [TT] text [TT]]
  Text      : <sad> <sad> I never thought it would end like this. <sad> <sad>
  Duration  : 3.93s
  Rating    : BEST**, ALMOST SPOKE TAG AT END

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <sad> <sad> <sad> I never thought it would end like this. <sad> <sad> <sad>
  Duration  : 4.43s
  Rating    : BEST***, EMOTIONAL, NO SPOKEN TAGS, NO ROBOTIC VOICE


TAG: <angry>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : You crossed the line this time.
  Duration  : 2.21s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <angry> You crossed the line this time.
  Duration  : 2.62s
  Rating    : BAD, SPOKE TAG

  [start   Γ—2 [TT] text]
  Text      : <angry> <angry> You crossed the line this time.
  Duration  : 2.94s
  Rating    : BAD, SPOKE TAG ONCE

  [start   Γ—3 [TTT] text]
  Text      : <angry> <angry> <angry> You crossed the line this time.
  Duration  : 3.26s
  Rating    : BAD, SPOKE TAG TWICE

  [end     Γ—1 text [T]]
  Text      : You crossed the line this time. <angry>
  Duration  : 2.87s
  Rating    : BAD, SPOKE TAG

  [end     Γ—2 text [TT]]
  Text      : You crossed the line this time. <angry> <angry>
  Duration  : 3.16s
  Rating    : BAD, SPOKE TAG TWICE

  [end     Γ—3 text [TTT]]
  Text      : You crossed the line this time. <angry> <angry> <angry>
  Duration  : 3.47s
  Rating    : BAD, SPOKE TAG TWICE

  [both    Γ—1 [T] text [T]]
  Text      : <angry> You crossed the line this time. <angry>
  Duration  : 3.22s
  Rating    : BAD, SPOKE TAG TWICE

  [both    Γ—2 [TT] text [TT]]
  Text      : <angry> <angry> You crossed the line this time. <angry> <angry>
  Duration  : 3.84s
  Rating    : BAD, SPOKE TAG THREE TIMES

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <angry> <angry> <angry> You crossed the line this time. <angry> <angry> <angry>
  Duration  : 4.46s
  Rating    : BAD, SPOKE TAG FIVE TIMES.


TAG: <cough>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : This room is so dusty.
  Duration  : 1.88s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <cough> This room is so dusty.
  Duration  : 2.23s
  Rating    : BAD, SPOKE TAG

  [start   Γ—2 [TT] text]
  Text      : <cough> <cough> This room is so dusty.
  Duration  : 2.57s
  Rating    : BAD, SPOKE TAG ONCE

  [start   Γ—3 [TTT] text]
  Text      : <cough> <cough> <cough> This room is so dusty.
  Duration  : 2.91s
  Rating    : BEST, NOT VERY REALISTIC

  [end     Γ—1 text [T]]
  Text      : This room is so dusty. <cough>
  Duration  : 2.67s
  Rating    : INDIFFERENT

  [end     Γ—2 text [TT]]
  Text      : This room is so dusty. <cough> <cough>
  Duration  : 2.99s
  Rating    : OK, SOUNDS LIKE A LAUGH

  [end     Γ—3 text [TTT]]
  Text      : This room is so dusty. <cough> <cough> <cough>
  Duration  : 3.28s
  Rating    : OK, ROBOTIC

  [both    Γ—1 [T] text [T]]
  Text      : <cough> This room is so dusty. <cough>
  Duration  : 3.05s
  Rating    : BAD, SPOKE TAG

  [both    Γ—2 [TT] text [TT]]
  Text      : <cough> <cough> This room is so dusty. <cough> <cough>
  Duration  : 3.64s
  Rating    : OK, PERFORMED COUGH AT START, STRANGE AT END

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <cough> <cough> <cough> This room is so dusty. <cough> <cough> <cough>
  Duration  : 4.25s
  Rating    : BAD, SPOKE TAG AT START, OK COUGH AT END


TAG: <yawn>
────────────────────────────────────────────────────────────────────────
  [control   (no tag)]
  Text      : I have been awake since four in the morning.
  Duration  : 2.89s
  Rating    : OK

  [start   Γ—1 [T] text]
  Text      : <yawn> I have been awake since four in the morning.
  Duration  : 3.26s
  Rating    : BAD, SPOKE TAG

  [start   Γ—2 [TT] text]
  Text      : <yawn> <yawn> I have been awake since four in the morning.
  Duration  : 3.55s
  Rating    : BAD, SPOKE TAG ONCE

  [start   Γ—3 [TTT] text]
  Text      : <yawn> <yawn> <yawn> I have been awake since four in the morning.
  Duration  : 3.86s
  Rating    : BAD, SPOKE TAG TWICE

  [end     Γ—1 text [T]]
  Text      : I have been awake since four in the morning. <yawn>
  Duration  : 3.63s
  Rating    : BAD, SPOKE TAG

  [end     Γ—2 text [TT]]
  Text      : I have been awake since four in the morning. <yawn> <yawn>
  Duration  : 3.88s
  Rating    : BAD, SPOKE TAG TWICE

  [end     Γ—3 text [TTT]]
  Text      : I have been awake since four in the morning. <yawn> <yawn> <yawn>
  Duration  : 4.16s
  Rating    : BAD, SPOKE TAG TWICE

  [both    Γ—1 [T] text [T]]
  Text      : <yawn> I have been awake since four in the morning. <yawn>
  Duration  : 3.96s
  Rating    : BAD, SPOKE TAG

  [both    Γ—2 [TT] text [TT]]
  Text      : <yawn> <yawn> I have been awake since four in the morning. <yawn> <yawn>
  Duration  : 4.47s
  Rating    : BAD, SPOKE TAG ONCE AT START AND AT END

  [both    Γ—3 [TTT] text [TTT]]
  Text      : <yawn> <yawn> <yawn> I have been awake since four in the morning. <yawn> <yawn> <yawn>
  Duration  : 5.00s
  Rating    : BAD, SPOKE TAG TWICE AT START AND ONCE AT END

════════════════════════════════════════════════════════════════════════
END OF REPORT
Supertone org

Wow, thank you for such a detailed analysis @gabrielmaneschy .

Honestly, we haven’t tested the tag behavior this systematically ourselves, so this is extremely helpful. The patterns you found are very practical, and I think they will also be useful for other users who want to experiment with the current version.

Your results make it clearer where the model is treating the tag as a non-verbal cue, where it ignores it, and where it simply reads the tag as text. That gives us much better test cases than just knowing that β€œtags sometimes do not work.”

For future updates, we’ll try to make this behavior more stable by cleaning up the different non-verbal token patterns in the data and collecting / generating more consistent examples for these tags.

Thanks again for taking the time to test this so carefully and share the results. This is genuinely valuable feedback for us.

Sign up or log in to comment