Spaces:
Sleeping
Subtext Arena
Try saying this out loud = "Yeah, I am really looking forward to it" Now go and read those same eight words on paper, just text. Sincere, or sarcastic? Honestly you cannot tell from the words alone, you have to actually hear the line , the pitch, going up at the end, and the pauses in between, that tiny pause right before "really", the voice that is just a bit too sweet to be real. Only then the meaning flips on its head
Machines are surprisingly bad at this GPT-4o, gets sarcasm right only about 67% of the time on the standard sarcasm benchmark. Which means one out of three times one of the most powerful models on the planet is being fooled by tone of voice itself. A paper last October from a few audio researchers actually calls this the "LISTEN gap", and the gap is still pretty much wide open as of today. So I thought, why not just build a training ground for this thing and see what happens? At least my own attempt at it
The data was already sitting there , a dataset called MUStARD, with 690 clips from Friends and Big Bang Theory, half sarcastic and half sincere, every single line hand labeled by some patient researcher. From each clip's audio I pulled out three pretty basic things: pitch (how high the voice goes), energy (how loud it gets), and pauses (where the speaker hesitates before speaking). The model sees the transcript plus those three numbers, and has to make a call is this line sincere, or sarcastic?
I took a small 3 billion parameter open source model and trained it on this setup for about Five hours. Nothing fancy at all. The training reward climbed steadily from 0.33 to 0.97 over 200 steps, which means yes, clearly the model was learning something real. But the actual test is always the held out clips, the ones the model has never seen during training. And here is where I have to be a bit honest with you
On a broad mix of 80 unseen clips, the trained model got 51% accuracy. If I am being completely fair, that is roughly what a plain text only model would also do, meaning my three audio features were still too simple to push the model much beyond just reading the transcript. That is a fair hit. But here is the part that actually got me excited. On the six clips where the audio is the only thing that flips the meaning , those clips where the literal words alone will fool you every time , the trained model got 5 out of 6 right. The text only baseline got 0 out of 6, and it was confidently, embarrassingly wrong on every single one of them
So that is the whole story really. The environment works and the reward signal works. The model learns. Now anyone with better audio features, wav2vec2, HuBERT, full audio LLMs, anything richer than my three basic numbers ,can plug in straight on top of this
Anyway this is just the start really, I am not stopping here at all. The whole point of putting in the effort to build an environment in the first place is, it should stay alive. Next on the list is to throw in proper audio embeddings instead of these three small numbers, then plug a 7B audio model on top, and after that try other kinds of subtext also , polite refusals, hidden anger, suppressed feelings, basically all the cases where the words and the actual meaning do not quite agree