Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="utf-8" /> | |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> | |
| <title>DialogueSidon — Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio</title> | |
| <meta name="description" content="Demo page for DialogueSidon: joint restoration and separation of degraded two-speaker dialogue audio via an SSL-VAE latent space and a diffusion-based latent predictor." /> | |
| <link rel="stylesheet" href="style.css" /> | |
| </head> | |
| <body> | |
| <header class="hero"> | |
| <div class="container"> | |
| <h1>DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio</h1> | |
| <p class="authors"> | |
| Wataru Nakata<sup>1,2</sup>, | |
| Yuki Saito<sup>1,2</sup>, | |
| Kazuki Yamauchi<sup>1</sup>, | |
| Emiru Tsunoo<sup>1</sup>, | |
| Hiroshi Saruwatari<sup>1</sup> | |
| </p> | |
| <p class="affiliation"> | |
| <sup>1</sup>The University of Tokyo, Tokyo, Japan | |
| <sup>2</sup>National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan | |
| </p> | |
| <nav class="actions"> | |
| <a class="btn" href="https://arxiv.org/abs/2604.09344" target="_blank">Paper</a> | |
| <a class="btn" href="https://huggingface.co/spaces/sarulab-speech/DialogueSidon-demo" target="_blank" rel="noopener">Live Demo</a> | |
| </nav> | |
| </div> | |
| </header> | |
| <main class="container"> | |
| <section id="abstract"> | |
| <h2>Abstract</h2> | |
| <p> | |
| Full-duplex dialogue audio, in which each speaker is recorded on a separate track, | |
| is an important resource for spoken dialogue research, but is difficult to collect at | |
| scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural | |
| mixtures, which is unsuitable for systems requiring clean speaker-wise signals. | |
| We propose <em>DialogueSidon</em>, a model for joint restoration and separation of | |
| degraded two-speaker dialogue audio. DialogueSidon combines an SSL-VAE—which | |
| compresses self-supervised speech features into a compact latent space—with a | |
| diffusion-based latent predictor that recovers speaker-wise latent representations | |
| from the degraded mixture. Experiments on English, multilingual, and in-the-wild | |
| dialogue datasets show that DialogueSidon substantially improves intelligibility and | |
| separation quality over a baseline, while also achieving much faster inference. | |
| </p> | |
| </section> | |
| <section id="samples"> | |
| <h2>Audio Samples</h2> | |
| <p class="note"> | |
| Each row plays the same utterance through three systems. The <strong>noisy</strong> | |
| column is the raw monaural input given to every model. <strong>GENESES</strong> is the | |
| baseline. <strong>DialogueSidon</strong> is ours (D = 32). Separated outputs are encoded | |
| as stereo: speaker 1 on the left channel, speaker 2 on the right. | |
| </p> | |
| <h3>English — Switchboard</h3> | |
| <div class="sample-table-wrapper"> | |
| <table class="sample-table"> | |
| <thead> | |
| <tr> | |
| <th>Example</th> | |
| <th>Noisy mixture</th> | |
| <th>GENESES</th> | |
| <th>DialogueSidon (ours)</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>sw02007</td> | |
| <td><div class="waveform" data-src="wav/swb/noisy/sw02007.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/swb/geneses/sw02007.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02007.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>sw02093</td> | |
| <td><div class="waveform" data-src="wav/swb/noisy/sw02093.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/swb/geneses/sw02093.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02093.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>sw02157</td> | |
| <td><div class="waveform" data-src="wav/swb/noisy/sw02157.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/swb/geneses/sw02157.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02157.wav"></div></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>Multilingual — CallFriend</h3> | |
| <div class="sample-table-wrapper"> | |
| <table class="sample-table"> | |
| <thead> | |
| <tr> | |
| <th>Language</th> | |
| <th>Noisy mixture</th> | |
| <th>GENESES</th> | |
| <th>DialogueSidon (ours)</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>German</td> | |
| <td><div class="waveform" data-src="wav/cf/noisy/deu_1082.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/geneses/deu_1082.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/dialoguesidon/deu_1082.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>English</td> | |
| <td><div class="waveform" data-src="wav/cf/noisy/eng-n_4708.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/geneses/eng-n_4708.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/dialoguesidon/eng-n_4708.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>French</td> | |
| <td><div class="waveform" data-src="wav/cf/noisy/fra-q_5110.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/geneses/fra-q_5110.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/dialoguesidon/fra-q_5110.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>Japanese</td> | |
| <td><div class="waveform" data-src="wav/cf/noisy/jpn_0921.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/geneses/jpn_0921.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/dialoguesidon/jpn_0921.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>Spanish</td> | |
| <td><div class="waveform" data-src="wav/cf/noisy/spa_1469.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/geneses/spa_1469.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/dialoguesidon/spa_1469.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>Mandarin</td> | |
| <td><div class="waveform" data-src="wav/cf/noisy/zho-m_0941.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/geneses/zho-m_0941.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/cf/dialoguesidon/zho-m_0941.wav"></div></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <h3>In-the-Wild — OpenDialog</h3> | |
| <p class="note"> | |
| Real internet dialogue recordings with realistic, unknown degradations. | |
| No clean reference exists for these clips. | |
| </p> | |
| <div class="sample-table-wrapper"> | |
| <table class="sample-table"> | |
| <thead> | |
| <tr> | |
| <th>Example</th> | |
| <th>Noisy mixture</th> | |
| <th>GENESES</th> | |
| <th>DialogueSidon (ours)</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>Example 1</td> | |
| <td><div class="waveform" data-src="wav/od/noisy/example_1.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/od/geneses/example_1.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/od/dialoguesidon/example_1.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>Example 2</td> | |
| <td><div class="waveform" data-src="wav/od/noisy/example_2.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/od/geneses/example_2.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/od/dialoguesidon/example_2.wav"></div></td> | |
| </tr> | |
| <tr> | |
| <td>Example 3</td> | |
| <td><div class="waveform" data-src="wav/od/noisy/example_3.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/od/geneses/example_3.wav"></div></td> | |
| <td><div class="waveform" data-src="wav/od/dialoguesidon/example_3.wav"></div></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| </section> | |
| <section id="bibtex"> | |
| <h2>Citation</h2> | |
| <pre class="bibtex"><code>[BibTeX entry will be provided upon publication.]</code></pre> | |
| </section> | |
| </main> | |
| <footer> | |
| <div class="container"> | |
| <p>Demo page accompanying the DialogueSidon preprint. | |
| Code will be released upon publication.</p> | |
| </div> | |
| </footer> | |
| <script type="module" src="script.js"></script> | |
| </body> | |
| </html> | |