Spaces:

Wataru
/

dsidonsamples

Running

App Files Files Community

dsidonsamples / index.html

Wataru

Update index.html

f047e32 verified 4 days ago

raw

history blame contribute delete

8.93 kB

	<!doctype html>
	<html lang="en">
	<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<title>DialogueSidon — Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio</title>
	<meta name="description" content="Demo page for DialogueSidon: joint restoration and separation of degraded two-speaker dialogue audio via an SSL-VAE latent space and a diffusion-based latent predictor." />
	<link rel="stylesheet" href="style.css" />
	</head>
	<body>
	<header class="hero">
	<div class="container">
	<h1>DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio</h1>

	<p class="authors">
	Wataru Nakata<sup>1,2</sup>,
	Yuki Saito<sup>1,2</sup>,
	Kazuki Yamauchi<sup>1</sup>,
	Emiru Tsunoo<sup>1</sup>,
	Hiroshi Saruwatari<sup>1</sup>
	</p>
	<p class="affiliation">
	<sup>1</sup>The University of Tokyo, Tokyo, Japan
	<sup>2</sup>National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
	</p>

	<nav class="actions">
	<a class="btn" href="https://arxiv.org/abs/2604.09344" target="_blank">Paper</a>
	<a class="btn" href="https://huggingface.co/spaces/sarulab-speech/DialogueSidon-demo" target="_blank" rel="noopener">Live Demo</a>
	</nav>
	</div>
	</header>

	<main class="container">

	<section id="abstract">
	<h2>Abstract</h2>
	<p>
	Full-duplex dialogue audio, in which each speaker is recorded on a separate track,
	is an important resource for spoken dialogue research, but is difficult to collect at
	scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural
	mixtures, which is unsuitable for systems requiring clean speaker-wise signals.
	We propose <em>DialogueSidon</em>, a model for joint restoration and separation of
	degraded two-speaker dialogue audio. DialogueSidon combines an SSL-VAE—which
	compresses self-supervised speech features into a compact latent space—with a
	diffusion-based latent predictor that recovers speaker-wise latent representations
	from the degraded mixture. Experiments on English, multilingual, and in-the-wild
	dialogue datasets show that DialogueSidon substantially improves intelligibility and
	separation quality over a baseline, while also achieving much faster inference.
	</p>
	</section>

	<section id="samples">
	<h2>Audio Samples</h2>
	<p class="note">
	Each row plays the same utterance through three systems. The <strong>noisy</strong>
	column is the raw monaural input given to every model. <strong>GENESES</strong> is the
	baseline. <strong>DialogueSidon</strong> is ours (D = 32). Separated outputs are encoded
	as stereo: speaker 1 on the left channel, speaker 2 on the right.
	</p>

	<h3>English — Switchboard</h3>
	<div class="sample-table-wrapper">
	<table class="sample-table">
	<thead>
	<tr>
	<th>Example</th>
	<th>Noisy mixture</th>
	<th>GENESES</th>
	<th>DialogueSidon (ours)</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>sw02007</td>
	<td><div class="waveform" data-src="wav/swb/noisy/sw02007.wav"></div></td>
	<td><div class="waveform" data-src="wav/swb/geneses/sw02007.wav"></div></td>
	<td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02007.wav"></div></td>
	</tr>
	<tr>
	<td>sw02093</td>
	<td><div class="waveform" data-src="wav/swb/noisy/sw02093.wav"></div></td>
	<td><div class="waveform" data-src="wav/swb/geneses/sw02093.wav"></div></td>
	<td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02093.wav"></div></td>
	</tr>
	<tr>
	<td>sw02157</td>
	<td><div class="waveform" data-src="wav/swb/noisy/sw02157.wav"></div></td>
	<td><div class="waveform" data-src="wav/swb/geneses/sw02157.wav"></div></td>
	<td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02157.wav"></div></td>
	</tr>
	</tbody>
	</table>
	</div>

	<h3>Multilingual — CallFriend</h3>
	<div class="sample-table-wrapper">
	<table class="sample-table">
	<thead>
	<tr>
	<th>Language</th>
	<th>Noisy mixture</th>
	<th>GENESES</th>
	<th>DialogueSidon (ours)</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>German</td>
	<td><div class="waveform" data-src="wav/cf/noisy/deu_1082.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/geneses/deu_1082.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/dialoguesidon/deu_1082.wav"></div></td>
	</tr>
	<tr>
	<td>English</td>
	<td><div class="waveform" data-src="wav/cf/noisy/eng-n_4708.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/geneses/eng-n_4708.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/dialoguesidon/eng-n_4708.wav"></div></td>
	</tr>
	<tr>
	<td>French</td>
	<td><div class="waveform" data-src="wav/cf/noisy/fra-q_5110.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/geneses/fra-q_5110.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/dialoguesidon/fra-q_5110.wav"></div></td>
	</tr>
	<tr>
	<td>Japanese</td>
	<td><div class="waveform" data-src="wav/cf/noisy/jpn_0921.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/geneses/jpn_0921.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/dialoguesidon/jpn_0921.wav"></div></td>
	</tr>
	<tr>
	<td>Spanish</td>
	<td><div class="waveform" data-src="wav/cf/noisy/spa_1469.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/geneses/spa_1469.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/dialoguesidon/spa_1469.wav"></div></td>
	</tr>
	<tr>
	<td>Mandarin</td>
	<td><div class="waveform" data-src="wav/cf/noisy/zho-m_0941.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/geneses/zho-m_0941.wav"></div></td>
	<td><div class="waveform" data-src="wav/cf/dialoguesidon/zho-m_0941.wav"></div></td>
	</tr>
	</tbody>
	</table>
	</div>

	<h3>In-the-Wild — OpenDialog</h3>
	<p class="note">
	Real internet dialogue recordings with realistic, unknown degradations.
	No clean reference exists for these clips.
	</p>
	<div class="sample-table-wrapper">
	<table class="sample-table">
	<thead>
	<tr>
	<th>Example</th>
	<th>Noisy mixture</th>
	<th>GENESES</th>
	<th>DialogueSidon (ours)</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Example 1</td>
	<td><div class="waveform" data-src="wav/od/noisy/example_1.wav"></div></td>
	<td><div class="waveform" data-src="wav/od/geneses/example_1.wav"></div></td>
	<td><div class="waveform" data-src="wav/od/dialoguesidon/example_1.wav"></div></td>
	</tr>
	<tr>
	<td>Example 2</td>
	<td><div class="waveform" data-src="wav/od/noisy/example_2.wav"></div></td>
	<td><div class="waveform" data-src="wav/od/geneses/example_2.wav"></div></td>
	<td><div class="waveform" data-src="wav/od/dialoguesidon/example_2.wav"></div></td>
	</tr>
	<tr>
	<td>Example 3</td>
	<td><div class="waveform" data-src="wav/od/noisy/example_3.wav"></div></td>
	<td><div class="waveform" data-src="wav/od/geneses/example_3.wav"></div></td>
	<td><div class="waveform" data-src="wav/od/dialoguesidon/example_3.wav"></div></td>
	</tr>
	</tbody>
	</table>
	</div>
	</section>

	<section id="bibtex">
	<h2>Citation</h2>
	<pre class="bibtex"><code>[BibTeX entry will be provided upon publication.]</code></pre>
	</section>

	</main>

	<footer>
	<div class="container">
	<p>Demo page accompanying the DialogueSidon preprint.
	Code will be released upon publication.</p>
	</div>
	</footer>

	<script type="module" src="script.js"></script>
	</body>
	</html>