Add pipeline tag and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +12 -670
README.md CHANGED
@@ -1,8 +1,9 @@
1
  ---
2
  license: cc-by-4.0
 
3
  ---
4
 
5
- # Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
6
 
7
  [Ziyun Zeng](https://stdkonjac.icu/), Yiqi Lin, [Guoqiang Liang](https://ethanliang99.github.io/), and [Mike Zheng Shou](https://cde.nus.edu.sg/ece/staff/shou-zheng-mike/)
8
 
@@ -11,651 +12,18 @@ license: cc-by-4.0
11
  [![Code](https://img.shields.io/badge/Code-GitHub%20Repo-blue?logo=github)](https://github.com/showlab/Sparkle)
12
  [![Dataset](https://img.shields.io/badge/🤗%20Dataset-Sparkle-orange.svg)](https://huggingface.co/datasets/stdKonjac/Sparkle)
13
  [![Benchmark](https://img.shields.io/badge/🤗%20Benchmark-Sparkle--Bench-orange.svg)](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench)
14
- [![Model](https://img.shields.io/badge/🤗%20Model-Kiwi--Sparkle-orange.svg)](https://huggingface.co/stdKonjac/Kiwi-Sparkle-720P-81F)
15
 
 
16
 
17
- ## 📦 Dataset
18
 
19
- **Sparkle** is a large-scale video background replacement dataset comprising ~140K high-quality source–edited video pairs. It is fully open-sourced at [🤗stdKonjac/Sparkle](https://huggingface.co/datasets/stdKonjac/Sparkle). For full methodology and dataset details, please refer to [our paper](https://arxiv.org/abs/2605.06535).
20
 
21
- The dataset is organized into **five themes** along different background-change axes:
22
 
23
- | Theme | Description |
24
- | ---------- |----------------------------------------------------------------------------------------------------------------------------------|
25
- | `location` | Background replaced with a different physical environment (rural, nature, landmark, ...). |
26
- | `season` | Background changed across seasons (spring, summer, autumn, winter). |
27
- | `time` | Background changed across times of day (dawn, dusk, night, ...). |
28
- | `style` | Background restyled (era, mood, cinematic, ...). |
29
- | `openve3m` | A re-creation of the OpenVE-3M background-replacement subset using our pipeline, retained for direct comparison with prior work. |
30
 
31
- ### 🗂️ Repository Structure
32
-
33
- ```
34
- Sparkle/
35
- ├── README.md
36
- ├── prompts/ # training annotations + dataset-viewer source
37
- │ ├── location_train.csv # 4 columns: prompt, src_video, tgt_video, task
38
- │ ├── location_train_metadata.jsonl # per-task metadata (edit_type, subtheme, original scene)
39
- │ ├── season_train.csv
40
- │ ├── season_train_metadata.jsonl
41
- │ ├── time_train.csv
42
- │ ├── time_train_metadata.jsonl
43
- │ ├── style_train.csv
44
- │ ├── style_train_metadata.jsonl
45
- │ ├── openve3m_train.csv
46
- │ └── openve3m_train_metadata.jsonl
47
-
48
- ├── location/ # online preview: first 100 samples
49
- │ ├── source_video/
50
- │ │ ├── Sparkle_location_000000.mp4
51
- │ │ └── ... (100 files)
52
- │ └── edited_video/
53
- │ ├── Sparkle_location_000000.mp4
54
- │ └── ... (100 files)
55
- ├── season/ # same structure as location/
56
- ├── time/
57
- ├── style/
58
- ├── openve3m/
59
-
60
- ├── location_source_video_part00.tar # full corpus, sharded into ~5GB tars
61
- ├── location_source_video_part01.tar
62
- ├── location_edited_video_part00.tar
63
- ├── ...
64
- ├── season_*_partXX.tar
65
- ├── time_*_partXX.tar
66
- ├── style_*_partXX.tar
67
- ├── openve3m_*_partXX.tar
68
-
69
- └── intermediate_data/ # pipeline intermediates (described below)
70
- └── ...
71
- ```
72
-
73
- ### 🧾 Training Data Format
74
-
75
- We follow the training data format of [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit) for direct compatibility with downstream training pipelines.
76
-
77
- Each theme's annotations live in `prompts/{edit_type}_train.csv`, a four-column table:
78
-
79
- | Column | Description |
80
- | ----------- | ----------- |
81
- | `prompt` | The natural-language editing instruction. |
82
- | `src_video` | Path to the source video, e.g. `location/source_video/Sparkle_location_000000.mp4`. |
83
- | `tgt_video` | Path to the edited video, e.g. `location/edited_video/Sparkle_location_000000.mp4`. |
84
- | `task` | The unique sample id, e.g. `Sparkle_location_000000`. Joins to the `id` field in the JSONL metadata. |
85
-
86
- Per-task auxiliary metadata is stored alongside in `prompts/{edit_type}_train_metadata.jsonl`. Each line is one sample:
87
-
88
- ```json
89
- {
90
- "id": "Sparkle_location_000000",
91
- "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
92
- "metadata": {
93
- "edit_type": "location",
94
- "chosen_keyword": "urban: rooftop overlooking skyline",
95
- "original_scene": "A cobblestone street in a historical European city, ..."
96
- }
97
- }
98
- ```
99
-
100
- | Field | Description |
101
- | -------------------------- |--------------------------------------------------------------------------------------------------------------------------|
102
- | `id` | Sample id, matches the `task` column in the CSV. |
103
- | `prompt` | Same as the `prompt` column in the CSV. |
104
- | `metadata.edit_type` | One of the five themes: `location` / `season` / `time` / `style` / `openve3m` (denoted as `openve3m_background_change`). |
105
- | `metadata.chosen_keyword` | The `subtheme: scene` label (e.g. `"urban: rooftop overlooking skyline"`). Not available for the `openve3m` theme. |
106
- | `metadata.original_scene` | A description of the source video's first frame. |
107
-
108
- ### 👀 Online Preview
109
-
110
- The first 100 samples of every theme are stored as uncompressed `.mp4` files under `{edit_type}/source_video/` and `{edit_type}/edited_video/`, and can be played directly in the browser without downloading the full corpus.
111
-
112
- For example, for the task `Sparkle_location_000000` (the first row in the **location** theme of the dataset viewer), you can directly browse its [Source Video](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/location/source_video/Sparkle_location_000000.mp4) and [Edited Video](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/location/edited_video/Sparkle_location_000000.mp4).
113
-
114
- The dataset viewer at the top of the HF page lets you scroll through all five themes and read the corresponding prompts inline.
115
-
116
- ### ⬇️ Downloading the Full Corpus
117
-
118
- The full ~140K-sample corpus is sharded into ~5GB `.tar` archives at the repository root, named `{edit_type}_{source_video|edited_video}_partXX.tar`.
119
-
120
- **Step 1 — Download the tar shards.** Download everything (recommended for full reproduction):
121
-
122
- ```bash
123
- hf download stdKonjac/Sparkle --repo-type=dataset --local-dir ./Sparkle
124
- ```
125
-
126
- Or only a single theme (e.g. `location`):
127
-
128
- ```bash
129
- hf download stdKonjac/Sparkle \
130
- --repo-type=dataset \
131
- --local-dir ./Sparkle \
132
- --include "location_*.tar" "prompts/location_*"
133
- ```
134
-
135
- Or only the source videos of a theme:
136
-
137
- ```bash
138
- hf download stdKonjac/Sparkle \
139
- --repo-type=dataset \
140
- --local-dir ./Sparkle \
141
- --include "location_source_video_*.tar"
142
- ```
143
-
144
- **Step 2 — Extract the tars.** Each tar is **self-contained**: its internal paths are `{edit_type}/{source_video|edited_video}/{task}.mp4`, so extracting any subset of shards in place will populate the corresponding folders correctly. There is **no need to concatenate the parts** before extraction.
145
-
146
- ```bash
147
- cd ./Sparkle
148
- for f in *.tar; do tar -xf "$f"; done
149
- ```
150
-
151
- After extraction, the directory layout matches the online preview structure, and the relative paths in `prompts/{edit_type}_train.csv` (e.g. `location/source_video/Sparkle_location_000000.mp4`) will resolve directly.
152
-
153
- <details>
154
- <summary><h3 style="display: inline">🧪 Pipeline Intermediates</h3></summary>
155
-
156
- To support **full reproducibility, transparency, and downstream research**, we additionally release every intermediate artifact produced by the 5-stage Sparkle data pipeline (see *Figure 2: Data Pipeline* in [our paper](https://arxiv.org/abs/2605.06535)) under `intermediate_data/`. **The first 100 samples of every theme are uncompressed and previewable directly in the browser**, mirroring the layout of the `{edit_type}/` preview folders described above.
157
-
158
- Taking `Sparkle_location_000000` as a running example, the artifact layout looks like:
159
-
160
- ```
161
- Sparkle/
162
- └── intermediate_data/
163
- └── location/
164
- ├── source_frame0/ # Stage 2 input: 0-th frame of the source video
165
- │ └── Sparkle_location_000000.png
166
- ├── edited_frame0/ # Stage 2 output: first frame after preliminary background replacement
167
- │ └── Sparkle_location_000000.png
168
- ├── edited_frame0_foreground_removed/ # Stage 3 intermediate: foreground-removed clean background image
169
- │ └── Sparkle_location_000000.png
170
- ├── edited_background_video/ # Stage 3 output: 81-frame pure background video (no foreground)
171
- │ └── Sparkle_location_000000.mp4
172
- ├── source_video_mask/ # Stage 4 output: BAIT-tracked foreground mask (packed bits)
173
- │ └── Sparkle_location_000000.npz
174
- └── edited_video_canny/ # Stage 5 intermediate: decoupled foreground + background Canny edges
175
- └── Sparkle_location_000000.mp4
176
- ```
177
-
178
- For the same task `Sparkle_location_000000`, every artifact is browsable online:
179
-
180
- | Pipeline stage | Artifact | Preview |
181
- |----------------|--------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
182
- | Stage 2 (in) | Source first frame | [`source_frame0/Sparkle_location_000000.png`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/source_frame0/Sparkle_location_000000.png) |
183
- | Stage 2 (out) | Preliminarily edited first frame | [`edited_frame0/Sparkle_location_000000.png`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_frame0/Sparkle_location_000000.png) |
184
- | Stage 3 (mid) | Foreground-removed clean background image | [`edited_frame0_foreground_removed/Sparkle_location_000000.png`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_frame0_foreground_removed/Sparkle_location_000000.png) |
185
- | Stage 3 (out) | Pure background video (81 frames, no foreground) | [`edited_background_video/Sparkle_location_000000.mp4`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_background_video/Sparkle_location_000000.mp4) |
186
- | Stage 4 | BAIT-tracked foreground mask | [`source_video_mask/Sparkle_location_000000.npz`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/source_video_mask/Sparkle_location_000000.npz) |
187
- | Stage 5 (mid) | Decoupled foreground + background Canny edges | [`edited_video_canny/Sparkle_location_000000.mp4`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_video_canny/Sparkle_location_000000.mp4) |
188
-
189
- **Loading the foreground mask.** The masks in `source_video_mask/` are bit-packed for storage efficiency. Each `.npz` file contains two arrays: `mask` (a `np.uint8` array of bits) and `shape` (the original `(T, H, W)` mask shape, where ``T ≤ 81``). Unpack with:
190
-
191
- ```python
192
- import numpy as np
193
-
194
- def load_mask(mask_path: str) -> np.ndarray:
195
- data = np.load(mask_path)
196
- packed_mask = data["mask"]
197
- shape = tuple(int(s) for s in data["shape"])
198
- total = shape[0] * shape[1] * shape[2]
199
- video_mask = np.unpackbits(packed_mask)[:total].reshape(shape).astype(bool)
200
- return video_mask # boolean array of shape (T, H, W)
201
- ```
202
-
203
- **Downloading the full intermediates.** Like the main corpus, the full intermediates for every theme are sharded into ~5GB `.tar` archives, stored under `intermediate_data/` and named `{edit_type}_{subdir}_partXX.tar` where `{subdir}` is one of the six folder names above. Download and extract them as follows:
204
-
205
- ```bash
206
- # Download all intermediates for a single theme (e.g. location)
207
- hf download stdKonjac/Sparkle \
208
- --repo-type=dataset \
209
- --local-dir ./Sparkle \
210
- --include "intermediate_data/location_*_part*.tar"
211
-
212
- # Extract in place; tar-internal paths are {edit_type}/{subdir}/{file},
213
- # so the working directory must be intermediate_data/ for the layout to align.
214
- cd ./Sparkle/intermediate_data
215
- for f in location_*_part*.tar; do tar -xf "$f"; done
216
- ```
217
-
218
- After extraction, the layout matches the online preview structure exactly, populating `intermediate_data/location/{source_frame0, edited_frame0, ...}/`.
219
-
220
- #### 📋 Per-task Pipeline Metadata
221
-
222
- In addition to the per-task artifacts, each theme's `intermediate_data/{edit_type}/` folder also contains five `.jsonl` files recording metadata produced at various stages of the pipeline (e.g., quality scores, foreground grounding labels). These records are useful for **reproducing our quality filtering**, **inspecting per-stage rejection statistics**, or **building stricter / looser variants of Sparkle for downstream research**.
223
-
224
- **`edited_frame0_score.jsonl`** records per-sample [EditScore](https://github.com/VectorSpaceLab/EditScore) evaluation of the Stage 2 output (`edited_frame0/{task}.png`). One JSON object per line:
225
-
226
- ```json
227
- {
228
- "id": "Sparkle_location_000000",
229
- "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
230
- "editscore": {
231
- "prompt_following": 9.7,
232
- "consistency": 8.8,
233
- "perceptual_quality": 8.5,
234
- "overall": 8.62887857991077,
235
- "SC_reasoning": "The edited image perfectly follows the instruction: ...",
236
- "PQ_reasoning": "The image displays a realistic cityscape with convincing lighting ..."
237
- }
238
- }
239
- ```
240
-
241
- | Field | Description |
242
- |----------------------------------|------------------------------------------------------------------------------|
243
- | `id` | Sample id, matches the `task` column in the CSV. |
244
- | `prompt` | The editing instruction. |
245
- | `editscore.prompt_following` | Sub-score (0–10): how well the edit follows the instruction. |
246
- | `editscore.consistency` | Sub-score (0–10): subject and identity consistency with the source frame. |
247
- | `editscore.perceptual_quality` | Sub-score (0–10): perceptual quality of the edited image. |
248
- | `editscore.overall` | Aggregated overall score. **We filter out samples with `overall < 8`.** |
249
- | `editscore.SC_reasoning` | Free-text rationale for the consistency / instruction-following sub-scores. |
250
- | `editscore.PQ_reasoning` | Free-text rationale for the perceptual-quality sub-score. |
251
-
252
- **`edited_frame0_foreground_removed_score.jsonl`** records per-sample [EditScore](https://github.com/VectorSpaceLab/EditScore) evaluation of the Stage 3 intermediate output (`edited_frame0_foreground_removed/{task}.png`), measuring the foreground-removal quality. The schema is identical to `edited_frame0_score.jsonl`:
253
-
254
- ```json
255
- {
256
- "id": "Sparkle_location_000000",
257
- "prompt": "...",
258
- "editscore": {
259
- "prompt_following": ...,
260
- "consistency": ...,
261
- "perceptual_quality": ...,
262
- "overall": ...,
263
- "SC_reasoning": "...",
264
- "PQ_reasoning": "..."
265
- }
266
- }
267
- ```
268
-
269
- At this stage we apply a stricter threshold and **filter out samples with `overall < 8.5`** to guarantee a perfectly clean background before the I2V generation that follows.
270
-
271
- **`foreground_grounding_r1.jsonl`** records the **first-round VLM grounding** result that compares the source first frame and the Stage 2 edited first frame to identify foreground objects to preserve. This is the labeling step described in Stage 3 of the pipeline. One JSON object per line:
272
-
273
- ```json
274
- {
275
- "id": "Sparkle_location_000000",
276
- "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
277
- "edit_type": "location",
278
- "round1_labels": [
279
- "woman in brown hat and coat",
280
- "clasped hands with ring",
281
- "striped shirt under coat",
282
- "brown wide-brimmed hat"
283
- ],
284
- "round1_objects": [
285
- {"bbox_2d": [447, 27, 765, 998], "label": "woman in brown hat and coat"},
286
- {"bbox_2d": [515, 800, 615, 980], "label": "clasped hands with ring"},
287
- {"bbox_2d": [490, 398, 615, 800], "label": "striped shirt under coat"},
288
- {"bbox_2d": [505, 27, 710, 258], "label": "brown wide-brimmed hat"}
289
- ]
290
- }
291
- ```
292
-
293
- | Field | Description |
294
- |------------------|------------------------------------------------------------------------------------------------------|
295
- | `id` | Sample id, matches the `task` column in the CSV. |
296
- | `prompt` | The editing instruction. |
297
- | `edit_type` | The theme this sample belongs to (`location` / `season` / `time` / `style` / `openve3m`). |
298
- | `round1_labels` | List of foreground-object labels detected by the VLM. |
299
- | `round1_objects` | Per-object detection records; each item has a `bbox_2d` and a `label`. |
300
-
301
- The bounding boxes are detected on the **source first frame** (`source_frame0/{task}.png`). Since our pipeline preserves the foreground identity and pose during background replacement, these boxes apply equally to the corresponding edited first frame (`edited_frame0/{task}.png`).
302
-
303
- <a id="normalize-bbox"></a>
304
-
305
- The `bbox_2d` field follows Qwen3-VL's **normalized coordinate format** with values in the range `[0, 1000]`, representing `[x1, y1, x2, y2]` (top-left and bottom-right corners). Convert them to absolute pixel coordinates of the real frame as follows:
306
-
307
- ```python
308
- def normalize_bbox(bbox, video_width: int, video_height: int):
309
- """Convert a Qwen3-VL [0, 1000]-normalized bbox to absolute pixel coordinates."""
310
- x1 = int(bbox[0] / 1000.0 * video_width)
311
- y1 = int(bbox[1] / 1000.0 * video_height)
312
- x2 = int(bbox[2] / 1000.0 * video_width)
313
- y2 = int(bbox[3] / 1000.0 * video_height)
314
-
315
- # Clamp to frame bounds and ensure x1 <= x2, y1 <= y2.
316
- x1 = max(0, min(min(x1, x2), video_width - 1))
317
- y1 = max(0, min(min(y1, y2), video_height - 1))
318
- x2 = max(0, min(max(x1, x2), video_width - 1))
319
- y2 = max(0, min(max(y1, y2), video_height - 1))
320
- return x1, y1, x2, y2
321
- ```
322
-
323
- **`foreground_grounding_r2.jsonl`** records the **second-round VLM grounding** result that produces the temporal anchors for Stage 4 (BAIT Foreground Tracking). Building on the labels from `foreground_grounding_r1.jsonl`, Qwen3-VL is asked to re-locate every Round 1 label on frames sampled at 2 FPS from the source video, yielding per-frame bounding boxes that anchor the subsequent SAM3 multi-pass tracking. One JSON object per line:
324
-
325
- ```json
326
- {
327
- "id": "Sparkle_location_000000",
328
- "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
329
- "edit_type": "location",
330
- "round1_labels": [...],
331
- "round1_objects": [...],
332
- "frame_objects": [
333
- [
334
- {"bbox_2d": [448, 26, 765, 998], "label": "woman in brown hat and coat"},
335
- {"bbox_2d": [521, 795, 618, 968], "label": "clasped hands with ring"},
336
- {"bbox_2d": [545, 420, 625, 805], "label": "striped shirt under coat"},
337
- {"bbox_2d": [507, 26, 712, 270], "label": "brown wide-brimmed hat"}
338
- ],
339
- [
340
- {"bbox_2d": [452, 34, 764, 998], "label": "woman in brown hat and coat"},
341
- {"bbox_2d": [505, 784, 600, 955], "label": "clasped hands with ring"},
342
- ...
343
- ],
344
- ...
345
- ]
346
- }
347
- ```
348
-
349
- The schema extends `foreground_grounding_r1.jsonl` with a single new field:
350
-
351
- | Field | Description |
352
- |-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
353
- | `frame_objects` | A 2D list of grounding results, one inner list per 2 FPS-sampled frame. Each inner list mirrors the `round1_objects` schema (a list of `{"bbox_2d": [...], "label": "..."}` items), giving the per-frame bbox of every Round 1 label on that frame. |
354
-
355
- The other fields (`id`, `prompt`, `edit_type`, `round1_labels`, `round1_objects`) are inherited unchanged from `foreground_grounding_r1.jsonl`. Use the same [`normalize_bbox`](#normalize-bbox) helper to convert `bbox_2d` values to absolute pixel coordinates.
356
-
357
- > **Note.** Some entries in `frame_objects` may have an empty `bbox_2d` (e.g. `{"bbox_2d": [], "label": "..."}`), indicating that the VLM failed to localize that particular label on that frame. Our BAIT algorithm handles these gracefully by relying on the remaining frames' anchors and a pixel-wise majority vote across SAM3 tracking passes.
358
-
359
- **`edited_video_score.jsonl`** records per-sample [EditScore](https://github.com/VectorSpaceLab/EditScore) evaluation of the **Stage 5 final synthesized video**. Following the protocol in our paper, we uniformly sample four non-first frames from each video and score them independently. One JSON object per line:
360
-
361
- ```json
362
- {
363
- "id": "Sparkle_location_000000",
364
- "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
365
- "frame_indices": [1, 26, 51, 76],
366
- "editscore": [
367
- {
368
- "SC_score": 9.0,
369
- "PQ_score": 8.5,
370
- "O_score": 8.719958110896453,
371
- "SC_score_reasoning": "The editing successfully changed the background to a rooftop overlooking a modern city skyline at dusk, ...",
372
- "PQ_score_reasoning": "The image has a mostly natural cityscape and lighting, but the person's hands appear slightly distorted ...",
373
- "SC_raw_output": "...",
374
- "PQ_raw_output": "..."
375
- },
376
- { "SC_score": 8.3, "PQ_score": 8.5, "O_score": 8.388302424289282, "...": "..." },
377
- { "SC_score": 9.1, "PQ_score": 7.4, "O_score": 8.143194240945185, "...": "..." },
378
- { "SC_score": 8.9, "PQ_score": 7.8, "O_score": 8.318623075017307, "...": "..." }
379
- ]
380
- }
381
- ```
382
-
383
- | Field | Description |
384
- |--------------------------------|-----------------------------------------------------------------------------------------------------------------------|
385
- | `id` | Sample id, matches the `task` column in the CSV. |
386
- | `prompt` | The editing instruction. |
387
- | `frame_indices` | The 4 frame indices (0-based) sampled from the synthesized video for evaluation, e.g. `[1, 26, 51, 76]`. |
388
- | `editscore` | A length-4 list, one entry per sampled frame, in the same order as `frame_indices`. |
389
- | `editscore[i].SC_score` | Sub-score (0–10) for instruction-following / consistency on frame `i`. |
390
- | `editscore[i].PQ_score` | Sub-score (0–10) for perceptual quality on frame `i`. |
391
- | `editscore[i].O_score` | Aggregated overall score on frame `i`. |
392
- | `editscore[i].SC_score_reasoning` | Free-text rationale behind `SC_score`. |
393
- | `editscore[i].PQ_score_reasoning` | Free-text rationale behind `PQ_score`. |
394
- | `editscore[i].SC_raw_output` | Raw JSON string returned by the EditScore SC head (contains `reasoning` and per-criterion `score` array). |
395
- | `editscore[i].PQ_raw_output` | Raw JSON string returned by the EditScore PQ head. |
396
-
397
- The final filtering rule is: **average `O_score` across all four sampled frames; discard the sample if the mean is below `8`.**
398
-
399
- </details>
400
-
401
- ### 📜 Dataset License
402
-
403
- The Sparkle dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
404
-
405
- Source videos in the `openve3m` theme are derived from [OpenVE-3M](https://arxiv.org/abs/2512.07826) and retain their original licenses; please consult the upstream source before redistribution.
406
-
407
- ## 🎯 Benchmark
408
-
409
- **Sparkle-Bench** is the largest evaluation benchmark tailored for instruction-guided video background replacement, comprising **458 carefully curated videos across 4 themes, 21 subthemes, and 97 distinct scenes**. It is fully open-sourced at [🤗stdKonjac/Sparkle-Bench](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench). For evaluation methodology and our six-dimensional scoring protocol, please refer to [our paper](https://arxiv.org/abs/2605.06535).
410
-
411
- **All source videos in the benchmark are uncompressed and previewable directly in the browser**, so users can inspect any sample without downloading anything.
412
-
413
- The benchmark is organized into **four themes**:
414
-
415
- | Theme | Description |
416
- | ---------- |------------------------------------------------------------------------------------------|
417
- | `location` | Background replaced with a different physical environment (rural, nature, landmark, ...).|
418
- | `season` | Background changed across seasons (spring, summer, autumn, winter). |
419
- | `time` | Background changed across times of day (dawn, dusk, night, ...). |
420
- | `style` | Background restyled (era, mood, cinematic, ...). |
421
-
422
- ### 🗂️ Repository Structure
423
-
424
- ```
425
- Sparkle-Bench/
426
- ├── README.md
427
- ├── location_bench.csv # 3 columns: edited_type, prompt, original_video
428
- ├── location_metadata.jsonl # per-task metadata (edit_type, subtheme, original scene)
429
- ├── season_bench.csv
430
- ├── season_metadata.jsonl
431
- ├── time_bench.csv
432
- ├── time_metadata.jsonl
433
- ├── style_bench.csv
434
- ├── style_metadata.jsonl
435
- ├── source_videos/ # all 458 source videos, browsable online
436
- │ ├── location/
437
- │ │ ├── Sparkle_location_000011.mp4
438
- │ │ └── ...
439
- │ ├── season/
440
- │ ├── time/
441
- │ └── style/
442
- └── ref_images/ # optional reference background images (see below)
443
- ├── location/
444
- ├── season/
445
- ├── time/
446
- └── style/
447
- ```
448
-
449
- ### 🧾 Benchmark Format
450
-
451
- We follow the format of [OpenVE-Bench](https://huggingface.co/datasets/Lewandofski/OpenVE-Bench) for direct compatibility with existing evaluation pipelines.
452
-
453
- Each theme's evaluation prompts live in `{edit_type}_bench.csv`, a three-column table:
454
-
455
- | Column | Description |
456
- |------------------|---------------------------------------------------------------------------------------------------|
457
- | `edited_type` | The theme of this sample, one of `location` / `season` / `time` / `style`. |
458
- | `prompt` | The natural-language editing instruction. |
459
- | `original_video` | Path to the source video, e.g. `source_videos/location/Sparkle_location_010913.mp4`. |
460
-
461
- Per-task auxiliary metadata is stored alongside in `{edit_type}_metadata.jsonl`. Each line is one sample:
462
-
463
- ```json
464
- {
465
- "id": "Sparkle_location_004302",
466
- "prompt": "Put the subject against ancient stone ruins overgrown with wind-swept grass, ...",
467
- "metadata": {
468
- "edit_type": "location",
469
- "chosen_keyword": "landmark: ancient stone ruins with wind-swept grass",
470
- "original_scene": "A dimly lit indoor bar or restaurant with brick walls, framed artwork, and warm overhead lighting."
471
- }
472
- }
473
- ```
474
-
475
- | Field | Description |
476
- |----------------------------|------------------------------------------------------------------------------------------------------------|
477
- | `id` | Sample id, e.g. `Sparkle_location_004302`. Matches the basename of the corresponding `original_video` path. |
478
- | `prompt` | Same as the `prompt` column in the CSV. |
479
- | `metadata.edit_type` | The theme this sample belongs to (`location` / `season` / `time` / `style`). |
480
- | `metadata.chosen_keyword` | The `subtheme: scene` label (e.g. `"landmark: ancient stone ruins with wind-swept grass"`). |
481
- | `metadata.original_scene` | A description of the source video's first frame. |
482
-
483
- ### 👀 Online Preview
484
-
485
- All 458 source videos are stored as uncompressed `.mp4` files under `source_videos/{edit_type}/`, and can be played directly in the browser without any download.
486
-
487
- For example, the source video of task `Sparkle_location_000011` (the first row in the **location** theme of the dataset viewer) is browsable at: [Sparkle_location_000011](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench/blob/main/source_videos/location/Sparkle_location_000011.mp4).
488
-
489
- The dataset viewer at the top of the HF page lets you scroll through all four themes and read the corresponding prompts inline.
490
-
491
- ### ⬇️ Downloading the Benchmark
492
-
493
- Sparkle-Bench is small enough to download in one command. Pull the entire repo:
494
-
495
- ```bash
496
- hf download stdKonjac/Sparkle-Bench --repo-type=dataset --local-dir ./Sparkle-Bench
497
- ```
498
-
499
- Or download only a single theme (e.g. `location`):
500
-
501
- ```bash
502
- hf download stdKonjac/Sparkle-Bench \
503
- --repo-type=dataset \
504
- --local-dir ./Sparkle-Bench \
505
- --include "location_*" "source_videos/location/*"
506
- ```
507
-
508
- After downloading, the relative paths in `{edit_type}_bench.csv` (e.g. `source_videos/location/Sparkle_location_010913.mp4`) will resolve directly.
509
-
510
- ### 📊 Evaluation
511
-
512
- We provide an end-to-end evaluation script, [`eval_sparkle_bench_gemini.py`](https://github.com/showlab/Sparkle/blob/main/eval_sparkle_bench_gemini.py), that scores edited videos using Gemini-2.5-Pro under our six-dimensional rubric (see *Section 3.7* in [our paper](https://arxiv.org/abs/2605.06535)). The six dimensions are: **Instruction Compliance**, **Overall Visual Quality**, **Foreground Integrity**, **Foreground Motion Consistency**, **Background Dynamics**, and **Background Visual Quality**, each scored on a 1–5 scale.
513
-
514
- #### 1. Prepare your inference outputs
515
-
516
- The script expects edited videos to be organized in a specific directory tree. For every sample in Sparkle-Bench, the inference output should be saved as:
517
-
518
- ```
519
- {save_dir}/{edit_type}/{subtheme}---{scene_key}/{id}_edited.mp4
520
- ```
521
-
522
- where:
523
-
524
- - `{save_dir}` is your inference root (free to choose).
525
- - `{edit_type}` is one of `location` / `season` / `time` / `style`.
526
- - `{subtheme}---{scene_key}` is derived from the sample's `chosen_keyword` field in `{edit_type}_metadata.jsonl`. Specifically, splitting `chosen_keyword` on `": "` yields `subtheme: scene`, then `scene_key = scene.replace(" ", "_")`. The triple-dash `---` is the separator between the two parts.
527
- - `{id}` is the sample id, e.g. `Sparkle_location_000172`.
528
-
529
- For example, the inference outputs across the four themes should look like:
530
-
531
- ```
532
- {save_dir}/
533
- ├── location/
534
- │ └── landmark---ancient_stone_ruins_with_wind-swept_grass/
535
- │ └── Sparkle_location_000172_edited.mp4
536
- ├── season/
537
- │ └── {subtheme}---{scene_key}/
538
- │ └── Sparkle_season_xxxxxx_edited.mp4
539
- ├── time/
540
- │ └── {subtheme}---{scene_key}/
541
- │ └── Sparkle_time_xxxxxx_edited.mp4
542
- └── style/
543
- └── {subtheme}---{scene_key}/
544
- └── Sparkle_style_xxxxxx_edited.mp4
545
- ```
546
-
547
- #### 2. Configure the Gemini API
548
-
549
- By default the script uses **Azure-hosted Gemini via the OpenAI-compatible API** for convenient concurrency. Export two environment variables before running:
550
-
551
- ```bash
552
- export AZURE_ENDPOINT="https://your-azure-endpoint"
553
- export GEMINI_API_KEY="your-api-key"
554
- ```
555
-
556
- If you have direct access to the Gemini API, you can swap the `GEMINI_API` client at the top of the script for the native [`google-genai`](https://github.com/googleapis/python-genai) SDK. The request payload only needs `(system prompt, source video, edited video)`, so the adaptation is straightforward. Just keep the `temperature=0` / `seed=42` settings for reproducibility.
557
-
558
- #### 3. Run the evaluation
559
-
560
- Assuming Sparkle-Bench has been downloaded to `data/Sparkle-Bench/` (the default `--bench_root`):
561
-
562
- ```bash
563
- python3 eval_sparkle_bench_gemini.py \
564
- --video_paths /path/to/sparkle_bench_results/
565
- ```
566
-
567
- For multiple checkpoints in one run:
568
-
569
- ```bash
570
- python3 eval_sparkle_bench_gemini.py \
571
- --video_paths /path/to/ckpt_a/sparkle_bench/ \
572
- /path/to/ckpt_b/sparkle_bench/ \
573
- /path/to/ckpt_c/sparkle_bench/
574
- ```
575
-
576
- By default the script evaluates all four themes (`location`, `season`, `time`, `style`); pass `--edit_types` to restrict to a subset. Concurrency is controlled inside the script (default 20 workers).
577
-
578
- #### 4. Read the output
579
-
580
- For each `(save_dir, edit_type)` pair, the script writes:
581
-
582
- ```
583
- {save_dir}/{edit_type}_gemini-2.5-pro_sparkle_score.jsonl
584
- ```
585
-
586
- Each line is a per-sample record containing the six-dim scores plus the original Gemini reasoning:
587
-
588
- ```json
589
- {
590
- "id": "Sparkle_location_000172",
591
- "prompt": "Put the subject against ancient stone ruins overgrown with wind-swept grass, ...",
592
- "edit_type": "location",
593
- "subtheme": "landmark",
594
- "scene": "ancient stone ruins with wind-swept grass",
595
- "scores": [5, 5, 5, 5, 5, 5],
596
- "result": "Brief reasoning: The edited background perfectly matches every detail of the prompt, ...\nInstruction Compliance: 5\nOverall Visual Quality: 5\nForeground Integrity: 5\nForeground Motion Consistency: 5\nBackground Dynamics: 5\nBackground Visual Quality: 5"
597
- }
598
- ```
599
-
600
- The `scores` array follows this fixed order: `[Instruction Compliance, Overall Visual Quality, Foreground Integrity, Foreground Motion Consistency, Background Dynamics, Background Visual Quality]`. Following the OpenVE-Bench protocol, the script automatically caps dimensions 2–6 at the Instruction Compliance score to prevent score hacking.
601
-
602
- After scoring, the script aggregates per-theme and macro averages and prints a summary table to stdout. The evaluation is **deterministic** by design (`temperature=0`, fixed `seed=42`) for reproducibility.
603
-
604
- ### 🖼️ Reference Images (Optional, Use with Caution)
605
-
606
- By construction, every Sparkle-Bench sample is a video that **passed the first four stages of our pipeline but failed the final synthesis quality check in Stage 5** (see Section 3.7 of [our paper](https://arxiv.org/abs/2605.06535)). As a free byproduct, this means each sample comes with a **pure background image** generated by Stage 3 (Individual Background Generation), where the foreground has been removed from the preliminarily edited first frame.
607
-
608
- We release these images under `ref_images/{edit_type}/{id}.png`, alongside the CSV/JSONL annotations. These images may be useful for **reference-based** background-replacement experiments (e.g., feeding the clean background as an extra visual condition to the editing model).
609
-
610
- > **⚠️ Disclaimer.** Our paper neither trains any reference-based model nor includes any reference-image-based evaluation. We release `ref_images/` purely to **facilitate future research** in this direction. The images are **not curated** and may contain noise such as low-quality edits or imperfect foreground removal. Please **use them with caution**. We make no quality guarantees about this auxiliary asset.
611
-
612
- ### 📜 Benchmark License
613
-
614
- The Sparkle-Bench is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
615
-
616
- Source videos are derived from [OpenVE-3M](https://arxiv.org/abs/2512.07826) and retain their original licenses; please consult the upstream source before redistribution.
617
-
618
- ## 🧠 Model
619
-
620
- We release **Kiwi-Sparkle**, a video background-replacement model fine-tuned on the **Sparkle** dataset for **10K steps** with a batch size of 128, starting from a [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit) base. Since we apply no architectural modifications to Kiwi-Edit, **Kiwi-Sparkle's weights are fully compatible with the Kiwi-Edit weights structure**. Any inference, training, or deployment pipeline that runs Kiwi-Edit can run Kiwi-Sparkle as a drop-in replacement.
621
-
622
- The model is open-sourced at [🤗stdKonjac/Kiwi-Sparkle-720P-81F](https://huggingface.co/stdKonjac/Kiwi-Sparkle-720P-81F) and supports **720P resolution** with up to **81-frame outputs**.
623
-
624
- | Setting | Value |
625
- |-----------------------|-------------------------------------------------------------------------------------------------------------------------|
626
- | Foundation model | [Kiwi-Edit-Stage2 (Image + Video)](https://huggingface.co/linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_720x1280_81f) |
627
- | Resolution | 720 × 1280 |
628
- | Max output frames | 81 |
629
- | Fine-tuning steps | 10,000 |
630
- | Batch size | 128 |
631
- | Architectural changes | None. Drop-in compatible with Kiwi-Edit. |
632
-
633
- ### 🚀 Training
634
-
635
- Kiwi-Sparkle is trained using the official Kiwi-Edit recipe in [this script](https://github.com/showlab/Kiwi-Edit/blob/main/scripts/run_wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_720x1280_81f.sh) with no modifications. Two common entry points are supported:
636
-
637
- **Train from the Kiwi-Edit base on a Sparkle theme.** Point `--vid_dataset_metadata_path` to the corresponding Sparkle training CSV, and load the foundation [Kiwi-Edit-Stage2](https://huggingface.co/linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_720x1280_81f) checkpoint:
638
-
639
- ```bash
640
- --vid_dataset_metadata_path /path/to/Sparkle/prompts/{edit_type}_train.csv
641
- --checkpoint /path/to/Kiwi-Edit-Stage2/model.safetensors
642
- ```
643
-
644
- where `{edit_type}` is one of `location` / `season` / `time` / `style` / `openve3m`. The five training CSVs are hosted [here](https://huggingface.co/datasets/stdKonjac/Sparkle/tree/main/prompts).
645
-
646
- **Continue training from our Kiwi-Sparkle checkpoint.** Replace the `--checkpoint` argument:
647
-
648
- ```bash
649
- --checkpoint /path/to/Kiwi-Sparkle-720P-81F/model.safetensors
650
- ```
651
-
652
- The rest of the script stays exactly as in the official Kiwi-Edit setup.
653
-
654
- ### 🎬 Inference
655
-
656
- #### OpenVE-Bench
657
-
658
- Since Kiwi-Sparkle is architecturally identical to Kiwi-Edit, you can simply follow the official OpenVE-Bench evaluation pipeline of Kiwi-Edit and swap the checkpoint to Kiwi-Sparkle. For example:
659
 
660
  ```bash
661
  python3 test_benchmark.py \
@@ -666,41 +34,15 @@ python3 test_benchmark.py \
666
  --save_dir ./infer_results/
667
  ```
668
 
669
- #### Sparkle-Bench
670
 
671
- We provide a dedicated launch pair, [`test_benchmark_sparkle_bench.py`](https://github.com/showlab/Sparkle/blob/main/test_benchmark_sparkle_bench.py) and [`test_benchmark_sparkle_bench.sh`](https://github.com/showlab/Sparkle/blob/main/test_benchmark_sparkle_bench.sh), that mirror Kiwi-Edit's existing benchmarking layout.
672
 
673
- **Step 1.** Clone the [Kiwi-Edit repository](https://github.com/showlab/Kiwi-Edit) and copy our two scripts into the Kiwi-Edit repo root, alongside the official `test_benchmark.py`.
674
-
675
- **Step 2.** Edit the shell script to point at your Kiwi-Sparkle checkpoint, then launch (defaults to 8 GPUs):
676
-
677
- ```bash
678
- bash test_benchmark_sparkle_bench.sh
679
- ```
680
-
681
- The script writes inference outputs to `infer_results/Kiwi-Sparkle-720P-81F/sparkle_bench/{edit_type}/{subtheme}---{scene_key}/{id}_edited.mp4`. Re-run it with a different `EDIT_TYPE` to cover all four themes.
682
-
683
- **Step 3.** Score the outputs with our [Gemini-based evaluator](#-evaluation):
684
-
685
- ```bash
686
- python3 eval_sparkle_bench_gemini.py \
687
- --video_paths infer_results/Kiwi-Sparkle-720P-81F/sparkle_bench/
688
- ```
689
-
690
- See the [Evaluation section](#-evaluation) above for details on environment setup, output format, and the six-dimensional scoring rubric.
691
-
692
- ### 📜 Model License
693
-
694
- Kiwi-Sparkle is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
695
-
696
- ## 🙏 Acknowledgements
697
-
698
- This project is built on top of a number of excellent open-source projects. We thank the authors of [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit), [FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B), [Qwen3-VL-32B](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct), [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B), [LightX2V](https://github.com/ModelTC/lightx2v), and [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) for releasing the infrastructure that made this work possible.
699
 
700
  ## 📝 Citation
701
 
702
- If you find Sparkle useful for your research, please consider citing our paper:
703
-
704
  ```bibtex
705
  @misc{zeng2026sparkle,
706
  title = {Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance},
 
1
  ---
2
  license: cc-by-4.0
3
+ pipeline_tag: image-to-video
4
  ---
5
 
6
+ # Kiwi-Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
7
 
8
  [Ziyun Zeng](https://stdkonjac.icu/), Yiqi Lin, [Guoqiang Liang](https://ethanliang99.github.io/), and [Mike Zheng Shou](https://cde.nus.edu.sg/ece/staff/shou-zheng-mike/)
9
 
 
12
  [![Code](https://img.shields.io/badge/Code-GitHub%20Repo-blue?logo=github)](https://github.com/showlab/Sparkle)
13
  [![Dataset](https://img.shields.io/badge/🤗%20Dataset-Sparkle-orange.svg)](https://huggingface.co/datasets/stdKonjac/Sparkle)
14
  [![Benchmark](https://img.shields.io/badge/🤗%20Benchmark-Sparkle--Bench-orange.svg)](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench)
 
15
 
16
+ ## 🧠 Model Details
17
 
18
+ **Kiwi-Sparkle** is a video background-replacement model fine-tuned on the **Sparkle** dataset for 10K steps, starting from a [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit) base. It specializes in instruction-guided background replacement, allowing for the synthesis of new, temporally consistent scenes while maintaining accurate foreground-background interactions.
19
 
20
+ The model supports **720P resolution** (720 × 1280) and can generate outputs of up to **81 frames**. It is architecturally compatible with the Kiwi-Edit weights structure and can be used as a drop-in replacement in compatible pipelines.
21
 
22
+ ## 🚀 Usage
23
 
24
+ Since Kiwi-Sparkle is architecturally identical to Kiwi-Edit, you can follow the official evaluation pipeline of Kiwi-Edit and swap the checkpoint to Kiwi-Sparkle.
 
 
 
 
 
 
25
 
26
+ ### Inference with OpenVE-Bench
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ```bash
29
  python3 test_benchmark.py \
 
34
  --save_dir ./infer_results/
35
  ```
36
 
37
+ For detailed instructions on evaluating with **Sparkle-Bench**, please refer to the [GitHub repository](https://github.com/showlab/Sparkle).
38
 
39
+ ## 📊 Dataset & Benchmark
40
 
41
+ - **Sparkle Dataset**: A large-scale dataset of ~140K high-quality source–edited video pairs across five themes (location, season, time, style, and OpenVE-3M).
42
+ - **Sparkle-Bench**: The largest evaluation benchmark tailored for video background replacement, consisting of 458 curated videos.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## 📝 Citation
45
 
 
 
46
  ```bibtex
47
  @misc{zeng2026sparkle,
48
  title = {Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance},