Add pipeline tag and improve model card
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
# Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
|
| 6 |
|
| 7 |
[Ziyun Zeng](https://stdkonjac.icu/), Yiqi Lin, [Guoqiang Liang](https://ethanliang99.github.io/), and [Mike Zheng Shou](https://cde.nus.edu.sg/ece/staff/shou-zheng-mike/)
|
| 8 |
|
|
@@ -11,651 +12,18 @@ license: cc-by-4.0
|
|
| 11 |
[](https://github.com/showlab/Sparkle)
|
| 12 |
[](https://huggingface.co/datasets/stdKonjac/Sparkle)
|
| 13 |
[](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench)
|
| 14 |
-
[](https://huggingface.co/stdKonjac/Kiwi-Sparkle-720P-81F)
|
| 15 |
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
**
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
| ---------- |----------------------------------------------------------------------------------------------------------------------------------|
|
| 25 |
-
| `location` | Background replaced with a different physical environment (rural, nature, landmark, ...). |
|
| 26 |
-
| `season` | Background changed across seasons (spring, summer, autumn, winter). |
|
| 27 |
-
| `time` | Background changed across times of day (dawn, dusk, night, ...). |
|
| 28 |
-
| `style` | Background restyled (era, mood, cinematic, ...). |
|
| 29 |
-
| `openve3m` | A re-creation of the OpenVE-3M background-replacement subset using our pipeline, retained for direct comparison with prior work. |
|
| 30 |
|
| 31 |
-
###
|
| 32 |
-
|
| 33 |
-
```
|
| 34 |
-
Sparkle/
|
| 35 |
-
├── README.md
|
| 36 |
-
├── prompts/ # training annotations + dataset-viewer source
|
| 37 |
-
│ ├── location_train.csv # 4 columns: prompt, src_video, tgt_video, task
|
| 38 |
-
│ ├── location_train_metadata.jsonl # per-task metadata (edit_type, subtheme, original scene)
|
| 39 |
-
│ ├── season_train.csv
|
| 40 |
-
│ ├── season_train_metadata.jsonl
|
| 41 |
-
│ ├── time_train.csv
|
| 42 |
-
│ ├── time_train_metadata.jsonl
|
| 43 |
-
│ ├── style_train.csv
|
| 44 |
-
│ ├── style_train_metadata.jsonl
|
| 45 |
-
│ ├── openve3m_train.csv
|
| 46 |
-
│ └── openve3m_train_metadata.jsonl
|
| 47 |
-
│
|
| 48 |
-
├── location/ # online preview: first 100 samples
|
| 49 |
-
│ ├── source_video/
|
| 50 |
-
│ │ ├── Sparkle_location_000000.mp4
|
| 51 |
-
│ │ └── ... (100 files)
|
| 52 |
-
│ └── edited_video/
|
| 53 |
-
│ ├── Sparkle_location_000000.mp4
|
| 54 |
-
│ └── ... (100 files)
|
| 55 |
-
├── season/ # same structure as location/
|
| 56 |
-
├── time/
|
| 57 |
-
├── style/
|
| 58 |
-
├── openve3m/
|
| 59 |
-
│
|
| 60 |
-
├── location_source_video_part00.tar # full corpus, sharded into ~5GB tars
|
| 61 |
-
├── location_source_video_part01.tar
|
| 62 |
-
├── location_edited_video_part00.tar
|
| 63 |
-
├── ...
|
| 64 |
-
├── season_*_partXX.tar
|
| 65 |
-
├── time_*_partXX.tar
|
| 66 |
-
├── style_*_partXX.tar
|
| 67 |
-
├── openve3m_*_partXX.tar
|
| 68 |
-
│
|
| 69 |
-
└── intermediate_data/ # pipeline intermediates (described below)
|
| 70 |
-
└── ...
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
### 🧾 Training Data Format
|
| 74 |
-
|
| 75 |
-
We follow the training data format of [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit) for direct compatibility with downstream training pipelines.
|
| 76 |
-
|
| 77 |
-
Each theme's annotations live in `prompts/{edit_type}_train.csv`, a four-column table:
|
| 78 |
-
|
| 79 |
-
| Column | Description |
|
| 80 |
-
| ----------- | ----------- |
|
| 81 |
-
| `prompt` | The natural-language editing instruction. |
|
| 82 |
-
| `src_video` | Path to the source video, e.g. `location/source_video/Sparkle_location_000000.mp4`. |
|
| 83 |
-
| `tgt_video` | Path to the edited video, e.g. `location/edited_video/Sparkle_location_000000.mp4`. |
|
| 84 |
-
| `task` | The unique sample id, e.g. `Sparkle_location_000000`. Joins to the `id` field in the JSONL metadata. |
|
| 85 |
-
|
| 86 |
-
Per-task auxiliary metadata is stored alongside in `prompts/{edit_type}_train_metadata.jsonl`. Each line is one sample:
|
| 87 |
-
|
| 88 |
-
```json
|
| 89 |
-
{
|
| 90 |
-
"id": "Sparkle_location_000000",
|
| 91 |
-
"prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
|
| 92 |
-
"metadata": {
|
| 93 |
-
"edit_type": "location",
|
| 94 |
-
"chosen_keyword": "urban: rooftop overlooking skyline",
|
| 95 |
-
"original_scene": "A cobblestone street in a historical European city, ..."
|
| 96 |
-
}
|
| 97 |
-
}
|
| 98 |
-
```
|
| 99 |
-
|
| 100 |
-
| Field | Description |
|
| 101 |
-
| -------------------------- |--------------------------------------------------------------------------------------------------------------------------|
|
| 102 |
-
| `id` | Sample id, matches the `task` column in the CSV. |
|
| 103 |
-
| `prompt` | Same as the `prompt` column in the CSV. |
|
| 104 |
-
| `metadata.edit_type` | One of the five themes: `location` / `season` / `time` / `style` / `openve3m` (denoted as `openve3m_background_change`). |
|
| 105 |
-
| `metadata.chosen_keyword` | The `subtheme: scene` label (e.g. `"urban: rooftop overlooking skyline"`). Not available for the `openve3m` theme. |
|
| 106 |
-
| `metadata.original_scene` | A description of the source video's first frame. |
|
| 107 |
-
|
| 108 |
-
### 👀 Online Preview
|
| 109 |
-
|
| 110 |
-
The first 100 samples of every theme are stored as uncompressed `.mp4` files under `{edit_type}/source_video/` and `{edit_type}/edited_video/`, and can be played directly in the browser without downloading the full corpus.
|
| 111 |
-
|
| 112 |
-
For example, for the task `Sparkle_location_000000` (the first row in the **location** theme of the dataset viewer), you can directly browse its [Source Video](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/location/source_video/Sparkle_location_000000.mp4) and [Edited Video](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/location/edited_video/Sparkle_location_000000.mp4).
|
| 113 |
-
|
| 114 |
-
The dataset viewer at the top of the HF page lets you scroll through all five themes and read the corresponding prompts inline.
|
| 115 |
-
|
| 116 |
-
### ⬇️ Downloading the Full Corpus
|
| 117 |
-
|
| 118 |
-
The full ~140K-sample corpus is sharded into ~5GB `.tar` archives at the repository root, named `{edit_type}_{source_video|edited_video}_partXX.tar`.
|
| 119 |
-
|
| 120 |
-
**Step 1 — Download the tar shards.** Download everything (recommended for full reproduction):
|
| 121 |
-
|
| 122 |
-
```bash
|
| 123 |
-
hf download stdKonjac/Sparkle --repo-type=dataset --local-dir ./Sparkle
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
Or only a single theme (e.g. `location`):
|
| 127 |
-
|
| 128 |
-
```bash
|
| 129 |
-
hf download stdKonjac/Sparkle \
|
| 130 |
-
--repo-type=dataset \
|
| 131 |
-
--local-dir ./Sparkle \
|
| 132 |
-
--include "location_*.tar" "prompts/location_*"
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
Or only the source videos of a theme:
|
| 136 |
-
|
| 137 |
-
```bash
|
| 138 |
-
hf download stdKonjac/Sparkle \
|
| 139 |
-
--repo-type=dataset \
|
| 140 |
-
--local-dir ./Sparkle \
|
| 141 |
-
--include "location_source_video_*.tar"
|
| 142 |
-
```
|
| 143 |
-
|
| 144 |
-
**Step 2 — Extract the tars.** Each tar is **self-contained**: its internal paths are `{edit_type}/{source_video|edited_video}/{task}.mp4`, so extracting any subset of shards in place will populate the corresponding folders correctly. There is **no need to concatenate the parts** before extraction.
|
| 145 |
-
|
| 146 |
-
```bash
|
| 147 |
-
cd ./Sparkle
|
| 148 |
-
for f in *.tar; do tar -xf "$f"; done
|
| 149 |
-
```
|
| 150 |
-
|
| 151 |
-
After extraction, the directory layout matches the online preview structure, and the relative paths in `prompts/{edit_type}_train.csv` (e.g. `location/source_video/Sparkle_location_000000.mp4`) will resolve directly.
|
| 152 |
-
|
| 153 |
-
<details>
|
| 154 |
-
<summary><h3 style="display: inline">🧪 Pipeline Intermediates</h3></summary>
|
| 155 |
-
|
| 156 |
-
To support **full reproducibility, transparency, and downstream research**, we additionally release every intermediate artifact produced by the 5-stage Sparkle data pipeline (see *Figure 2: Data Pipeline* in [our paper](https://arxiv.org/abs/2605.06535)) under `intermediate_data/`. **The first 100 samples of every theme are uncompressed and previewable directly in the browser**, mirroring the layout of the `{edit_type}/` preview folders described above.
|
| 157 |
-
|
| 158 |
-
Taking `Sparkle_location_000000` as a running example, the artifact layout looks like:
|
| 159 |
-
|
| 160 |
-
```
|
| 161 |
-
Sparkle/
|
| 162 |
-
└── intermediate_data/
|
| 163 |
-
└── location/
|
| 164 |
-
├── source_frame0/ # Stage 2 input: 0-th frame of the source video
|
| 165 |
-
│ └── Sparkle_location_000000.png
|
| 166 |
-
├── edited_frame0/ # Stage 2 output: first frame after preliminary background replacement
|
| 167 |
-
│ └── Sparkle_location_000000.png
|
| 168 |
-
├── edited_frame0_foreground_removed/ # Stage 3 intermediate: foreground-removed clean background image
|
| 169 |
-
│ └── Sparkle_location_000000.png
|
| 170 |
-
├── edited_background_video/ # Stage 3 output: 81-frame pure background video (no foreground)
|
| 171 |
-
│ └── Sparkle_location_000000.mp4
|
| 172 |
-
├── source_video_mask/ # Stage 4 output: BAIT-tracked foreground mask (packed bits)
|
| 173 |
-
│ └── Sparkle_location_000000.npz
|
| 174 |
-
└── edited_video_canny/ # Stage 5 intermediate: decoupled foreground + background Canny edges
|
| 175 |
-
└── Sparkle_location_000000.mp4
|
| 176 |
-
```
|
| 177 |
-
|
| 178 |
-
For the same task `Sparkle_location_000000`, every artifact is browsable online:
|
| 179 |
-
|
| 180 |
-
| Pipeline stage | Artifact | Preview |
|
| 181 |
-
|----------------|--------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 182 |
-
| Stage 2 (in) | Source first frame | [`source_frame0/Sparkle_location_000000.png`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/source_frame0/Sparkle_location_000000.png) |
|
| 183 |
-
| Stage 2 (out) | Preliminarily edited first frame | [`edited_frame0/Sparkle_location_000000.png`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_frame0/Sparkle_location_000000.png) |
|
| 184 |
-
| Stage 3 (mid) | Foreground-removed clean background image | [`edited_frame0_foreground_removed/Sparkle_location_000000.png`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_frame0_foreground_removed/Sparkle_location_000000.png) |
|
| 185 |
-
| Stage 3 (out) | Pure background video (81 frames, no foreground) | [`edited_background_video/Sparkle_location_000000.mp4`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_background_video/Sparkle_location_000000.mp4) |
|
| 186 |
-
| Stage 4 | BAIT-tracked foreground mask | [`source_video_mask/Sparkle_location_000000.npz`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/source_video_mask/Sparkle_location_000000.npz) |
|
| 187 |
-
| Stage 5 (mid) | Decoupled foreground + background Canny edges | [`edited_video_canny/Sparkle_location_000000.mp4`](https://huggingface.co/datasets/stdKonjac/Sparkle/blob/main/intermediate_data/location/edited_video_canny/Sparkle_location_000000.mp4) |
|
| 188 |
-
|
| 189 |
-
**Loading the foreground mask.** The masks in `source_video_mask/` are bit-packed for storage efficiency. Each `.npz` file contains two arrays: `mask` (a `np.uint8` array of bits) and `shape` (the original `(T, H, W)` mask shape, where ``T ≤ 81``). Unpack with:
|
| 190 |
-
|
| 191 |
-
```python
|
| 192 |
-
import numpy as np
|
| 193 |
-
|
| 194 |
-
def load_mask(mask_path: str) -> np.ndarray:
|
| 195 |
-
data = np.load(mask_path)
|
| 196 |
-
packed_mask = data["mask"]
|
| 197 |
-
shape = tuple(int(s) for s in data["shape"])
|
| 198 |
-
total = shape[0] * shape[1] * shape[2]
|
| 199 |
-
video_mask = np.unpackbits(packed_mask)[:total].reshape(shape).astype(bool)
|
| 200 |
-
return video_mask # boolean array of shape (T, H, W)
|
| 201 |
-
```
|
| 202 |
-
|
| 203 |
-
**Downloading the full intermediates.** Like the main corpus, the full intermediates for every theme are sharded into ~5GB `.tar` archives, stored under `intermediate_data/` and named `{edit_type}_{subdir}_partXX.tar` where `{subdir}` is one of the six folder names above. Download and extract them as follows:
|
| 204 |
-
|
| 205 |
-
```bash
|
| 206 |
-
# Download all intermediates for a single theme (e.g. location)
|
| 207 |
-
hf download stdKonjac/Sparkle \
|
| 208 |
-
--repo-type=dataset \
|
| 209 |
-
--local-dir ./Sparkle \
|
| 210 |
-
--include "intermediate_data/location_*_part*.tar"
|
| 211 |
-
|
| 212 |
-
# Extract in place; tar-internal paths are {edit_type}/{subdir}/{file},
|
| 213 |
-
# so the working directory must be intermediate_data/ for the layout to align.
|
| 214 |
-
cd ./Sparkle/intermediate_data
|
| 215 |
-
for f in location_*_part*.tar; do tar -xf "$f"; done
|
| 216 |
-
```
|
| 217 |
-
|
| 218 |
-
After extraction, the layout matches the online preview structure exactly, populating `intermediate_data/location/{source_frame0, edited_frame0, ...}/`.
|
| 219 |
-
|
| 220 |
-
#### 📋 Per-task Pipeline Metadata
|
| 221 |
-
|
| 222 |
-
In addition to the per-task artifacts, each theme's `intermediate_data/{edit_type}/` folder also contains five `.jsonl` files recording metadata produced at various stages of the pipeline (e.g., quality scores, foreground grounding labels). These records are useful for **reproducing our quality filtering**, **inspecting per-stage rejection statistics**, or **building stricter / looser variants of Sparkle for downstream research**.
|
| 223 |
-
|
| 224 |
-
**`edited_frame0_score.jsonl`** records per-sample [EditScore](https://github.com/VectorSpaceLab/EditScore) evaluation of the Stage 2 output (`edited_frame0/{task}.png`). One JSON object per line:
|
| 225 |
-
|
| 226 |
-
```json
|
| 227 |
-
{
|
| 228 |
-
"id": "Sparkle_location_000000",
|
| 229 |
-
"prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
|
| 230 |
-
"editscore": {
|
| 231 |
-
"prompt_following": 9.7,
|
| 232 |
-
"consistency": 8.8,
|
| 233 |
-
"perceptual_quality": 8.5,
|
| 234 |
-
"overall": 8.62887857991077,
|
| 235 |
-
"SC_reasoning": "The edited image perfectly follows the instruction: ...",
|
| 236 |
-
"PQ_reasoning": "The image displays a realistic cityscape with convincing lighting ..."
|
| 237 |
-
}
|
| 238 |
-
}
|
| 239 |
-
```
|
| 240 |
-
|
| 241 |
-
| Field | Description |
|
| 242 |
-
|----------------------------------|------------------------------------------------------------------------------|
|
| 243 |
-
| `id` | Sample id, matches the `task` column in the CSV. |
|
| 244 |
-
| `prompt` | The editing instruction. |
|
| 245 |
-
| `editscore.prompt_following` | Sub-score (0–10): how well the edit follows the instruction. |
|
| 246 |
-
| `editscore.consistency` | Sub-score (0–10): subject and identity consistency with the source frame. |
|
| 247 |
-
| `editscore.perceptual_quality` | Sub-score (0–10): perceptual quality of the edited image. |
|
| 248 |
-
| `editscore.overall` | Aggregated overall score. **We filter out samples with `overall < 8`.** |
|
| 249 |
-
| `editscore.SC_reasoning` | Free-text rationale for the consistency / instruction-following sub-scores. |
|
| 250 |
-
| `editscore.PQ_reasoning` | Free-text rationale for the perceptual-quality sub-score. |
|
| 251 |
-
|
| 252 |
-
**`edited_frame0_foreground_removed_score.jsonl`** records per-sample [EditScore](https://github.com/VectorSpaceLab/EditScore) evaluation of the Stage 3 intermediate output (`edited_frame0_foreground_removed/{task}.png`), measuring the foreground-removal quality. The schema is identical to `edited_frame0_score.jsonl`:
|
| 253 |
-
|
| 254 |
-
```json
|
| 255 |
-
{
|
| 256 |
-
"id": "Sparkle_location_000000",
|
| 257 |
-
"prompt": "...",
|
| 258 |
-
"editscore": {
|
| 259 |
-
"prompt_following": ...,
|
| 260 |
-
"consistency": ...,
|
| 261 |
-
"perceptual_quality": ...,
|
| 262 |
-
"overall": ...,
|
| 263 |
-
"SC_reasoning": "...",
|
| 264 |
-
"PQ_reasoning": "..."
|
| 265 |
-
}
|
| 266 |
-
}
|
| 267 |
-
```
|
| 268 |
-
|
| 269 |
-
At this stage we apply a stricter threshold and **filter out samples with `overall < 8.5`** to guarantee a perfectly clean background before the I2V generation that follows.
|
| 270 |
-
|
| 271 |
-
**`foreground_grounding_r1.jsonl`** records the **first-round VLM grounding** result that compares the source first frame and the Stage 2 edited first frame to identify foreground objects to preserve. This is the labeling step described in Stage 3 of the pipeline. One JSON object per line:
|
| 272 |
-
|
| 273 |
-
```json
|
| 274 |
-
{
|
| 275 |
-
"id": "Sparkle_location_000000",
|
| 276 |
-
"prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
|
| 277 |
-
"edit_type": "location",
|
| 278 |
-
"round1_labels": [
|
| 279 |
-
"woman in brown hat and coat",
|
| 280 |
-
"clasped hands with ring",
|
| 281 |
-
"striped shirt under coat",
|
| 282 |
-
"brown wide-brimmed hat"
|
| 283 |
-
],
|
| 284 |
-
"round1_objects": [
|
| 285 |
-
{"bbox_2d": [447, 27, 765, 998], "label": "woman in brown hat and coat"},
|
| 286 |
-
{"bbox_2d": [515, 800, 615, 980], "label": "clasped hands with ring"},
|
| 287 |
-
{"bbox_2d": [490, 398, 615, 800], "label": "striped shirt under coat"},
|
| 288 |
-
{"bbox_2d": [505, 27, 710, 258], "label": "brown wide-brimmed hat"}
|
| 289 |
-
]
|
| 290 |
-
}
|
| 291 |
-
```
|
| 292 |
-
|
| 293 |
-
| Field | Description |
|
| 294 |
-
|------------------|------------------------------------------------------------------------------------------------------|
|
| 295 |
-
| `id` | Sample id, matches the `task` column in the CSV. |
|
| 296 |
-
| `prompt` | The editing instruction. |
|
| 297 |
-
| `edit_type` | The theme this sample belongs to (`location` / `season` / `time` / `style` / `openve3m`). |
|
| 298 |
-
| `round1_labels` | List of foreground-object labels detected by the VLM. |
|
| 299 |
-
| `round1_objects` | Per-object detection records; each item has a `bbox_2d` and a `label`. |
|
| 300 |
-
|
| 301 |
-
The bounding boxes are detected on the **source first frame** (`source_frame0/{task}.png`). Since our pipeline preserves the foreground identity and pose during background replacement, these boxes apply equally to the corresponding edited first frame (`edited_frame0/{task}.png`).
|
| 302 |
-
|
| 303 |
-
<a id="normalize-bbox"></a>
|
| 304 |
-
|
| 305 |
-
The `bbox_2d` field follows Qwen3-VL's **normalized coordinate format** with values in the range `[0, 1000]`, representing `[x1, y1, x2, y2]` (top-left and bottom-right corners). Convert them to absolute pixel coordinates of the real frame as follows:
|
| 306 |
-
|
| 307 |
-
```python
|
| 308 |
-
def normalize_bbox(bbox, video_width: int, video_height: int):
|
| 309 |
-
"""Convert a Qwen3-VL [0, 1000]-normalized bbox to absolute pixel coordinates."""
|
| 310 |
-
x1 = int(bbox[0] / 1000.0 * video_width)
|
| 311 |
-
y1 = int(bbox[1] / 1000.0 * video_height)
|
| 312 |
-
x2 = int(bbox[2] / 1000.0 * video_width)
|
| 313 |
-
y2 = int(bbox[3] / 1000.0 * video_height)
|
| 314 |
-
|
| 315 |
-
# Clamp to frame bounds and ensure x1 <= x2, y1 <= y2.
|
| 316 |
-
x1 = max(0, min(min(x1, x2), video_width - 1))
|
| 317 |
-
y1 = max(0, min(min(y1, y2), video_height - 1))
|
| 318 |
-
x2 = max(0, min(max(x1, x2), video_width - 1))
|
| 319 |
-
y2 = max(0, min(max(y1, y2), video_height - 1))
|
| 320 |
-
return x1, y1, x2, y2
|
| 321 |
-
```
|
| 322 |
-
|
| 323 |
-
**`foreground_grounding_r2.jsonl`** records the **second-round VLM grounding** result that produces the temporal anchors for Stage 4 (BAIT Foreground Tracking). Building on the labels from `foreground_grounding_r1.jsonl`, Qwen3-VL is asked to re-locate every Round 1 label on frames sampled at 2 FPS from the source video, yielding per-frame bounding boxes that anchor the subsequent SAM3 multi-pass tracking. One JSON object per line:
|
| 324 |
-
|
| 325 |
-
```json
|
| 326 |
-
{
|
| 327 |
-
"id": "Sparkle_location_000000",
|
| 328 |
-
"prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
|
| 329 |
-
"edit_type": "location",
|
| 330 |
-
"round1_labels": [...],
|
| 331 |
-
"round1_objects": [...],
|
| 332 |
-
"frame_objects": [
|
| 333 |
-
[
|
| 334 |
-
{"bbox_2d": [448, 26, 765, 998], "label": "woman in brown hat and coat"},
|
| 335 |
-
{"bbox_2d": [521, 795, 618, 968], "label": "clasped hands with ring"},
|
| 336 |
-
{"bbox_2d": [545, 420, 625, 805], "label": "striped shirt under coat"},
|
| 337 |
-
{"bbox_2d": [507, 26, 712, 270], "label": "brown wide-brimmed hat"}
|
| 338 |
-
],
|
| 339 |
-
[
|
| 340 |
-
{"bbox_2d": [452, 34, 764, 998], "label": "woman in brown hat and coat"},
|
| 341 |
-
{"bbox_2d": [505, 784, 600, 955], "label": "clasped hands with ring"},
|
| 342 |
-
...
|
| 343 |
-
],
|
| 344 |
-
...
|
| 345 |
-
]
|
| 346 |
-
}
|
| 347 |
-
```
|
| 348 |
-
|
| 349 |
-
The schema extends `foreground_grounding_r1.jsonl` with a single new field:
|
| 350 |
-
|
| 351 |
-
| Field | Description |
|
| 352 |
-
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 353 |
-
| `frame_objects` | A 2D list of grounding results, one inner list per 2 FPS-sampled frame. Each inner list mirrors the `round1_objects` schema (a list of `{"bbox_2d": [...], "label": "..."}` items), giving the per-frame bbox of every Round 1 label on that frame. |
|
| 354 |
-
|
| 355 |
-
The other fields (`id`, `prompt`, `edit_type`, `round1_labels`, `round1_objects`) are inherited unchanged from `foreground_grounding_r1.jsonl`. Use the same [`normalize_bbox`](#normalize-bbox) helper to convert `bbox_2d` values to absolute pixel coordinates.
|
| 356 |
-
|
| 357 |
-
> **Note.** Some entries in `frame_objects` may have an empty `bbox_2d` (e.g. `{"bbox_2d": [], "label": "..."}`), indicating that the VLM failed to localize that particular label on that frame. Our BAIT algorithm handles these gracefully by relying on the remaining frames' anchors and a pixel-wise majority vote across SAM3 tracking passes.
|
| 358 |
-
|
| 359 |
-
**`edited_video_score.jsonl`** records per-sample [EditScore](https://github.com/VectorSpaceLab/EditScore) evaluation of the **Stage 5 final synthesized video**. Following the protocol in our paper, we uniformly sample four non-first frames from each video and score them independently. One JSON object per line:
|
| 360 |
-
|
| 361 |
-
```json
|
| 362 |
-
{
|
| 363 |
-
"id": "Sparkle_location_000000",
|
| 364 |
-
"prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
|
| 365 |
-
"frame_indices": [1, 26, 51, 76],
|
| 366 |
-
"editscore": [
|
| 367 |
-
{
|
| 368 |
-
"SC_score": 9.0,
|
| 369 |
-
"PQ_score": 8.5,
|
| 370 |
-
"O_score": 8.719958110896453,
|
| 371 |
-
"SC_score_reasoning": "The editing successfully changed the background to a rooftop overlooking a modern city skyline at dusk, ...",
|
| 372 |
-
"PQ_score_reasoning": "The image has a mostly natural cityscape and lighting, but the person's hands appear slightly distorted ...",
|
| 373 |
-
"SC_raw_output": "...",
|
| 374 |
-
"PQ_raw_output": "..."
|
| 375 |
-
},
|
| 376 |
-
{ "SC_score": 8.3, "PQ_score": 8.5, "O_score": 8.388302424289282, "...": "..." },
|
| 377 |
-
{ "SC_score": 9.1, "PQ_score": 7.4, "O_score": 8.143194240945185, "...": "..." },
|
| 378 |
-
{ "SC_score": 8.9, "PQ_score": 7.8, "O_score": 8.318623075017307, "...": "..." }
|
| 379 |
-
]
|
| 380 |
-
}
|
| 381 |
-
```
|
| 382 |
-
|
| 383 |
-
| Field | Description |
|
| 384 |
-
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------|
|
| 385 |
-
| `id` | Sample id, matches the `task` column in the CSV. |
|
| 386 |
-
| `prompt` | The editing instruction. |
|
| 387 |
-
| `frame_indices` | The 4 frame indices (0-based) sampled from the synthesized video for evaluation, e.g. `[1, 26, 51, 76]`. |
|
| 388 |
-
| `editscore` | A length-4 list, one entry per sampled frame, in the same order as `frame_indices`. |
|
| 389 |
-
| `editscore[i].SC_score` | Sub-score (0–10) for instruction-following / consistency on frame `i`. |
|
| 390 |
-
| `editscore[i].PQ_score` | Sub-score (0–10) for perceptual quality on frame `i`. |
|
| 391 |
-
| `editscore[i].O_score` | Aggregated overall score on frame `i`. |
|
| 392 |
-
| `editscore[i].SC_score_reasoning` | Free-text rationale behind `SC_score`. |
|
| 393 |
-
| `editscore[i].PQ_score_reasoning` | Free-text rationale behind `PQ_score`. |
|
| 394 |
-
| `editscore[i].SC_raw_output` | Raw JSON string returned by the EditScore SC head (contains `reasoning` and per-criterion `score` array). |
|
| 395 |
-
| `editscore[i].PQ_raw_output` | Raw JSON string returned by the EditScore PQ head. |
|
| 396 |
-
|
| 397 |
-
The final filtering rule is: **average `O_score` across all four sampled frames; discard the sample if the mean is below `8`.**
|
| 398 |
-
|
| 399 |
-
</details>
|
| 400 |
-
|
| 401 |
-
### 📜 Dataset License
|
| 402 |
-
|
| 403 |
-
The Sparkle dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
|
| 404 |
-
|
| 405 |
-
Source videos in the `openve3m` theme are derived from [OpenVE-3M](https://arxiv.org/abs/2512.07826) and retain their original licenses; please consult the upstream source before redistribution.
|
| 406 |
-
|
| 407 |
-
## 🎯 Benchmark
|
| 408 |
-
|
| 409 |
-
**Sparkle-Bench** is the largest evaluation benchmark tailored for instruction-guided video background replacement, comprising **458 carefully curated videos across 4 themes, 21 subthemes, and 97 distinct scenes**. It is fully open-sourced at [🤗stdKonjac/Sparkle-Bench](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench). For evaluation methodology and our six-dimensional scoring protocol, please refer to [our paper](https://arxiv.org/abs/2605.06535).
|
| 410 |
-
|
| 411 |
-
**All source videos in the benchmark are uncompressed and previewable directly in the browser**, so users can inspect any sample without downloading anything.
|
| 412 |
-
|
| 413 |
-
The benchmark is organized into **four themes**:
|
| 414 |
-
|
| 415 |
-
| Theme | Description |
|
| 416 |
-
| ---------- |------------------------------------------------------------------------------------------|
|
| 417 |
-
| `location` | Background replaced with a different physical environment (rural, nature, landmark, ...).|
|
| 418 |
-
| `season` | Background changed across seasons (spring, summer, autumn, winter). |
|
| 419 |
-
| `time` | Background changed across times of day (dawn, dusk, night, ...). |
|
| 420 |
-
| `style` | Background restyled (era, mood, cinematic, ...). |
|
| 421 |
-
|
| 422 |
-
### 🗂️ Repository Structure
|
| 423 |
-
|
| 424 |
-
```
|
| 425 |
-
Sparkle-Bench/
|
| 426 |
-
├── README.md
|
| 427 |
-
├── location_bench.csv # 3 columns: edited_type, prompt, original_video
|
| 428 |
-
├── location_metadata.jsonl # per-task metadata (edit_type, subtheme, original scene)
|
| 429 |
-
├── season_bench.csv
|
| 430 |
-
├── season_metadata.jsonl
|
| 431 |
-
├── time_bench.csv
|
| 432 |
-
├── time_metadata.jsonl
|
| 433 |
-
├── style_bench.csv
|
| 434 |
-
├── style_metadata.jsonl
|
| 435 |
-
├── source_videos/ # all 458 source videos, browsable online
|
| 436 |
-
│ ├── location/
|
| 437 |
-
│ │ ├── Sparkle_location_000011.mp4
|
| 438 |
-
│ │ └── ...
|
| 439 |
-
│ ├── season/
|
| 440 |
-
│ ├── time/
|
| 441 |
-
│ └── style/
|
| 442 |
-
└── ref_images/ # optional reference background images (see below)
|
| 443 |
-
├── location/
|
| 444 |
-
├── season/
|
| 445 |
-
├── time/
|
| 446 |
-
└── style/
|
| 447 |
-
```
|
| 448 |
-
|
| 449 |
-
### 🧾 Benchmark Format
|
| 450 |
-
|
| 451 |
-
We follow the format of [OpenVE-Bench](https://huggingface.co/datasets/Lewandofski/OpenVE-Bench) for direct compatibility with existing evaluation pipelines.
|
| 452 |
-
|
| 453 |
-
Each theme's evaluation prompts live in `{edit_type}_bench.csv`, a three-column table:
|
| 454 |
-
|
| 455 |
-
| Column | Description |
|
| 456 |
-
|------------------|---------------------------------------------------------------------------------------------------|
|
| 457 |
-
| `edited_type` | The theme of this sample, one of `location` / `season` / `time` / `style`. |
|
| 458 |
-
| `prompt` | The natural-language editing instruction. |
|
| 459 |
-
| `original_video` | Path to the source video, e.g. `source_videos/location/Sparkle_location_010913.mp4`. |
|
| 460 |
-
|
| 461 |
-
Per-task auxiliary metadata is stored alongside in `{edit_type}_metadata.jsonl`. Each line is one sample:
|
| 462 |
-
|
| 463 |
-
```json
|
| 464 |
-
{
|
| 465 |
-
"id": "Sparkle_location_004302",
|
| 466 |
-
"prompt": "Put the subject against ancient stone ruins overgrown with wind-swept grass, ...",
|
| 467 |
-
"metadata": {
|
| 468 |
-
"edit_type": "location",
|
| 469 |
-
"chosen_keyword": "landmark: ancient stone ruins with wind-swept grass",
|
| 470 |
-
"original_scene": "A dimly lit indoor bar or restaurant with brick walls, framed artwork, and warm overhead lighting."
|
| 471 |
-
}
|
| 472 |
-
}
|
| 473 |
-
```
|
| 474 |
-
|
| 475 |
-
| Field | Description |
|
| 476 |
-
|----------------------------|------------------------------------------------------------------------------------------------------------|
|
| 477 |
-
| `id` | Sample id, e.g. `Sparkle_location_004302`. Matches the basename of the corresponding `original_video` path. |
|
| 478 |
-
| `prompt` | Same as the `prompt` column in the CSV. |
|
| 479 |
-
| `metadata.edit_type` | The theme this sample belongs to (`location` / `season` / `time` / `style`). |
|
| 480 |
-
| `metadata.chosen_keyword` | The `subtheme: scene` label (e.g. `"landmark: ancient stone ruins with wind-swept grass"`). |
|
| 481 |
-
| `metadata.original_scene` | A description of the source video's first frame. |
|
| 482 |
-
|
| 483 |
-
### 👀 Online Preview
|
| 484 |
-
|
| 485 |
-
All 458 source videos are stored as uncompressed `.mp4` files under `source_videos/{edit_type}/`, and can be played directly in the browser without any download.
|
| 486 |
-
|
| 487 |
-
For example, the source video of task `Sparkle_location_000011` (the first row in the **location** theme of the dataset viewer) is browsable at: [Sparkle_location_000011](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench/blob/main/source_videos/location/Sparkle_location_000011.mp4).
|
| 488 |
-
|
| 489 |
-
The dataset viewer at the top of the HF page lets you scroll through all four themes and read the corresponding prompts inline.
|
| 490 |
-
|
| 491 |
-
### ⬇️ Downloading the Benchmark
|
| 492 |
-
|
| 493 |
-
Sparkle-Bench is small enough to download in one command. Pull the entire repo:
|
| 494 |
-
|
| 495 |
-
```bash
|
| 496 |
-
hf download stdKonjac/Sparkle-Bench --repo-type=dataset --local-dir ./Sparkle-Bench
|
| 497 |
-
```
|
| 498 |
-
|
| 499 |
-
Or download only a single theme (e.g. `location`):
|
| 500 |
-
|
| 501 |
-
```bash
|
| 502 |
-
hf download stdKonjac/Sparkle-Bench \
|
| 503 |
-
--repo-type=dataset \
|
| 504 |
-
--local-dir ./Sparkle-Bench \
|
| 505 |
-
--include "location_*" "source_videos/location/*"
|
| 506 |
-
```
|
| 507 |
-
|
| 508 |
-
After downloading, the relative paths in `{edit_type}_bench.csv` (e.g. `source_videos/location/Sparkle_location_010913.mp4`) will resolve directly.
|
| 509 |
-
|
| 510 |
-
### 📊 Evaluation
|
| 511 |
-
|
| 512 |
-
We provide an end-to-end evaluation script, [`eval_sparkle_bench_gemini.py`](https://github.com/showlab/Sparkle/blob/main/eval_sparkle_bench_gemini.py), that scores edited videos using Gemini-2.5-Pro under our six-dimensional rubric (see *Section 3.7* in [our paper](https://arxiv.org/abs/2605.06535)). The six dimensions are: **Instruction Compliance**, **Overall Visual Quality**, **Foreground Integrity**, **Foreground Motion Consistency**, **Background Dynamics**, and **Background Visual Quality**, each scored on a 1–5 scale.
|
| 513 |
-
|
| 514 |
-
#### 1. Prepare your inference outputs
|
| 515 |
-
|
| 516 |
-
The script expects edited videos to be organized in a specific directory tree. For every sample in Sparkle-Bench, the inference output should be saved as:
|
| 517 |
-
|
| 518 |
-
```
|
| 519 |
-
{save_dir}/{edit_type}/{subtheme}---{scene_key}/{id}_edited.mp4
|
| 520 |
-
```
|
| 521 |
-
|
| 522 |
-
where:
|
| 523 |
-
|
| 524 |
-
- `{save_dir}` is your inference root (free to choose).
|
| 525 |
-
- `{edit_type}` is one of `location` / `season` / `time` / `style`.
|
| 526 |
-
- `{subtheme}---{scene_key}` is derived from the sample's `chosen_keyword` field in `{edit_type}_metadata.jsonl`. Specifically, splitting `chosen_keyword` on `": "` yields `subtheme: scene`, then `scene_key = scene.replace(" ", "_")`. The triple-dash `---` is the separator between the two parts.
|
| 527 |
-
- `{id}` is the sample id, e.g. `Sparkle_location_000172`.
|
| 528 |
-
|
| 529 |
-
For example, the inference outputs across the four themes should look like:
|
| 530 |
-
|
| 531 |
-
```
|
| 532 |
-
{save_dir}/
|
| 533 |
-
├── location/
|
| 534 |
-
│ └── landmark---ancient_stone_ruins_with_wind-swept_grass/
|
| 535 |
-
│ └── Sparkle_location_000172_edited.mp4
|
| 536 |
-
├── season/
|
| 537 |
-
│ └── {subtheme}---{scene_key}/
|
| 538 |
-
│ └── Sparkle_season_xxxxxx_edited.mp4
|
| 539 |
-
├── time/
|
| 540 |
-
│ └── {subtheme}---{scene_key}/
|
| 541 |
-
│ └── Sparkle_time_xxxxxx_edited.mp4
|
| 542 |
-
└── style/
|
| 543 |
-
└── {subtheme}---{scene_key}/
|
| 544 |
-
└── Sparkle_style_xxxxxx_edited.mp4
|
| 545 |
-
```
|
| 546 |
-
|
| 547 |
-
#### 2. Configure the Gemini API
|
| 548 |
-
|
| 549 |
-
By default the script uses **Azure-hosted Gemini via the OpenAI-compatible API** for convenient concurrency. Export two environment variables before running:
|
| 550 |
-
|
| 551 |
-
```bash
|
| 552 |
-
export AZURE_ENDPOINT="https://your-azure-endpoint"
|
| 553 |
-
export GEMINI_API_KEY="your-api-key"
|
| 554 |
-
```
|
| 555 |
-
|
| 556 |
-
If you have direct access to the Gemini API, you can swap the `GEMINI_API` client at the top of the script for the native [`google-genai`](https://github.com/googleapis/python-genai) SDK. The request payload only needs `(system prompt, source video, edited video)`, so the adaptation is straightforward. Just keep the `temperature=0` / `seed=42` settings for reproducibility.
|
| 557 |
-
|
| 558 |
-
#### 3. Run the evaluation
|
| 559 |
-
|
| 560 |
-
Assuming Sparkle-Bench has been downloaded to `data/Sparkle-Bench/` (the default `--bench_root`):
|
| 561 |
-
|
| 562 |
-
```bash
|
| 563 |
-
python3 eval_sparkle_bench_gemini.py \
|
| 564 |
-
--video_paths /path/to/sparkle_bench_results/
|
| 565 |
-
```
|
| 566 |
-
|
| 567 |
-
For multiple checkpoints in one run:
|
| 568 |
-
|
| 569 |
-
```bash
|
| 570 |
-
python3 eval_sparkle_bench_gemini.py \
|
| 571 |
-
--video_paths /path/to/ckpt_a/sparkle_bench/ \
|
| 572 |
-
/path/to/ckpt_b/sparkle_bench/ \
|
| 573 |
-
/path/to/ckpt_c/sparkle_bench/
|
| 574 |
-
```
|
| 575 |
-
|
| 576 |
-
By default the script evaluates all four themes (`location`, `season`, `time`, `style`); pass `--edit_types` to restrict to a subset. Concurrency is controlled inside the script (default 20 workers).
|
| 577 |
-
|
| 578 |
-
#### 4. Read the output
|
| 579 |
-
|
| 580 |
-
For each `(save_dir, edit_type)` pair, the script writes:
|
| 581 |
-
|
| 582 |
-
```
|
| 583 |
-
{save_dir}/{edit_type}_gemini-2.5-pro_sparkle_score.jsonl
|
| 584 |
-
```
|
| 585 |
-
|
| 586 |
-
Each line is a per-sample record containing the six-dim scores plus the original Gemini reasoning:
|
| 587 |
-
|
| 588 |
-
```json
|
| 589 |
-
{
|
| 590 |
-
"id": "Sparkle_location_000172",
|
| 591 |
-
"prompt": "Put the subject against ancient stone ruins overgrown with wind-swept grass, ...",
|
| 592 |
-
"edit_type": "location",
|
| 593 |
-
"subtheme": "landmark",
|
| 594 |
-
"scene": "ancient stone ruins with wind-swept grass",
|
| 595 |
-
"scores": [5, 5, 5, 5, 5, 5],
|
| 596 |
-
"result": "Brief reasoning: The edited background perfectly matches every detail of the prompt, ...\nInstruction Compliance: 5\nOverall Visual Quality: 5\nForeground Integrity: 5\nForeground Motion Consistency: 5\nBackground Dynamics: 5\nBackground Visual Quality: 5"
|
| 597 |
-
}
|
| 598 |
-
```
|
| 599 |
-
|
| 600 |
-
The `scores` array follows this fixed order: `[Instruction Compliance, Overall Visual Quality, Foreground Integrity, Foreground Motion Consistency, Background Dynamics, Background Visual Quality]`. Following the OpenVE-Bench protocol, the script automatically caps dimensions 2–6 at the Instruction Compliance score to prevent score hacking.
|
| 601 |
-
|
| 602 |
-
After scoring, the script aggregates per-theme and macro averages and prints a summary table to stdout. The evaluation is **deterministic** by design (`temperature=0`, fixed `seed=42`) for reproducibility.
|
| 603 |
-
|
| 604 |
-
### 🖼️ Reference Images (Optional, Use with Caution)
|
| 605 |
-
|
| 606 |
-
By construction, every Sparkle-Bench sample is a video that **passed the first four stages of our pipeline but failed the final synthesis quality check in Stage 5** (see Section 3.7 of [our paper](https://arxiv.org/abs/2605.06535)). As a free byproduct, this means each sample comes with a **pure background image** generated by Stage 3 (Individual Background Generation), where the foreground has been removed from the preliminarily edited first frame.
|
| 607 |
-
|
| 608 |
-
We release these images under `ref_images/{edit_type}/{id}.png`, alongside the CSV/JSONL annotations. These images may be useful for **reference-based** background-replacement experiments (e.g., feeding the clean background as an extra visual condition to the editing model).
|
| 609 |
-
|
| 610 |
-
> **⚠️ Disclaimer.** Our paper neither trains any reference-based model nor includes any reference-image-based evaluation. We release `ref_images/` purely to **facilitate future research** in this direction. The images are **not curated** and may contain noise such as low-quality edits or imperfect foreground removal. Please **use them with caution**. We make no quality guarantees about this auxiliary asset.
|
| 611 |
-
|
| 612 |
-
### 📜 Benchmark License
|
| 613 |
-
|
| 614 |
-
The Sparkle-Bench is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
|
| 615 |
-
|
| 616 |
-
Source videos are derived from [OpenVE-3M](https://arxiv.org/abs/2512.07826) and retain their original licenses; please consult the upstream source before redistribution.
|
| 617 |
-
|
| 618 |
-
## 🧠 Model
|
| 619 |
-
|
| 620 |
-
We release **Kiwi-Sparkle**, a video background-replacement model fine-tuned on the **Sparkle** dataset for **10K steps** with a batch size of 128, starting from a [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit) base. Since we apply no architectural modifications to Kiwi-Edit, **Kiwi-Sparkle's weights are fully compatible with the Kiwi-Edit weights structure**. Any inference, training, or deployment pipeline that runs Kiwi-Edit can run Kiwi-Sparkle as a drop-in replacement.
|
| 621 |
-
|
| 622 |
-
The model is open-sourced at [🤗stdKonjac/Kiwi-Sparkle-720P-81F](https://huggingface.co/stdKonjac/Kiwi-Sparkle-720P-81F) and supports **720P resolution** with up to **81-frame outputs**.
|
| 623 |
-
|
| 624 |
-
| Setting | Value |
|
| 625 |
-
|-----------------------|-------------------------------------------------------------------------------------------------------------------------|
|
| 626 |
-
| Foundation model | [Kiwi-Edit-Stage2 (Image + Video)](https://huggingface.co/linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_720x1280_81f) |
|
| 627 |
-
| Resolution | 720 × 1280 |
|
| 628 |
-
| Max output frames | 81 |
|
| 629 |
-
| Fine-tuning steps | 10,000 |
|
| 630 |
-
| Batch size | 128 |
|
| 631 |
-
| Architectural changes | None. Drop-in compatible with Kiwi-Edit. |
|
| 632 |
-
|
| 633 |
-
### 🚀 Training
|
| 634 |
-
|
| 635 |
-
Kiwi-Sparkle is trained using the official Kiwi-Edit recipe in [this script](https://github.com/showlab/Kiwi-Edit/blob/main/scripts/run_wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_720x1280_81f.sh) with no modifications. Two common entry points are supported:
|
| 636 |
-
|
| 637 |
-
**Train from the Kiwi-Edit base on a Sparkle theme.** Point `--vid_dataset_metadata_path` to the corresponding Sparkle training CSV, and load the foundation [Kiwi-Edit-Stage2](https://huggingface.co/linyq/wan2.2_ti2v_5b_qwen25vl_3b_stage2_img_vid_720x1280_81f) checkpoint:
|
| 638 |
-
|
| 639 |
-
```bash
|
| 640 |
-
--vid_dataset_metadata_path /path/to/Sparkle/prompts/{edit_type}_train.csv
|
| 641 |
-
--checkpoint /path/to/Kiwi-Edit-Stage2/model.safetensors
|
| 642 |
-
```
|
| 643 |
-
|
| 644 |
-
where `{edit_type}` is one of `location` / `season` / `time` / `style` / `openve3m`. The five training CSVs are hosted [here](https://huggingface.co/datasets/stdKonjac/Sparkle/tree/main/prompts).
|
| 645 |
-
|
| 646 |
-
**Continue training from our Kiwi-Sparkle checkpoint.** Replace the `--checkpoint` argument:
|
| 647 |
-
|
| 648 |
-
```bash
|
| 649 |
-
--checkpoint /path/to/Kiwi-Sparkle-720P-81F/model.safetensors
|
| 650 |
-
```
|
| 651 |
-
|
| 652 |
-
The rest of the script stays exactly as in the official Kiwi-Edit setup.
|
| 653 |
-
|
| 654 |
-
### 🎬 Inference
|
| 655 |
-
|
| 656 |
-
#### OpenVE-Bench
|
| 657 |
-
|
| 658 |
-
Since Kiwi-Sparkle is architecturally identical to Kiwi-Edit, you can simply follow the official OpenVE-Bench evaluation pipeline of Kiwi-Edit and swap the checkpoint to Kiwi-Sparkle. For example:
|
| 659 |
|
| 660 |
```bash
|
| 661 |
python3 test_benchmark.py \
|
|
@@ -666,41 +34,15 @@ python3 test_benchmark.py \
|
|
| 666 |
--save_dir ./infer_results/
|
| 667 |
```
|
| 668 |
|
| 669 |
-
|
| 670 |
|
| 671 |
-
|
| 672 |
|
| 673 |
-
**
|
| 674 |
-
|
| 675 |
-
**Step 2.** Edit the shell script to point at your Kiwi-Sparkle checkpoint, then launch (defaults to 8 GPUs):
|
| 676 |
-
|
| 677 |
-
```bash
|
| 678 |
-
bash test_benchmark_sparkle_bench.sh
|
| 679 |
-
```
|
| 680 |
-
|
| 681 |
-
The script writes inference outputs to `infer_results/Kiwi-Sparkle-720P-81F/sparkle_bench/{edit_type}/{subtheme}---{scene_key}/{id}_edited.mp4`. Re-run it with a different `EDIT_TYPE` to cover all four themes.
|
| 682 |
-
|
| 683 |
-
**Step 3.** Score the outputs with our [Gemini-based evaluator](#-evaluation):
|
| 684 |
-
|
| 685 |
-
```bash
|
| 686 |
-
python3 eval_sparkle_bench_gemini.py \
|
| 687 |
-
--video_paths infer_results/Kiwi-Sparkle-720P-81F/sparkle_bench/
|
| 688 |
-
```
|
| 689 |
-
|
| 690 |
-
See the [Evaluation section](#-evaluation) above for details on environment setup, output format, and the six-dimensional scoring rubric.
|
| 691 |
-
|
| 692 |
-
### 📜 Model License
|
| 693 |
-
|
| 694 |
-
Kiwi-Sparkle is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
|
| 695 |
-
|
| 696 |
-
## 🙏 Acknowledgements
|
| 697 |
-
|
| 698 |
-
This project is built on top of a number of excellent open-source projects. We thank the authors of [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit), [FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B), [Qwen3-VL-32B](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct), [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B), [LightX2V](https://github.com/ModelTC/lightx2v), and [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) for releasing the infrastructure that made this work possible.
|
| 699 |
|
| 700 |
## 📝 Citation
|
| 701 |
|
| 702 |
-
If you find Sparkle useful for your research, please consider citing our paper:
|
| 703 |
-
|
| 704 |
```bibtex
|
| 705 |
@misc{zeng2026sparkle,
|
| 706 |
title = {Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance},
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
| 3 |
+
pipeline_tag: image-to-video
|
| 4 |
---
|
| 5 |
|
| 6 |
+
# Kiwi-Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
|
| 7 |
|
| 8 |
[Ziyun Zeng](https://stdkonjac.icu/), Yiqi Lin, [Guoqiang Liang](https://ethanliang99.github.io/), and [Mike Zheng Shou](https://cde.nus.edu.sg/ece/staff/shou-zheng-mike/)
|
| 9 |
|
|
|
|
| 12 |
[](https://github.com/showlab/Sparkle)
|
| 13 |
[](https://huggingface.co/datasets/stdKonjac/Sparkle)
|
| 14 |
[](https://huggingface.co/datasets/stdKonjac/Sparkle-Bench)
|
|
|
|
| 15 |
|
| 16 |
+
## 🧠 Model Details
|
| 17 |
|
| 18 |
+
**Kiwi-Sparkle** is a video background-replacement model fine-tuned on the **Sparkle** dataset for 10K steps, starting from a [Kiwi-Edit](https://github.com/showlab/Kiwi-Edit) base. It specializes in instruction-guided background replacement, allowing for the synthesis of new, temporally consistent scenes while maintaining accurate foreground-background interactions.
|
| 19 |
|
| 20 |
+
The model supports **720P resolution** (720 × 1280) and can generate outputs of up to **81 frames**. It is architecturally compatible with the Kiwi-Edit weights structure and can be used as a drop-in replacement in compatible pipelines.
|
| 21 |
|
| 22 |
+
## 🚀 Usage
|
| 23 |
|
| 24 |
+
Since Kiwi-Sparkle is architecturally identical to Kiwi-Edit, you can follow the official evaluation pipeline of Kiwi-Edit and swap the checkpoint to Kiwi-Sparkle.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
### Inference with OpenVE-Bench
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
```bash
|
| 29 |
python3 test_benchmark.py \
|
|
|
|
| 34 |
--save_dir ./infer_results/
|
| 35 |
```
|
| 36 |
|
| 37 |
+
For detailed instructions on evaluating with **Sparkle-Bench**, please refer to the [GitHub repository](https://github.com/showlab/Sparkle).
|
| 38 |
|
| 39 |
+
## 📊 Dataset & Benchmark
|
| 40 |
|
| 41 |
+
- **Sparkle Dataset**: A large-scale dataset of ~140K high-quality source–edited video pairs across five themes (location, season, time, style, and OpenVE-3M).
|
| 42 |
+
- **Sparkle-Bench**: The largest evaluation benchmark tailored for video background replacement, consisting of 458 curated videos.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## 📝 Citation
|
| 45 |
|
|
|
|
|
|
|
| 46 |
```bibtex
|
| 47 |
@misc{zeng2026sparkle,
|
| 48 |
title = {Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance},
|