File size: 12,111 Bytes
33eae59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
836e4c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33eae59
 
 
 
 
 
 
 
 
 
711c874
 
 
 
 
 
6b38433
 
 
 
 
 
 
 
 
 
 
 
33eae59
711c874
 
33eae59
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project goal

Build a Hugging Face **Space** (ZeroGPU, Gradio SDK) that lets users upload / pick images and get
**pairwise or one-vs-many style-similarity scores** produced by **MegaStyle-Encoder** from the paper
*MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping*
([arXiv:2604.08364](https://arxiv.org/abs/2604.08364), [project page](https://jeoyal.github.io/MegaStyle/),
[upstream code](https://github.com/Tencent/MegaStyle)).

The Space is a **comparison/analysis tool**, not a style-transfer demo β€” it should surface a
similarity matrix / ranked grid / heatmap, not generate images. The style-transfer sibling model
(`megastyle_flux.safetensors`) is explicitly **out of scope** unless the user asks for it β€” a ZeroGPU
slot can load SigLIP but cannot sanely run a 12B FLUX model.

## The model contract (critical β€” easy to get wrong)

MegaStyle-Encoder is **not** a standalone architecture. It is **SigLIP** with fine-tuned weights:

- **Backbone**: `google/siglip-so400m-patch14-384` loaded via `transformers.SiglipVisionModel`
- **Processor**: `transformers.AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")`
- **Weights**: `megastyle_encoder.pth` from [Gaojunyao/MegaStyle](https://huggingface.co/Gaojunyao/MegaStyle)
  (~857 MB). Loaded with `torch.load(..., map_location="cpu")`, then:
  - The checkpoint may be either a raw `state_dict` or `{"model": state_dict}` β€” handle both.
  - Apply with `model.load_state_dict(state, strict=False)` (non-strict is correct, upstream does this).
- **Embedding**: `model(pixel_values=...).pooler_output`, then **L2-normalize** (`emb / emb.norm(p=2, dim=-1, keepdim=True)`).
- **Similarity**: plain dot product of two normalized embeddings (= cosine similarity, range roughly [-1, 1]).

Reference implementation is `style_score.py` in the upstream repo β€” treat it as the authoritative
contract for any inference code in this Space. Do not "improve" the preprocessing, pooling choice, or
normalization without strong reason; the checkpoint was trained to produce a useful metric only under
exactly this pipeline.

## Deployment target: ZeroGPU (but the workload is GPU-light)

This Space targets **ZeroGPU** (Hugging Face's serverless GPU pool, currently backed by H200)
because it's the free GPU tier. The actual workload β€” one SigLIP-so400m forward pass over up
to 9 images β€” is small enough to run on CPU in 15–30s, or on any CUDA device in under a second.
We're not picking H200 for capability reasons; we take whatever ZeroGPU allocates.

**Viable alternatives** if ZeroGPU isn't available (no HF Pro, etc.):
- **CPU basic Space** (free, no Pro): drop `spaces` import and `@spaces.GPU` decorator; everything
  else stays the same. Comparison takes 15–30s instead of <1s.
- **Dedicated T4/L4/A10G paid tier**: no code change; just pick hardware in Space Settings.

ZeroGPU-specific constraints that shape the code today:

- GPU-using functions **must** be decorated with `@spaces.GPU(duration=...)` and the `spaces`
  package must be imported **before** `torch`. GPU tensors/models cannot live at module scope β€”
  move `.to("cuda")` and the forward pass **inside** the decorated function. Guard `MODEL.to(device)`
  with a `next(MODEL.parameters()).device != device` check so repeat calls don't re-migrate.
- The model is ~857 MB (`megastyle_encoder.pth`) + SigLIP ~1.8 GB. Load on CPU at module scope
  via `hf_hub_download`; `.to(device)` inside the decorated function on first invocation.
- **`duration`** on `@spaces.GPU` is currently 30s (sized for 1 test image + up to 8 references).
  Over-estimating wastes the user's ZeroGPU daily quota. Bump only if batch grows.
- **ZeroGPU hardware is NOT set in frontmatter** β€” it's configured in the Space's
  *Settings β†’ Hardware* panel. The `hardware:` frontmatter key that exists for regular Spaces
  does not apply here. Don't reintroduce it.
- **Python**: ZeroGPU supports only Python **3.10.13** and **3.12.12**. Pin `python_version: "3.10"`
  in frontmatter.
- **PyTorch**: ZeroGPU's supported wheel list is `>=2.1.0`. Keep the pin compatible.
- `sdk: gradio`, `sdk_version: 5.9.1` (latest 5.x as of this repo). Gradio 4+ is the only SDK
  currently supported by ZeroGPU.
- Requirements live in `requirements.txt` at repo root: `gradio`, `torch`, `transformers`,
  `Pillow`, `spaces`, `huggingface_hub`. Pin `transformers>=4.45` so `SiglipVisionModel` exists.

## Repo shape

Current layout (single-file app is appropriate at this size):

- `app.py` β€” Gradio Blocks UI + `@spaces.GPU`-decorated inference. Single entry point.
- `requirements.txt` β€” dependency pins.
- `README.md` β€” HF Space frontmatter (sdk, python_version, license) + user-facing description.
  Doubles as the Space's landing page.
- `LICENSE` β€” MIT. Matches the `license: mit` claim in README frontmatter and the upstream
  model's license.
- `.gitignore` β€” standard Python + HF cache + Gradio `flagged/` directory.
- `CLAUDE.md` β€” this file.

Split `app.py` into a separate `megastyle.py` module only if inference logic grows past ~150 lines
or needs unit tests without Gradio. Not needed today.

## UI & scoring conventions (locked by review)

- **Up to 8 reference images.** Clipped silently at compute time with a note in the result markdown.
- **Headline score is mean cosine similarity** across per-reference scores. Per-reference
  breakdown is surfaced in a table so users can spot outliers dragging down the mean.
- **Label bands** (heuristic, calibrated for SigLIP-family cosine ranges where unrelated images
  typically sit 0.4–0.6):

  | Cosine | Label | Emoji |
  |--------|-------|-------|
  | β‰₯ 0.75 | Strong style match | 🟒 |
  | 0.65–0.75 | Good style match | 🟒 |
  | 0.55–0.65 | Moderate style match | 🟑 |
  | 0.45–0.55 | Weak style match | 🟠 |
  | < 0.45 | Minimal style match | πŸ”΄ |

  If you ever have calibration data (e.g., per-style-pair cosine distributions from the MegaStyle
  dataset), retighten these. Until then, do not represent the label as ground truth β€” the raw
  cosine is the source of truth.
- **Do not display a pseudo-percentage.** An earlier iteration mapped cosine β†’ `(x+1)/2 * 100`
  which compresses useful signal and misleads users. The raw three-decimal cosine + the label is
  the display contract.
- **Verdict styling uses emoji prefix, not inline HTML `<span>`.** `gr.Markdown`'s HTML handling
  varies across Gradio versions; emoji is reliably rendered.

## Deploying and observing the Space

The live Space is at **`olfronar/megastyle-comparison`**
(https://huggingface.co/spaces/olfronar/megastyle-comparison). Local working copy is git-backed
with `origin` pointing at the Space's git URL β€” normal `git push origin main` triggers a rebuild.

### Publish a change

```bash
git add <files> && git commit -m "..." && git push origin main
```

For one-off blob pushes without a git clone:

```bash
hf upload olfronar/megastyle-comparison . --repo-type space --commit-message "..."
```

### Set hardware (ZeroGPU, etc.)

Hardware is *not* set via README frontmatter. Use `HfApi.request_space_hardware`:

```python
from huggingface_hub import HfApi
HfApi().request_space_hardware("olfronar/megastyle-comparison", hardware="zero-a10g")
```

Valid flavors include `cpu-basic`, `cpu-upgrade`, `t4-small`, `a10g-small`, `a10g-large`,
`a100-large`, and `zero-a10g` (the legacy name still used for the ZeroGPU pool, which is
currently backed by H200 hardware). You can also toggle via the Space's *Settings β†’ Hardware*
page; both paths write to the same field.

### Observe build and run logs

HF exposes Server-Sent-Events streams for both phases. They require the user's HF token (read
it from `~/.cache/huggingface/token` after `hf auth login`). Capped-timeout curl works for
point-in-time snapshots:

```bash
# Build phase (docker layers, pip install)
curl -s --max-time 20 \
  -H "Authorization: Bearer $(cat ~/.cache/huggingface/token)" \
  "https://huggingface.co/api/spaces/olfronar/megastyle-comparison/logs/build" | tail -80

# Run phase (Python stdout/stderr, Gradio startup)
curl -s --max-time 20 \
  -H "Authorization: Bearer $(cat ~/.cache/huggingface/token)" \
  "https://huggingface.co/api/spaces/olfronar/megastyle-comparison/logs/run" | tail -80
```

Each event line is JSON with `data` and `timestamp` fields. Because it's SSE, the stream is
long-lived β€” always use `--max-time` to bound it.

### Check high-level state

```bash
curl -s "https://huggingface.co/api/spaces/olfronar/megastyle-comparison" | \
  python3 -c "import sys, json; d=json.load(sys.stdin); r=d.get('runtime',{}); \
  print('stage:', r.get('stage'), 'hardware:', r.get('hardware'), 'sha:', d.get('sha','?')[:7])"
```

`stage` values: `BUILDING` β†’ `RUNNING` (healthy) or `RUNTIME_ERROR` / `BUILD_FAILED` (check logs).

### Common fix cycles

- **Runtime import / missing dep**: edit `requirements.txt` or the affected import, commit, push,
  stream the run logs until `RUNNING`. Most fixes don't need a full rebuild β€” layer caching makes
  re-pushes with unchanged deps very fast.
- **Hardware change**: one `HfApi().request_space_hardware(...)` call β€” no push, no rebuild.
- **Stuck build**: use the *Settings β†’ Factory Reboot* UI as a last resort; our Dockerfile is the
  HF default so factory-reboot is safe.

## Working with the Hugging Face Hub

- Model weights (`megastyle_encoder.pth`) should be pulled at runtime via `huggingface_hub.hf_hub_download`,
  not committed. The Space's build cache keeps it warm between restarts.
- License of the checkpoint is MIT (per `Gaojunyao/MegaStyle`); SigLIP is Apache-2.0. The Space's own
  code should ship under a compatible license.
- Paper citation (if surfaced in UI): `gao2026megastyle` bibtex is in the upstream repo README.

## Conventions

- **Use `SiglipImageProcessor`, not `AutoProcessor`.** Upstream's `style_score.py` uses
  `AutoProcessor.from_pretrained(SIGLIP_ID)` which loads both the image processor *and* the SigLIP
  tokenizer; the tokenizer requires `sentencepiece`, which isn't in the ZeroGPU base image and
  crashes at import. Vision-only inference has no use for the tokenizer. Load
  `SiglipImageProcessor.from_pretrained(SIGLIP_ID)` directly β€” the `pixel_values` output is
  identical.
- **Control the gradio version via `sdk_version` in README frontmatter, not `requirements.txt`.**
  HF's Dockerfile injects `gradio[oauth]==<sdk_version>` into the pip install line alongside our
  `requirements.txt`. Putting a conflicting `gradio` pin in `requirements.txt` (e.g.
  `gradio>=5.25.0`) crashes the build with `Cannot install gradio==<sdk_version> and gradio>=...
  because these package versions have conflicting dependencies`. Bump `sdk_version` instead;
  `requirements.txt` should not name gradio at all.
- **`sdk_version: 5.50.0` or newer is required.** Gradio 5.9.1 (the version HF defaults to at
  Space creation) ships a `gradio_client.utils.get_type` that crashes on boolean JSON schemas
  (valid JSON-Schema shorthand for accept-anything). `gr.Dataframe` triggers this in `/api/info`,
  flooding run logs with `TypeError: argument of type 'bool' is not iterable`. Fixed in later
  5.x. Also launch with `show_api=False` as a belt-and-suspenders so the endpoint isn't exposed
  at all β€” we don't need programmatic API access for a visual-only demo.
- **Match upstream preprocessing exactly.** If you find yourself tempted to resize, center-crop, or
  change color-space conversion manually instead of going through the SigLIP image processor, stop
  β€” the metric will silently degrade.
- **Don't cast weights to fp16/bf16 on load unless explicitly needed.** ZeroGPU A10G handles fp32
  SigLIP fine; precision changes affect similarity scores measurably.
- **Device string**: the upstream `style_score.py` probes for `cuda` β†’ `npu` β†’ `cpu`. On ZeroGPU it
  will always be `cuda` inside the decorated function; outside (e.g. cold start), keep it on CPU.