File size: 30,532 Bytes
84b67b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
\section{Experiments}
\label{sec:experiments}

We address three research questions under the fixed-contract framework defined in Section~\ref{sec:eval}. \textbf{RQ1:} Under fixed outputs, does the matching rule determine whether a wildfire model appears usable?
\textbf{RQ2:} Under fixed features, does ranking-based head selection lose decision performance?
\textbf{RQ3:} Under fixed task contracts, do model comparisons remain consistent across task forms?
\vspace{-0.5em}
\subsection{Experimental Setup}
\paragraph{Task instances.}
We instantiate the six task-form contracts defined in Section~\ref{sec:taskforms}.
Occupancy and fire spread serve as primary tasks because they evaluate spatial fire outputs under matching or overlap rules and align with the decision structure of early warning systems~\cite{goldammer1999early, farahmand2020fdeo}.
The four supporting tasks, \textit{final burned area, analog retrieval, smoke PM$_{2.5}$, and extreme heat}, use different prediction units and metric families; their results bound rather than replace primary decision evidence.

\paragraph{Compared backbones.}
The frozen Earth-FM comparator set includes Prithvi-WxC~\cite{schmude2024prithviwxc}, Aurora~\cite{bodnar2025aurora}, ClimaX~\cite{nguyen2023climax}, StormCast~\cite{pathak2024stormcast}, DLWP~\cite{weyn2020dlwp}, FCN~\cite{pathak2022fourcastnet}, FengWu~\cite{chen2023fengwu}, FuXi~\cite{chen2023fuxi}, Pangu-Weather~\cite{bi2023panguweather}, and AlphaEarth~\cite{brown2025alphaearth}.
\ourfm\ serves as the wildfire-specialized reference backbone.

\paragraph{Protocol.}
For each comparison, the contract $\mathcal{C} = (\mathcal{T}, M, \Lambda, \Omega, \mathcal{A})$ is fixed before reporting test scores. 
Thresholds and morphology parameters are selected on validation data and held fixed at test time.
Stochastic components are evaluated over five seeds and reported as mean $\pm$ standard deviation; deterministic fixed-output checks have zero seed variance by construction.
Entries outside a fixed contract are omitted from main tables and documented in the appendix.
For error metrics lower is better ($\downarrow$); for F1, AP, nDCG, and correlation metrics higher is better ($\uparrow$).
Appendix Table~\ref{tab:app_seed_robustness} summarizes the seed-level checks behind the reported mean-with-std convention.

\subsection{Matching-Rule Sensitivity Under Fixed Output (RQ1)}
\label{sec:rq1}

To answer RQ1, we conduct a fixed-output check on occupancy and fire spread tasks, holding the score field $S$, label field $Y$, threshold, and all other operating choices fixed while varying only the matching rule $\Lambda$ across exact, tolerated, and union settings. Occupancy results are reported in Figure~\ref{fig:fireprone_contract_progression} under both global and fire-prone scopes. The same progression is applied to fire spread outputs. Complete occupancy sweeps and predicted-positive rates are reported in Appendix Tables~\ref{tab:fireprone_contract_progression} and~\ref{tab:app_occupancy_ppr_scope}.

\begin{wrapfigure}[21]{r}{0.50\textwidth}
\centering
\vspace{-3mm}
\includegraphics[width=\linewidth]{figures/fig_primary_rank_change_map.pdf}
\caption{\textbf{Primary-task rank changes (RQ1).}
Cells show rank before\(\rightarrow\)after. Green/red/gray mark moving up/down/no change; darker green or red marks a larger move. Following Section~\ref{sec:taskforms}, Ex/Tol/Un are occupancy exact, tolerated, and union matching; Sp is spread spatial-overlap $F_1$.}
\label{fig:primary_ranking}
\vspace{-0.8em}
\end{wrapfigure}
Because both tasks involve spatially sparse targets, fire-active cells for occupancy, burned raster patches for spread, the operational assumptions encoded in $\Lambda$ directly govern what the model is being asked to get right, making matching-rule choice a substantive experimental setting rather than a post hoc evaluation detail.
The fixed-output results reveal a pattern that goes beyond score differences: matching-rule choice determines whether a model appears viable for wildfire decision tasks at all. Under exact matching, which requires same-cell same-time agreement, the majority of frozen Earth-FM backbones produce F1 scores that are effectively near zero, rendering them indistinguishable from an uninformative baseline and suggesting they have no practical utility for the task. As the matching rule relaxes to tolerated and then union matching, both of which reflect operationally realistic assumptions for early warning systems, where a prediction displaced by a few grid cells still triggers the correct response, the same frozen representations recover substantial decision performance, with several backbones crossing from near-zero to practically meaningful F1 levels. This transition is not a marginal score improvement: it is a qualitative change in whether a model can be considered usable. The same pattern holds for fire spread under region-level matching relaxation, where strict raster-cell agreement again suppresses performance for most backbones while spatial tolerance restores it. The implications for prior wildfire transfer claims are significant: papers that report model performance under a single implicit matching rule, which is common practice given that sparse decision targets almost always require some form of tolerance~\cite{ebert2009neighborhood, gilleland2009intercomparison}, may be drawing viability conclusions that are entirely dependent on an undisclosed protocol choice. A model claimed to perform well under one tolerance assumption may be completely unusable under a stricter one, and vice versa. Matching rule cannot be treated as an evaluation detail; it is an experimental setting that must be fixed, reported, and justified as part of any wildfire transfer claim. Additional spread AP values under fixed scopes are reported in Appendix Table~\ref{tab:app_spread_ap_by_scope}.


\begin{table}[t]
\centering
\small
\setlength{\tabcolsep}{4pt}
\renewcommand{\arraystretch}{1.20}
\caption{%
  \textbf{Primary fixed-contract transfer results (RQ1).}
  Occupancy metrics: exact, tolerated, union $F_1$ (\%).
  Fire spread metrics: exact $F_1$ and spatial $F_1$ (\%).
  Each block fixes $\mathcal{T}$, $\Lambda$, $\Omega$, and $\mathcal{A}$.
  Upward arrows indicate that larger values are better.
  \textbf{Bold} marks the best value per metric. \textbf{Tol.} = Tolerated
}
\label{tab:primary_results}
\setlength{\arrayrulewidth}{0.4pt}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lccccc}
\toprule
& \multicolumn{3}{c}{\textbf{Occupancy}} 
& \multicolumn{2}{c}{\textbf{Fire spread}} \\
\cmidrule(lr){2-4}\cmidrule(lr){5-6}
\textbf{Comparator} 
& \textbf{Exact $F_1\uparrow$} & \textbf{Tol.\ $F_1\uparrow$} & \textbf{Union $F_1\uparrow$}
& \textbf{Exact $F_1\uparrow$} & \textbf{Spatial $F_1\uparrow$} \\
\midrule
\ourfm\
& \ms{0.4546}{0.1412}
& \ms{29.7484}{1.2868}
& \ms{59.0656}{2.7372}
& \ensuremath{\mathbf{37.6700}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.9800}}}
& \ensuremath{\mathbf{80.9700}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{2.0200}}} \\
\midrule
Prithvi-WxC
& \ms{0.0552}{0.0039} & \ms{7.1649}{0.6557} & \ms{20.1853}{1.8299}
& \ms{22.3500}{3.4500} & \ms{65.2600}{1.0700} \\
Aurora
& \ms{0.0656}{0.0094} & \ms{8.5009}{1.9594} & \ms{23.1037}{4.9418}
& \ms{30.8757}{0.1343} & \ms{71.7329}{0.0141} \\
ClimaX
& \ms{0.3480}{0.0754}
& \ensuremath{\mathbf{29.7535}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{3.6073}}}
& \ensuremath{\mathbf{60.1506}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{7.5865}}}
& \ms{27.9853}{2.0532} & \ms{69.0634}{2.3832} \\
StormCast
& \ms{0.0626}{0.0119} & \ms{8.1951}{2.1895} & \ms{22.3817}{5.4294}
& \ms{14.8387}{7.5791} & \ms{55.7568}{21.3003} \\
DLWP
& \ms{0.1693}{0.0419} & \ms{14.9148}{3.2446} & \ms{28.1901}{6.9658}
& \ms{5.9335}{10.0712} & \ms{22.8587}{22.3750} \\
FCN
& \ms{0.2829}{0.0839} & \ms{19.5061}{3.3412} & \ms{40.0604}{9.3701}
& \ms{3.1798}{2.6598} & \ms{15.6203}{12.4531} \\
FengWu
& \ms{0.2613}{0.0757} & \ms{12.0050}{6.0239} & \ms{24.1022}{13.6293}
& \ms{5.5189}{9.0883} & \ms{18.4774}{22.4703} \\
FuXi
& \ms{0.3774}{0.1212} & \ms{21.0323}{4.8211} & \ms{37.2888}{9.4470}
& \ms{19.9909}{2.1364} & \ms{56.1826}{3.0412} \\
Pangu-Weather
& \ms{0.2755}{0.1089} & \ms{17.0909}{4.0477} & \ms{35.6386}{9.0327}
& \ms{11.2583}{11.0719} & \ms{32.5081}{25.4969} \\
AlphaEarth
& \ensuremath{\mathbf{2.0606}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.4404}}}
& \ms{29.4476}{6.0064} & \ms{37.4286}{9.9458}
& \ms{11.0995}{3.6088} & \ms{32.8316}{7.4634} \\
\bottomrule
\end{tabular}
}
\end{table}


% \begin{figure}[H]
%     \centering
%     \includegraphics[width=\textwidth]{figures/fig_fireprone_contract_progression_compact.pdf}
%     \caption{
% \textbf{Matching-rule sensitivity in fire-prone occupancy (RQ1).}
% Each row holds the score field \(S\), label field \(Y\), threshold, and \(\Omega\) fixed, and changes only \(\Lambda\).
% Legend: \textcolor[HTML]{17375E}{$\blacksquare$} strict \(F_1\),
% \textcolor[HTML]{4F8DCC}{$\blacksquare$} added \(F_1\) from spatial tolerance,
% \textcolor[HTML]{BFD7F0}{$\blacksquare$} added \(F_1\) from union matching,
% red outline \ourfm, and dashed line original weather FMs vs.\ added baselines.
% The horizontal axis is \(F_1\) in percent.
% }
%     \label{fig:fireprone_contract_progression}
% \end{figure}
\begin{wrapfigure}[14]{r}{0.50\textwidth}
    \centering
    \vspace{-1em} \includegraphics[width=0.50\textwidth]{figures/fig_selection_regret_scatter.pdf}
\caption{\textbf{Head-selection regret under fixed features (RQ2).}
Each point is one backbone; selection regret \(\delta\) follows Section~\ref{sec:checks} under global-scope union-\(F_1\).}
    
    \label{fig:selection_regret_diagnostic}
    \vspace{-1.2em}
\end{wrapfigure}

% \begin{wrapfigure}[17]{r}{0.46\textwidth}
% \vspace{-0.4em}
% \centering
% \resizebox{\linewidth}{!}{%
% \begin{tikzpicture}[x=1cm,y=1cm]
% \footnotesize
% \draw[black!12, line width=0.35pt] (2.450,-0.350) -- (2.450,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (2.450,-0.410) {-20};
% \draw[black!12, line width=0.35pt] (3.243,-0.350) -- (3.243,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (3.243,-0.410) {-10};
% \draw[wfgray, line width=0.55pt] (4.036,-0.350) -- (4.036,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (4.036,-0.410) {0};
% \draw[black!12, line width=0.35pt] (4.829,-0.350) -- (4.829,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (4.829,-0.410) {10};
% \draw[black!12, line width=0.35pt] (5.621,-0.350) -- (5.621,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (5.621,-0.410) {20};
% \draw[black!12, line width=0.35pt] (6.414,-0.350) -- (6.414,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (6.414,-0.410) {30};
% \draw[black!12, line width=0.35pt] (7.207,-0.350) -- (7.207,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (7.207,-0.410) {40};
% \draw[black!12, line width=0.35pt] (8.000,-0.350) -- (8.000,4.530);
% \node[anchor=north, font=\scriptsize, text=black!70] at (8.000,-0.410) {50};
% \draw[black!45, line width=0.4pt] (2.450,-0.350) -- (8.000,-0.350);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,4.350) {\textcolor{wfblue}{\textbf{\ourfm}}};
% \draw[wfslate, line width=0.72pt] (4.030,4.220) -- (5.212,4.220);
% \draw[wfslate, line width=0.72pt] (4.030,4.185) -- (4.030,4.255);
% \draw[wfslate, line width=0.72pt] (5.212,4.185) -- (5.212,4.255);
% \filldraw[wfslate] (4.621,4.220) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.051,4.480) -- (4.487,4.480);
% \draw[wforange, line width=0.72pt] (4.051,4.445) -- (4.051,4.515);
% \draw[wforange, line width=0.72pt] (4.487,4.445) -- (4.487,4.515);
% \filldraw[wforange] (4.224,4.435) rectangle (4.314,4.525);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,3.940) {Prithvi-WxC};
% \draw[wfslate, line width=0.72pt] (4.036,3.810) -- (4.036,3.810);
% \draw[wfslate, line width=0.72pt] (4.036,3.775) -- (4.036,3.845);
% \draw[wfslate, line width=0.72pt] (4.036,3.775) -- (4.036,3.845);
% \filldraw[wfslate] (4.036,3.810) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.036,4.070) -- (4.036,4.070);
% \draw[wforange, line width=0.72pt] (4.036,4.035) -- (4.036,4.105);
% \draw[wforange, line width=0.72pt] (4.036,4.035) -- (4.036,4.105);
% \filldraw[wforange] (3.991,4.025) rectangle (4.081,4.115);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,3.530) {Aurora};
% \draw[wfslate, line width=0.72pt] (3.580,3.400) -- (5.276,3.400);
% \draw[wfslate, line width=0.72pt] (3.580,3.365) -- (3.580,3.435);
% \draw[wfslate, line width=0.72pt] (5.276,3.365) -- (5.276,3.435);
% \filldraw[wfslate] (4.428,3.400) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (2.627,3.660) -- (7.723,3.660);
% \draw[wforange, line width=0.72pt] (2.627,3.625) -- (2.627,3.695);
% \draw[wforange, line width=0.72pt] (7.723,3.625) -- (7.723,3.695);
% \filldraw[wforange] (5.130,3.615) rectangle (5.220,3.705);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,3.120) {ClimaX};
% \draw[wfslate, line width=0.72pt] (4.032,2.990) -- (4.060,2.990);
% \draw[wfslate, line width=0.72pt] (4.032,2.955) -- (4.032,3.025);
% \draw[wfslate, line width=0.72pt] (4.060,2.955) -- (4.060,3.025);
% \filldraw[wfslate] (4.046,2.990) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.036,3.250) -- (4.036,3.250);
% \draw[wforange, line width=0.72pt] (4.036,3.215) -- (4.036,3.285);
% \draw[wforange, line width=0.72pt] (4.036,3.215) -- (4.036,3.285);
% \filldraw[wforange] (3.991,3.205) rectangle (4.081,3.295);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,2.710) {StormCast};
% \draw[wfslate, line width=0.72pt] (4.036,2.580) -- (4.036,2.580);
% \draw[wfslate, line width=0.72pt] (4.036,2.545) -- (4.036,2.615);
% \draw[wfslate, line width=0.72pt] (4.036,2.545) -- (4.036,2.615);
% \filldraw[wfslate] (4.036,2.580) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.036,2.840) -- (4.036,2.840);
% \draw[wforange, line width=0.72pt] (4.036,2.805) -- (4.036,2.875);
% \draw[wforange, line width=0.72pt] (4.036,2.805) -- (4.036,2.875);
% \filldraw[wforange] (3.991,2.795) rectangle (4.081,2.885);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,2.300) {DLWP};
% \draw[wfslate, line width=0.72pt] (4.036,2.170) -- (4.036,2.170);
% \draw[wfslate, line width=0.72pt] (4.036,2.135) -- (4.036,2.205);
% \draw[wfslate, line width=0.72pt] (4.036,2.135) -- (4.036,2.205);
% \filldraw[wfslate] (4.036,2.170) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.044,2.430) -- (4.735,2.430);
% \draw[wforange, line width=0.72pt] (4.044,2.395) -- (4.044,2.465);
% \draw[wforange, line width=0.72pt] (4.735,2.395) -- (4.735,2.465);
% \filldraw[wforange] (4.345,2.385) rectangle (4.435,2.475);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,1.890) {FCN};
% \draw[wfslate, line width=0.72pt] (4.036,1.760) -- (4.036,1.760);
% \draw[wfslate, line width=0.72pt] (4.036,1.725) -- (4.036,1.795);
% \draw[wfslate, line width=0.72pt] (4.036,1.725) -- (4.036,1.795);
% \filldraw[wfslate] (4.036,1.760) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (3.971,2.020) -- (4.286,2.020);
% \draw[wforange, line width=0.72pt] (3.971,1.985) -- (3.971,2.055);
% \draw[wforange, line width=0.72pt] (4.286,1.985) -- (4.286,2.055);
% \filldraw[wforange] (4.083,1.975) rectangle (4.173,2.065);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,1.480) {FengWu};
% \draw[wfslate, line width=0.72pt] (4.036,1.350) -- (4.036,1.350);
% \draw[wfslate, line width=0.72pt] (4.036,1.315) -- (4.036,1.385);
% \draw[wfslate, line width=0.72pt] (4.036,1.315) -- (4.036,1.385);
% \filldraw[wfslate] (4.036,1.350) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.028,1.610) -- (4.127,1.610);
% \draw[wforange, line width=0.72pt] (4.028,1.575) -- (4.028,1.645);
% \draw[wforange, line width=0.72pt] (4.127,1.575) -- (4.127,1.645);
% \filldraw[wforange] (4.032,1.565) rectangle (4.122,1.655);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,1.070) {FuXi};
% \draw[wfslate, line width=0.72pt] (4.036,0.940) -- (4.036,0.940);
% \draw[wfslate, line width=0.72pt] (4.036,0.905) -- (4.036,0.975);
% \draw[wfslate, line width=0.72pt] (4.036,0.905) -- (4.036,0.975);
% \filldraw[wfslate] (4.036,0.940) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.029,1.200) -- (4.087,1.200);
% \draw[wforange, line width=0.72pt] (4.029,1.165) -- (4.029,1.235);
% \draw[wforange, line width=0.72pt] (4.087,1.165) -- (4.087,1.235);
% \filldraw[wforange] (4.013,1.155) rectangle (4.103,1.245);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,0.660) {Pangu-Weather};
% \draw[wfslate, line width=0.72pt] (4.036,0.530) -- (4.036,0.530);
% \draw[wfslate, line width=0.72pt] (4.036,0.495) -- (4.036,0.565);
% \draw[wfslate, line width=0.72pt] (4.036,0.495) -- (4.036,0.565);
% \filldraw[wfslate] (4.036,0.530) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (4.025,0.790) -- (4.076,0.790);
% \draw[wforange, line width=0.72pt] (4.025,0.755) -- (4.025,0.825);
% \draw[wforange, line width=0.72pt] (4.076,0.755) -- (4.076,0.825);
% \filldraw[wforange] (4.006,0.745) rectangle (4.096,0.835);
% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,0.250) {AlphaEarth};
% \draw[wfslate, line width=0.72pt] (4.700,0.120) -- (6.103,0.120);
% \draw[wfslate, line width=0.72pt] (4.700,0.085) -- (4.700,0.155);
% \draw[wfslate, line width=0.72pt] (6.103,0.085) -- (6.103,0.155);
% \filldraw[wfslate] (5.401,0.120) circle[radius=0.045];
% \draw[wforange, line width=0.72pt] (3.872,0.380) -- (4.815,0.380);
% \draw[wforange, line width=0.72pt] (3.872,0.345) -- (3.872,0.415);
% \draw[wforange, line width=0.72pt] (4.815,0.345) -- (4.815,0.415);
% \filldraw[wforange] (4.298,0.335) rectangle (4.388,0.425);
% \end{tikzpicture}%
% }
% \caption{\textbf{Fixed-feature selection-regret check (RQ2).} Fixed-feature selection regret \(\delta = D(h_D)-D(h_R)\) under union-\(F_1\). \textcolor{wfslate}{$\bullet$} uses the full grid; \textcolor{wforange}{$\blacksquare$} uses the top \(20\%\) fire-prone cells from training fire frequency. Horizontal intervals use the same colors and show mean \(\pm\) std over five seeds, in percentage points.}
% \vspace{+0.4em}
% \label{fig:selection_regret_diagnostic}
% \end{wrapfigure}

\subsection{Head-Selection Sensitivity Under Fixed Features (RQ2)}
\label{sec:rq2}

To answer RQ2, we conduct a fixed-feature check on occupancy and fire spread tasks, holding the frozen feature source, $\mathcal{T}$, $\Omega$, $\Lambda$, and candidate head family $\mathcal{H} \subseteq \mathcal{A}$ fixed while varying only the selection metric between PR-AUC-based and decision-F1-based selection. The resulting selection regret $\delta = D(h_D) - D(h_R)$ measures the decision-score loss induced by metric misalignment. Occupancy results are reported in Figure~\ref{fig:selection_regret_diagnostic} under both global and fire-prone scopes. Full per-seed and per-head details are reported in Appendix~\ref{sec:app_seeded_audits}, and the exact, tolerated, and union regret breakdown is provided in Appendix Table~\ref{tab:appendix_selection_regret_tolerance}.

The fixed-feature results show that head-selection metrics introduce substantial backbone-dependent variation that is not explained by representation quality alone. Some backbones exhibit near-zero regret, indicating agreement between PR-AUC and decision-F1 selection, while others show large regret concentrated in specific scope-matching settings. Regret is generally larger under the global scope, where severe fire imbalance amplifies misalignment between ranking and decision metrics~\cite{mcdermott2024aurocauprc}. Restricting evaluation to fire-prone scopes typically reduces regret by concentrating evaluation on fire-relevant regions. A similar pattern appears for fire spread, where ranking and decision metrics can favor different heads under the same frozen representation. These results show that selection metrics must be aligned with the evaluation objective as part of the evaluation contract~\cite{traub2024selectiveclassification}.

\subsection{Supporting Task Checks (RQ3)}
\label{sec:rq3}

To answer RQ3, we evaluate all backbones across the four supporting task contracts, \textit{burned area, analog retrieval, smoke PM$_{2.5}$, }and\textit{ extreme heat}, and examine whether the reference-versus-frozen ordering established under primary tasks generalizes across task forms. A rank overview across all six contracts is provided in Figure~\ref{fig:task_comparator_normalized_map}, which maps backbone-by-task rank positions and makes cross-task ordering shifts directly visible. Native metric values are reported in Table~\ref{tab:supporting_results}. Additional supporting-task diagnostics are reported in Appendix Tables~\ref{tab:app_burned_area_median_acre}, \ref{tab:app_analog_rank_depth}, \ref{tab:app_smoke_high_event}, and~\ref{tab:app_heat_event_pr}.


The supporting task results produce three qualitatively distinct patterns relative to the primary findings. Burned area largely preserves the reference-versus-frozen ordering seen under occupancy and spread: \ourfm\ leads frozen entries on log-RMSE and Spearman $\rho$, suggesting that the representational advantage of wildfire-specific pretraining generalizes to event-scale regression under a different metric family, providing convergent evidence for the primary claim. Analog retrieval and smoke PM$_{2.5}$ show a different pattern, with AlphaEarth matching \ourfm\ closely on both tasks while atmospheric FMs show near-zero correlation on smoke PM$_{2.5}$, indicating that retrieval and air-quality signals are captured comparably by a general remote-sensing backbone, and that the primary occupancy advantage does not extend uniformly to these task forms. Extreme heat exhibits the largest variance across the comparator set, with atmospheric FMs ranging from near-reference performance to near-complete failure depending on backbone pretraining domain, while AlphaEarth again matches \ourfm\ closely. The scale of this variance is itself informative: aggregating scores across task forms without respecting contract boundaries would produce rankings dominated by scale artifacts in the extreme heat block rather than by transfer quality. Taken together, these results establish that supporting tasks bound rather than extend the primary claim, they provide useful evidence about where backbone families generalize and where they do not, but they cannot substitute for primary decision task evaluation, and their results must within their own task-form contracts.

\begin{figure}[t]
    \centering
\vspace{-5mm}

    \includegraphics[width=\textwidth]{figures/fig_rank_heatmap1.pdf}
    \vspace{-2mm}
    \caption{{\textbf{Rank map for supporting task comparison (RQ4).} Each row fixes one task contract $\mathcal{C}$ and ranks the eligible backbones within that contract. The figure shows rank changes across task forms; native metric values are reported in Table~\ref{tab:supporting_results}.}}
\vspace{-6mm}
    \label{fig:task_comparator_normalized_map}
\end{figure}

\begin{table}[t]
\centering
\small
\setlength{\tabcolsep}{3.5pt}
\renewcommand{\arraystretch}{1.18}
\caption{%
  \textbf{Supporting task-metric matrix (RQ3).}
  Top: final burned area and analog retrieval.
  Bottom: smoke PM$_{2.5}$ and extreme heat.
  Each block fixes $\mathcal{T}$, $\Lambda$, and $\Omega$; backbone
  column is shared across paired tasks. \ourfm\ row is
  separated by a rule as the empirical anchor. \textbf{Bold} marks
  the best value per metric. For error metrics
  lower is better ($\downarrow$); for $F_1$, nDCG, and $r$ higher
  is better ($\uparrow$).
}
\label{tab:supporting_results}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lcccccc}
\toprule
& \multicolumn{3}{c}{\textbf{Burned area}}
& \multicolumn{3}{c}{\textbf{Analog retrieval}} \\
\cmidrule(lr){2-4}\cmidrule(lr){5-7}
\textbf{Backbone}
& \textbf{log-RMSE$\downarrow$} & \textbf{log-MAE$\downarrow$}
& \textbf{Spearman$\uparrow$}
& \textbf{nDCG@10$\uparrow$} & \textbf{log-RMSE$\downarrow$}
& \textbf{log-MAE$\downarrow$} \\
\midrule
\ourfm\
& \ensuremath{\mathbf{1.1657}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0126}}}
& \ensuremath{\mathbf{1.0423}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0081}}}
& \ensuremath{\mathbf{0.6298}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0338}}}
& \ensuremath{\mathbf{0.5099}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0336}}}
& \ensuremath{\mathbf{1.1977}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.1029}}}
& \ensuremath{\mathbf{1.0043}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0759}}} \\
\midrule
Prithvi-WxC
& \ms{1.3630}{0.0681} & \ms{1.2435}{0.0668} & \ms{0.1799}{0.3002}
& \ms{0.3857}{0.0189} & \ms{1.3908}{0.0938} & \ms{1.2585}{0.0865} \\
Aurora
& \ms{1.8658}{0.2009} & \ms{1.6717}{0.1245} & \ms{-0.1156}{0.2982}
& \ms{0.4046}{0.0144} & \ms{1.3659}{0.0792} & \ms{1.2596}{0.0968} \\
ClimaX
& \ms{2.0300}{0.2103} & \ms{1.8443}{0.1528} & \ms{-0.2515}{0.2688}
& \ms{0.4143}{0.0191} & \ms{1.4526}{0.0926} & \ms{1.2441}{0.1446} \\
StormCast
& \ms{1.6679}{0.1438} & \ms{1.4745}{0.1134} & \ms{0.1830}{0.1969}
& \ms{0.4076}{0.0094} & \ms{1.3663}{0.0781} & \ms{1.2371}{0.1078} \\
DLWP
& \ms{1.3070}{0.0980} & \ms{1.1769}{0.0834} & \ms{0.4888}{0.1368}
& \ms{0.3972}{0.0146} & \ms{1.5351}{0.0802} & \ms{1.3196}{0.0781} \\
FCN
& \ms{1.3693}{0.0885} & \ms{1.2599}{0.0723} & \ms{0.3484}{0.1662}
& \ms{0.4316}{0.0134} & \ms{1.4604}{0.1035} & \ms{1.2351}{0.0586} \\
FengWu
& \ms{1.3715}{0.1011} & \ms{1.2604}{0.0820} & \ms{0.3221}{0.2004}
& \ms{0.4246}{0.0237} & \ms{1.4179}{0.0986} & \ms{1.2233}{0.0915} \\
FuXi
& \ms{1.4068}{0.1011} & \ms{1.3023}{0.0789} & \ms{0.2663}{0.2561}
& \ms{0.4279}{0.0212} & \ms{1.4290}{0.0929} & \ms{1.2236}{0.0961} \\
Pangu-Weather
& \ms{1.3280}{0.0735} & \ms{1.2081}{0.0607} & \ms{0.4141}{0.1573}
& \ms{0.4017}{0.0245} & \ms{1.4235}{0.0731} & \ms{1.2225}{0.0847} \\
AlphaEarth
& \ms{2.4068}{0.2841} & \ms{2.0822}{0.2371} & \ms{-0.3428}{0.1716}
& \ms{0.5086}{0.0440} & \ms{1.2158}{0.1310} & \ms{1.0350}{0.1018} \\
\bottomrule
\end{tabular}
}

\vspace{4pt}

\resizebox{\textwidth}{!}{%
\begin{tabular}{lcccccc}
\toprule
& \multicolumn{3}{c}{\textbf{Smoke PM$_{2.5}$}}
& \multicolumn{3}{c}{\textbf{Extreme heat}} \\
\cmidrule(lr){2-4}\cmidrule(lr){5-7}
\textbf{Backbone}
& \textbf{RMSE$\downarrow$} & \textbf{MAE$\downarrow$}
& \textbf{Pearson $r\uparrow$}
& \textbf{RMSE-C$\downarrow$} & \textbf{MAE-C$\downarrow$}
& \textbf{Exceed.\ $F_1\uparrow$} \\
\midrule
\ourfm\
& \ms{4.4646}{0.0060}
& \ms{2.4108}{0.0016}
& \ensuremath{\mathbf{0.6368}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0013}}}
& \ensuremath{\mathbf{0.2179}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0043}}}
& \ensuremath{\mathbf{0.1787}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0018}}}
& \ms{0.9541}{0.0164} \\
\midrule
Prithvi-WxC
& \ms{6.0382}{0.0828} & \ms{3.7301}{0.0055} & \ms{0.0243}{0.0045}
& \ms{4.6225}{0.0192} & \ms{2.6315}{0.0128} & \ms{0.8693}{0.0023} \\
Aurora
& \ms{6.0384}{0.0828} & \ms{3.7265}{0.0055} & \ms{0.0193}{0.0043}
& \ms{18.0474}{0.0708} & \ms{15.3747}{0.0594} & \ms{0.0951}{0.0038} \\
ClimaX
& \ms{6.0402}{0.0828} & \ms{3.7290}{0.0055} & \ms{0.0004}{0.0029}
& \ms{17.6492}{0.0347} & \ms{14.4938}{0.0319} & \ms{0.7684}{0.0068} \\
StormCast
& \ms{6.1230}{0.0830} & \ms{3.8182}{0.0073} & \ms{0.0183}{0.0041}
& \ms{1.7671}{0.2145} & \ms{1.3507}{0.1576} & \ms{0.9073}{0.0189} \\
DLWP
& \ms{5.9289}{0.1031} & \ms{3.7331}{0.0088} & \ms{0.0303}{0.0060}
& \ms{2.2662}{0.1106} & \ms{1.7153}{0.0748} & \ms{0.9156}{0.0112} \\
FCN
& \ms{5.9277}{0.1033} & \ms{3.7345}{0.0088} & \ms{0.0312}{0.0062}
& \ms{2.1657}{0.1800} & \ms{1.6033}{0.1039} & \ms{0.9257}{0.0096} \\
FengWu
& \ms{5.9297}{0.1032} & \ms{3.7395}{0.0088} & \ms{0.0304}{0.0063}
& \ms{2.1266}{0.1589} & \ms{1.5801}{0.1004} & \ms{0.0481}{0.0459} \\
FuXi
& \ms{5.9319}{0.1029} & \ms{3.7398}{0.0088} & \ms{0.0299}{0.0061}
& \ms{2.1282}{0.0969} & \ms{1.5759}{0.0719} & \ms{0.2268}{0.0623} \\
Pangu-Weather
& \ms{5.9270}{0.1036} & \ms{3.7320}{0.0088} & \ms{0.0301}{0.0060}
& \ms{2.2045}{0.1483} & \ms{1.6307}{0.0889} & \ms{0.0199}{0.0062} \\
AlphaEarth
& \ensuremath{\mathbf{4.4403}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0488}}}
& \ensuremath{\mathbf{2.3992}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0056}}}
& \ms{0.6347}{0.0066}
& \ms{0.2194}{0.0039}
& \ms{0.1800}{0.0014}
& \ensuremath{\mathbf{0.9542}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0107}}} \\
\bottomrule
\end{tabular}
}
\end{table}


\paragraph{Pattern 1: primary pattern preserved (burned area).}
\ourfm\ leads all frozen entries on log-RMSE and Spearman $\rho$. The ordering observed under occupancy and spread is preserved under burned-area regression despite the different prediction unit and metric family.

\paragraph{Pattern 2: primary pattern bounded (analog retrieval and smoke PM$_{2.5}$).}
For analog retrieval, AlphaEarth matches \ourfm\ (nDCG@10 $= 0.51 \pm 0.04$ vs.\ $0.51 \pm 0.03$). For smoke PM$_{2.5}$, AlphaEarth also matches \ourfm\ on MAE and Pearson $r$, while atmospheric Earth FMs show near-zero correlation. These results show that the occupancy-and-spread ordering does not fully extend to all supporting tasks once AlphaEarth is included.

\paragraph{Pattern 3: primary pattern bounded with large variance (extreme heat).} AlphaEarth matches \ourfm on RMSE-C and remains close on exceedance F1, while atmospheric FMs range from RMSE-C $= 1.77$ (StormCast) to $18.05$ (Aurora). This large spread indicates that aggregated scores across task forms would be dominated by scale artifacts rather than transfer quality, reinforcing the need for per-contract reporting established in Section~\ref{sec:eval}.

\textit{Answer to RQ3:} Figure~\ref{fig:task_comparator_normalized_map} and Table~\ref{tab:supporting_results} show that burned area preserves the primary reference-versus-frozen pattern under a different metric family. Analog retrieval, smoke PM$_{2.5}$, and extreme heat bound this pattern: AlphaEarth matches or approaches \ourfm on these tasks, indicating that the primary occupancy and spread claims do not extend uniformly across all task forms.