Wildfire-FM / paper /sections /4_experiments.tex

Refresh WildFIRE-FM model release

84b67b3 verified 10 days ago

30.5 kB

	\section{Experiments}
	\label{sec:experiments}

	We address three research questions under the fixed-contract framework defined in Section~\ref{sec:eval}. \textbf{RQ1:} Under fixed outputs, does the matching rule determine whether a wildfire model appears usable?
	\textbf{RQ2:} Under fixed features, does ranking-based head selection lose decision performance?
	\textbf{RQ3:} Under fixed task contracts, do model comparisons remain consistent across task forms?
	\vspace{-0.5em}
	\subsection{Experimental Setup}
	\paragraph{Task instances.}
	We instantiate the six task-form contracts defined in Section~\ref{sec:taskforms}.
	Occupancy and fire spread serve as primary tasks because they evaluate spatial fire outputs under matching or overlap rules and align with the decision structure of early warning systems~\cite{goldammer1999early, farahmand2020fdeo}.
	The four supporting tasks, \textit{final burned area, analog retrieval, smoke PM$_{2.5}$, and extreme heat}, use different prediction units and metric families; their results bound rather than replace primary decision evidence.

	\paragraph{Compared backbones.}
	The frozen Earth-FM comparator set includes Prithvi-WxC~\cite{schmude2024prithviwxc}, Aurora~\cite{bodnar2025aurora}, ClimaX~\cite{nguyen2023climax}, StormCast~\cite{pathak2024stormcast}, DLWP~\cite{weyn2020dlwp}, FCN~\cite{pathak2022fourcastnet}, FengWu~\cite{chen2023fengwu}, FuXi~\cite{chen2023fuxi}, Pangu-Weather~\cite{bi2023panguweather}, and AlphaEarth~\cite{brown2025alphaearth}.
	\ourfm\ serves as the wildfire-specialized reference backbone.

	\paragraph{Protocol.}
	For each comparison, the contract $\mathcal{C} = (\mathcal{T}, M, \Lambda, \Omega, \mathcal{A})$ is fixed before reporting test scores.
	Thresholds and morphology parameters are selected on validation data and held fixed at test time.
	Stochastic components are evaluated over five seeds and reported as mean $\pm$ standard deviation; deterministic fixed-output checks have zero seed variance by construction.
	Entries outside a fixed contract are omitted from main tables and documented in the appendix.
	For error metrics lower is better ($\downarrow$); for F1, AP, nDCG, and correlation metrics higher is better ($\uparrow$).
	Appendix Table~\ref{tab:app_seed_robustness} summarizes the seed-level checks behind the reported mean-with-std convention.

	\subsection{Matching-Rule Sensitivity Under Fixed Output (RQ1)}
	\label{sec:rq1}

	To answer RQ1, we conduct a fixed-output check on occupancy and fire spread tasks, holding the score field $S$, label field $Y$, threshold, and all other operating choices fixed while varying only the matching rule $\Lambda$ across exact, tolerated, and union settings. Occupancy results are reported in Figure~\ref{fig:fireprone_contract_progression} under both global and fire-prone scopes. The same progression is applied to fire spread outputs. Complete occupancy sweeps and predicted-positive rates are reported in Appendix Tables~\ref{tab:fireprone_contract_progression} and~\ref{tab:app_occupancy_ppr_scope}.

	\begin{wrapfigure}[21]{r}{0.50\textwidth}
	\centering
	\vspace{-3mm}
	\includegraphics[width=\linewidth]{figures/fig_primary_rank_change_map.pdf}
	\caption{\textbf{Primary-task rank changes (RQ1).}
	Cells show rank before$\rightarrow$after. Green/red/gray mark moving up/down/no change; darker green or red marks a larger move. Following Section~\ref{sec:taskforms}, Ex/Tol/Un are occupancy exact, tolerated, and union matching; Sp is spread spatial-overlap $F_1$.}
	\label{fig:primary_ranking}
	\vspace{-0.8em}
	\end{wrapfigure}
	Because both tasks involve spatially sparse targets, fire-active cells for occupancy, burned raster patches for spread, the operational assumptions encoded in $\Lambda$ directly govern what the model is being asked to get right, making matching-rule choice a substantive experimental setting rather than a post hoc evaluation detail.
	The fixed-output results reveal a pattern that goes beyond score differences: matching-rule choice determines whether a model appears viable for wildfire decision tasks at all. Under exact matching, which requires same-cell same-time agreement, the majority of frozen Earth-FM backbones produce F1 scores that are effectively near zero, rendering them indistinguishable from an uninformative baseline and suggesting they have no practical utility for the task. As the matching rule relaxes to tolerated and then union matching, both of which reflect operationally realistic assumptions for early warning systems, where a prediction displaced by a few grid cells still triggers the correct response, the same frozen representations recover substantial decision performance, with several backbones crossing from near-zero to practically meaningful F1 levels. This transition is not a marginal score improvement: it is a qualitative change in whether a model can be considered usable. The same pattern holds for fire spread under region-level matching relaxation, where strict raster-cell agreement again suppresses performance for most backbones while spatial tolerance restores it. The implications for prior wildfire transfer claims are significant: papers that report model performance under a single implicit matching rule, which is common practice given that sparse decision targets almost always require some form of tolerance~\cite{ebert2009neighborhood, gilleland2009intercomparison}, may be drawing viability conclusions that are entirely dependent on an undisclosed protocol choice. A model claimed to perform well under one tolerance assumption may be completely unusable under a stricter one, and vice versa. Matching rule cannot be treated as an evaluation detail; it is an experimental setting that must be fixed, reported, and justified as part of any wildfire transfer claim. Additional spread AP values under fixed scopes are reported in Appendix Table~\ref{tab:app_spread_ap_by_scope}.


	\begin{table}[t]
	\centering
	\small
	\setlength{\tabcolsep}{4pt}
	\renewcommand{\arraystretch}{1.20}
	\caption{%
	\textbf{Primary fixed-contract transfer results (RQ1).}
	Occupancy metrics: exact, tolerated, union $F_1$ (\%).
	Fire spread metrics: exact $F_1$ and spatial $F_1$ (\%).
	Each block fixes $\mathcal{T}$, $\Lambda$, $\Omega$, and $\mathcal{A}$.
	Upward arrows indicate that larger values are better.
	\textbf{Bold} marks the best value per metric. \textbf{Tol.} = Tolerated
	}
	\label{tab:primary_results}
	\setlength{\arrayrulewidth}{0.4pt}
	\resizebox{\textwidth}{!}{%
	\begin{tabular}{lccccc}
	\toprule
	& \multicolumn{3}{c}{\textbf{Occupancy}}
	& \multicolumn{2}{c}{\textbf{Fire spread}} \\
	\cmidrule(lr){2-4}\cmidrule(lr){5-6}
	\textbf{Comparator}
	& \textbf{Exact $F_1\uparrow$} & \textbf{Tol.\ $F_1\uparrow$} & \textbf{Union $F_1\uparrow$}
	& \textbf{Exact $F_1\uparrow$} & \textbf{Spatial $F_1\uparrow$} \\
	\midrule
	\ourfm\
	& \ms{0.4546}{0.1412}
	& \ms{29.7484}{1.2868}
	& \ms{59.0656}{2.7372}
	& \ensuremath{\mathbf{37.6700}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.9800}}}
	& \ensuremath{\mathbf{80.9700}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{2.0200}}} \\
	\midrule
	Prithvi-WxC
	& \ms{0.0552}{0.0039} & \ms{7.1649}{0.6557} & \ms{20.1853}{1.8299}
	& \ms{22.3500}{3.4500} & \ms{65.2600}{1.0700} \\
	Aurora
	& \ms{0.0656}{0.0094} & \ms{8.5009}{1.9594} & \ms{23.1037}{4.9418}
	& \ms{30.8757}{0.1343} & \ms{71.7329}{0.0141} \\
	ClimaX
	& \ms{0.3480}{0.0754}
	& \ensuremath{\mathbf{29.7535}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{3.6073}}}
	& \ensuremath{\mathbf{60.1506}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{7.5865}}}
	& \ms{27.9853}{2.0532} & \ms{69.0634}{2.3832} \\
	StormCast
	& \ms{0.0626}{0.0119} & \ms{8.1951}{2.1895} & \ms{22.3817}{5.4294}
	& \ms{14.8387}{7.5791} & \ms{55.7568}{21.3003} \\
	DLWP
	& \ms{0.1693}{0.0419} & \ms{14.9148}{3.2446} & \ms{28.1901}{6.9658}
	& \ms{5.9335}{10.0712} & \ms{22.8587}{22.3750} \\
	FCN
	& \ms{0.2829}{0.0839} & \ms{19.5061}{3.3412} & \ms{40.0604}{9.3701}
	& \ms{3.1798}{2.6598} & \ms{15.6203}{12.4531} \\
	FengWu
	& \ms{0.2613}{0.0757} & \ms{12.0050}{6.0239} & \ms{24.1022}{13.6293}
	& \ms{5.5189}{9.0883} & \ms{18.4774}{22.4703} \\
	FuXi
	& \ms{0.3774}{0.1212} & \ms{21.0323}{4.8211} & \ms{37.2888}{9.4470}
	& \ms{19.9909}{2.1364} & \ms{56.1826}{3.0412} \\
	Pangu-Weather
	& \ms{0.2755}{0.1089} & \ms{17.0909}{4.0477} & \ms{35.6386}{9.0327}
	& \ms{11.2583}{11.0719} & \ms{32.5081}{25.4969} \\
	AlphaEarth
	& \ensuremath{\mathbf{2.0606}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.4404}}}
	& \ms{29.4476}{6.0064} & \ms{37.4286}{9.9458}
	& \ms{11.0995}{3.6088} & \ms{32.8316}{7.4634} \\
	\bottomrule
	\end{tabular}
	}
	\end{table}


	% \begin{figure}[H]
	% \centering
	% \includegraphics[width=\textwidth]{figures/fig_fireprone_contract_progression_compact.pdf}
	% \caption{
	% \textbf{Matching-rule sensitivity in fire-prone occupancy (RQ1).}
	% Each row holds the score field $S$, label field $Y$, threshold, and $\Omega$ fixed, and changes only $\Lambda$.
	% Legend: \textcolor[HTML]{17375E}{$\blacksquare$} strict $F_1$,
	% \textcolor[HTML]{4F8DCC}{$\blacksquare$} added $F_1$ from spatial tolerance,
	% \textcolor[HTML]{BFD7F0}{$\blacksquare$} added $F_1$ from union matching,
	% red outline \ourfm, and dashed line original weather FMs vs.\ added baselines.
	% The horizontal axis is $F_1$ in percent.
	% }
	% \label{fig:fireprone_contract_progression}
	% \end{figure}
	\begin{wrapfigure}[14]{r}{0.50\textwidth}
	\centering
	\vspace{-1em} \includegraphics[width=0.50\textwidth]{figures/fig_selection_regret_scatter.pdf}
	\caption{\textbf{Head-selection regret under fixed features (RQ2).}
	Each point is one backbone; selection regret $\delta$ follows Section~\ref{sec:checks} under global-scope union-$F_1$.}

	\label{fig:selection_regret_diagnostic}
	\vspace{-1.2em}
	\end{wrapfigure}

	% \begin{wrapfigure}[17]{r}{0.46\textwidth}
	% \vspace{-0.4em}
	% \centering
	% \resizebox{\linewidth}{!}{%
	% \begin{tikzpicture}[x=1cm,y=1cm]
	% \footnotesize
	% \draw[black!12, line width=0.35pt] (2.450,-0.350) -- (2.450,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (2.450,-0.410) {-20};
	% \draw[black!12, line width=0.35pt] (3.243,-0.350) -- (3.243,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (3.243,-0.410) {-10};
	% \draw[wfgray, line width=0.55pt] (4.036,-0.350) -- (4.036,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (4.036,-0.410) {0};
	% \draw[black!12, line width=0.35pt] (4.829,-0.350) -- (4.829,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (4.829,-0.410) {10};
	% \draw[black!12, line width=0.35pt] (5.621,-0.350) -- (5.621,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (5.621,-0.410) {20};
	% \draw[black!12, line width=0.35pt] (6.414,-0.350) -- (6.414,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (6.414,-0.410) {30};
	% \draw[black!12, line width=0.35pt] (7.207,-0.350) -- (7.207,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (7.207,-0.410) {40};
	% \draw[black!12, line width=0.35pt] (8.000,-0.350) -- (8.000,4.530);
	% \node[anchor=north, font=\scriptsize, text=black!70] at (8.000,-0.410) {50};
	% \draw[black!45, line width=0.4pt] (2.450,-0.350) -- (8.000,-0.350);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,4.350) {\textcolor{wfblue}{\textbf{\ourfm}}};
	% \draw[wfslate, line width=0.72pt] (4.030,4.220) -- (5.212,4.220);
	% \draw[wfslate, line width=0.72pt] (4.030,4.185) -- (4.030,4.255);
	% \draw[wfslate, line width=0.72pt] (5.212,4.185) -- (5.212,4.255);
	% \filldraw[wfslate] (4.621,4.220) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.051,4.480) -- (4.487,4.480);
	% \draw[wforange, line width=0.72pt] (4.051,4.445) -- (4.051,4.515);
	% \draw[wforange, line width=0.72pt] (4.487,4.445) -- (4.487,4.515);
	% \filldraw[wforange] (4.224,4.435) rectangle (4.314,4.525);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,3.940) {Prithvi-WxC};
	% \draw[wfslate, line width=0.72pt] (4.036,3.810) -- (4.036,3.810);
	% \draw[wfslate, line width=0.72pt] (4.036,3.775) -- (4.036,3.845);
	% \draw[wfslate, line width=0.72pt] (4.036,3.775) -- (4.036,3.845);
	% \filldraw[wfslate] (4.036,3.810) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.036,4.070) -- (4.036,4.070);
	% \draw[wforange, line width=0.72pt] (4.036,4.035) -- (4.036,4.105);
	% \draw[wforange, line width=0.72pt] (4.036,4.035) -- (4.036,4.105);
	% \filldraw[wforange] (3.991,4.025) rectangle (4.081,4.115);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,3.530) {Aurora};
	% \draw[wfslate, line width=0.72pt] (3.580,3.400) -- (5.276,3.400);
	% \draw[wfslate, line width=0.72pt] (3.580,3.365) -- (3.580,3.435);
	% \draw[wfslate, line width=0.72pt] (5.276,3.365) -- (5.276,3.435);
	% \filldraw[wfslate] (4.428,3.400) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (2.627,3.660) -- (7.723,3.660);
	% \draw[wforange, line width=0.72pt] (2.627,3.625) -- (2.627,3.695);
	% \draw[wforange, line width=0.72pt] (7.723,3.625) -- (7.723,3.695);
	% \filldraw[wforange] (5.130,3.615) rectangle (5.220,3.705);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,3.120) {ClimaX};
	% \draw[wfslate, line width=0.72pt] (4.032,2.990) -- (4.060,2.990);
	% \draw[wfslate, line width=0.72pt] (4.032,2.955) -- (4.032,3.025);
	% \draw[wfslate, line width=0.72pt] (4.060,2.955) -- (4.060,3.025);
	% \filldraw[wfslate] (4.046,2.990) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.036,3.250) -- (4.036,3.250);
	% \draw[wforange, line width=0.72pt] (4.036,3.215) -- (4.036,3.285);
	% \draw[wforange, line width=0.72pt] (4.036,3.215) -- (4.036,3.285);
	% \filldraw[wforange] (3.991,3.205) rectangle (4.081,3.295);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,2.710) {StormCast};
	% \draw[wfslate, line width=0.72pt] (4.036,2.580) -- (4.036,2.580);
	% \draw[wfslate, line width=0.72pt] (4.036,2.545) -- (4.036,2.615);
	% \draw[wfslate, line width=0.72pt] (4.036,2.545) -- (4.036,2.615);
	% \filldraw[wfslate] (4.036,2.580) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.036,2.840) -- (4.036,2.840);
	% \draw[wforange, line width=0.72pt] (4.036,2.805) -- (4.036,2.875);
	% \draw[wforange, line width=0.72pt] (4.036,2.805) -- (4.036,2.875);
	% \filldraw[wforange] (3.991,2.795) rectangle (4.081,2.885);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,2.300) {DLWP};
	% \draw[wfslate, line width=0.72pt] (4.036,2.170) -- (4.036,2.170);
	% \draw[wfslate, line width=0.72pt] (4.036,2.135) -- (4.036,2.205);
	% \draw[wfslate, line width=0.72pt] (4.036,2.135) -- (4.036,2.205);
	% \filldraw[wfslate] (4.036,2.170) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.044,2.430) -- (4.735,2.430);
	% \draw[wforange, line width=0.72pt] (4.044,2.395) -- (4.044,2.465);
	% \draw[wforange, line width=0.72pt] (4.735,2.395) -- (4.735,2.465);
	% \filldraw[wforange] (4.345,2.385) rectangle (4.435,2.475);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,1.890) {FCN};
	% \draw[wfslate, line width=0.72pt] (4.036,1.760) -- (4.036,1.760);
	% \draw[wfslate, line width=0.72pt] (4.036,1.725) -- (4.036,1.795);
	% \draw[wfslate, line width=0.72pt] (4.036,1.725) -- (4.036,1.795);
	% \filldraw[wfslate] (4.036,1.760) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (3.971,2.020) -- (4.286,2.020);
	% \draw[wforange, line width=0.72pt] (3.971,1.985) -- (3.971,2.055);
	% \draw[wforange, line width=0.72pt] (4.286,1.985) -- (4.286,2.055);
	% \filldraw[wforange] (4.083,1.975) rectangle (4.173,2.065);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,1.480) {FengWu};
	% \draw[wfslate, line width=0.72pt] (4.036,1.350) -- (4.036,1.350);
	% \draw[wfslate, line width=0.72pt] (4.036,1.315) -- (4.036,1.385);
	% \draw[wfslate, line width=0.72pt] (4.036,1.315) -- (4.036,1.385);
	% \filldraw[wfslate] (4.036,1.350) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.028,1.610) -- (4.127,1.610);
	% \draw[wforange, line width=0.72pt] (4.028,1.575) -- (4.028,1.645);
	% \draw[wforange, line width=0.72pt] (4.127,1.575) -- (4.127,1.645);
	% \filldraw[wforange] (4.032,1.565) rectangle (4.122,1.655);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,1.070) {FuXi};
	% \draw[wfslate, line width=0.72pt] (4.036,0.940) -- (4.036,0.940);
	% \draw[wfslate, line width=0.72pt] (4.036,0.905) -- (4.036,0.975);
	% \draw[wfslate, line width=0.72pt] (4.036,0.905) -- (4.036,0.975);
	% \filldraw[wfslate] (4.036,0.940) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.029,1.200) -- (4.087,1.200);
	% \draw[wforange, line width=0.72pt] (4.029,1.165) -- (4.029,1.235);
	% \draw[wforange, line width=0.72pt] (4.087,1.165) -- (4.087,1.235);
	% \filldraw[wforange] (4.013,1.155) rectangle (4.103,1.245);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,0.660) {Pangu-Weather};
	% \draw[wfslate, line width=0.72pt] (4.036,0.530) -- (4.036,0.530);
	% \draw[wfslate, line width=0.72pt] (4.036,0.495) -- (4.036,0.565);
	% \draw[wfslate, line width=0.72pt] (4.036,0.495) -- (4.036,0.565);
	% \filldraw[wfslate] (4.036,0.530) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (4.025,0.790) -- (4.076,0.790);
	% \draw[wforange, line width=0.72pt] (4.025,0.755) -- (4.025,0.825);
	% \draw[wforange, line width=0.72pt] (4.076,0.755) -- (4.076,0.825);
	% \filldraw[wforange] (4.006,0.745) rectangle (4.096,0.835);
	% \node[anchor=east, font=\scriptsize, text=black!82] at (2.320,0.250) {AlphaEarth};
	% \draw[wfslate, line width=0.72pt] (4.700,0.120) -- (6.103,0.120);
	% \draw[wfslate, line width=0.72pt] (4.700,0.085) -- (4.700,0.155);
	% \draw[wfslate, line width=0.72pt] (6.103,0.085) -- (6.103,0.155);
	% \filldraw[wfslate] (5.401,0.120) circle[radius=0.045];
	% \draw[wforange, line width=0.72pt] (3.872,0.380) -- (4.815,0.380);
	% \draw[wforange, line width=0.72pt] (3.872,0.345) -- (3.872,0.415);
	% \draw[wforange, line width=0.72pt] (4.815,0.345) -- (4.815,0.415);
	% \filldraw[wforange] (4.298,0.335) rectangle (4.388,0.425);
	% \end{tikzpicture}%
	% }
	% \caption{\textbf{Fixed-feature selection-regret check (RQ2).} Fixed-feature selection regret $\delta = D(h_D)-D(h_R)$ under union-$F_1$. \textcolor{wfslate}{$\bullet$} uses the full grid; \textcolor{wforange}{$\blacksquare$} uses the top $20\%$ fire-prone cells from training fire frequency. Horizontal intervals use the same colors and show mean $\pm$ std over five seeds, in percentage points.}
	% \vspace{+0.4em}
	% \label{fig:selection_regret_diagnostic}
	% \end{wrapfigure}

	\subsection{Head-Selection Sensitivity Under Fixed Features (RQ2)}
	\label{sec:rq2}

	To answer RQ2, we conduct a fixed-feature check on occupancy and fire spread tasks, holding the frozen feature source, $\mathcal{T}$, $\Omega$, $\Lambda$, and candidate head family $\mathcal{H} \subseteq \mathcal{A}$ fixed while varying only the selection metric between PR-AUC-based and decision-F1-based selection. The resulting selection regret $\delta = D(h_D) - D(h_R)$ measures the decision-score loss induced by metric misalignment. Occupancy results are reported in Figure~\ref{fig:selection_regret_diagnostic} under both global and fire-prone scopes. Full per-seed and per-head details are reported in Appendix~\ref{sec:app_seeded_audits}, and the exact, tolerated, and union regret breakdown is provided in Appendix Table~\ref{tab:appendix_selection_regret_tolerance}.

	The fixed-feature results show that head-selection metrics introduce substantial backbone-dependent variation that is not explained by representation quality alone. Some backbones exhibit near-zero regret, indicating agreement between PR-AUC and decision-F1 selection, while others show large regret concentrated in specific scope-matching settings. Regret is generally larger under the global scope, where severe fire imbalance amplifies misalignment between ranking and decision metrics~\cite{mcdermott2024aurocauprc}. Restricting evaluation to fire-prone scopes typically reduces regret by concentrating evaluation on fire-relevant regions. A similar pattern appears for fire spread, where ranking and decision metrics can favor different heads under the same frozen representation. These results show that selection metrics must be aligned with the evaluation objective as part of the evaluation contract~\cite{traub2024selectiveclassification}.

	\subsection{Supporting Task Checks (RQ3)}
	\label{sec:rq3}

	To answer RQ3, we evaluate all backbones across the four supporting task contracts, \textit{burned area, analog retrieval, smoke PM$_{2.5}$, }and\textit{ extreme heat}, and examine whether the reference-versus-frozen ordering established under primary tasks generalizes across task forms. A rank overview across all six contracts is provided in Figure~\ref{fig:task_comparator_normalized_map}, which maps backbone-by-task rank positions and makes cross-task ordering shifts directly visible. Native metric values are reported in Table~\ref{tab:supporting_results}. Additional supporting-task diagnostics are reported in Appendix Tables~\ref{tab:app_burned_area_median_acre}, \ref{tab:app_analog_rank_depth}, \ref{tab:app_smoke_high_event}, and~\ref{tab:app_heat_event_pr}.


	The supporting task results produce three qualitatively distinct patterns relative to the primary findings. Burned area largely preserves the reference-versus-frozen ordering seen under occupancy and spread: \ourfm\ leads frozen entries on log-RMSE and Spearman $\rho$, suggesting that the representational advantage of wildfire-specific pretraining generalizes to event-scale regression under a different metric family, providing convergent evidence for the primary claim. Analog retrieval and smoke PM$_{2.5}$ show a different pattern, with AlphaEarth matching \ourfm\ closely on both tasks while atmospheric FMs show near-zero correlation on smoke PM$_{2.5}$, indicating that retrieval and air-quality signals are captured comparably by a general remote-sensing backbone, and that the primary occupancy advantage does not extend uniformly to these task forms. Extreme heat exhibits the largest variance across the comparator set, with atmospheric FMs ranging from near-reference performance to near-complete failure depending on backbone pretraining domain, while AlphaEarth again matches \ourfm\ closely. The scale of this variance is itself informative: aggregating scores across task forms without respecting contract boundaries would produce rankings dominated by scale artifacts in the extreme heat block rather than by transfer quality. Taken together, these results establish that supporting tasks bound rather than extend the primary claim, they provide useful evidence about where backbone families generalize and where they do not, but they cannot substitute for primary decision task evaluation, and their results must within their own task-form contracts.

	\begin{figure}[t]
	\centering
	\vspace{-5mm}

	\includegraphics[width=\textwidth]{figures/fig_rank_heatmap1.pdf}
	\vspace{-2mm}
	\caption{{\textbf{Rank map for supporting task comparison (RQ4).} Each row fixes one task contract $\mathcal{C}$ and ranks the eligible backbones within that contract. The figure shows rank changes across task forms; native metric values are reported in Table~\ref{tab:supporting_results}.}}
	\vspace{-6mm}
	\label{fig:task_comparator_normalized_map}
	\end{figure}

	\begin{table}[t]
	\centering
	\small
	\setlength{\tabcolsep}{3.5pt}
	\renewcommand{\arraystretch}{1.18}
	\caption{%
	\textbf{Supporting task-metric matrix (RQ3).}
	Top: final burned area and analog retrieval.
	Bottom: smoke PM$_{2.5}$ and extreme heat.
	Each block fixes $\mathcal{T}$, $\Lambda$, and $\Omega$; backbone
	column is shared across paired tasks. \ourfm\ row is
	separated by a rule as the empirical anchor. \textbf{Bold} marks
	the best value per metric. For error metrics
	lower is better ($\downarrow$); for $F_1$, nDCG, and $r$ higher
	is better ($\uparrow$).
	}
	\label{tab:supporting_results}
	\resizebox{\textwidth}{!}{%
	\begin{tabular}{lcccccc}
	\toprule
	& \multicolumn{3}{c}{\textbf{Burned area}}
	& \multicolumn{3}{c}{\textbf{Analog retrieval}} \\
	\cmidrule(lr){2-4}\cmidrule(lr){5-7}
	\textbf{Backbone}
	& \textbf{log-RMSE$\downarrow$} & \textbf{log-MAE$\downarrow$}
	& \textbf{Spearman$\uparrow$}
	& \textbf{nDCG@10$\uparrow$} & \textbf{log-RMSE$\downarrow$}
	& \textbf{log-MAE$\downarrow$} \\
	\midrule
	\ourfm\
	& \ensuremath{\mathbf{1.1657}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0126}}}
	& \ensuremath{\mathbf{1.0423}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0081}}}
	& \ensuremath{\mathbf{0.6298}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0338}}}
	& \ensuremath{\mathbf{0.5099}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0336}}}
	& \ensuremath{\mathbf{1.1977}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.1029}}}
	& \ensuremath{\mathbf{1.0043}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0759}}} \\
	\midrule
	Prithvi-WxC
	& \ms{1.3630}{0.0681} & \ms{1.2435}{0.0668} & \ms{0.1799}{0.3002}
	& \ms{0.3857}{0.0189} & \ms{1.3908}{0.0938} & \ms{1.2585}{0.0865} \\
	Aurora
	& \ms{1.8658}{0.2009} & \ms{1.6717}{0.1245} & \ms{-0.1156}{0.2982}
	& \ms{0.4046}{0.0144} & \ms{1.3659}{0.0792} & \ms{1.2596}{0.0968} \\
	ClimaX
	& \ms{2.0300}{0.2103} & \ms{1.8443}{0.1528} & \ms{-0.2515}{0.2688}
	& \ms{0.4143}{0.0191} & \ms{1.4526}{0.0926} & \ms{1.2441}{0.1446} \\
	StormCast
	& \ms{1.6679}{0.1438} & \ms{1.4745}{0.1134} & \ms{0.1830}{0.1969}
	& \ms{0.4076}{0.0094} & \ms{1.3663}{0.0781} & \ms{1.2371}{0.1078} \\
	DLWP
	& \ms{1.3070}{0.0980} & \ms{1.1769}{0.0834} & \ms{0.4888}{0.1368}
	& \ms{0.3972}{0.0146} & \ms{1.5351}{0.0802} & \ms{1.3196}{0.0781} \\
	FCN
	& \ms{1.3693}{0.0885} & \ms{1.2599}{0.0723} & \ms{0.3484}{0.1662}
	& \ms{0.4316}{0.0134} & \ms{1.4604}{0.1035} & \ms{1.2351}{0.0586} \\
	FengWu
	& \ms{1.3715}{0.1011} & \ms{1.2604}{0.0820} & \ms{0.3221}{0.2004}
	& \ms{0.4246}{0.0237} & \ms{1.4179}{0.0986} & \ms{1.2233}{0.0915} \\
	FuXi
	& \ms{1.4068}{0.1011} & \ms{1.3023}{0.0789} & \ms{0.2663}{0.2561}
	& \ms{0.4279}{0.0212} & \ms{1.4290}{0.0929} & \ms{1.2236}{0.0961} \\
	Pangu-Weather
	& \ms{1.3280}{0.0735} & \ms{1.2081}{0.0607} & \ms{0.4141}{0.1573}
	& \ms{0.4017}{0.0245} & \ms{1.4235}{0.0731} & \ms{1.2225}{0.0847} \\
	AlphaEarth
	& \ms{2.4068}{0.2841} & \ms{2.0822}{0.2371} & \ms{-0.3428}{0.1716}
	& \ms{0.5086}{0.0440} & \ms{1.2158}{0.1310} & \ms{1.0350}{0.1018} \\
	\bottomrule
	\end{tabular}
	}

	\vspace{4pt}

	\resizebox{\textwidth}{!}{%
	\begin{tabular}{lcccccc}
	\toprule
	& \multicolumn{3}{c}{\textbf{Smoke PM$_{2.5}$}}
	& \multicolumn{3}{c}{\textbf{Extreme heat}} \\
	\cmidrule(lr){2-4}\cmidrule(lr){5-7}
	\textbf{Backbone}
	& \textbf{RMSE$\downarrow$} & \textbf{MAE$\downarrow$}
	& \textbf{Pearson $r\uparrow$}
	& \textbf{RMSE-C$\downarrow$} & \textbf{MAE-C$\downarrow$}
	& \textbf{Exceed.\ $F_1\uparrow$} \\
	\midrule
	\ourfm\
	& \ms{4.4646}{0.0060}
	& \ms{2.4108}{0.0016}
	& \ensuremath{\mathbf{0.6368}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0013}}}
	& \ensuremath{\mathbf{0.2179}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0043}}}
	& \ensuremath{\mathbf{0.1787}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0018}}}
	& \ms{0.9541}{0.0164} \\
	\midrule
	Prithvi-WxC
	& \ms{6.0382}{0.0828} & \ms{3.7301}{0.0055} & \ms{0.0243}{0.0045}
	& \ms{4.6225}{0.0192} & \ms{2.6315}{0.0128} & \ms{0.8693}{0.0023} \\
	Aurora
	& \ms{6.0384}{0.0828} & \ms{3.7265}{0.0055} & \ms{0.0193}{0.0043}
	& \ms{18.0474}{0.0708} & \ms{15.3747}{0.0594} & \ms{0.0951}{0.0038} \\
	ClimaX
	& \ms{6.0402}{0.0828} & \ms{3.7290}{0.0055} & \ms{0.0004}{0.0029}
	& \ms{17.6492}{0.0347} & \ms{14.4938}{0.0319} & \ms{0.7684}{0.0068} \\
	StormCast
	& \ms{6.1230}{0.0830} & \ms{3.8182}{0.0073} & \ms{0.0183}{0.0041}
	& \ms{1.7671}{0.2145} & \ms{1.3507}{0.1576} & \ms{0.9073}{0.0189} \\
	DLWP
	& \ms{5.9289}{0.1031} & \ms{3.7331}{0.0088} & \ms{0.0303}{0.0060}
	& \ms{2.2662}{0.1106} & \ms{1.7153}{0.0748} & \ms{0.9156}{0.0112} \\
	FCN
	& \ms{5.9277}{0.1033} & \ms{3.7345}{0.0088} & \ms{0.0312}{0.0062}
	& \ms{2.1657}{0.1800} & \ms{1.6033}{0.1039} & \ms{0.9257}{0.0096} \\
	FengWu
	& \ms{5.9297}{0.1032} & \ms{3.7395}{0.0088} & \ms{0.0304}{0.0063}
	& \ms{2.1266}{0.1589} & \ms{1.5801}{0.1004} & \ms{0.0481}{0.0459} \\
	FuXi
	& \ms{5.9319}{0.1029} & \ms{3.7398}{0.0088} & \ms{0.0299}{0.0061}
	& \ms{2.1282}{0.0969} & \ms{1.5759}{0.0719} & \ms{0.2268}{0.0623} \\
	Pangu-Weather
	& \ms{5.9270}{0.1036} & \ms{3.7320}{0.0088} & \ms{0.0301}{0.0060}
	& \ms{2.2045}{0.1483} & \ms{1.6307}{0.0889} & \ms{0.0199}{0.0062} \\
	AlphaEarth
	& \ensuremath{\mathbf{4.4403}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0488}}}
	& \ensuremath{\mathbf{2.3992}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0056}}}
	& \ms{0.6347}{0.0066}
	& \ms{0.2194}{0.0039}
	& \ms{0.1800}{0.0014}
	& \ensuremath{\mathbf{0.9542}{\mkern1mu}_{\scriptscriptstyle \boldsymbol{\pm}\mathbf{0.0107}}} \\
	\bottomrule
	\end{tabular}
	}
	\end{table}


	\paragraph{Pattern 1: primary pattern preserved (burned area).}
	\ourfm\ leads all frozen entries on log-RMSE and Spearman $\rho$. The ordering observed under occupancy and spread is preserved under burned-area regression despite the different prediction unit and metric family.

	\paragraph{Pattern 2: primary pattern bounded (analog retrieval and smoke PM$_{2.5}$).}
	For analog retrieval, AlphaEarth matches \ourfm\ (nDCG@10 $= 0.51 \pm 0.04$ vs.\ $0.51 \pm 0.03$). For smoke PM$_{2.5}$, AlphaEarth also matches \ourfm\ on MAE and Pearson $r$, while atmospheric Earth FMs show near-zero correlation. These results show that the occupancy-and-spread ordering does not fully extend to all supporting tasks once AlphaEarth is included.

	\paragraph{Pattern 3: primary pattern bounded with large variance (extreme heat).} AlphaEarth matches \ourfm on RMSE-C and remains close on exceedance F1, while atmospheric FMs range from RMSE-C $= 1.77$ (StormCast) to $18.05$ (Aurora). This large spread indicates that aggregated scores across task forms would be dominated by scale artifacts rather than transfer quality, reinforcing the need for per-contract reporting established in Section~\ref{sec:eval}.

	\textit{Answer to RQ3:} Figure~\ref{fig:task_comparator_normalized_map} and Table~\ref{tab:supporting_results} show that burned area preserves the primary reference-versus-frozen pattern under a different metric family. Analog retrieval, smoke PM$_{2.5}$, and extreme heat bound this pattern: AlphaEarth matches or approaches \ourfm on these tasks, indicating that the primary occupancy and spread claims do not extend uniformly across all task forms.