sdd-distiller-v1

Semantic DOM Distiller — An ONNX model that scores the importance of each DOM node from 0.0 to 1.0.

Overview

Modern websites have extremely complex and obfuscated DOM structures. Feeding raw DOM trees into LLMs leads to inflated token costs and degraded reasoning accuracy due to noise.

sdd-distiller-v1 is the core scoring model of the Semantic DOM Distiller (SDD) pipeline — a DOM-to-Specification preprocessing engine optimized for multimodal AI agents such as Amazon Nova Act.

Given a 41-dimensional feature vector representing a DOM node, the model outputs an importance score between 0.0 and 1.0. Nodes below a configurable threshold are pruned, reconstructing a lean "functional DOM tree" that contains only what the AI needs to understand the page.

Model Details

Item	Value
Task	DOM Node Importance Regression (0.0–1.0)
Architecture	GradientBoostingRegressor → ONNX
Input shape	`(1, 41)` float32
Output shape	`(1, 1)` float32
Model size	~449 KB
RMSE	0.0347
R²	0.9762
Threshold accuracy	97.01% (t=0.3)
Training samples	50,000 (synthetic)
ONNX opset	17

Input Features (41 dimensions)

#	Feature	Description
0	`isHighValueTag`	1 if tag is button / input / a / form / nav / h1–h3 etc.
1	`isMediumValueTag`	1 if tag is h4–h6 / label / section / article etc.
2	`isContainerTag`	1 if tag is div / span / p / section etc.
3	`isInteractive`	1 if element is interactable (click, type, select)
4	`isClickable`	1 if element has click handlers or is a link/button
5	`hasTabIndex`	1 if tabIndex >= 0
6	`hasRole`	1 if explicit or implicit ARIA role exists
7	`roleBaseScore`	Base importance score derived from WAI-ARIA role (0.0–1.0)
8	`hasAriaLabel`	1 if `aria-label` attribute is present
9	`hasAriaLabelledBy`	1 if `aria-labelledby` is present
10	`hasAriaDescribedBy`	1 if `aria-describedby` is present
11	`hasAriaRequired`	1 if `aria-required="true"`
12	`hasAriaExpanded`	1 if `aria-expanded` is present
13	`hasAriaLive`	1 if `aria-live` is present
14	`hasTestId`	1 if `data-testid` / `data-cy` / `data-test` is present
15	`hasText`	1 if direct text content exists
16	`textLength`	Normalized text length (0.0–1.0, capped at 200 chars)
17	`isLabelText`	1 if text is short and label-like (< 50 chars)
18	`isActionText`	1 if text contains action words (submit, save, login, 送信, etc.)
19	`childCount`	Normalized child element count (0.0–1.0, capped at 10)
20	`hasChildren`	1 if element has child elements
21	`isLeaf`	1 if element has no children
22	`depth`	Normalized nesting depth (0.0–1.0, capped at 20)
23	`depthPenalty`	Multiplier penalizing deeply nested nodes (1.0 → 0.25)
24	`fontSizeNorm`	Normalized font size (0.0–1.0, capped at 72px)
25	`isBold`	1 if font-weight >= 600
26	`areaRatio`	Element area / viewport area (0.0–1.0)
27	`isAboveFold`	1 if element is within the initial viewport (y < 800px)
28	`isLargeElement`	1 if width > 200px and height > 30px
29	`hasHref`	1 if `href` attribute is present
30	`hasAlt`	1 if `alt` attribute is present
31	`hasPlaceholder`	1 if `placeholder` attribute is present
32	`isRequired`	1 if `required` attribute is present
33	`isDisabled`	1 if `disabled` attribute is present
34	`inputType`	Importance score derived from `input[type]` (0.0–1.0)
35	`headingLevel`	Importance derived from heading level h1→1.0, h6→0.17
36	`parentIsForm`	1 if an ancestor is a `<form>` element
37	`parentIsNav`	1 if an ancestor is a `<nav>` element
38	`parentIsTable`	1 if an ancestor is a `<table>` element
39	`parentIsInteractive`	1 if an ancestor is interactive
40	`ancestorScore`	Decayed importance score propagated from ancestors

Usage

Node.js (`onnxruntime-node`)

import * as ort from 'onnxruntime-node';

const session = await ort.InferenceSession.create('./sdd-distiller-v1.onnx');

// Build a 41-dim feature vector (Float32Array) for a DOM node
const featureVector = new Float32Array(41);
// featureVector[0] = isHighValueTag, featureVector[3] = isInteractive, ...

const input = new ort.Tensor('float32', featureVector, [1, 41]);
const result = await session.run({ input });
const score = result.output.data[0]; // 0.0 ~ 1.0

Browser (`onnxruntime-web`)

import * as ort from 'onnxruntime-web';

ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.3/dist/';
const session = await ort.InferenceSession.create('./sdd-distiller-v1.onnx');

const featureVector = new Float32Array(41);
const input = new ort.Tensor('float32', featureVector, [1, 41]);
const result = await session.run({ input });
const score = result.output.data[0]; // 0.0 ~ 1.0

Python (`onnxruntime`)

import onnxruntime as rt
import numpy as np

sess = rt.InferenceSession("sdd-distiller-v1.onnx")

# Build a (1, 41) float32 feature matrix
feature_vector = np.zeros((1, 41), dtype=np.float32)
# feature_vector[0, 0] = 1.0  # isHighValueTag
# feature_vector[0, 3] = 1.0  # isInteractive
# ...

score = sess.run(["output"], {"input": feature_vector})[0][0]
print(f"Importance score: {score:.3f}")  # 0.0 ~ 1.0

Training

Training data: 50,000 synthetic samples generated by using the heuristic scorer as a teacher
Base learner: sklearn.ensemble.GradientBoostingRegressor (200 estimators, max_depth=5, lr=0.05)
Export tool: skl2onnx with ONNX opset 17
Training script: scripts/train_model.py

Top Feature Importances

Feature	Importance
`parentIsForm`	0.2961
`isInteractive`	0.1637
`isHighValueTag`	0.1191
`depthPenalty`	0.1008
`depth`	0.0942
`roleBaseScore`	0.0896
`isDisabled`	0.0334
`parentIsNav`	0.0211
`hasHref`	0.0198
`isContainerTag`	0.0169

watilde
/

sdd-distiller

sdd-distiller-v1

Overview

Model Details

Input Features (41 dimensions)

Usage

Node.js (`onnxruntime-node`)

Browser (`onnxruntime-web`)

Python (`onnxruntime`)

Training

Top Feature Importances

Links

sdd-distiller-v1

Overview

Model Details

Input Features (41 dimensions)

Usage

Node.js (onnxruntime-node)

Browser (onnxruntime-web)

Python (onnxruntime)

Training

Top Feature Importances

Links

Node.js (`onnxruntime-node`)

Browser (`onnxruntime-web`)

Python (`onnxruntime`)