sdd-distiller-v1
Semantic DOM Distiller β An ONNX model that scores the importance of each DOM node from 0.0 to 1.0.
GitHub: watilde/sdd | Live Demo
Overview
Modern websites have extremely complex and obfuscated DOM structures. Feeding raw DOM trees into LLMs leads to inflated token costs and degraded reasoning accuracy due to noise.
sdd-distiller-v1 is the core scoring model of the Semantic DOM Distiller (SDD) pipeline β a DOM-to-Specification preprocessing engine optimized for multimodal AI agents such as Amazon Nova Act.
Given a 41-dimensional feature vector representing a DOM node, the model outputs an importance score between 0.0 and 1.0. Nodes below a configurable threshold are pruned, reconstructing a lean "functional DOM tree" that contains only what the AI needs to understand the page.
Model Details
| Item | Value |
|---|---|
| Task | DOM Node Importance Regression (0.0β1.0) |
| Architecture | GradientBoostingRegressor β ONNX |
| Input shape | (1, 41) float32 |
| Output shape | (1, 1) float32 |
| Model size | ~449 KB |
| RMSE | 0.0347 |
| RΒ² | 0.9762 |
| Threshold accuracy | 97.01% (t=0.3) |
| Training samples | 50,000 (synthetic) |
| ONNX opset | 17 |
Input Features (41 dimensions)
| # | Feature | Description |
|---|---|---|
| 0 | isHighValueTag |
1 if tag is button / input / a / form / nav / h1βh3 etc. |
| 1 | isMediumValueTag |
1 if tag is h4βh6 / label / section / article etc. |
| 2 | isContainerTag |
1 if tag is div / span / p / section etc. |
| 3 | isInteractive |
1 if element is interactable (click, type, select) |
| 4 | isClickable |
1 if element has click handlers or is a link/button |
| 5 | hasTabIndex |
1 if tabIndex >= 0 |
| 6 | hasRole |
1 if explicit or implicit ARIA role exists |
| 7 | roleBaseScore |
Base importance score derived from WAI-ARIA role (0.0β1.0) |
| 8 | hasAriaLabel |
1 if aria-label attribute is present |
| 9 | hasAriaLabelledBy |
1 if aria-labelledby is present |
| 10 | hasAriaDescribedBy |
1 if aria-describedby is present |
| 11 | hasAriaRequired |
1 if aria-required="true" |
| 12 | hasAriaExpanded |
1 if aria-expanded is present |
| 13 | hasAriaLive |
1 if aria-live is present |
| 14 | hasTestId |
1 if data-testid / data-cy / data-test is present |
| 15 | hasText |
1 if direct text content exists |
| 16 | textLength |
Normalized text length (0.0β1.0, capped at 200 chars) |
| 17 | isLabelText |
1 if text is short and label-like (< 50 chars) |
| 18 | isActionText |
1 if text contains action words (submit, save, login, ιδΏ‘, etc.) |
| 19 | childCount |
Normalized child element count (0.0β1.0, capped at 10) |
| 20 | hasChildren |
1 if element has child elements |
| 21 | isLeaf |
1 if element has no children |
| 22 | depth |
Normalized nesting depth (0.0β1.0, capped at 20) |
| 23 | depthPenalty |
Multiplier penalizing deeply nested nodes (1.0 β 0.25) |
| 24 | fontSizeNorm |
Normalized font size (0.0β1.0, capped at 72px) |
| 25 | isBold |
1 if font-weight >= 600 |
| 26 | areaRatio |
Element area / viewport area (0.0β1.0) |
| 27 | isAboveFold |
1 if element is within the initial viewport (y < 800px) |
| 28 | isLargeElement |
1 if width > 200px and height > 30px |
| 29 | hasHref |
1 if href attribute is present |
| 30 | hasAlt |
1 if alt attribute is present |
| 31 | hasPlaceholder |
1 if placeholder attribute is present |
| 32 | isRequired |
1 if required attribute is present |
| 33 | isDisabled |
1 if disabled attribute is present |
| 34 | inputType |
Importance score derived from input[type] (0.0β1.0) |
| 35 | headingLevel |
Importance derived from heading level h1β1.0, h6β0.17 |
| 36 | parentIsForm |
1 if an ancestor is a <form> element |
| 37 | parentIsNav |
1 if an ancestor is a <nav> element |
| 38 | parentIsTable |
1 if an ancestor is a <table> element |
| 39 | parentIsInteractive |
1 if an ancestor is interactive |
| 40 | ancestorScore |
Decayed importance score propagated from ancestors |
Usage
Node.js (onnxruntime-node)
import * as ort from 'onnxruntime-node';
const session = await ort.InferenceSession.create('./sdd-distiller-v1.onnx');
// Build a 41-dim feature vector (Float32Array) for a DOM node
const featureVector = new Float32Array(41);
// featureVector[0] = isHighValueTag, featureVector[3] = isInteractive, ...
const input = new ort.Tensor('float32', featureVector, [1, 41]);
const result = await session.run({ input });
const score = result.output.data[0]; // 0.0 ~ 1.0
Browser (onnxruntime-web)
import * as ort from 'onnxruntime-web';
ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.3/dist/';
const session = await ort.InferenceSession.create('./sdd-distiller-v1.onnx');
const featureVector = new Float32Array(41);
const input = new ort.Tensor('float32', featureVector, [1, 41]);
const result = await session.run({ input });
const score = result.output.data[0]; // 0.0 ~ 1.0
Python (onnxruntime)
import onnxruntime as rt
import numpy as np
sess = rt.InferenceSession("sdd-distiller-v1.onnx")
# Build a (1, 41) float32 feature matrix
feature_vector = np.zeros((1, 41), dtype=np.float32)
# feature_vector[0, 0] = 1.0 # isHighValueTag
# feature_vector[0, 3] = 1.0 # isInteractive
# ...
score = sess.run(["output"], {"input": feature_vector})[0][0]
print(f"Importance score: {score:.3f}") # 0.0 ~ 1.0
Training
- Training data: 50,000 synthetic samples generated by using the heuristic scorer as a teacher
- Base learner:
sklearn.ensemble.GradientBoostingRegressor(200 estimators, max_depth=5, lr=0.05) - Export tool:
skl2onnxwith ONNX opset 17 - Training script: scripts/train_model.py
Top Feature Importances
| Feature | Importance |
|---|---|
parentIsForm |
0.2961 |
isInteractive |
0.1637 |
isHighValueTag |
0.1191 |
depthPenalty |
0.1008 |
depth |
0.0942 |
roleBaseScore |
0.0896 |
isDisabled |
0.0334 |
parentIsNav |
0.0211 |
hasHref |
0.0198 |
isContainerTag |
0.0169 |
Links
- π» GitHub: https://github.com/watilde/sdd
- π Live Demo: https://watilde.github.io/sdd/
- π¦ npm:
semantic-dom-distiller(coming soon)