sdd-distiller-v1

Semantic DOM Distiller β€” An ONNX model that scores the importance of each DOM node from 0.0 to 1.0.

GitHub: watilde/sdd | Live Demo

Overview

Modern websites have extremely complex and obfuscated DOM structures. Feeding raw DOM trees into LLMs leads to inflated token costs and degraded reasoning accuracy due to noise.

sdd-distiller-v1 is the core scoring model of the Semantic DOM Distiller (SDD) pipeline β€” a DOM-to-Specification preprocessing engine optimized for multimodal AI agents such as Amazon Nova Act.

Given a 41-dimensional feature vector representing a DOM node, the model outputs an importance score between 0.0 and 1.0. Nodes below a configurable threshold are pruned, reconstructing a lean "functional DOM tree" that contains only what the AI needs to understand the page.

Model Details

Item Value
Task DOM Node Importance Regression (0.0–1.0)
Architecture GradientBoostingRegressor β†’ ONNX
Input shape (1, 41) float32
Output shape (1, 1) float32
Model size ~449 KB
RMSE 0.0347
RΒ² 0.9762
Threshold accuracy 97.01% (t=0.3)
Training samples 50,000 (synthetic)
ONNX opset 17

Input Features (41 dimensions)

# Feature Description
0 isHighValueTag 1 if tag is button / input / a / form / nav / h1–h3 etc.
1 isMediumValueTag 1 if tag is h4–h6 / label / section / article etc.
2 isContainerTag 1 if tag is div / span / p / section etc.
3 isInteractive 1 if element is interactable (click, type, select)
4 isClickable 1 if element has click handlers or is a link/button
5 hasTabIndex 1 if tabIndex >= 0
6 hasRole 1 if explicit or implicit ARIA role exists
7 roleBaseScore Base importance score derived from WAI-ARIA role (0.0–1.0)
8 hasAriaLabel 1 if aria-label attribute is present
9 hasAriaLabelledBy 1 if aria-labelledby is present
10 hasAriaDescribedBy 1 if aria-describedby is present
11 hasAriaRequired 1 if aria-required="true"
12 hasAriaExpanded 1 if aria-expanded is present
13 hasAriaLive 1 if aria-live is present
14 hasTestId 1 if data-testid / data-cy / data-test is present
15 hasText 1 if direct text content exists
16 textLength Normalized text length (0.0–1.0, capped at 200 chars)
17 isLabelText 1 if text is short and label-like (< 50 chars)
18 isActionText 1 if text contains action words (submit, save, login, 送俑, etc.)
19 childCount Normalized child element count (0.0–1.0, capped at 10)
20 hasChildren 1 if element has child elements
21 isLeaf 1 if element has no children
22 depth Normalized nesting depth (0.0–1.0, capped at 20)
23 depthPenalty Multiplier penalizing deeply nested nodes (1.0 β†’ 0.25)
24 fontSizeNorm Normalized font size (0.0–1.0, capped at 72px)
25 isBold 1 if font-weight >= 600
26 areaRatio Element area / viewport area (0.0–1.0)
27 isAboveFold 1 if element is within the initial viewport (y < 800px)
28 isLargeElement 1 if width > 200px and height > 30px
29 hasHref 1 if href attribute is present
30 hasAlt 1 if alt attribute is present
31 hasPlaceholder 1 if placeholder attribute is present
32 isRequired 1 if required attribute is present
33 isDisabled 1 if disabled attribute is present
34 inputType Importance score derived from input[type] (0.0–1.0)
35 headingLevel Importance derived from heading level h1β†’1.0, h6β†’0.17
36 parentIsForm 1 if an ancestor is a <form> element
37 parentIsNav 1 if an ancestor is a <nav> element
38 parentIsTable 1 if an ancestor is a <table> element
39 parentIsInteractive 1 if an ancestor is interactive
40 ancestorScore Decayed importance score propagated from ancestors

Usage

Node.js (onnxruntime-node)

import * as ort from 'onnxruntime-node';

const session = await ort.InferenceSession.create('./sdd-distiller-v1.onnx');

// Build a 41-dim feature vector (Float32Array) for a DOM node
const featureVector = new Float32Array(41);
// featureVector[0] = isHighValueTag, featureVector[3] = isInteractive, ...

const input = new ort.Tensor('float32', featureVector, [1, 41]);
const result = await session.run({ input });
const score = result.output.data[0]; // 0.0 ~ 1.0

Browser (onnxruntime-web)

import * as ort from 'onnxruntime-web';

ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.3/dist/';
const session = await ort.InferenceSession.create('./sdd-distiller-v1.onnx');

const featureVector = new Float32Array(41);
const input = new ort.Tensor('float32', featureVector, [1, 41]);
const result = await session.run({ input });
const score = result.output.data[0]; // 0.0 ~ 1.0

Python (onnxruntime)

import onnxruntime as rt
import numpy as np

sess = rt.InferenceSession("sdd-distiller-v1.onnx")

# Build a (1, 41) float32 feature matrix
feature_vector = np.zeros((1, 41), dtype=np.float32)
# feature_vector[0, 0] = 1.0  # isHighValueTag
# feature_vector[0, 3] = 1.0  # isInteractive
# ...

score = sess.run(["output"], {"input": feature_vector})[0][0]
print(f"Importance score: {score:.3f}")  # 0.0 ~ 1.0

Training

  • Training data: 50,000 synthetic samples generated by using the heuristic scorer as a teacher
  • Base learner: sklearn.ensemble.GradientBoostingRegressor (200 estimators, max_depth=5, lr=0.05)
  • Export tool: skl2onnx with ONNX opset 17
  • Training script: scripts/train_model.py

Top Feature Importances

Feature Importance
parentIsForm 0.2961
isInteractive 0.1637
isHighValueTag 0.1191
depthPenalty 0.1008
depth 0.0942
roleBaseScore 0.0896
isDisabled 0.0334
parentIsNav 0.0211
hasHref 0.0198
isContainerTag 0.0169

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support