Pnyx Erscheinung - Authenticity Detection Model (v0.7)

Named after Hannah Arendt's Erscheinungsraum (space of appearance). Detects whether genuine human presence exists behind a text - whether there is a "who" to listen to.

Model

Base: microsoft/deberta-v3-small (141M params)
Format: ONNX, FP16, pruned vocab (70K tokens from 128K) (126 MB)
Inference: ONNX Runtime Web (WASM) for in-browser use

Architecture

DeBERTa CLS (768-dim) + features (8-dim)
  -> LayerNorm(776)
  -> Linear(776, 256) -> GELU -> Dropout(0.3)
  -> Linear(256, 128) -> GELU -> Dropout(0.2)
  -> Linear(128, 2)

Inputs

Name	Shape	Description
input_ids	[1, 128]	Tokenized text (max 128 tokens)
attention_mask	[1, 128]	Token mask
features	[1, 8]	Hand-crafted features (TTR, hapax rate, sentence variance, etc.)

Token remapping

This model uses a pruned vocabulary. After tokenization with the standard DeBERTa tokenizer, remap token IDs using token_remap.json:

const remap = await fetch('token_remap.json').then(r => r.json());
const remapped = ids.map(id => remap[id] ?? 0);

Feature order (index 0-7)

Index	Feature	Range
0	Type-Token Ratio	0-1
1	Hapax rate	0-1
2	Sentence length variance	0+
3	Average sentence length	0+
4	Bigram uniqueness	0-1
5	Stop word density	0-1
6	Contraction presence	0/1
7	Lowercase ratio	0-1

Output

Softmax probabilities: [human_prob, ai_prob]. Score >= 0.5 indicates AI-generated text.

Dual-tier architecture

In practice, this model runs alongside an 85-signal heuristic tier. If heuristic confidence is high enough (score >= 4.0), ML inference is skipped entirely, saving ~500ms.

Part of Pnyx

This model powers the SEE layer (authenticity detection) of Pnyx, a listening infrastructure for public discourse built for the Agora Hackathon x TUM.ai E-Lab (April 2026).

Downloads last month: 11