Pnyx Erscheinung - Authenticity Detection Model (v0.7)
Named after Hannah Arendt's Erscheinungsraum (space of appearance). Detects whether genuine human presence exists behind a text - whether there is a "who" to listen to.
Model
- Base:
microsoft/deberta-v3-small(141M params) - Format: ONNX, FP16, pruned vocab (70K tokens from 128K) (126 MB)
- Inference: ONNX Runtime Web (WASM) for in-browser use
Architecture
DeBERTa CLS (768-dim) + features (8-dim)
-> LayerNorm(776)
-> Linear(776, 256) -> GELU -> Dropout(0.3)
-> Linear(256, 128) -> GELU -> Dropout(0.2)
-> Linear(128, 2)
Inputs
| Name | Shape | Description |
|---|---|---|
| input_ids | [1, 128] | Tokenized text (max 128 tokens) |
| attention_mask | [1, 128] | Token mask |
| features | [1, 8] | Hand-crafted features (TTR, hapax rate, sentence variance, etc.) |
Token remapping
This model uses a pruned vocabulary. After tokenization with the standard DeBERTa tokenizer, remap token IDs using token_remap.json:
const remap = await fetch('token_remap.json').then(r => r.json());
const remapped = ids.map(id => remap[id] ?? 0);
Feature order (index 0-7)
| Index | Feature | Range |
|---|---|---|
| 0 | Type-Token Ratio | 0-1 |
| 1 | Hapax rate | 0-1 |
| 2 | Sentence length variance | 0+ |
| 3 | Average sentence length | 0+ |
| 4 | Bigram uniqueness | 0-1 |
| 5 | Stop word density | 0-1 |
| 6 | Contraction presence | 0/1 |
| 7 | Lowercase ratio | 0-1 |
Output
Softmax probabilities: [human_prob, ai_prob]. Score >= 0.5 indicates AI-generated text.
Dual-tier architecture
In practice, this model runs alongside an 85-signal heuristic tier. If heuristic confidence is high enough (score >= 4.0), ML inference is skipped entirely, saving ~500ms.
Part of Pnyx
This model powers the SEE layer (authenticity detection) of Pnyx, a listening infrastructure for public discourse built for the Agora Hackathon x TUM.ai E-Lab (April 2026).
- Downloads last month
- 11