nraptisss
/

tmf921-intent-training

+# Paper Outline
+Working title:
+> **A Research-Grade Benchmark and QLoRA Baseline for Multi-Standard Telecom Intent-to-Configuration Translation**
+Alternative titles:
+1. **From Natural Language Intent to Telecom Configuration: Dataset, OOD Evaluation, and QLoRA Baselines Across TMF921, 3GPP, CAMARA, O-RAN, and ETSI ZSM**
+2. **Multi-Standard Telecom Intent-to-Configuration Translation: An Open Dataset, OOD Benchmark, and Qwen3 QLoRA Baseline**
+3. **Benchmarking LLMs for Intent-Based Network Configuration Across Telecom Standards**
+---
+## One-sentence thesis
+We introduce an open research dataset, OOD benchmark, normalized evaluator, and Qwen3-8B QLoRA baseline for multi-standard telecom intent-to-configuration translation, showing strong JSON validity and structural correctness while identifying O1 NRM and A1 policy value fidelity as the main remaining bottlenecks.
+---
+## Core contributions
+1. **Research-grade dataset and benchmark**
+   - Derived from `nraptisss/TMF921-intent-to-config-augmented`.
+   - Adds benchmark splits for in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial evaluation.
+   - Adds provenance, template/scenario IDs, token-length audit, validation flags, and rare-class-aware training split.
+2. **Reproducible training pipeline**
+   - Single RTX 6000 Ada QLoRA recipe for Qwen3-8B.
+   - Includes nohup-based resumability, GPU preflight, adapter merging, and Hub-ready artifacts.
+3. **Evaluation methodology**
+   - Raw JSON exact/field metrics.
+   - Normalized evaluator that removes volatile IDs, hrefs, timestamps, schema links, and generated identifiers.
+   - Stratified reporting by split, target layer, slice type, and lifecycle operation.
+4. **Baseline model and empirical results**
+   - Qwen3-8B QLoRA achieves near-perfect JSON validity.
+   - Normalized key F1 is about 98% across non-adversarial ID/OOD splits.
+   - Normalized field F1 is about 77–80% across non-adversarial ID/OOD splits.
+   - Adversarial rejection reaches 96.97% normalized exact match.
+5. **Weak-layer diagnostic experiment**
+   - A stage-2 weak-layer continuation experiment did not materially improve O1/A1 value fidelity.
+   - This negative result suggests better semantic supervision/evaluation is needed, not merely more weak-layer exposure.
+---
+## Proposed abstract draft
+Intent-Based Networking aims to let operators express high-level network goals in natural language and automatically translate them into executable configurations. However, open research resources for training and evaluating such systems across multiple telecom standards remain limited. We present a research-grade benchmark and reproducible QLoRA baseline for multi-standard telecom intent-to-configuration translation. Starting from a 41,815-example TMF921 intent-to-configuration corpus, we construct a benchmark with in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial splits, plus provenance, validation, and token-length metadata. We fine-tune Qwen3-8B using QLoRA on a single RTX 6000 Ada GPU and evaluate with both raw JSON metrics and a normalized evaluator that removes volatile identifiers, hrefs, timestamps, and schema links. The model achieves near-perfect JSON validity across all splits, normalized structural key F1 around 98%, normalized field F1 of 79.6% in-distribution, 78.7% on template-OOD, 79.1% on use-case-OOD, 77.0% on sector-OOD, and 96.97% normalized exact match on adversarial rejection. A targeted weak-layer continuation experiment shows that simple oversampling does not solve O1 NRM and A1 policy value-fidelity errors, motivating layer-specific semantic evaluators and canonical scenario generation. We release the dataset, training code, evaluation scripts, metrics, and qualitative examples to support reproducible research in telecom LLMs and intent-based network management.
+---
+## Paper structure
+### 1. Introduction
+Purpose:
+- Motivate intent-based networking and natural-language-to-config translation.
+- Explain need for open data and reproducible baselines.
+- Introduce multi-standard challenge: TMF921, 3GPP, ETSI ZSM, CAMARA, O-RAN A1, O1 NRM.
+- State contributions.
+Key points:
+- Frontier LLMs may perform well, but telecom systems need open, auditable, deployable models.
+- Raw JSON exact match is not enough due to volatile fields.
+- OOD evaluation is necessary because synthetic template datasets can inflate in-distribution performance.
+### 2. Related Work
+Subsections:
+- LLMs for telecommunications.
+- Telecom benchmarks: TeleQnA, SPEC5G, ORANBench, TSpec-LLM, TelecomGPT-related resources.
+- Intent-based networking and network slicing automation.
+- Structured-output / JSON-constrained generation.
+- QLoRA and parameter-efficient adaptation.
+### 3. Dataset and Benchmark Construction
+Subsections:
+- Source dataset summary.
+- Data audit and motivation for research splits.
+- Split construction:
+  - `train_base`
+  - `train_sota`
+  - `validation`
+  - `test_in_distribution`
+  - `test_template_ood`
+  - `test_use_case_ood`
+  - `test_sector_ood`
+  - `test_adversarial`
+- Metadata and validation columns.
+- Token length audit.
+- Limitations of synthetic data.
+### 4. Training Method
+Subsections:
+- Base model: Qwen3-8B.
+- QLoRA setup:
+  - NF4 4-bit quantization.
+  - double quantization.
+  - LoRA rank 64.
+  - all-linear target modules.
+  - assistant-only SFT loss.
+- Hardware: RTX 6000 Ada.
+- Reproducible training scripts and run management.
+- Stage-2 continuation setup.
+### 5. Evaluation Methodology
+Subsections:
+- Raw metrics:
+  - JSON parse rate.
+  - canonical exact match.
+  - flattened field precision/recall/F1.
+  - KPI text-presence diagnostics.
+- Normalized metrics:
+  - remove volatile IDs/hrefs/timestamps/descriptions/schema links.
+  - sort lists deterministically.
+  - normalized field F1.
+  - normalized key F1.
+- OOD evaluation protocol.
+- Adversarial evaluation.
+- Known limitations of current evaluator.
+### 6. Results
+Subsections:
+- Main stage-1 metrics.
+- OOD generalization.
+- Per-target-layer performance.
+- Raw vs normalized evaluation.
+- Adversarial rejection.
+### 7. Weak-Layer and Error Analysis
+Subsections:
+- O1 NRM: high structural correctness but low value fidelity.
+- A1 policy: moderate value fidelity.
+- Lifecycle report/monitor: measurement mismatch issues.
+- Qualitative examples.
+- Stage-2 weak-layer continuation negative result.
+### 8. Limitations
+Must state clearly:
+- Synthetic benchmark, not real operator logs.
+- No official standards validators yet.
+- Normalized JSON metrics are a proxy for semantic correctness.
+- O1/A1 need layer-specific semantic evaluators.
+- No human expert validation yet.
+- Low-level values may require canonical scenario generation.
+### 9. Conclusion
+Reiterate:
+- Open dataset + OOD benchmark + reproducible QLoRA baseline.
+- Strong JSON validity and structural correctness.
+- Honest weak-layer findings.
+- Future work: semantic validators and Gen4 canonical data.
+---
+## Main conclusion to defend
+This is not yet a production-certified telecom configuration system. It is a strong open research benchmark and baseline that advances reproducible study of multi-standard telecom intent-to-configuration translation.