File size: 20,406 Bytes

acf330d

# A Research-Grade Benchmark and QLoRA Baseline for Multi-Standard Telecom Intent-to-Configuration Translation

**Draft version:** 0.1  
**Status:** first manuscript draft  
**Primary artifacts:**

- Dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
- Training/evaluation code: https://huggingface.co/nraptisss/tmf921-intent-training
- Primary model: https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-QLoRA-qwen3-8b-qlora-20260501-083834

---

## Abstract

Intent-Based Networking aims to let operators express high-level network goals in natural language and automatically translate them into executable configurations. However, open research resources for training and evaluating such systems across multiple telecom standards remain limited. We present a research-grade benchmark and reproducible QLoRA baseline for multi-standard telecom intent-to-configuration translation. Starting from a 41,815-example TMF921 intent-to-configuration corpus, we construct a benchmark with in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial splits, plus provenance, validation, and token-length metadata. We fine-tune Qwen3-8B using QLoRA on a single RTX 6000 Ada GPU and evaluate with both raw JSON metrics and a normalized evaluator that removes volatile identifiers, hrefs, timestamps, and schema links. The model achieves near-perfect JSON validity across all splits, normalized structural key F1 around 98%, normalized field F1 of 79.6% in-distribution, 78.7% on template-OOD, 79.1% on use-case-OOD, 77.0% on sector-OOD, and 96.97% normalized exact match on adversarial rejection. A zero-shot Qwen3-8B baseline largely fails the task, producing valid JSON on only about one-third of non-adversarial examples and near-zero normalized field F1. A targeted weak-layer continuation experiment shows that simple oversampling does not solve O1 NRM and A1 policy value-fidelity errors, motivating layer-specific semantic evaluators and canonical scenario generation. We release the dataset, training code, evaluation scripts, metrics, and qualitative examples to support reproducible research in telecom LLMs and intent-based network management.

---

## 1. Introduction

Intent-Based Networking (IBN) is a central objective of autonomous network management: instead of manually configuring low-level network resources, operators express desired outcomes, constraints, and service-level objectives in high-level language. For 5G and emerging 6G networks, such intents must be translated into structured configuration artifacts spanning multiple standards and operational layers, including TM Forum TMF921, 3GPP intent management concepts, ETSI ZSM, CAMARA network slicing APIs, O-RAN A1 policies, and O1/NRM-style resource models.

Large language models (LLMs) are promising for this translation task because they can map flexible natural language into structured outputs. However, telecom intent-to-configuration translation is not simply generic JSON generation. It requires:

1. understanding domain-specific service types such as eMBB, URLLC, mMTC, V2X, MPS, and HMTC;
2. preserving quantitative SLOs such as latency, throughput, reliability, and UE count;
3. producing target-layer-specific JSON structures;
4. handling lifecycle operations and adversarial/invalid requests;
5. generalizing beyond prompt templates and seen sectors/use cases.

Despite recent progress in telecom LLM benchmarks and domain adaptation, open resources for supervised training and rigorous evaluation of natural-language-to-configuration generation remain limited. Many existing telecom NLP resources focus on multiple-choice question answering, document understanding, or general telecom instruction following rather than structured configuration generation.

This work addresses that gap by releasing and evaluating a research-grade dataset and baseline model for multi-standard telecom intent-to-configuration translation. Our emphasis is not only on training a model, but also on building a reproducible benchmark with explicit out-of-distribution (OOD) splits, normalized metrics, and transparent failure analysis.

### Contributions

We make five contributions:

1. **Research-grade dataset and benchmark.** We construct `TMF921-intent-to-config-research-sota`, a benchmark derived from a 41,815-example TMF921 intent-to-configuration corpus. The benchmark includes in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial splits.

2. **Reproducible QLoRA training pipeline.** We provide scripts and configurations for training Qwen3-8B with QLoRA on a single RTX 6000 Ada GPU, including GPU preflight checks, resumable nohup training, adapter merging, and evaluation scripts.

3. **Normalized structured-output evaluation.** We introduce a normalized evaluator that removes volatile fields such as IDs, hrefs, timestamps, descriptions, schema links, and generated identifiers before computing field/key metrics.

4. **Empirical baseline results.** The fine-tuned Qwen3-8B QLoRA model achieves near-perfect JSON validity, approximately 98% normalized key F1, and 77–80% normalized field F1 across non-adversarial ID/OOD splits. On adversarial rejection, it achieves 96.97% normalized exact match.

5. **Weak-layer and negative-result analysis.** We show that O1 NRM and A1 policy value fidelity remain difficult, and that a second-stage weak-layer continuation experiment does not materially improve these layers. This motivates semantic validators and better canonical data generation rather than blind oversampling.

---

## 2. Related Work

### 2.1 Telecom LLM benchmarks

Recent telecom NLP benchmarks such as TeleQnA, ORANBench, TSpec-LLM, SPEC5G, and TelecomGPT-related resources have advanced the evaluation of LLMs on telecom knowledge, standards understanding, and domain-specific question answering. These resources demonstrate that telecom reasoning remains challenging for general-purpose models, especially when tasks require standards-specific knowledge.

However, most existing benchmarks focus on multiple-choice or natural-language answers rather than supervised generation of structured telecom configuration objects. Our work complements these resources by focusing on intent-to-configuration translation, where the target is machine-readable JSON rather than free-form text.

### 2.2 Intent-based networking and network automation

Intent-Based Networking has long been proposed as a path toward autonomous network management. In this paradigm, an operator specifies goals and constraints, while an intent handler translates, validates, enforces, and monitors the resulting configuration. For 5G/6G network slicing, intents must map to multiple abstraction layers: high-level service intent, slice booking, policy control, and low-level network resource models.

Systems such as ORION and related LLM-based network automation work demonstrate the promise of LLMs for intent translation and orchestration. However, many such systems rely on proprietary frontier models, small evaluation sets, or tool-calling frameworks rather than open supervised fine-tuning datasets and reproducible baselines.

### 2.3 Structured-output generation and synthetic data

The task studied here belongs to structured-output generation: models must produce valid JSON with correct schema and values. Prior work on synthetic structured data generation, including approaches such as SynthIE, has shown that synthetic data can be effective when structured targets are generated deterministically and natural-language inputs are diversified. However, synthetic data also risks template leakage, surface-form memorization, and overoptimistic in-distribution evaluation.

Our benchmark explicitly addresses these risks by adding template-OOD, use-case-OOD, and sector-OOD splits, and by reporting both raw and normalized metrics.

### 2.4 Parameter-efficient fine-tuning

QLoRA enables efficient adaptation of large language models using 4-bit quantization and LoRA adapters. This makes it possible to fine-tune an 8B model on a single RTX 6000 Ada GPU while preserving a practical effective batch size and 2048-token context. We use QLoRA as a reproducible baseline rather than as a claim of final optimality.

---

## 3. Dataset and Benchmark Construction

### 3.1 Source dataset

The source dataset contains 41,815 examples of natural-language telecom intents paired with structured JSON configuration outputs. Each example is represented in ChatML-style format with `system`, `user`, and `assistant` messages, and includes metadata such as target layer, slice type, S-NSSAI fields, sector, region, KPI targets, and lifecycle operation.

A scientific audit of the source dataset found:

- 41,815 total rows;
- 0 missing values;
- 0 duplicate IDs;
- 100% assistant JSON parse validity;
- 0 exact train/test full-message overlaps;
- high near-duplicate prompt similarity in the original split;
- strong lifecycle imbalance, with `create` operations dominating;
- only 31 unique JSON structure signatures.

These findings indicate that the source dataset is technically clean and useful for SFT, but its original split is mainly in-distribution and template-like.

### 3.2 Research splits

We construct a research-grade derivative dataset with the following splits.

| Split | Rows | Purpose |
|---|---:|---|
| `train_base` | 26,357 | Unaugmented training after OOD holdouts |
| `train_sota` | 32,357 | Training split with lifecycle/adversarial upsampling and multi-turn wrappers |
| `validation` | 1,547 | Validation during training |
| `test_in_distribution` | 1,455 | In-distribution test |
| `test_template_ood` | 3,503 | Held-out prompt-template family |
| `test_use_case_ood` | 4,341 | Held-out use cases |
| `test_sector_ood` | 4,579 | Held-out sectors |
| `test_adversarial` | 33 | Held-out adversarial rejection examples |

The OOD splits are designed to separate interpolation performance from generalization across held-out prompt forms, use cases, and sectors. They remain synthetic OOD splits, not real deployment distributions.

### 3.3 Added metadata

The derivative dataset adds columns for reproducibility and analysis, including:

- `prompt_template_id`,
- `scenario_id`,
- `json_structure_id`,
- `json_root_family`,
- `messages_format_valid`,
- `assistant_is_valid_json`,
- `slice_sst_valid`,
- `kpi_profile_valid`,
- `semantic_rule_valid_v1`,
- `qwen3_chat_template_tokens`,
- `fits_2048_qwen3`,
- `fits_4096_qwen3`,
- `is_augmented`,
- `augmentation_type`,
- `source_id`,
- `conversation_type`.

### 3.4 Token-length audit

Using the Qwen3-8B chat template, all source examples fit within 2048 tokens.

| Statistic | Tokens |
|---|---:|
| Mean | 754.1 |
| p50 | 705 |
| p95 | 1293 |
| p99 | 1300 |
| Max | 1316 |
| Fit under 2048 | 100% |

This justifies `max_length=2048` for Qwen3-family fine-tuning on this dataset.

### 3.5 Training split design

The `train_sota` split includes marked training-only upsampling for rare lifecycle and adversarial examples. This is explicitly marked using provenance fields (`is_augmented`, `augmentation_type`, and `source_id`). Evaluation splits remain unaugmented.

---

## 4. Training Method

### 4.1 Base model and adapter method

We use `Qwen/Qwen3-8B` as the base model and train LoRA adapters with QLoRA. The main configuration is shown below.

| Item | Value |
|---|---|
| Base model | `Qwen/Qwen3-8B` |
| Training method | QLoRA SFT |
| Quantization | 4-bit NF4 + double quantization |
| LoRA rank | 64 |
| LoRA alpha | 16 |
| LoRA dropout | 0.05 |
| Target modules | `all-linear` |
| Max length | 2048 |
| Loss | Assistant-only SFT loss |
| Learning rate | 2e-4 |
| Scheduler | constant |
| Optimizer | paged AdamW 32-bit |
| Gradient checkpointing | enabled |
| Hardware | RTX 6000 Ada 48/50GB |
| Train split | `train_sota` |

### 4.2 Reproducible execution

Training was performed using a repository that includes:

- CUDA/GPU preflight checks;
- installation script for RTX 6000 Ada;
- resumable nohup training;
- checkpoint saving;
- adapter merge script;
- raw and normalized evaluation scripts;
- results packaging tools.

The primary run directory was:

```text
runs/qwen3-8b-qlora-20260501-083834
```

Training converged smoothly with no observed OOM, NaNs, or gradient instability.

---

## 5. Evaluation Methodology

### 5.1 Raw metrics

We report:

- JSON parse rate,
- canonical JSON exact match,
- flattened field precision/recall/F1,
- slice/SST diagnostic pass,
- KPI text-presence diagnostic pass,
- adversarial status pass.

Raw exact match is intentionally reported, but it is not the primary metric for this task because valid configurations may contain volatile identifiers and metadata.

### 5.2 Normalized metrics

We introduce normalized metrics that remove or mask volatile/generated fields before scoring. Normalization removes:

- IDs,
- hrefs,
- names/descriptions,
- timestamps,
- schema links,
- UUID/hash-like strings,
- generated request/policy/booking/intent IDs.

It then computes:

- normalized exact match,
- normalized field precision/recall/F1,
- normalized key precision/recall/F1.

Normalized key F1 measures structural agreement, while normalized field F1 estimates value-level agreement after volatile-field removal.

### 5.3 OOD and adversarial evaluation

All results are reported separately for each split. We do not merge the OOD splits into a single aggregate score because each split tests a distinct generalization mode.

---

## 6. Results

### 6.1 Raw stage-1 results

| Split | JSON parse | Exact match | Field F1 | KPI presence |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.0227 | 0.6868 | 0.7973 |
| `test_template_ood` | 1.0000 | 0.0014 | 0.6790 | 0.8062 |
| `test_use_case_ood` | 0.9998 | 0.0122 | 0.6825 | 0.7883 |
| `test_sector_ood` | 1.0000 | 0.0166 | 0.6610 | 0.7733 |
| `test_adversarial` | 1.0000 | 0.9697 | 0.9697 | 1.0000 |

Raw exact match is low for primary configuration layers, but this metric is overly strict because many correct or acceptable generations differ in volatile fields.

### 6.2 Normalized stage-1 results

| Split | JSON parse | Normalized field F1 | Normalized key F1 | Normalized exact |
|---|---:|---:|---:|---:|
| `test_in_distribution` | 1.0000 | 0.7956 | 0.9811 | 0.0351 |
| `test_template_ood` | 1.0000 | 0.7865 | 0.9801 | 0.0177 |
| `test_use_case_ood` | 0.9998 | 0.7907 | 0.9805 | 0.0253 |
| `test_sector_ood` | 1.0000 | 0.7697 | 0.9818 | 0.0293 |
| `test_adversarial` | 1.0000 | 0.9697 | 1.0000 | 0.9697 |

The fine-tuned model achieves near-perfect parseability and high structural agreement across ID and OOD splits. Normalized field F1 is stable across OOD splits, indicating that performance is not purely template memorization.

### 6.3 Zero-shot baseline

We evaluate zero-shot `Qwen/Qwen3-8B` on 200 examples per split. It performs poorly.

| Split | Zero-shot parse | Fine-tuned parse | Zero-shot norm field F1 | Fine-tuned norm field F1 | Zero-shot norm key F1 | Fine-tuned norm key F1 |
|---|---:|---:|---:|---:|---:|---:|
| ID | 0.335 | 1.000 | 0.0009 | 0.7956 | 0.0169 | 0.9811 |
| Template OOD | 0.340 | 1.000 | 0.0014 | 0.7865 | 0.0172 | 0.9801 |
| Use-case OOD | 0.325 | 0.9998 | 0.0012 | 0.7907 | 0.0198 | 0.9805 |
| Sector OOD | 0.345 | 1.000 | 0.0008 | 0.7697 | 0.0171 | 0.9818 |
| Adversarial | 0.000 | 1.000 | 0.0000 | 0.9697 | 0.0000 | 1.0000 |

This establishes that domain-specific QLoRA fine-tuning is essential for the task.

### 6.4 Layer-level results

The model performs best on high-level intent and API-like targets.

| Target layer | Normalized field F1 range | Interpretation |
|---|---:|---|
| `tmf921` | 0.93–0.94 | Strong high-level intent object generation |
| `camara` | 0.81–0.87 | Strong after volatile-field normalization |
| `intent_3gpp` | 0.80–0.82 | Strong/moderate |
| `etsi_zsm` | 0.75–0.79 | Moderate/strong |
| `a1_policy` | 0.67–0.68 | Moderate, value fidelity remains limited |
| `o1_nrm` | 0.39–0.40 | Weak value fidelity despite correct structure |
| `tmf921_lifecycle_report` | 0.15–0.18 | Weak, likely measurement/simulation mismatch |
| `tmf921_lifecycle_monitor` | 0.39–0.52 | Weak/mixed |

The O1 NRM result is especially informative: normalized key F1 is high, but normalized field F1 is low. This means the model learns the structural schema but fails to reliably assign correct low-level values.

---

## 7. Stage-2 Weak-Layer Continuation

To test whether weak-layer exposure alone would fix value fidelity, we continued training the stage-1 adapter on a weak-layer-focused dataset. The stage-2 dataset included all weak-layer rows, duplicated rare weak layers, and a replay buffer from strong layers.

Weak layers targeted:

- `o1_nrm`,
- `a1_policy`,
- `tmf921_lifecycle_report`,
- `tmf921_lifecycle_monitor`,
- `tmf921_lifecycle_scale`.

Stage-2 global comparison:

| Split | Stage 1 norm field F1 | Stage 2 norm field F1 | Delta | Stage 1 norm key F1 | Stage 2 norm key F1 | Delta |
|---|---:|---:|---:|---:|---:|---:|
| `test_in_distribution` | 0.7956 | 0.7952 | -0.0003 | 0.9811 | 0.9796 | -0.0014 |
| `test_template_ood` | 0.7865 | 0.7855 | -0.0010 | 0.9801 | 0.9786 | -0.0015 |
| `test_use_case_ood` | 0.7907 | 0.7895 | -0.0012 | 0.9805 | 0.9787 | -0.0018 |
| `test_sector_ood` | 0.7697 | 0.7694 | -0.0002 | 0.9818 | 0.9809 | -0.0009 |
| `test_adversarial` | 0.9697 | 0.9596 | -0.0101 | 1.0000 | 0.9697 | -0.0303 |

Stage 2 did not improve O1 NRM or A1 policy meaningfully, and it slightly reduced adversarial robustness. Therefore, it is not promoted. Stage 1 remains the primary model.

This negative result suggests that weak-layer errors are not solved by exposure alone. They likely require better canonical labels, layer-specific semantic evaluation, and improved generation rules.

---

## 8. Qualitative Failure Analysis

Qualitative examples are available in:

```text
analysis/stage1_examples/failure_examples.md
analysis/stage1_examples/failure_examples.json
analysis/stage2_examples/failure_examples.md
analysis/stage2_examples/failure_examples.json
```

Common error types include:

- correct JSON structure but wrong low-level values,
- O1 NRM value fidelity errors,
- A1 policy value errors,
- lifecycle report measurement mismatch,
- lifecycle monitor measurement mismatch.

These examples support the quantitative conclusion that the model is structurally strong but still struggles with certain low-level semantic/value assignments.

---

## 9. Limitations

| Limitation | Impact | Mitigation / future work |
|---|---|---|
| Synthetic data | May not reflect real operator language | Add expert/human-authored validation subset |
| No official standard validators | Cannot claim production compliance | Add TMF921/CAMARA/OpenAPI/YANG validators |
| O1 NRM weak value fidelity | Low-level RAN configuration unreliable | Add semantic evaluator and canonical labels |
| A1 policy moderate fidelity | Policy values may be wrong | Add policy-specific extractor/scorer |
| Lifecycle report/monitor weak | Measurement fields may be hard to reproduce | Use tolerance/semantic scoring |
| Exact match low | Raw exact match over-penalizes volatile fields | Report normalized metrics alongside raw |

This work should be interpreted as a research benchmark and baseline, not a production-certified telecom configuration system.

---

## 10. Conclusion

We release a research-grade open dataset, OOD benchmark, normalized evaluator, training pipeline, and Qwen3-8B QLoRA baseline for multi-standard telecom intent-to-configuration translation. The fine-tuned model achieves near-perfect JSON validity, around 98% normalized structural key F1, and 77–80% normalized field F1 across non-adversarial ID/OOD splits, with strong adversarial rejection. A zero-shot baseline shows that the base model largely fails the task without domain-specific fine-tuning. At the same time, weak-layer analysis reveals that O1 NRM and A1 policy value fidelity remain open problems. Future work should focus on layer-specific semantic evaluators, official/derived schema validators, expert validation, and Gen4 canonical scenario generation.

---

## Citation placeholders

Add formal citations for:

- QLoRA,
- LoRA,
- TRL,
- Qwen3,
- TeleQnA,
- ORANBench,
- SPEC5G,
- TSpec-LLM,
- TelecomGPT,
- ORION,
- SynthIE,
- TMF921,
- 3GPP TS 28.312,
- ETSI ZSM,
- CAMARA,
- O-RAN A1,
- 3GPP TS 28.541.