PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss's picture
Add first manuscript draft
acf330d verified

A Research-Grade Benchmark and QLoRA Baseline for Multi-Standard Telecom Intent-to-Configuration Translation

Draft version: 0.1
Status: first manuscript draft
Primary artifacts:


Abstract

Intent-Based Networking aims to let operators express high-level network goals in natural language and automatically translate them into executable configurations. However, open research resources for training and evaluating such systems across multiple telecom standards remain limited. We present a research-grade benchmark and reproducible QLoRA baseline for multi-standard telecom intent-to-configuration translation. Starting from a 41,815-example TMF921 intent-to-configuration corpus, we construct a benchmark with in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial splits, plus provenance, validation, and token-length metadata. We fine-tune Qwen3-8B using QLoRA on a single RTX 6000 Ada GPU and evaluate with both raw JSON metrics and a normalized evaluator that removes volatile identifiers, hrefs, timestamps, and schema links. The model achieves near-perfect JSON validity across all splits, normalized structural key F1 around 98%, normalized field F1 of 79.6% in-distribution, 78.7% on template-OOD, 79.1% on use-case-OOD, 77.0% on sector-OOD, and 96.97% normalized exact match on adversarial rejection. A zero-shot Qwen3-8B baseline largely fails the task, producing valid JSON on only about one-third of non-adversarial examples and near-zero normalized field F1. A targeted weak-layer continuation experiment shows that simple oversampling does not solve O1 NRM and A1 policy value-fidelity errors, motivating layer-specific semantic evaluators and canonical scenario generation. We release the dataset, training code, evaluation scripts, metrics, and qualitative examples to support reproducible research in telecom LLMs and intent-based network management.


1. Introduction

Intent-Based Networking (IBN) is a central objective of autonomous network management: instead of manually configuring low-level network resources, operators express desired outcomes, constraints, and service-level objectives in high-level language. For 5G and emerging 6G networks, such intents must be translated into structured configuration artifacts spanning multiple standards and operational layers, including TM Forum TMF921, 3GPP intent management concepts, ETSI ZSM, CAMARA network slicing APIs, O-RAN A1 policies, and O1/NRM-style resource models.

Large language models (LLMs) are promising for this translation task because they can map flexible natural language into structured outputs. However, telecom intent-to-configuration translation is not simply generic JSON generation. It requires:

  1. understanding domain-specific service types such as eMBB, URLLC, mMTC, V2X, MPS, and HMTC;
  2. preserving quantitative SLOs such as latency, throughput, reliability, and UE count;
  3. producing target-layer-specific JSON structures;
  4. handling lifecycle operations and adversarial/invalid requests;
  5. generalizing beyond prompt templates and seen sectors/use cases.

Despite recent progress in telecom LLM benchmarks and domain adaptation, open resources for supervised training and rigorous evaluation of natural-language-to-configuration generation remain limited. Many existing telecom NLP resources focus on multiple-choice question answering, document understanding, or general telecom instruction following rather than structured configuration generation.

This work addresses that gap by releasing and evaluating a research-grade dataset and baseline model for multi-standard telecom intent-to-configuration translation. Our emphasis is not only on training a model, but also on building a reproducible benchmark with explicit out-of-distribution (OOD) splits, normalized metrics, and transparent failure analysis.

Contributions

We make five contributions:

  1. Research-grade dataset and benchmark. We construct TMF921-intent-to-config-research-sota, a benchmark derived from a 41,815-example TMF921 intent-to-configuration corpus. The benchmark includes in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial splits.

  2. Reproducible QLoRA training pipeline. We provide scripts and configurations for training Qwen3-8B with QLoRA on a single RTX 6000 Ada GPU, including GPU preflight checks, resumable nohup training, adapter merging, and evaluation scripts.

  3. Normalized structured-output evaluation. We introduce a normalized evaluator that removes volatile fields such as IDs, hrefs, timestamps, descriptions, schema links, and generated identifiers before computing field/key metrics.

  4. Empirical baseline results. The fine-tuned Qwen3-8B QLoRA model achieves near-perfect JSON validity, approximately 98% normalized key F1, and 77–80% normalized field F1 across non-adversarial ID/OOD splits. On adversarial rejection, it achieves 96.97% normalized exact match.

  5. Weak-layer and negative-result analysis. We show that O1 NRM and A1 policy value fidelity remain difficult, and that a second-stage weak-layer continuation experiment does not materially improve these layers. This motivates semantic validators and better canonical data generation rather than blind oversampling.


2. Related Work

2.1 Telecom LLM benchmarks

Recent telecom NLP benchmarks such as TeleQnA, ORANBench, TSpec-LLM, SPEC5G, and TelecomGPT-related resources have advanced the evaluation of LLMs on telecom knowledge, standards understanding, and domain-specific question answering. These resources demonstrate that telecom reasoning remains challenging for general-purpose models, especially when tasks require standards-specific knowledge.

However, most existing benchmarks focus on multiple-choice or natural-language answers rather than supervised generation of structured telecom configuration objects. Our work complements these resources by focusing on intent-to-configuration translation, where the target is machine-readable JSON rather than free-form text.

2.2 Intent-based networking and network automation

Intent-Based Networking has long been proposed as a path toward autonomous network management. In this paradigm, an operator specifies goals and constraints, while an intent handler translates, validates, enforces, and monitors the resulting configuration. For 5G/6G network slicing, intents must map to multiple abstraction layers: high-level service intent, slice booking, policy control, and low-level network resource models.

Systems such as ORION and related LLM-based network automation work demonstrate the promise of LLMs for intent translation and orchestration. However, many such systems rely on proprietary frontier models, small evaluation sets, or tool-calling frameworks rather than open supervised fine-tuning datasets and reproducible baselines.

2.3 Structured-output generation and synthetic data

The task studied here belongs to structured-output generation: models must produce valid JSON with correct schema and values. Prior work on synthetic structured data generation, including approaches such as SynthIE, has shown that synthetic data can be effective when structured targets are generated deterministically and natural-language inputs are diversified. However, synthetic data also risks template leakage, surface-form memorization, and overoptimistic in-distribution evaluation.

Our benchmark explicitly addresses these risks by adding template-OOD, use-case-OOD, and sector-OOD splits, and by reporting both raw and normalized metrics.

2.4 Parameter-efficient fine-tuning

QLoRA enables efficient adaptation of large language models using 4-bit quantization and LoRA adapters. This makes it possible to fine-tune an 8B model on a single RTX 6000 Ada GPU while preserving a practical effective batch size and 2048-token context. We use QLoRA as a reproducible baseline rather than as a claim of final optimality.


3. Dataset and Benchmark Construction

3.1 Source dataset

The source dataset contains 41,815 examples of natural-language telecom intents paired with structured JSON configuration outputs. Each example is represented in ChatML-style format with system, user, and assistant messages, and includes metadata such as target layer, slice type, S-NSSAI fields, sector, region, KPI targets, and lifecycle operation.

A scientific audit of the source dataset found:

  • 41,815 total rows;
  • 0 missing values;
  • 0 duplicate IDs;
  • 100% assistant JSON parse validity;
  • 0 exact train/test full-message overlaps;
  • high near-duplicate prompt similarity in the original split;
  • strong lifecycle imbalance, with create operations dominating;
  • only 31 unique JSON structure signatures.

These findings indicate that the source dataset is technically clean and useful for SFT, but its original split is mainly in-distribution and template-like.

3.2 Research splits

We construct a research-grade derivative dataset with the following splits.

Split Rows Purpose
train_base 26,357 Unaugmented training after OOD holdouts
train_sota 32,357 Training split with lifecycle/adversarial upsampling and multi-turn wrappers
validation 1,547 Validation during training
test_in_distribution 1,455 In-distribution test
test_template_ood 3,503 Held-out prompt-template family
test_use_case_ood 4,341 Held-out use cases
test_sector_ood 4,579 Held-out sectors
test_adversarial 33 Held-out adversarial rejection examples

The OOD splits are designed to separate interpolation performance from generalization across held-out prompt forms, use cases, and sectors. They remain synthetic OOD splits, not real deployment distributions.

3.3 Added metadata

The derivative dataset adds columns for reproducibility and analysis, including:

  • prompt_template_id,
  • scenario_id,
  • json_structure_id,
  • json_root_family,
  • messages_format_valid,
  • assistant_is_valid_json,
  • slice_sst_valid,
  • kpi_profile_valid,
  • semantic_rule_valid_v1,
  • qwen3_chat_template_tokens,
  • fits_2048_qwen3,
  • fits_4096_qwen3,
  • is_augmented,
  • augmentation_type,
  • source_id,
  • conversation_type.

3.4 Token-length audit

Using the Qwen3-8B chat template, all source examples fit within 2048 tokens.

Statistic Tokens
Mean 754.1
p50 705
p95 1293
p99 1300
Max 1316
Fit under 2048 100%

This justifies max_length=2048 for Qwen3-family fine-tuning on this dataset.

3.5 Training split design

The train_sota split includes marked training-only upsampling for rare lifecycle and adversarial examples. This is explicitly marked using provenance fields (is_augmented, augmentation_type, and source_id). Evaluation splits remain unaugmented.


4. Training Method

4.1 Base model and adapter method

We use Qwen/Qwen3-8B as the base model and train LoRA adapters with QLoRA. The main configuration is shown below.

Item Value
Base model Qwen/Qwen3-8B
Training method QLoRA SFT
Quantization 4-bit NF4 + double quantization
LoRA rank 64
LoRA alpha 16
LoRA dropout 0.05
Target modules all-linear
Max length 2048
Loss Assistant-only SFT loss
Learning rate 2e-4
Scheduler constant
Optimizer paged AdamW 32-bit
Gradient checkpointing enabled
Hardware RTX 6000 Ada 48/50GB
Train split train_sota

4.2 Reproducible execution

Training was performed using a repository that includes:

  • CUDA/GPU preflight checks;
  • installation script for RTX 6000 Ada;
  • resumable nohup training;
  • checkpoint saving;
  • adapter merge script;
  • raw and normalized evaluation scripts;
  • results packaging tools.

The primary run directory was:

runs/qwen3-8b-qlora-20260501-083834

Training converged smoothly with no observed OOM, NaNs, or gradient instability.


5. Evaluation Methodology

5.1 Raw metrics

We report:

  • JSON parse rate,
  • canonical JSON exact match,
  • flattened field precision/recall/F1,
  • slice/SST diagnostic pass,
  • KPI text-presence diagnostic pass,
  • adversarial status pass.

Raw exact match is intentionally reported, but it is not the primary metric for this task because valid configurations may contain volatile identifiers and metadata.

5.2 Normalized metrics

We introduce normalized metrics that remove or mask volatile/generated fields before scoring. Normalization removes:

  • IDs,
  • hrefs,
  • names/descriptions,
  • timestamps,
  • schema links,
  • UUID/hash-like strings,
  • generated request/policy/booking/intent IDs.

It then computes:

  • normalized exact match,
  • normalized field precision/recall/F1,
  • normalized key precision/recall/F1.

Normalized key F1 measures structural agreement, while normalized field F1 estimates value-level agreement after volatile-field removal.

5.3 OOD and adversarial evaluation

All results are reported separately for each split. We do not merge the OOD splits into a single aggregate score because each split tests a distinct generalization mode.


6. Results

6.1 Raw stage-1 results

Split JSON parse Exact match Field F1 KPI presence
test_in_distribution 1.0000 0.0227 0.6868 0.7973
test_template_ood 1.0000 0.0014 0.6790 0.8062
test_use_case_ood 0.9998 0.0122 0.6825 0.7883
test_sector_ood 1.0000 0.0166 0.6610 0.7733
test_adversarial 1.0000 0.9697 0.9697 1.0000

Raw exact match is low for primary configuration layers, but this metric is overly strict because many correct or acceptable generations differ in volatile fields.

6.2 Normalized stage-1 results

Split JSON parse Normalized field F1 Normalized key F1 Normalized exact
test_in_distribution 1.0000 0.7956 0.9811 0.0351
test_template_ood 1.0000 0.7865 0.9801 0.0177
test_use_case_ood 0.9998 0.7907 0.9805 0.0253
test_sector_ood 1.0000 0.7697 0.9818 0.0293
test_adversarial 1.0000 0.9697 1.0000 0.9697

The fine-tuned model achieves near-perfect parseability and high structural agreement across ID and OOD splits. Normalized field F1 is stable across OOD splits, indicating that performance is not purely template memorization.

6.3 Zero-shot baseline

We evaluate zero-shot Qwen/Qwen3-8B on 200 examples per split. It performs poorly.

Split Zero-shot parse Fine-tuned parse Zero-shot norm field F1 Fine-tuned norm field F1 Zero-shot norm key F1 Fine-tuned norm key F1
ID 0.335 1.000 0.0009 0.7956 0.0169 0.9811
Template OOD 0.340 1.000 0.0014 0.7865 0.0172 0.9801
Use-case OOD 0.325 0.9998 0.0012 0.7907 0.0198 0.9805
Sector OOD 0.345 1.000 0.0008 0.7697 0.0171 0.9818
Adversarial 0.000 1.000 0.0000 0.9697 0.0000 1.0000

This establishes that domain-specific QLoRA fine-tuning is essential for the task.

6.4 Layer-level results

The model performs best on high-level intent and API-like targets.

Target layer Normalized field F1 range Interpretation
tmf921 0.93–0.94 Strong high-level intent object generation
camara 0.81–0.87 Strong after volatile-field normalization
intent_3gpp 0.80–0.82 Strong/moderate
etsi_zsm 0.75–0.79 Moderate/strong
a1_policy 0.67–0.68 Moderate, value fidelity remains limited
o1_nrm 0.39–0.40 Weak value fidelity despite correct structure
tmf921_lifecycle_report 0.15–0.18 Weak, likely measurement/simulation mismatch
tmf921_lifecycle_monitor 0.39–0.52 Weak/mixed

The O1 NRM result is especially informative: normalized key F1 is high, but normalized field F1 is low. This means the model learns the structural schema but fails to reliably assign correct low-level values.


7. Stage-2 Weak-Layer Continuation

To test whether weak-layer exposure alone would fix value fidelity, we continued training the stage-1 adapter on a weak-layer-focused dataset. The stage-2 dataset included all weak-layer rows, duplicated rare weak layers, and a replay buffer from strong layers.

Weak layers targeted:

  • o1_nrm,
  • a1_policy,
  • tmf921_lifecycle_report,
  • tmf921_lifecycle_monitor,
  • tmf921_lifecycle_scale.

Stage-2 global comparison:

Split Stage 1 norm field F1 Stage 2 norm field F1 Delta Stage 1 norm key F1 Stage 2 norm key F1 Delta
test_in_distribution 0.7956 0.7952 -0.0003 0.9811 0.9796 -0.0014
test_template_ood 0.7865 0.7855 -0.0010 0.9801 0.9786 -0.0015
test_use_case_ood 0.7907 0.7895 -0.0012 0.9805 0.9787 -0.0018
test_sector_ood 0.7697 0.7694 -0.0002 0.9818 0.9809 -0.0009
test_adversarial 0.9697 0.9596 -0.0101 1.0000 0.9697 -0.0303

Stage 2 did not improve O1 NRM or A1 policy meaningfully, and it slightly reduced adversarial robustness. Therefore, it is not promoted. Stage 1 remains the primary model.

This negative result suggests that weak-layer errors are not solved by exposure alone. They likely require better canonical labels, layer-specific semantic evaluation, and improved generation rules.


8. Qualitative Failure Analysis

Qualitative examples are available in:

analysis/stage1_examples/failure_examples.md
analysis/stage1_examples/failure_examples.json
analysis/stage2_examples/failure_examples.md
analysis/stage2_examples/failure_examples.json

Common error types include:

  • correct JSON structure but wrong low-level values,
  • O1 NRM value fidelity errors,
  • A1 policy value errors,
  • lifecycle report measurement mismatch,
  • lifecycle monitor measurement mismatch.

These examples support the quantitative conclusion that the model is structurally strong but still struggles with certain low-level semantic/value assignments.


9. Limitations

Limitation Impact Mitigation / future work
Synthetic data May not reflect real operator language Add expert/human-authored validation subset
No official standard validators Cannot claim production compliance Add TMF921/CAMARA/OpenAPI/YANG validators
O1 NRM weak value fidelity Low-level RAN configuration unreliable Add semantic evaluator and canonical labels
A1 policy moderate fidelity Policy values may be wrong Add policy-specific extractor/scorer
Lifecycle report/monitor weak Measurement fields may be hard to reproduce Use tolerance/semantic scoring
Exact match low Raw exact match over-penalizes volatile fields Report normalized metrics alongside raw

This work should be interpreted as a research benchmark and baseline, not a production-certified telecom configuration system.


10. Conclusion

We release a research-grade open dataset, OOD benchmark, normalized evaluator, training pipeline, and Qwen3-8B QLoRA baseline for multi-standard telecom intent-to-configuration translation. The fine-tuned model achieves near-perfect JSON validity, around 98% normalized structural key F1, and 77–80% normalized field F1 across non-adversarial ID/OOD splits, with strong adversarial rejection. A zero-shot baseline shows that the base model largely fails the task without domain-specific fine-tuning. At the same time, weak-layer analysis reveals that O1 NRM and A1 policy value fidelity remain open problems. Future work should focus on layer-specific semantic evaluators, official/derived schema validators, expert validation, and Gen4 canonical scenario generation.


Citation placeholders

Add formal citations for:

  • QLoRA,
  • LoRA,
  • TRL,
  • Qwen3,
  • TeleQnA,
  • ORANBench,
  • SPEC5G,
  • TSpec-LLM,
  • TelecomGPT,
  • ORION,
  • SynthIE,
  • TMF921,
  • 3GPP TS 28.312,
  • ETSI ZSM,
  • CAMARA,
  • O-RAN A1,
  • 3GPP TS 28.541.