Add first manuscript draft

acf330d verified about 18 hours ago

preview code

raw

history blame contribute delete

20.4 kB

A Research-Grade Benchmark and QLoRA Baseline for Multi-Standard Telecom Intent-to-Configuration Translation

Draft version: 0.1
Status: first manuscript draft
Primary artifacts:

Dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
Training/evaluation code: https://huggingface.co/nraptisss/tmf921-intent-training
Primary model: https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-QLoRA-qwen3-8b-qlora-20260501-083834

Abstract

Intent-Based Networking aims to let operators express high-level network goals in natural language and automatically translate them into executable configurations. However, open research resources for training and evaluating such systems across multiple telecom standards remain limited. We present a research-grade benchmark and reproducible QLoRA baseline for multi-standard telecom intent-to-configuration translation. Starting from a 41,815-example TMF921 intent-to-configuration corpus, we construct a benchmark with in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial splits, plus provenance, validation, and token-length metadata. We fine-tune Qwen3-8B using QLoRA on a single RTX 6000 Ada GPU and evaluate with both raw JSON metrics and a normalized evaluator that removes volatile identifiers, hrefs, timestamps, and schema links. The model achieves near-perfect JSON validity across all splits, normalized structural key F1 around 98%, normalized field F1 of 79.6% in-distribution, 78.7% on template-OOD, 79.1% on use-case-OOD, 77.0% on sector-OOD, and 96.97% normalized exact match on adversarial rejection. A zero-shot Qwen3-8B baseline largely fails the task, producing valid JSON on only about one-third of non-adversarial examples and near-zero normalized field F1. A targeted weak-layer continuation experiment shows that simple oversampling does not solve O1 NRM and A1 policy value-fidelity errors, motivating layer-specific semantic evaluators and canonical scenario generation. We release the dataset, training code, evaluation scripts, metrics, and qualitative examples to support reproducible research in telecom LLMs and intent-based network management.

1. Introduction

Intent-Based Networking (IBN) is a central objective of autonomous network management: instead of manually configuring low-level network resources, operators express desired outcomes, constraints, and service-level objectives in high-level language. For 5G and emerging 6G networks, such intents must be translated into structured configuration artifacts spanning multiple standards and operational layers, including TM Forum TMF921, 3GPP intent management concepts, ETSI ZSM, CAMARA network slicing APIs, O-RAN A1 policies, and O1/NRM-style resource models.

Large language models (LLMs) are promising for this translation task because they can map flexible natural language into structured outputs. However, telecom intent-to-configuration translation is not simply generic JSON generation. It requires:

understanding domain-specific service types such as eMBB, URLLC, mMTC, V2X, MPS, and HMTC;
preserving quantitative SLOs such as latency, throughput, reliability, and UE count;
producing target-layer-specific JSON structures;
handling lifecycle operations and adversarial/invalid requests;
generalizing beyond prompt templates and seen sectors/use cases.

Despite recent progress in telecom LLM benchmarks and domain adaptation, open resources for supervised training and rigorous evaluation of natural-language-to-configuration generation remain limited. Many existing telecom NLP resources focus on multiple-choice question answering, document understanding, or general telecom instruction following rather than structured configuration generation.

This work addresses that gap by releasing and evaluating a research-grade dataset and baseline model for multi-standard telecom intent-to-configuration translation. Our emphasis is not only on training a model, but also on building a reproducible benchmark with explicit out-of-distribution (OOD) splits, normalized metrics, and transparent failure analysis.

Contributions

We make five contributions:

Research-grade dataset and benchmark. We construct TMF921-intent-to-config-research-sota, a benchmark derived from a 41,815-example TMF921 intent-to-configuration corpus. The benchmark includes in-distribution, template-OOD, use-case-OOD, sector-OOD, and adversarial splits.
Reproducible QLoRA training pipeline. We provide scripts and configurations for training Qwen3-8B with QLoRA on a single RTX 6000 Ada GPU, including GPU preflight checks, resumable nohup training, adapter merging, and evaluation scripts.
Normalized structured-output evaluation. We introduce a normalized evaluator that removes volatile fields such as IDs, hrefs, timestamps, descriptions, schema links, and generated identifiers before computing field/key metrics.
Empirical baseline results. The fine-tuned Qwen3-8B QLoRA model achieves near-perfect JSON validity, approximately 98% normalized key F1, and 77–80% normalized field F1 across non-adversarial ID/OOD splits. On adversarial rejection, it achieves 96.97% normalized exact match.
Weak-layer and negative-result analysis. We show that O1 NRM and A1 policy value fidelity remain difficult, and that a second-stage weak-layer continuation experiment does not materially improve these layers. This motivates semantic validators and better canonical data generation rather than blind oversampling.

2. Related Work

2.1 Telecom LLM benchmarks

Recent telecom NLP benchmarks such as TeleQnA, ORANBench, TSpec-LLM, SPEC5G, and TelecomGPT-related resources have advanced the evaluation of LLMs on telecom knowledge, standards understanding, and domain-specific question answering. These resources demonstrate that telecom reasoning remains challenging for general-purpose models, especially when tasks require standards-specific knowledge.

However, most existing benchmarks focus on multiple-choice or natural-language answers rather than supervised generation of structured telecom configuration objects. Our work complements these resources by focusing on intent-to-configuration translation, where the target is machine-readable JSON rather than free-form text.

2.2 Intent-based networking and network automation

Intent-Based Networking has long been proposed as a path toward autonomous network management. In this paradigm, an operator specifies goals and constraints, while an intent handler translates, validates, enforces, and monitors the resulting configuration. For 5G/6G network slicing, intents must map to multiple abstraction layers: high-level service intent, slice booking, policy control, and low-level network resource models.

Systems such as ORION and related LLM-based network automation work demonstrate the promise of LLMs for intent translation and orchestration. However, many such systems rely on proprietary frontier models, small evaluation sets, or tool-calling frameworks rather than open supervised fine-tuning datasets and reproducible baselines.

2.3 Structured-output generation and synthetic data

The task studied here belongs to structured-output generation: models must produce valid JSON with correct schema and values. Prior work on synthetic structured data generation, including approaches such as SynthIE, has shown that synthetic data can be effective when structured targets are generated deterministically and natural-language inputs are diversified. However, synthetic data also risks template leakage, surface-form memorization, and overoptimistic in-distribution evaluation.

Our benchmark explicitly addresses these risks by adding template-OOD, use-case-OOD, and sector-OOD splits, and by reporting both raw and normalized metrics.

2.4 Parameter-efficient fine-tuning

QLoRA enables efficient adaptation of large language models using 4-bit quantization and LoRA adapters. This makes it possible to fine-tune an 8B model on a single RTX 6000 Ada GPU while preserving a practical effective batch size and 2048-token context. We use QLoRA as a reproducible baseline rather than as a claim of final optimality.

3. Dataset and Benchmark Construction

3.1 Source dataset

The source dataset contains 41,815 examples of natural-language telecom intents paired with structured JSON configuration outputs. Each example is represented in ChatML-style format with system, user, and assistant messages, and includes metadata such as target layer, slice type, S-NSSAI fields, sector, region, KPI targets, and lifecycle operation.

A scientific audit of the source dataset found:

41,815 total rows;
0 missing values;
0 duplicate IDs;
100% assistant JSON parse validity;
0 exact train/test full-message overlaps;
high near-duplicate prompt similarity in the original split;
strong lifecycle imbalance, with create operations dominating;
only 31 unique JSON structure signatures.

These findings indicate that the source dataset is technically clean and useful for SFT, but its original split is mainly in-distribution and template-like.

3.2 Research splits

We construct a research-grade derivative dataset with the following splits.

Split	Rows	Purpose
`train_base`	26,357	Unaugmented training after OOD holdouts
`train_sota`	32,357	Training split with lifecycle/adversarial upsampling and multi-turn wrappers
`validation`	1,547	Validation during training
`test_in_distribution`	1,455	In-distribution test
`test_template_ood`	3,503	Held-out prompt-template family
`test_use_case_ood`	4,341	Held-out use cases
`test_sector_ood`	4,579	Held-out sectors
`test_adversarial`	33	Held-out adversarial rejection examples

The OOD splits are designed to separate interpolation performance from generalization across held-out prompt forms, use cases, and sectors. They remain synthetic OOD splits, not real deployment distributions.

3.3 Added metadata

The derivative dataset adds columns for reproducibility and analysis, including:

prompt_template_id,
scenario_id,
json_structure_id,
json_root_family,
messages_format_valid,
assistant_is_valid_json,
slice_sst_valid,
kpi_profile_valid,
semantic_rule_valid_v1,
qwen3_chat_template_tokens,
fits_2048_qwen3,
fits_4096_qwen3,
is_augmented,
augmentation_type,
source_id,
conversation_type.

3.4 Token-length audit

Using the Qwen3-8B chat template, all source examples fit within 2048 tokens.

Statistic	Tokens
Mean	754.1
p50	705
p95	1293
p99	1300
Max	1316
Fit under 2048	100%

This justifies max_length=2048 for Qwen3-family fine-tuning on this dataset.

3.5 Training split design

The train_sota split includes marked training-only upsampling for rare lifecycle and adversarial examples. This is explicitly marked using provenance fields (is_augmented, augmentation_type, and source_id). Evaluation splits remain unaugmented.

4. Training Method

4.1 Base model and adapter method

We use Qwen/Qwen3-8B as the base model and train LoRA adapters with QLoRA. The main configuration is shown below.

Item	Value
Base model	`Qwen/Qwen3-8B`
Training method	QLoRA SFT
Quantization	4-bit NF4 + double quantization
LoRA rank	64
LoRA alpha	16
LoRA dropout	0.05
Target modules	`all-linear`
Max length	2048
Loss	Assistant-only SFT loss
Learning rate	2e-4
Scheduler	constant
Optimizer	paged AdamW 32-bit
Gradient checkpointing	enabled
Hardware	RTX 6000 Ada 48/50GB
Train split	`train_sota`

4.2 Reproducible execution

Training was performed using a repository that includes:

CUDA/GPU preflight checks;
installation script for RTX 6000 Ada;
resumable nohup training;
checkpoint saving;
adapter merge script;
raw and normalized evaluation scripts;
results packaging tools.

The primary run directory was:

runs/qwen3-8b-qlora-20260501-083834

Training converged smoothly with no observed OOM, NaNs, or gradient instability.

5. Evaluation Methodology

5.1 Raw metrics

We report:

JSON parse rate,
canonical JSON exact match,
flattened field precision/recall/F1,
slice/SST diagnostic pass,
KPI text-presence diagnostic pass,
adversarial status pass.

Raw exact match is intentionally reported, but it is not the primary metric for this task because valid configurations may contain volatile identifiers and metadata.

5.2 Normalized metrics

We introduce normalized metrics that remove or mask volatile/generated fields before scoring. Normalization removes:

IDs,
hrefs,
names/descriptions,
timestamps,
schema links,
UUID/hash-like strings,
generated request/policy/booking/intent IDs.

It then computes:

normalized exact match,
normalized field precision/recall/F1,
normalized key precision/recall/F1.

Normalized key F1 measures structural agreement, while normalized field F1 estimates value-level agreement after volatile-field removal.

5.3 OOD and adversarial evaluation

All results are reported separately for each split. We do not merge the OOD splits into a single aggregate score because each split tests a distinct generalization mode.

6. Results

6.1 Raw stage-1 results

Split	JSON parse	Exact match	Field F1	KPI presence
`test_in_distribution`	1.0000	0.0227	0.6868	0.7973
`test_template_ood`	1.0000	0.0014	0.6790	0.8062
`test_use_case_ood`	0.9998	0.0122	0.6825	0.7883
`test_sector_ood`	1.0000	0.0166	0.6610	0.7733
`test_adversarial`	1.0000	0.9697	0.9697	1.0000

Raw exact match is low for primary configuration layers, but this metric is overly strict because many correct or acceptable generations differ in volatile fields.

6.2 Normalized stage-1 results

Split	JSON parse	Normalized field F1	Normalized key F1	Normalized exact
`test_in_distribution`	1.0000	0.7956	0.9811	0.0351
`test_template_ood`	1.0000	0.7865	0.9801	0.0177
`test_use_case_ood`	0.9998	0.7907	0.9805	0.0253
`test_sector_ood`	1.0000	0.7697	0.9818	0.0293
`test_adversarial`	1.0000	0.9697	1.0000	0.9697

The fine-tuned model achieves near-perfect parseability and high structural agreement across ID and OOD splits. Normalized field F1 is stable across OOD splits, indicating that performance is not purely template memorization.

6.3 Zero-shot baseline

We evaluate zero-shot Qwen/Qwen3-8B on 200 examples per split. It performs poorly.

Split	Zero-shot parse	Fine-tuned parse	Zero-shot norm field F1	Fine-tuned norm field F1	Zero-shot norm key F1	Fine-tuned norm key F1
ID	0.335	1.000	0.0009	0.7956	0.0169	0.9811
Template OOD	0.340	1.000	0.0014	0.7865	0.0172	0.9801
Use-case OOD	0.325	0.9998	0.0012	0.7907	0.0198	0.9805
Sector OOD	0.345	1.000	0.0008	0.7697	0.0171	0.9818
Adversarial	0.000	1.000	0.0000	0.9697	0.0000	1.0000

This establishes that domain-specific QLoRA fine-tuning is essential for the task.

6.4 Layer-level results

The model performs best on high-level intent and API-like targets.

Target layer	Normalized field F1 range	Interpretation
`tmf921`	0.93–0.94	Strong high-level intent object generation
`camara`	0.81–0.87	Strong after volatile-field normalization
`intent_3gpp`	0.80–0.82	Strong/moderate
`etsi_zsm`	0.75–0.79	Moderate/strong
`a1_policy`	0.67–0.68	Moderate, value fidelity remains limited
`o1_nrm`	0.39–0.40	Weak value fidelity despite correct structure
`tmf921_lifecycle_report`	0.15–0.18	Weak, likely measurement/simulation mismatch
`tmf921_lifecycle_monitor`	0.39–0.52	Weak/mixed

The O1 NRM result is especially informative: normalized key F1 is high, but normalized field F1 is low. This means the model learns the structural schema but fails to reliably assign correct low-level values.

7. Stage-2 Weak-Layer Continuation

To test whether weak-layer exposure alone would fix value fidelity, we continued training the stage-1 adapter on a weak-layer-focused dataset. The stage-2 dataset included all weak-layer rows, duplicated rare weak layers, and a replay buffer from strong layers.

Weak layers targeted:

o1_nrm,
a1_policy,
tmf921_lifecycle_report,
tmf921_lifecycle_monitor,
tmf921_lifecycle_scale.

Stage-2 global comparison:

Split	Stage 1 norm field F1	Stage 2 norm field F1	Delta	Stage 1 norm key F1	Stage 2 norm key F1	Delta
`test_in_distribution`	0.7956	0.7952	-0.0003	0.9811	0.9796	-0.0014
`test_template_ood`	0.7865	0.7855	-0.0010	0.9801	0.9786	-0.0015
`test_use_case_ood`	0.7907	0.7895	-0.0012	0.9805	0.9787	-0.0018
`test_sector_ood`	0.7697	0.7694	-0.0002	0.9818	0.9809	-0.0009
`test_adversarial`	0.9697	0.9596	-0.0101	1.0000	0.9697	-0.0303

Stage 2 did not improve O1 NRM or A1 policy meaningfully, and it slightly reduced adversarial robustness. Therefore, it is not promoted. Stage 1 remains the primary model.

This negative result suggests that weak-layer errors are not solved by exposure alone. They likely require better canonical labels, layer-specific semantic evaluation, and improved generation rules.

8. Qualitative Failure Analysis

Qualitative examples are available in:

analysis/stage1_examples/failure_examples.md
analysis/stage1_examples/failure_examples.json
analysis/stage2_examples/failure_examples.md
analysis/stage2_examples/failure_examples.json

Common error types include:

correct JSON structure but wrong low-level values,
O1 NRM value fidelity errors,
A1 policy value errors,
lifecycle report measurement mismatch,
lifecycle monitor measurement mismatch.

These examples support the quantitative conclusion that the model is structurally strong but still struggles with certain low-level semantic/value assignments.

9. Limitations

Limitation	Impact	Mitigation / future work
Synthetic data	May not reflect real operator language	Add expert/human-authored validation subset
No official standard validators	Cannot claim production compliance	Add TMF921/CAMARA/OpenAPI/YANG validators
O1 NRM weak value fidelity	Low-level RAN configuration unreliable	Add semantic evaluator and canonical labels
A1 policy moderate fidelity	Policy values may be wrong	Add policy-specific extractor/scorer
Lifecycle report/monitor weak	Measurement fields may be hard to reproduce	Use tolerance/semantic scoring
Exact match low	Raw exact match over-penalizes volatile fields	Report normalized metrics alongside raw

This work should be interpreted as a research benchmark and baseline, not a production-certified telecom configuration system.

10. Conclusion

We release a research-grade open dataset, OOD benchmark, normalized evaluator, training pipeline, and Qwen3-8B QLoRA baseline for multi-standard telecom intent-to-configuration translation. The fine-tuned model achieves near-perfect JSON validity, around 98% normalized structural key F1, and 77–80% normalized field F1 across non-adversarial ID/OOD splits, with strong adversarial rejection. A zero-shot baseline shows that the base model largely fails the task without domain-specific fine-tuning. At the same time, weak-layer analysis reveals that O1 NRM and A1 policy value fidelity remain open problems. Future work should focus on layer-specific semantic evaluators, official/derived schema validators, expert validation, and Gen4 canonical scenario generation.

Citation placeholders

Add formal citations for:

QLoRA,
LoRA,
TRL,
Qwen3,
TeleQnA,
ORANBench,
SPEC5G,
TSpec-LLM,
TelecomGPT,
ORION,
SynthIE,
TMF921,
3GPP TS 28.312,
ETSI ZSM,
CAMARA,
O-RAN A1,
3GPP TS 28.541.