PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern

TMF921 Intent-to-Config Training + Evaluation

Training and evaluation repo for nraptisss/TMF921-intent-to-config-research-sota on a single RTX 6000 Ada 48/50GB server.

The default recipe is Qwen3-8B + QLoRA NF4 + TRL SFTTrainer + PEFT LoRA.

Why this recipe

  • Dataset rows were audited with Qwen/Qwen3-8B chat-template tokenization.
  • Source max length: 1,316 tokens, p99: 1,300, so max_length=2048 is safe.
  • QLoRA NF4 + double quant follows the QLoRA recipe for fitting large models on one 48GB-class GPU.
  • LoRA uses target_modules="all-linear", recommended for QLoRA-style training.
  • assistant_only_loss=True trains only the JSON/config response tokens.
  • Evaluation is split by in-distribution and OOD splits; do not report only a single merged score.

Hardware target

Recommended server:

  • GPU: NVIDIA RTX 6000 Ada, 48GB/50GB VRAM
  • RAM: 64GB+
  • Disk: 200GB+ free
  • CUDA-compatible PyTorch

Default effective batch size:

per_device_train_batch_size = 2
gradient_accumulation_steps = 8
effective batch size = 16
max_length = 2048

If OOM occurs, preserve the effective batch size by changing:

per_device_train_batch_size: 1
gradient_accumulation_steps: 16

Do not reduce max_length unless you intentionally want a different training task.

Quick start with nohup, unique run dirs, and resumable checkpoints

git clone https://huggingface.co/nraptisss/tmf921-intent-training
cd tmf921-intent-training

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
bash scripts/install_rtx6000ada.sh
python scripts/check_gpu.py

export HF_TOKEN=hf_...
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH="$PWD/src"
export TOKENIZERS_PARALLELISM=false

bash scripts/nohup_new_run.sh

Monitor:

RUN_DIR=runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
bash scripts/status_run.sh "$RUN_DIR"
tail -f "$RUN_DIR/logs/train.log"
watch -n 2 nvidia-smi

Resume:

bash scripts/nohup_resume.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS

Evaluate:

bash scripts/nohup_eval.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS

Configs

  • configs/rtx6000ada_qwen3_8b_qlora.yaml — recommended stage-1 config
  • configs/rtx6000ada_qwen3_14b_qlora_experimental.yaml — experimental 14B config
  • configs/stage2_weak_layer_qwen3_8b.yaml — diagnostic weak-layer continuation config

Evaluation

Raw evaluator:

python scripts/evaluate_model.py \
  --model Qwen/Qwen3-8B \
  --adapter outputs/qwen3-8b-tmf921-qlora \
  --dataset nraptisss/TMF921-intent-to-config-research-sota \
  --output_dir outputs/qwen3-8b-tmf921-qlora/eval \
  --load_in_4bit

Normalize existing predictions:

python scripts/normalize_eval_metrics.py \
  --eval_dir outputs/qwen3-8b-tmf921-qlora/eval

Metrics:

  • JSON parse rate
  • canonical JSON exact match
  • field precision / recall / F1
  • normalized field precision / recall / F1
  • normalized key precision / recall / F1
  • slice/SST diagnostic pass
  • KPI text-presence diagnostic pass
  • adversarial status pass
  • stratified metrics by target_layer, slice_type, and lifecycle_operation

Merge adapter for deployment/evaluation

python scripts/merge_adapter.py \
  --base_model Qwen/Qwen3-8B \
  --adapter outputs/qwen3-8b-tmf921-qlora \
  --output_dir outputs/qwen3-8b-tmf921-merged

Stage 2 weak-layer continuation

Stage 2 was implemented and tested as a diagnostic experiment. It is not promoted as the main model because it did not materially improve O1/A1 and slightly regressed adversarial performance.

Run if needed:

bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS

Results packaging and qualitative failure analysis

After completing stage-1 and stage-2 evaluation plus normalization, package publication artifacts with:

export PYTHONPATH="$PWD/src"

python scripts/package_results.py \
  --stage1_eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
  --stage2_eval_dir runs/stage2-weak-20260505-080040/eval \
  --output_dir results

This writes:

results/stage1_raw_metrics.json
results/stage1_normalized_metrics.json
results/stage2_raw_metrics.json
results/stage2_normalized_metrics.json
results/metrics_summary.json
results/stage1_vs_stage2_comparison.md

Generate qualitative success/failure examples for the paper with:

python scripts/sample_failure_examples.py \
  --eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
  --output_dir analysis/stage1_examples

Optionally also sample stage-2 examples:

python scripts/sample_failure_examples.py \
  --eval_dir runs/stage2-weak-20260505-080040/eval \
  --output_dir analysis/stage2_examples

The example sampler writes:

analysis/*/failure_examples.md
analysis/*/failure_examples.json

These artifacts are intended for paper tables, qualitative error analysis, and reproducibility appendices.

Scientific reporting protocol

For research papers/reports, report at least:

  1. validation loss,
  2. test_in_distribution metrics,
  3. test_template_ood metrics,
  4. test_use_case_ood metrics,
  5. test_sector_ood metrics,
  6. test_adversarial metrics,
  7. per-target-layer field F1,
  8. normalized field/key F1,
  9. JSON parse rate,
  10. rare-class metrics for lifecycle operations and adversarial categories.

Do not claim production standards compliance from JSON validity alone. Official TMF921/3GPP/ETSI/CAMARA/O-RAN validators are still needed for schema-level certification.

Files

configs/
scripts/
src/tmf921_train/
PROJECT_JOURNAL.md
requirements.txt

References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nraptisss/tmf921-intent-training

Finetuned
Qwen/Qwen3-8B
Adapter
(1185)
this model

Dataset used to train nraptisss/tmf921-intent-training

Papers for nraptisss/tmf921-intent-training