PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss's picture
Update ML Intern artifact metadata
825bb04 verified
---
license: apache-2.0
tags:
- qlora
- sft
- trl
- peft
- qwen3
- tmf921
- intent-based-networking
- network-slicing
- rtx-6000-ada
- ml-intern
base_model:
- Qwen/Qwen3-8B
datasets:
- nraptisss/TMF921-intent-to-config-research-sota
---
# TMF921 Intent-to-Config Training + Evaluation
Training and evaluation repo for [`nraptisss/TMF921-intent-to-config-research-sota`](https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota) on a single **RTX 6000 Ada 48/50GB** server.
The default recipe is **Qwen3-8B + QLoRA NF4 + TRL SFTTrainer + PEFT LoRA**.
## Why this recipe
- Dataset rows were audited with `Qwen/Qwen3-8B` chat-template tokenization.
- Source max length: **1,316 tokens**, p99: **1,300**, so `max_length=2048` is safe.
- QLoRA NF4 + double quant follows the QLoRA recipe for fitting large models on one 48GB-class GPU.
- LoRA uses `target_modules="all-linear"`, recommended for QLoRA-style training.
- `assistant_only_loss=True` trains only the JSON/config response tokens.
- Evaluation is split by in-distribution and OOD splits; do not report only a single merged score.
## Hardware target
Recommended server:
- GPU: NVIDIA RTX 6000 Ada, 48GB/50GB VRAM
- RAM: 64GB+
- Disk: 200GB+ free
- CUDA-compatible PyTorch
Default effective batch size:
```text
per_device_train_batch_size = 2
gradient_accumulation_steps = 8
effective batch size = 16
max_length = 2048
```
If OOM occurs, preserve the effective batch size by changing:
```yaml
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
```
Do **not** reduce `max_length` unless you intentionally want a different training task.
## Quick start with nohup, unique run dirs, and resumable checkpoints
```bash
git clone https://huggingface.co/nraptisss/tmf921-intent-training
cd tmf921-intent-training
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
bash scripts/install_rtx6000ada.sh
python scripts/check_gpu.py
export HF_TOKEN=hf_...
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH="$PWD/src"
export TOKENIZERS_PARALLELISM=false
bash scripts/nohup_new_run.sh
```
Monitor:
```bash
RUN_DIR=runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
bash scripts/status_run.sh "$RUN_DIR"
tail -f "$RUN_DIR/logs/train.log"
watch -n 2 nvidia-smi
```
Resume:
```bash
bash scripts/nohup_resume.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
```
Evaluate:
```bash
bash scripts/nohup_eval.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
```
## Configs
- `configs/rtx6000ada_qwen3_8b_qlora.yaml` — recommended stage-1 config
- `configs/rtx6000ada_qwen3_14b_qlora_experimental.yaml` — experimental 14B config
- `configs/stage2_weak_layer_qwen3_8b.yaml` — diagnostic weak-layer continuation config
## Evaluation
Raw evaluator:
```bash
python scripts/evaluate_model.py \
--model Qwen/Qwen3-8B \
--adapter outputs/qwen3-8b-tmf921-qlora \
--dataset nraptisss/TMF921-intent-to-config-research-sota \
--output_dir outputs/qwen3-8b-tmf921-qlora/eval \
--load_in_4bit
```
Normalize existing predictions:
```bash
python scripts/normalize_eval_metrics.py \
--eval_dir outputs/qwen3-8b-tmf921-qlora/eval
```
Metrics:
- JSON parse rate
- canonical JSON exact match
- field precision / recall / F1
- normalized field precision / recall / F1
- normalized key precision / recall / F1
- slice/SST diagnostic pass
- KPI text-presence diagnostic pass
- adversarial status pass
- stratified metrics by `target_layer`, `slice_type`, and `lifecycle_operation`
## Merge adapter for deployment/evaluation
```bash
python scripts/merge_adapter.py \
--base_model Qwen/Qwen3-8B \
--adapter outputs/qwen3-8b-tmf921-qlora \
--output_dir outputs/qwen3-8b-tmf921-merged
```
## Stage 2 weak-layer continuation
Stage 2 was implemented and tested as a diagnostic experiment. It is **not promoted** as the main model because it did not materially improve O1/A1 and slightly regressed adversarial performance.
Run if needed:
```bash
bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
```
## Results packaging and qualitative failure analysis
After completing stage-1 and stage-2 evaluation plus normalization, package publication artifacts with:
```bash
export PYTHONPATH="$PWD/src"
python scripts/package_results.py \
--stage1_eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
--stage2_eval_dir runs/stage2-weak-20260505-080040/eval \
--output_dir results
```
This writes:
```text
results/stage1_raw_metrics.json
results/stage1_normalized_metrics.json
results/stage2_raw_metrics.json
results/stage2_normalized_metrics.json
results/metrics_summary.json
results/stage1_vs_stage2_comparison.md
```
Generate qualitative success/failure examples for the paper with:
```bash
python scripts/sample_failure_examples.py \
--eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
--output_dir analysis/stage1_examples
```
Optionally also sample stage-2 examples:
```bash
python scripts/sample_failure_examples.py \
--eval_dir runs/stage2-weak-20260505-080040/eval \
--output_dir analysis/stage2_examples
```
The example sampler writes:
```text
analysis/*/failure_examples.md
analysis/*/failure_examples.json
```
These artifacts are intended for paper tables, qualitative error analysis, and reproducibility appendices.
## Scientific reporting protocol
For research papers/reports, report at least:
1. validation loss,
2. `test_in_distribution` metrics,
3. `test_template_ood` metrics,
4. `test_use_case_ood` metrics,
5. `test_sector_ood` metrics,
6. `test_adversarial` metrics,
7. per-target-layer field F1,
8. normalized field/key F1,
9. JSON parse rate,
10. rare-class metrics for lifecycle operations and adversarial categories.
Do **not** claim production standards compliance from JSON validity alone. Official TMF921/3GPP/ETSI/CAMARA/O-RAN validators are still needed for schema-level certification.
## Files
```text
configs/
scripts/
src/tmf921_train/
PROJECT_JOURNAL.md
requirements.txt
```
## References
- QLoRA: https://huggingface.co/papers/2305.14314
- LoRA: https://huggingface.co/papers/2106.09685
- TRL SFTTrainer docs: https://huggingface.co/docs/trl/sft_trainer
- TRL PEFT integration: https://huggingface.co/docs/trl/peft_integration
- Source dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'nraptisss/tmf921-intent-training'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.