Update ML Intern artifact metadata

825bb04 verified about 8 hours ago

6.99 kB

	---
	license: apache-2.0
	tags:
	- qlora
	- sft
	- trl
	- peft
	- qwen3
	- tmf921
	- intent-based-networking
	- network-slicing
	- rtx-6000-ada
	- ml-intern
	base_model:
	- Qwen/Qwen3-8B
	datasets:
	- nraptisss/TMF921-intent-to-config-research-sota
	---

	# TMF921 Intent-to-Config Training + Evaluation

	Training and evaluation repo for [`nraptisss/TMF921-intent-to-config-research-sota`](https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota) on a single RTX 6000 Ada 48/50GB server.

	The default recipe is Qwen3-8B + QLoRA NF4 + TRL SFTTrainer + PEFT LoRA.

	## Why this recipe

	- Dataset rows were audited with `Qwen/Qwen3-8B` chat-template tokenization.
	- Source max length: 1,316 tokens, p99: 1,300, so `max_length=2048` is safe.
	- QLoRA NF4 + double quant follows the QLoRA recipe for fitting large models on one 48GB-class GPU.
	- LoRA uses `target_modules="all-linear"`, recommended for QLoRA-style training.
	- `assistant_only_loss=True` trains only the JSON/config response tokens.
	- Evaluation is split by in-distribution and OOD splits; do not report only a single merged score.

	## Hardware target

	Recommended server:

	- GPU: NVIDIA RTX 6000 Ada, 48GB/50GB VRAM
	- RAM: 64GB+
	- Disk: 200GB+ free
	- CUDA-compatible PyTorch

	Default effective batch size:

	```text
	per_device_train_batch_size = 2
	gradient_accumulation_steps = 8
	effective batch size = 16
	max_length = 2048
	```

	If OOM occurs, preserve the effective batch size by changing:

	```yaml
	per_device_train_batch_size: 1
	gradient_accumulation_steps: 16
	```

	Do not reduce `max_length` unless you intentionally want a different training task.

	## Quick start with nohup, unique run dirs, and resumable checkpoints

	```bash
	git clone https://huggingface.co/nraptisss/tmf921-intent-training
	cd tmf921-intent-training

	python -m venv .venv
	source .venv/bin/activate
	python -m pip install -U pip
	bash scripts/install_rtx6000ada.sh
	python scripts/check_gpu.py

	export HF_TOKEN=hf_...
	export CUDA_VISIBLE_DEVICES=0
	export PYTHONPATH="$PWD/src"
	export TOKENIZERS_PARALLELISM=false

	bash scripts/nohup_new_run.sh
	```

	Monitor:

	```bash
	RUN_DIR=runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
	bash scripts/status_run.sh "$RUN_DIR"
	tail -f "$RUN_DIR/logs/train.log"
	watch -n 2 nvidia-smi
	```

	Resume:

	```bash
	bash scripts/nohup_resume.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
	```

	Evaluate:

	```bash
	bash scripts/nohup_eval.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
	```

	## Configs

	- `configs/rtx6000ada_qwen3_8b_qlora.yaml` — recommended stage-1 config
	- `configs/rtx6000ada_qwen3_14b_qlora_experimental.yaml` — experimental 14B config
	- `configs/stage2_weak_layer_qwen3_8b.yaml` — diagnostic weak-layer continuation config

	## Evaluation

	Raw evaluator:

	```bash
	python scripts/evaluate_model.py \
	--model Qwen/Qwen3-8B \
	--adapter outputs/qwen3-8b-tmf921-qlora \
	--dataset nraptisss/TMF921-intent-to-config-research-sota \
	--output_dir outputs/qwen3-8b-tmf921-qlora/eval \
	--load_in_4bit
	```

	Normalize existing predictions:

	```bash
	python scripts/normalize_eval_metrics.py \
	--eval_dir outputs/qwen3-8b-tmf921-qlora/eval
	```

	Metrics:

	- JSON parse rate
	- canonical JSON exact match
	- field precision / recall / F1
	- normalized field precision / recall / F1
	- normalized key precision / recall / F1
	- slice/SST diagnostic pass
	- KPI text-presence diagnostic pass
	- adversarial status pass
	- stratified metrics by `target_layer`, `slice_type`, and `lifecycle_operation`

	## Merge adapter for deployment/evaluation

	```bash
	python scripts/merge_adapter.py \
	--base_model Qwen/Qwen3-8B \
	--adapter outputs/qwen3-8b-tmf921-qlora \
	--output_dir outputs/qwen3-8b-tmf921-merged
	```

	## Stage 2 weak-layer continuation

	Stage 2 was implemented and tested as a diagnostic experiment. It is not promoted as the main model because it did not materially improve O1/A1 and slightly regressed adversarial performance.

	Run if needed:

	```bash
	bash scripts/nohup_stage2_weak.sh runs/qwen3-8b-qlora-YYYYMMDD-HHMMSS
	```

	## Results packaging and qualitative failure analysis

	After completing stage-1 and stage-2 evaluation plus normalization, package publication artifacts with:

	```bash
	export PYTHONPATH="$PWD/src"

	python scripts/package_results.py \
	--stage1_eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
	--stage2_eval_dir runs/stage2-weak-20260505-080040/eval \
	--output_dir results
	```

	This writes:

	```text
	results/stage1_raw_metrics.json
	results/stage1_normalized_metrics.json
	results/stage2_raw_metrics.json
	results/stage2_normalized_metrics.json
	results/metrics_summary.json
	results/stage1_vs_stage2_comparison.md
	```

	Generate qualitative success/failure examples for the paper with:

	```bash
	python scripts/sample_failure_examples.py \
	--eval_dir runs/qwen3-8b-qlora-20260501-083834/eval_merged \
	--output_dir analysis/stage1_examples
	```

	Optionally also sample stage-2 examples:

	```bash
	python scripts/sample_failure_examples.py \
	--eval_dir runs/stage2-weak-20260505-080040/eval \
	--output_dir analysis/stage2_examples
	```

	The example sampler writes:

	```text
	analysis/*/failure_examples.md
	analysis/*/failure_examples.json
	```

	These artifacts are intended for paper tables, qualitative error analysis, and reproducibility appendices.

	## Scientific reporting protocol

	For research papers/reports, report at least:

	1. validation loss,
	2. `test_in_distribution` metrics,
	3. `test_template_ood` metrics,
	4. `test_use_case_ood` metrics,
	5. `test_sector_ood` metrics,
	6. `test_adversarial` metrics,
	7. per-target-layer field F1,
	8. normalized field/key F1,
	9. JSON parse rate,
	10. rare-class metrics for lifecycle operations and adversarial categories.

	Do not claim production standards compliance from JSON validity alone. Official TMF921/3GPP/ETSI/CAMARA/O-RAN validators are still needed for schema-level certification.

	## Files

	```text
	configs/
	scripts/
	src/tmf921_train/
	PROJECT_JOURNAL.md
	requirements.txt
	```

	## References

	- QLoRA: https://huggingface.co/papers/2305.14314
	- LoRA: https://huggingface.co/papers/2106.09685
	- TRL SFTTrainer docs: https://huggingface.co/docs/trl/sft_trainer
	- TRL PEFT integration: https://huggingface.co/docs/trl/peft_integration
	- Source dataset: https://huggingface.co/datasets/nraptisss/TMF921-intent-to-config-research-sota

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = 'nraptisss/tmf921-intent-training'
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.