Update READEME File

271210d 14 days ago

10.9 kB

	# LLM auto annotation for HICO-DET dataset (Pose from [Halpe](https://github.com/Fang-Haoshu/Halpe-FullBody), Part State from [HAKE](https://github.com/DirtyHarryLYL/HAKE)).

	## Environment
	The code is developed using python 3.11.11 on Ubuntu 21.xx with torch==2.6.0+cu124,
	transformers==4.57.3 (with Qwen3 series)

	## Annotating HICO-Det
	### A. Installation
	1. Install required packges and dependencies.
	2. Clone this repo, and we'll call the directory that you cloned as ${ROOT}.
	3. Creat necessary directories:
	```
	mkdir outputs
	mkdir model_weights
	```
	4. Download LLM's weights into model_weights from hugging face.


	### B. Prepare Dataset
	5. Install COCO API:
	```
	pip install pycocotools
	```
	6. Download [dataset](https://huggingface.co/datasets/ayh015/HICO-Det_Halpe_HAKE).
	7. Organize dataset, your directory tree of dataset should look like this (there maybe extra files.):
	```
	{DATA_ROOT}
	\|-- Annotation
	\| \|--hico-det-instance-level
	\| \| \|--hico-det-training-set-instance-level.json
	\| `--hico-fullbody-pose
	\| \|--halpe_train_v1.json
	\| `--halpe_val_v1.json
	\|── Configs
	\| \|--hico_hoi_list.txt
	\| `--Part_State_76.txt
	\|── Images
	\| \|--images
	\| \|--test2015
	\| \| \|--HICO_test2015_00000001.jpg
	\| \| \|--HICO_test2015_00000002.jpg
	\| \| ...
	\| `--train2015
	\| \|--HICO_train2015_00000001.jpg
	\| \|--HICO_train2015_00000002.jpg
	\| ...
	`── Logic_Rules
	\|--gather_rule.pkl
	`--read_rules.py
	```

	### C. Start annotation
	#### Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate.sh".
	```
	IDX={YOUR_GPU_IDS}
	export PYTHONPATH=$PYTHONPATH:./

	data_path={DATA_ROOT}
	model_path={ROOT}/model_weights/{YOUR_MODEL_NAME}
	output_dir={ROOT}/outputs

	if [ -d ${output_dir} ];then
	echo "dir already exists"
	else
	mkdir ${output_dir}
	fi

	CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \
	tools/annotate_hico.py \
	--model-path ${model_path} \
	--data-path ${data_path} \
	--output-dir ${output_dir} \
	```
	#### Start auto-annotation
	```
	bash scripts/annotate_hico.sh
	```

	### D. Multi-stage HICO pipeline
	The repository now supports a 3-stage HICO workflow:

	1. Long description generation
	2. Description refinement
	3. Description examination / checking

	Each stage writes per-rank JSON files first, then merges them into one JSON file for the next stage.

	#### Stage 1. Generate long descriptions
	This is the original HICO annotation stage. It uses `Conversation` in `data/convsersation.py`.

	Run:
	```
	bash scripts/annotate_hico.sh
	```

	This creates per-rank files such as:
	```
	outputs/labels_0.json
	outputs/labels_1.json
	```

	Merge them with:
	```
	python3 tools/merge_json_outputs.py \
	--input-dir outputs \
	--pattern "labels_*.json" \
	--output-path outputs/merged_labels.json
	```

	#### Stage 2. Refine generated descriptions
	This stage reads a merged JSON from Stage 1 and adds a `refined_description` field. It uses `Conversation_For_Clean_Descrption` in `data/convsersation.py`.

	Modify `data_path`, `model_path`, `annotation_path`, and `output_dir` in `scripts/refine_hico.sh`, then run:
	```
	bash scripts/refine_hico.sh
	```

	This creates files such as:
	```
	outputs/refine/refine_labels_0.json
	```

	Merge them with:
	```
	python3 tools/merge_json_outputs.py \
	--input-dir outputs/refine \
	--pattern "refine_labels_*.json" \
	--output-path outputs/merged_refine.json
	```

	#### Stage 3. Examine / check generated descriptions
	This stage reads a merged JSON from Stage 2 and adds an `examiner_result` field. It uses `Conversation_examiner` in `data/convsersation.py`.

	Modify `data_path`, `model_path`, `annotation_path`, and `output_dir` in `scripts/examine_hico.sh`, then run:
	```
	bash scripts/examine_hico.sh
	```

	This creates files such as:
	```
	outputs/examiner/examiner_labels_0.json
	```

	Merge them with:
	```
	python3 tools/merge_json_outputs.py \
	--input-dir outputs/examiner \
	--pattern "examiner_labels_*.json" \
	--output-path outputs/merged_examine.json
	```

	#### One-shot pipeline
	If you want to run all 3 stages end-to-end, use:
	```
	bash scripts/pipeline_hico.sh
	```

	Before running it, edit the following variables in `scripts/pipeline_hico.sh`:

	- `DATA_PATH`
	- `LONG_MODEL_PATH`
	- `REFINE_MODEL_PATH`
	- `EXAMINE_MODEL_PATH`
	- `LONG_GPU_IDS`
	- `REFINE_GPU_IDS`
	- `EXAMINE_GPU_IDS`
	- `LONG_NPROC`
	- `REFINE_NPROC`
	- `EXAMINE_NPROC`

	The pipeline will produce:

	- `outputs/pipeline/merged_long.json`
	- `outputs/pipeline/merged_refine.json`
	- `outputs/pipeline/merged_examine.json`

	### E. Using different VLM backends
	The HICO scripts are no longer hardcoded to Qwen only. The model loading logic is centralized in `tools/vlm_backend.py`, so you can use different VLM families for long-description generation, refinement, and examination.

	The following scripts support backend selection:

	- `tools/annotate_hico.py`
	- `tools/refine_hico.py`
	- `tools/examine_hico.py`
	- `tools/clean_initial_annotation.py`

	Each of them accepts:

	- `--model-path`
	- `--model-backend`
	- `--torch-dtype`

	Examples:
	```
	torchrun --nnodes=1 --nproc_per_node=1 tools/annotate_hico.py \
	--model-path /path/to/model \
	--model-backend auto \
	--torch-dtype bfloat16 \
	--data-path ../datasets/HICO-Det \
	--output-dir outputs/test \
	--max-samples 5
	```

	You may also force a backend explicitly, for example:
	```
	--model-backend qwen3_vl
	--model-backend qwen3_vl_moe
	--model-backend llava
	--model-backend deepseek_vl
	--model-backend hf_vision2seq
	--model-backend hf_causal_vlm
	```

	#### Where to customize for a new model
	If you want to adapt the repository to a new model family, the main file to edit is:

	- `tools/vlm_backend.py`

	This file controls:

	- backend detection: `infer_model_backend(...)`
	- model/processor loading: `load_model_and_processor(...)`
	- prompt/image packaging: `build_batch_tensors(...)`
	- output decoding: `decode_generated_text(...)`

	In most cases, you do not need to change the HICO task scripts themselves.

	#### How to add a new model backend
	There are three common situations.

	1. The model already works with Hugging Face `AutoProcessor` and `AutoModelForVision2Seq` or `AutoModelForCausalLM`.
	In that case, you may only need to run with:
	```
	--model-backend auto
	```
	or explicitly:
	```
	--model-backend hf_vision2seq
	```
	or:
	```
	--model-backend hf_causal_vlm
	```

	2. The model needs custom backend detection.
	Add a rule inside `infer_model_backend(...)` in `tools/vlm_backend.py`.

	3. The model needs a custom class or custom multimodal input format.
	Add a new branch inside:
	- `load_model_and_processor(...)`
	- `build_batch_tensors(...)`
	- `decode_generated_text(...)` if needed

	#### Rule of thumb

	- If you want to change task behavior or prompting, edit `data/convsersation.py`.
	- If you want to support a new model family, edit `tools/vlm_backend.py`.
	- If you want to add a new stage, add a new script under `tools/`.

	### F. Annotation format
	A list of dict that contains the following keys:
	```
	{
	'file_name': 'HICO_train2015_00009511.jpg',
	'image_id': 0,
	'keypoints': a 51-elements list (17x3 keypoints with x, y, v),
	'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags),
	'instance_id':0,
	'action_labels': [{'human_part': part_id, 'partstate': state_id}, ...],
	'height': 640,
	'width': 480,
	'human_bbox': [126, 258, 150, 305],
	'object_bbox': [128, 276, 144, 313],
	'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal."
	}
	```

	After refinement and examination, extra fields may appear in the JSON:
	```
	{
	'refined_description': "A refined 2-3 sentence version aligned with the target HOI label.",
	'examiner_result': "Verdict: PASS or FAIL ..."
	}
	```


	## Annotate COCO
	1. Download COCO dataset.
	2. Organize dataset, your directory tree of dataset should look like this (the files inside the Config is copied from the HICO-Det):
	```
	{DATA_ROOT}
	\|-- annotations
	\| \|--person_keypoints_train2017.json
	\| `--person_keypoints_val2017.json
	\|── Configs
	\| \|--hico_hoi_list.txt
	\| `--Part_State_76.txt
	\|── train2017
	\| \|--000000000009.jpg
	\| \|--000000000025.jpg
	\| ...
	`-- val2017
	\|--000000000139.jpg
	\|--000000000285.jpg
	...

	```

	### Start annotation
	#### Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate_coco.sh".
	```
	IDX={YOUR_GPU_IDS}
	export PYTHONPATH=$PYTHONPATH:./

	data_path={DATA_ROOT}
	model_path={ROOT}/model_weights/{YOUR_MODEL_NAME}
	output_dir={ROOT}/outputs

	if [ -d ${output_dir} ];then
	echo "dir already exists"
	else
	mkdir ${output_dir}
	fi

	CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \
	tools/annotate_coco.py \
	--model-path ${model_path} \
	--data-path ${data_path} \
	--output-dir ${output_dir} \
	```
	#### Start auto-annotation
	```
	bash scripts/annotate_coco.sh
	```
	By defualt, the annotation script only annotates the COCO train2017 set. To annotate val2017, find the following two code in Line167-Line168 in the tools/annotate_coco.py and replace the 'train2017' to 'val2017'.

	```
	dataset = PoseCOCODataset(
	data_path=os.path.join(args.data_path, 'annotations', 'person_keypoints_train2017.json'), # <- Line 167
	multimodal_cfg=dict(image_folder=os.path.join(args.data_path, 'train2017'), # <- Line 168
	data_augmentation=False,
	image_size=336,),)
	```


	## Annotation format
	A list of dict that contains the following keys:
	```
	{
	'file_name': '000000000009.jpg',
	'image_id': 9,
	'keypoints': a 51-elements list (17x3 keypoints with x, y, v),
	'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags),
	'height': 640,
	'width': 480,
	'human_bbox': [126, 258, 150, 305],
	'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal."
	}
	```