# LLM auto annotation for HICO-DET dataset (Pose from [Halpe](https://github.com/Fang-Haoshu/Halpe-FullBody), Part State from [HAKE](https://github.com/DirtyHarryLYL/HAKE)). ## Environment The code is developed using python 3.11.11 on Ubuntu 21.xx with torch==2.6.0+cu124, transformers==4.57.3 (with Qwen3 series) ## Annotating HICO-Det ### A. Installation 1. Install required packges and dependencies. 2. Clone this repo, and we'll call the directory that you cloned as ${ROOT}. 3. Creat necessary directories: ``` mkdir outputs mkdir model_weights ``` 4. Download LLM's weights into model_weights from hugging face. ### B. Prepare Dataset 5. Install COCO API: ``` pip install pycocotools ``` 6. Download [dataset](https://huggingface.co/datasets/ayh015/HICO-Det_Halpe_HAKE). 7. Organize dataset, your directory tree of dataset should look like this (there maybe extra files.): ``` {DATA_ROOT} |-- Annotation | |--hico-det-instance-level | | |--hico-det-training-set-instance-level.json | `--hico-fullbody-pose | |--halpe_train_v1.json | `--halpe_val_v1.json |── Configs | |--hico_hoi_list.txt | `--Part_State_76.txt |── Images | |--images | |--test2015 | | |--HICO_test2015_00000001.jpg | | |--HICO_test2015_00000002.jpg | | ... | `--train2015 | |--HICO_train2015_00000001.jpg | |--HICO_train2015_00000002.jpg | ... `── Logic_Rules |--gather_rule.pkl `--read_rules.py ``` ### C. Start annotation #### Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate.sh". ``` IDX={YOUR_GPU_IDS} export PYTHONPATH=$PYTHONPATH:./ data_path={DATA_ROOT} model_path={ROOT}/model_weights/{YOUR_MODEL_NAME} output_dir={ROOT}/outputs if [ -d ${output_dir} ];then echo "dir already exists" else mkdir ${output_dir} fi CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \ tools/annotate_hico.py \ --model-path ${model_path} \ --data-path ${data_path} \ --output-dir ${output_dir} \ ``` #### Start auto-annotation ``` bash scripts/annotate_hico.sh ``` ### D. Multi-stage HICO pipeline The repository now supports a 3-stage HICO workflow: 1. Long description generation 2. Description refinement 3. Description examination / checking Each stage writes per-rank JSON files first, then merges them into one JSON file for the next stage. #### Stage 1. Generate long descriptions This is the original HICO annotation stage. It uses `Conversation` in `data/convsersation.py`. Run: ``` bash scripts/annotate_hico.sh ``` This creates per-rank files such as: ``` outputs/labels_0.json outputs/labels_1.json ``` Merge them with: ``` python3 tools/merge_json_outputs.py \ --input-dir outputs \ --pattern "labels_*.json" \ --output-path outputs/merged_labels.json ``` #### Stage 2. Refine generated descriptions This stage reads a merged JSON from Stage 1 and adds a `refined_description` field. It uses `Conversation_For_Clean_Descrption` in `data/convsersation.py`. Modify `data_path`, `model_path`, `annotation_path`, and `output_dir` in `scripts/refine_hico.sh`, then run: ``` bash scripts/refine_hico.sh ``` This creates files such as: ``` outputs/refine/refine_labels_0.json ``` Merge them with: ``` python3 tools/merge_json_outputs.py \ --input-dir outputs/refine \ --pattern "refine_labels_*.json" \ --output-path outputs/merged_refine.json ``` #### Stage 3. Examine / check generated descriptions This stage reads a merged JSON from Stage 2 and adds an `examiner_result` field. It uses `Conversation_examiner` in `data/convsersation.py`. Modify `data_path`, `model_path`, `annotation_path`, and `output_dir` in `scripts/examine_hico.sh`, then run: ``` bash scripts/examine_hico.sh ``` This creates files such as: ``` outputs/examiner/examiner_labels_0.json ``` Merge them with: ``` python3 tools/merge_json_outputs.py \ --input-dir outputs/examiner \ --pattern "examiner_labels_*.json" \ --output-path outputs/merged_examine.json ``` #### One-shot pipeline If you want to run all 3 stages end-to-end, use: ``` bash scripts/pipeline_hico.sh ``` Before running it, edit the following variables in `scripts/pipeline_hico.sh`: - `DATA_PATH` - `LONG_MODEL_PATH` - `REFINE_MODEL_PATH` - `EXAMINE_MODEL_PATH` - `LONG_GPU_IDS` - `REFINE_GPU_IDS` - `EXAMINE_GPU_IDS` - `LONG_NPROC` - `REFINE_NPROC` - `EXAMINE_NPROC` The pipeline will produce: - `outputs/pipeline/merged_long.json` - `outputs/pipeline/merged_refine.json` - `outputs/pipeline/merged_examine.json` ### E. Using different VLM backends The HICO scripts are no longer hardcoded to Qwen only. The model loading logic is centralized in `tools/vlm_backend.py`, so you can use different VLM families for long-description generation, refinement, and examination. The following scripts support backend selection: - `tools/annotate_hico.py` - `tools/refine_hico.py` - `tools/examine_hico.py` - `tools/clean_initial_annotation.py` Each of them accepts: - `--model-path` - `--model-backend` - `--torch-dtype` Examples: ``` torchrun --nnodes=1 --nproc_per_node=1 tools/annotate_hico.py \ --model-path /path/to/model \ --model-backend auto \ --torch-dtype bfloat16 \ --data-path ../datasets/HICO-Det \ --output-dir outputs/test \ --max-samples 5 ``` You may also force a backend explicitly, for example: ``` --model-backend qwen3_vl --model-backend qwen3_vl_moe --model-backend llava --model-backend deepseek_vl --model-backend hf_vision2seq --model-backend hf_causal_vlm ``` #### Where to customize for a new model If you want to adapt the repository to a new model family, the main file to edit is: - `tools/vlm_backend.py` This file controls: - backend detection: `infer_model_backend(...)` - model/processor loading: `load_model_and_processor(...)` - prompt/image packaging: `build_batch_tensors(...)` - output decoding: `decode_generated_text(...)` In most cases, you do not need to change the HICO task scripts themselves. #### How to add a new model backend There are three common situations. 1. The model already works with Hugging Face `AutoProcessor` and `AutoModelForVision2Seq` or `AutoModelForCausalLM`. In that case, you may only need to run with: ``` --model-backend auto ``` or explicitly: ``` --model-backend hf_vision2seq ``` or: ``` --model-backend hf_causal_vlm ``` 2. The model needs custom backend detection. Add a rule inside `infer_model_backend(...)` in `tools/vlm_backend.py`. 3. The model needs a custom class or custom multimodal input format. Add a new branch inside: - `load_model_and_processor(...)` - `build_batch_tensors(...)` - `decode_generated_text(...)` if needed #### Rule of thumb - If you want to change task behavior or prompting, edit `data/convsersation.py`. - If you want to support a new model family, edit `tools/vlm_backend.py`. - If you want to add a new stage, add a new script under `tools/`. ### F. Annotation format A list of dict that contains the following keys: ``` { 'file_name': 'HICO_train2015_00009511.jpg', 'image_id': 0, 'keypoints': a 51-elements list (17x3 keypoints with x, y, v), 'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags), 'instance_id':0, 'action_labels': [{'human_part': part_id, 'partstate': state_id}, ...], 'height': 640, 'width': 480, 'human_bbox': [126, 258, 150, 305], 'object_bbox': [128, 276, 144, 313], 'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal." } ``` After refinement and examination, extra fields may appear in the JSON: ``` { 'refined_description': "A refined 2-3 sentence version aligned with the target HOI label.", 'examiner_result': "Verdict: PASS or FAIL ..." } ``` ## Annotate COCO 1. Download COCO dataset. 2. Organize dataset, your directory tree of dataset should look like this (the files inside the Config is copied from the HICO-Det): ``` {DATA_ROOT} |-- annotations | |--person_keypoints_train2017.json | `--person_keypoints_val2017.json |── Configs | |--hico_hoi_list.txt | `--Part_State_76.txt |── train2017 | |--000000000009.jpg | |--000000000025.jpg | ... `-- val2017 |--000000000139.jpg |--000000000285.jpg ... ``` ### Start annotation #### Modify the data_path, model_path, output_dir='outputs' by your configuration in "{ROOT}/scripts/annotate_coco.sh". ``` IDX={YOUR_GPU_IDS} export PYTHONPATH=$PYTHONPATH:./ data_path={DATA_ROOT} model_path={ROOT}/model_weights/{YOUR_MODEL_NAME} output_dir={ROOT}/outputs if [ -d ${output_dir} ];then echo "dir already exists" else mkdir ${output_dir} fi CUDA_VISIBLE_DEVICES=$IDX OMP_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node={NUM_YOUR_GPUs} --master_port=25005 \ tools/annotate_coco.py \ --model-path ${model_path} \ --data-path ${data_path} \ --output-dir ${output_dir} \ ``` #### Start auto-annotation ``` bash scripts/annotate_coco.sh ``` By defualt, the annotation script only annotates the COCO train2017 set. To annotate val2017, find the following two code in Line167-Line168 in the tools/annotate_coco.py and replace the 'train2017' to 'val2017'. ``` dataset = PoseCOCODataset( data_path=os.path.join(args.data_path, 'annotations', 'person_keypoints_train2017.json'), # <- Line 167 multimodal_cfg=dict(image_folder=os.path.join(args.data_path, 'train2017'), # <- Line 168 data_augmentation=False, image_size=336,),) ``` ## Annotation format A list of dict that contains the following keys: ``` { 'file_name': '000000000009.jpg', 'image_id': 9, 'keypoints': a 51-elements list (17x3 keypoints with x, y, v), 'vis': a 51-elements list (17 keypionts, each has 3 visiblity flags), 'height': 640, 'width': 480, 'human_bbox': [126, 258, 150, 305], 'description': "The person is riding a bicycle, supported by visible evidence of their body interacting with the bike.\n\n- The right hand is holding the right handlebar.\n- The left hand is holding the left handlebar.\n- The right hip is positioned over the seat, indicating the person is sitting on the bicycle.\n- The right foot is on the right pedal.\n- The left foot is on the left pedal." } ```