INFO 03-09 13:52:56 [__init__.py:239] Automatically detected platform cuda. 2026-03-09:13:55:18 INFO [__main__:429] Passed `--trust_remote_code`, setting environment variable `HF_DATASETS_TRUST_REMOTE_CODE=true` 2026-03-09:13:55:18 INFO [__main__:446] Selected Tasks: ['arc_challenge', 'arc_easy', 'boolq', 'hellaswag', 'lambada', 'lambada_standard', 'piqa', 'social_iqa', 'wikitext', 'winogrande'] 2026-03-09:13:55:19 INFO [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2026-03-09:13:55:19 INFO [evaluator:240] Initializing hf model, with arguments: {'pretrained': 'results/hf_ckpts/blockffn_12b_mul1001_withmean_d64_s128_lr654e4_b512/', 'dtype': 'bfloat16', 'trust_remote_code': True} 2026-03-09:13:55:19 WARNING [accelerate.utils.other:512] Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 2026-03-09:13:55:19 INFO [models.huggingface:147] Using device 'cuda:0' 2026-03-09:13:55:19 INFO [models.huggingface:535] Model type cannot be determined. Using default model type 'causal' 2026-03-09:13:55:20 INFO [models.huggingface:414] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} 2026-03-09:13:56:10 WARNING [api.task:846] [Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean 2026-03-09:13:56:10 WARNING [api.task:858] [Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True /home/test1267/test-6/miniconda3/envs/lmeval/lib/python3.10/site-packages/datasets/load.py:1298: FutureWarning: The repository for social_i_qa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/social_i_qa You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( 2026-03-09:13:58:06 WARNING [api.task:846] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity 2026-03-09:13:58:06 WARNING [api.task:858] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False 2026-03-09:13:58:06 WARNING [api.task:846] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity 2026-03-09:13:58:06 WARNING [api.task:858] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False 2026-03-09:13:58:06 WARNING [api.task:846] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte 2026-03-09:13:58:06 WARNING [api.task:858] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False 2026-03-09:13:59:52 INFO [api.task:434] Building contexts for winogrande on rank 0... 0%| | 0/1267 [00:00