diff --git "a/evaluation.log" "b/evaluation.log" new file mode 100644--- /dev/null +++ "b/evaluation.log" @@ -0,0 +1,87 @@ +INFO 03-30 20:48:58 [__init__.py:239] Automatically detected platform cuda. +2026-03-30:20:49:09 INFO [__main__:429] Passed `--trust_remote_code`, setting environment variable `HF_DATASETS_TRUST_REMOTE_CODE=true` +2026-03-30:20:49:09 INFO [__main__:446] Selected Tasks: ['arc_challenge', 'arc_easy', 'boolq', 'hellaswag', 'lambada', 'lambada_standard', 'piqa', 'social_iqa', 'wikitext', 'winogrande'] +2026-03-30:20:49:09 INFO [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 +2026-03-30:20:49:09 INFO [evaluator:240] Initializing hf model, with arguments: {'pretrained': 'results/hf_ckpts/blockffn_02b_mul1002_withmean_d64_s128_lr93e4_b128/', 'dtype': + 'bfloat16', 'trust_remote_code': True} +2026-03-30:20:49:09 WARNING [accelerate.utils.other:512] Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. +2026-03-30:20:49:09 INFO [models.huggingface:147] Using device 'cuda:0' +2026-03-30:20:49:09 INFO [models.huggingface:535] Model type cannot be determined. Using default model type 'causal' +2026-03-30:20:49:10 INFO [models.huggingface:414] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} +2026-03-30:20:49:28 WARNING [api.task:846] [Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean +2026-03-30:20:49:28 WARNING [api.task:858] [Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True +/home/test1267/test-6/miniconda3/envs/lmeval/lib/python3.10/site-packages/datasets/load.py:1298: FutureWarning: The repository for social_i_qa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/social_i_qa +You can avoid this message in future by passing the argument `trust_remote_code=True`. +Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. + warnings.warn( +2026-03-30:20:50:24 WARNING [api.task:846] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity +2026-03-30:20:50:24 WARNING [api.task:858] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False +2026-03-30:20:50:24 WARNING [api.task:846] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity +2026-03-30:20:50:24 WARNING [api.task:858] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False +2026-03-30:20:50:24 WARNING [api.task:846] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte +2026-03-30:20:50:24 WARNING [api.task:858] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False +2026-03-30:20:50:41 INFO [api.task:434] Building contexts for winogrande on rank 0... + 0%| | 0/1267 [00:00