diff --git "a/evaluation.log" "b/evaluation.log" new file mode 100644--- /dev/null +++ "b/evaluation.log" @@ -0,0 +1,86 @@ +INFO 01-25 01:47:25 [__init__.py:239] Automatically detected platform cuda. +2026-01-25:01:49:22 INFO [__main__:429] Passed `--trust_remote_code`, setting environment variable `HF_DATASETS_TRUST_REMOTE_CODE=true` +2026-01-25:01:49:22 INFO [__main__:446] Selected Tasks: ['arc_challenge', 'arc_easy', 'boolq', 'hellaswag', 'lambada', 'lambada_standard', 'piqa', 'social_iqa', 'wikitext', 'winogrande'] +2026-01-25:01:49:22 INFO [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 +2026-01-25:01:49:22 INFO [evaluator:240] Initializing hf model, with arguments: {'pretrained': 'results/hf_ckpts/blockffn_05b_mul1001_withmean_d64_s128_lr78e4_b256/', 'dtype': + 'bfloat16', 'trust_remote_code': True} +2026-01-25:01:49:22 INFO [models.huggingface:147] Using device 'cuda:0' +2026-01-25:01:49:23 INFO [models.huggingface:535] Model type cannot be determined. Using default model type 'causal' +2026-01-25:01:49:24 INFO [models.huggingface:414] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} +2026-01-25:01:49:43 WARNING [api.task:846] [Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean +2026-01-25:01:49:43 WARNING [api.task:858] [Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True +/home/test1267/test-6/miniconda3/envs/lmeval/lib/python3.10/site-packages/datasets/load.py:1298: FutureWarning: The repository for social_i_qa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/social_i_qa +You can avoid this message in future by passing the argument `trust_remote_code=True`. +Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. + warnings.warn( +2026-01-25:01:50:34 WARNING [api.task:846] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity +2026-01-25:01:50:34 WARNING [api.task:858] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False +2026-01-25:01:50:34 WARNING [api.task:846] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity +2026-01-25:01:50:34 WARNING [api.task:858] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False +2026-01-25:01:50:34 WARNING [api.task:846] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte +2026-01-25:01:50:34 WARNING [api.task:858] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False +2026-01-25:01:50:48 INFO [api.task:434] Building contexts for winogrande on rank 0... + 0%| | 0/1267 [00:00