tokenizer 的速度特别慢?
from datasets import load_dataset
def formatting_prompts_func2(examples):
outputs = []
# print(examples)
for line in examples["messages"]:
text = line[0]["content"]
outputs.append(text + tokenizer.pad_token)
return { "text" : outputs, }
dataset = load_dataset("data", data_files="train_cls_2.json",split="train")
dataset = dataset.map(formatting_prompts_func2, batched = True,)
dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True)
dataset = dataset.remove_columns(["messages"])
evalset = load_dataset("data", data_files="test_cls_2.json", split="train")
evalset = evalset.map(formatting_prompts_func2, batched = True,)
evalset = evalset.map(lambda x: tokenizer(x["text"]), batched=True)
evalset = evalset.remove_columns(["messages"])
上面是代码,下面是速度,是不是我使用的方式不对?这也太慢了,同样的代码,之前不同的模型,tokenizer都只要几秒种,这要30分钟了?
Repo card metadata block was not found. Setting CardData to empty.
Map: 100%|██████████| 30000/30000 [00:00<00:00, 34348.20 examples/s]
Map: 0%| | 0/30000 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (627 > 512). Running this sequence through the model will result in indexing errors
Map: 3%|▎ | 1000/30000 [01:16<23:13, 20.81 examples/s]IOStream.flush timed out
Map: 27%|██▋ | 8000/30000 [06:31<16:16, 22.53 examples/s]IOStream.flush timed out
Map: 33%|███▎ | 10000/30000 [08:02<14:58, 22.27 examples/s]IOStream.flush timed out
Map: 47%|████▋ | 14000/30000 [10:51<12:15, 21.75 examples/s]
单独使用tokenizer预处理时,建议直接用tokenizer.tokenize(text)转化为tokens;
如果需要转化为ids,使用tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))