tokenizer 的速度特别慢？

by wellinXu - opened Jul 1, 2025

Jul 1, 2025

from datasets import load_dataset
def formatting_prompts_func2(examples):
outputs = []
# print(examples)
for line in examples["messages"]:
text = line[0]["content"]
outputs.append(text + tokenizer.pad_token)
return { "text" : outputs, }

dataset = load_dataset("data", data_files="train_cls_2.json",split="train")

dataset = dataset.map(formatting_prompts_func2, batched = True,)
dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True)
dataset = dataset.remove_columns(["messages"])

evalset = load_dataset("data", data_files="test_cls_2.json", split="train")

evalset = evalset.map(formatting_prompts_func2, batched = True,)
evalset = evalset.map(lambda x: tokenizer(x["text"]), batched=True)
evalset = evalset.remove_columns(["messages"])

上面是代码，下面是速度，是不是我使用的方式不对？这也太慢了，同样的代码，之前不同的模型，tokenizer都只要几秒种，这要30分钟了？

Repo card metadata block was not found. Setting CardData to empty.
Map: 100%|██████████| 30000/30000 [00:00<00:00, 34348.20 examples/s]
Map: 0%| | 0/30000 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (627 > 512). Running this sequence through the model will result in indexing errors
Map: 3%|▎ | 1000/30000 [01:16<23:13, 20.81 examples/s]IOStream.flush timed out
Map: 27%|██▋ | 8000/30000 [06:31<16:16, 22.53 examples/s]IOStream.flush timed out
Map: 33%|███▎ | 10000/30000 [08:02<14:58, 22.27 examples/s]IOStream.flush timed out
Map: 47%|████▋ | 14000/30000 [10:51<12:15, 21.75 examples/s]

hlfby06

BAIDU org Jul 1, 2025

单独使用tokenizer预处理时，建议直接用tokenizer.tokenize(text)转化为tokens；
如果需要转化为ids，使用tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment