Inquiry regarding performance alignment for Florence-2 on COCO dataset
#34
by hongsik91 - opened
Dear Team Florence-2,
I am a graduate student currently using Florence-2 as a backbone for my vision-language research. I am writing to seek your guidance regarding a performance discrepancy I encountered while reproducing the COCO captioning results.
According to the paper, the zero-shot CIDEr for Florence-2-base is 133.0. However, my local evaluation on the Karpathy test split yields 103.48, and even the fine-tuned version (Florence-2-base-ft) only reaches 111.45.
I have attached my evaluation script (eval_florence2_baseline_hf_datasets.py) for your reference. To summarize my setup:
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-base", torch_dtype=torch.float16, trust_remote_code=True).to(cuda).eval()
proc = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
task = "<CAPTION>"
preds: List[Dict] = []
done: Set[int] = set()
for s in tqdm(range(0, len(ids), args.batch_size), desc=task):
bid = ids[s : s +16]
imgs, wh = [], []
for i in bid:
inf = meta[i]
ip = image_path(coco, i, inf, train_ids)
imgs.append(Image.open(ip).convert("RGB"))
wh.append((int(inf["width"]), int(inf["height"])))
batch = proc(text=[task] * len(imgs), images=imgs, return_tensors="pt", padding=True).to(dev)
batch["pixel_values"] = batch["pixel_values"].to(torch.float16)
with torch.no_grad():
out = model.generate(
input_ids=batch["input_ids"],
pixel_values=batch["pixel_values"],
attention_mask=batch.get("attention_mask"),
max_new_tokens=1024,
num_beams=3,
do_sample=False,
early_stopping=True
)
texts = proc.tokenizer.batch_decode(out, skip_special_tokens=True)
for i, (w, h), raw in zip(bid, wh, texts):
cap = proc.post_process_generation(raw, task=task, image_size=(w, h)).get(task, "").strip()
cap = cap.replace("<s>", "").replace("</s>", "").replace("<pad>", "").strip()
if i not in done:
preds.append({"image_id": i, "caption": cap})
done.add(i)
cider, bleu = cider_bleu(preds, gts)
print(f"CIDEr: {cider:.2f}")
print(f"BLEU: {bleu[0]:.4f} / {bleu[1]:.4f} / {bleu[2]:.4f} / {bleu[3]:.4f}")
Task Prompt: <CAPTION>
Environment: transformers==4.46.3, torch.float16, latest model revision.
Generation Config: num_beams=3, do_sample=False, early_stopping=True, max_new_tokens=1024.
Post-processing: I am using processor.post_process_generation followed by manual cleaning of special tokens (e.g., <s>, </s>).
Despite following the standard evaluation pipeline, there remains a significant gap (~30 CIDEr points) from the reported baseline. Could you kindly share the specific generation configurations (e.g., beam size, length penalty) or any data preprocessing/prompting details used for the official benchmark?
Thank you for your time and for sharing this impressive model with the community. I look forward to your insights.
Best regards,