Inquiry regarding performance alignment for Florence-2 on COCO dataset

#34
by hongsik91 - opened

Dear Team Florence-2,

I am a graduate student currently using Florence-2 as a backbone for my vision-language research. I am writing to seek your guidance regarding a performance discrepancy I encountered while reproducing the COCO captioning results.

According to the paper, the zero-shot CIDEr for Florence-2-base is 133.0. However, my local evaluation on the Karpathy test split yields 103.48, and even the fine-tuned version (Florence-2-base-ft) only reaches 111.45.

I have attached my evaluation script (eval_florence2_baseline_hf_datasets.py) for your reference. To summarize my setup:

    dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Florence-2-base", torch_dtype=torch.float16, trust_remote_code=True).to(cuda).eval()
    proc = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)

    task = "<CAPTION>"
    preds: List[Dict] = []
    done: Set[int] = set()

    for s in tqdm(range(0, len(ids), args.batch_size), desc=task):
        bid = ids[s : s +16]
        imgs, wh = [], []
        for i in bid:
            inf = meta[i]
            ip = image_path(coco, i, inf, train_ids)
            imgs.append(Image.open(ip).convert("RGB"))
            wh.append((int(inf["width"]), int(inf["height"])))

        batch = proc(text=[task] * len(imgs), images=imgs, return_tensors="pt", padding=True).to(dev)
        batch["pixel_values"] = batch["pixel_values"].to(torch.float16)
        with torch.no_grad():
            out = model.generate(
                input_ids=batch["input_ids"],
                pixel_values=batch["pixel_values"],
                attention_mask=batch.get("attention_mask"),
                max_new_tokens=1024,
                num_beams=3,
                do_sample=False,
                early_stopping=True
            )
        texts = proc.tokenizer.batch_decode(out, skip_special_tokens=True)
        for i, (w, h), raw in zip(bid, wh, texts):
            cap = proc.post_process_generation(raw, task=task, image_size=(w, h)).get(task, "").strip()
            cap = cap.replace("<s>", "").replace("</s>", "").replace("<pad>", "").strip()
            if i not in done:
                preds.append({"image_id": i, "caption": cap})
                done.add(i)

    cider, bleu = cider_bleu(preds, gts)
    print(f"CIDEr: {cider:.2f}")
    print(f"BLEU:  {bleu[0]:.4f} / {bleu[1]:.4f} / {bleu[2]:.4f} / {bleu[3]:.4f}")

Task Prompt: <CAPTION>
Environment: transformers==4.46.3, torch.float16, latest model revision.
Generation Config: num_beams=3, do_sample=False, early_stopping=True, max_new_tokens=1024.
Post-processing: I am using processor.post_process_generation followed by manual cleaning of special tokens (e.g., <s>, </s>).

Despite following the standard evaluation pipeline, there remains a significant gap (~30 CIDEr points) from the reported baseline. Could you kindly share the specific generation configurations (e.g., beam size, length penalty) or any data preprocessing/prompting details used for the official benchmark?

Thank you for your time and for sharing this impressive model with the community. I look forward to your insights.

Best regards,

Sign up or log in to comment