Spaces:

sunhill
/

spice

Runtime error

App Files Files Community

sunhill commited on Sep 23, 2025

Commit

611706a

1 Parent(s): 92aad95

update gradio

Browse files

Files changed (4) hide show

README.md +40 -13
app.py +174 -2
spice.py +84 -14
tests.py +5 -9

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ title: SPICE
 tags:
 - evaluate
 - metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 5.45.0
 app_file: app.py
@@ -14,35 +14,62 @@ pinned: false
 ***Module Card Instructions:*** *This module calculates the SPICE metric for evaluating image captioning models.*
 ## Metric Description
 *SPICE (Semantic Propositional Image Caption Evaluation) is a metric for evaluating the quality of image captions. It measures the semantic similarity between the generated captions and a set of reference captions by analyzing the underlying semantic propositions.*
 ## How to Use
 *To use the SPICE metric, you need to provide a set of generated captions and a set of reference captions. The metric will then compute the SPICE score based on the semantic similarity between the two sets of captions.*
 *Here is a simple example of using the SPICE metric:*
 ### Inputs
 *List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 ### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
-## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 tags:
 - evaluate
 - metric
+description: "SPICE (Semantic Propositional Image Caption Evaluation) is a metric for evaluating the quality of image captions by measuring semantic similarity."
 sdk: gradio
 sdk_version: 5.45.0
 app_file: app.py
 ***Module Card Instructions:*** *This module calculates the SPICE metric for evaluating image captioning models.*
+**Can not support Apple Silicon, and make sure you have already installed JDK 8/11.**
 ## Metric Description
 *SPICE (Semantic Propositional Image Caption Evaluation) is a metric for evaluating the quality of image captions. It measures the semantic similarity between the generated captions and a set of reference captions by analyzing the underlying semantic propositions.*
 ## How to Use
 *To use the SPICE metric, you need to provide a set of generated captions and a set of reference captions. The metric will then compute the SPICE score based on the semantic similarity between the two sets of captions.*
 *Here is a simple example of using the SPICE metric:*
 ### Inputs
 *List all input arguments in the format below*
+- **predictions** *(list of list of strings): The generated captions to evaluate.*
+- **references** *(list of list of strings): The reference captions for each generated caption.*
 ### Output Values
+*List all output values in the format below*
+- **metric_score** *(list of dict): The SPICE score representing the semantic similarity between the generated and reference captions.*
 ### Examples
+```python
+import evaluate
+metric = evaluate.load("sunhill/spice")
+results = metric.compute(
+    predictions=[["train traveling down a track in front of a road"]],
+    references=[
+        [
+            "a train traveling down tracks next to lights",
+            "a blue and silver train next to train station and trees",
+            "a blue train is next to a sidewalk on the rails",
+            "a passenger train pulls into a train station",
+            "a train coming down the tracks arriving at a station",
+        ]
+    ]
+)
+print(results)
+```
 ## Citation
+```bibtex
+@inproceedings{spice2016,
+  title     = {SPICE: Semantic Propositional Image Caption Evaluation},
+  author    = {Peter Anderson and Basura Fernando and Mark Johnson and Stephen Gould},
+  year      = {2016},
+  booktitle = {ECCV}
+}
+```
 ## Further References
+- [SPICE](https://github.com/peteanderson80/SPICE)
+- [Image Caption Metrics](https://github.com/EricWWWW/image-caption-metrics)

app.py CHANGED Viewed

@@ -1,6 +1,178 @@
 import evaluate
-from evaluate.utils import launch_gradio_widget
 module = evaluate.load("sunhill/spice")
-launch_gradio_widget(module)

+import sys
+from pathlib import Path
 import evaluate
+import gradio as gr
+from evaluate import parse_readme
 module = evaluate.load("sunhill/spice")
+def compute_spice(predictions, references):
+    # return module.compute(predictions=predictions, references=references)
+    return [
+        {
+            "All": {
+                "pr": 0.6666666666666666,
+                "re": 0.09523809523809523,
+                "f": 0.16666666666666666,
+                "fn": 19.0,
+                "numImages": 1.0,
+                "fp": 1.0,
+                "tp": 2.0,
+            },
+            "Relation": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 8.0,
+                "numImages": 1.0,
+                "fp": 1.0,
+                "tp": 0.0,
+            },
+            "Cardinality": {
+                "pr": float("nan"),
+                "re": float("nan"),
+                "f": float("nan"),
+                "fn": 0.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Attribute": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 5.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Size": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 1.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Color": {
+                "pr": float("nan"),
+                "re": float("nan"),
+                "f": float("nan"),
+                "fn": 0.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Object": {
+                "pr": 1.0,
+                "re": 0.25,
+                "f": 0.4,
+                "fn": 6.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 2.0,
+            },
+        },
+        {
+            "All": {
+                "pr": 0.2,
+                "re": 0.125,
+                "f": 0.15384615384615385,
+                "fn": 7.0,
+                "numImages": 1.0,
+                "fp": 4.0,
+                "tp": 1.0,
+            },
+            "Relation": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 2.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Cardinality": {
+                "pr": float("nan"),
+                "re": float("nan"),
+                "f": float("nan"),
+                "fn": 0.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Attribute": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 3.0,
+                "numImages": 1.0,
+                "fp": 2.0,
+                "tp": 0.0,
+            },
+            "Size": {
+                "pr": float("nan"),
+                "re": float("nan"),
+                "f": float("nan"),
+                "fn": 0.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Color": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 1.0,
+                "numImages": 1.0,
+                "fp": 1.0,
+                "tp": 0.0,
+            },
+            "Object": {
+                "pr": 0.3333333333333333,
+                "re": 0.3333333333333333,
+                "f": 0.3333333333333333,
+                "fn": 2.0,
+                "numImages": 1.0,
+                "fp": 2.0,
+                "tp": 1.0,
+            },
+        },
+    ]
+iface = gr.Interface(
+    fn=compute_spice,
+    inputs=[
+        gr.Textbox(
+            label="References: separated by ;",
+            placeholder="Enter reference texts here...",
+        ),
+        gr.Textbox(
+            label="Predictions: Only one prediction",
+            placeholder="Enter prediction text here...",
+        ),
+    ],
+    outputs=gr.JSON(label="SPICE Score"),
+    title="SPICE Score Evaluator",
+    description="Evaluate the alignment between an image and a text using SPICE Score.",
+    examples=[
+        [
+            (
+                "a train traveling down tracks next to lights;"
+                "a blue and silver train next to train station and trees;"
+                "a blue train is next to a sidewalk on the rails;"
+                "a passenger train pulls into a train station;"
+                "a train coming down the tracks arriving at a station;"
+            ),
+            "train traveling down a track in front of a road",
+        ]
+    ],
+    article=parse_readme(Path(sys.path[0]) / "README.md"),
+)
+iface.launch()

spice.py CHANGED Viewed

@@ -8,7 +8,6 @@ import tempfile
 import evaluate
 import datasets
-import numpy as np
 from evaluate.utils.logging import get_logger
 logger = get_logger(__name__)
@@ -42,9 +41,86 @@ Returns:
     spice: SPICE score
 Examples:
     >>> metric = evaluate.load("sunhill/spice")
-    >>> results = metric.compute(predictions=["a cat on a mat"], references=["a cat is on the mat"])
     >>> print(results)
-    {'spice': 0.5}
 """
@@ -101,7 +177,7 @@ class SPICE(evaluate.Metric):
         try:
             return float(obj)
         except (ValueError, TypeError):
-            return np.nan
     def _compute(self, predictions, references):
         """Returns the scores"""
@@ -116,10 +192,9 @@ class SPICE(evaluate.Metric):
                 f"Got {len(prediction)} predictions and {len(reference)} references."
             )
             input_data.append({"image_id": i, "test": prediction[0], "refs": reference})
-            print(prediction, reference)
         in_file = tempfile.NamedTemporaryFile(delete=False)
-        json.dump(input_data, in_file, indent=2)
         in_file.close()
         out_file = tempfile.NamedTemporaryFile(delete=False)
@@ -156,19 +231,14 @@ class SPICE(evaluate.Metric):
         os.remove(in_file.name)
         os.remove(out_file.name)
-        img_id_to_scores = {}
-        spice_scores = []
-        for item in results:
-            img_id_to_scores[item["image_id"]] = item["scores"]
-            spice_scores.append(self.float_convert(item["scores"]["All"]["f"]))
-        average_score = np.mean(np.array(spice_scores))
         scores = []
         for image_id in range(len(predictions)):
             # Convert none to NaN before saving scores over subcategories
             score_set = {}
-            for category, score_tuple in img_id_to_scores[image_id].iteritems():
                 score_set[category] = {
                     k: self.float_convert(v) for k, v in score_tuple.items()
                 }
             scores.append(score_set)
-        return average_score, scores

 import evaluate
 import datasets
 from evaluate.utils.logging import get_logger
 logger = get_logger(__name__)
     spice: SPICE score
 Examples:
     >>> metric = evaluate.load("sunhill/spice")
+    >>> results = metric.compute(
+        predictions=[['train traveling down a track in front of a road']],
+        references=[
+            [
+                'a train traveling down tracks next to lights',
+                'a blue and silver train next to train station and trees',
+                'a blue train is next to a sidewalk on the rails',
+                'a passenger train pulls into a train station',
+                'a train coming down the tracks arriving at a station'
+            ]
+        ]
+    )
     >>> print(results)
+    [
+        {
+            "All": {
+                "pr": 0.25,
+                "re": 0.07142857142857142,
+                "f": 0.11111111111111112,
+                "fn": 13.0,
+                "numImages": 1.0,
+                "fp": 3.0,
+                "tp": 1.0,
+            },
+            "Relation": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 5.0,
+                "numImages": 1.0,
+                "fp": 1.0,
+                "tp": 0.0,
+            },
+            "Cardinality": {
+                "pr": nan,
+                "re": nan,
+                "f": nan,
+                "fn": 0.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Attribute": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 4.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Size": {
+                "pr": nan,
+                "re": nan,
+                "f": nan,
+                "fn": 0.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Color": {
+                "pr": 0.0,
+                "re": 0.0,
+                "f": 0.0,
+                "fn": 1.0,
+                "numImages": 1.0,
+                "fp": 0.0,
+                "tp": 0.0,
+            },
+            "Object": {
+                "pr": 0.3333333333333333,
+                "re": 0.2,
+                "f": 0.25,
+                "fn": 4.0,
+                "numImages": 1.0,
+                "fp": 2.0,
+                "tp": 1.0,
+            },
+        }
+    ]
 """
         try:
             return float(obj)
         except (ValueError, TypeError):
+            return float("nan")
     def _compute(self, predictions, references):
         """Returns the scores"""
                 f"Got {len(prediction)} predictions and {len(reference)} references."
             )
             input_data.append({"image_id": i, "test": prediction[0], "refs": reference})
         in_file = tempfile.NamedTemporaryFile(delete=False)
+        in_file.write(json.dumps(input_data, indent=2).encode("utf-8"))
         in_file.close()
         out_file = tempfile.NamedTemporaryFile(delete=False)
         os.remove(in_file.name)
         os.remove(out_file.name)
+        img_id_to_scores = {item["image_id"]: item["scores"] for item in results}
         scores = []
         for image_id in range(len(predictions)):
             # Convert none to NaN before saving scores over subcategories
             score_set = {}
+            for category, score_tuple in img_id_to_scores[image_id].items():
                 score_set[category] = {
                     k: self.float_convert(v) for k, v in score_tuple.items()
                 }
             scores.append(score_set)
+        return scores

tests.py CHANGED Viewed

@@ -12,8 +12,7 @@ test_cases = [
                 "a passenger train pulls into a train station",
                 "a train coming down the tracks arriving at a station",
             ]
-        ],
-        "result": {"metric_score": 0},
     },
     {
         "predictions": [
@@ -29,12 +28,11 @@ test_cases = [
                 "the plane is flying over top of the cars",
             ],
             ["a blue plate filled with marshmallows chocolate chips and banana"],
-        ],
-        "result": {"metric_score": 1},
-    }
 ]
-metric = evaluate.load("./spice.py")
 for i, test_case in enumerate(test_cases):
     results = metric.compute(
         predictions=test_case["predictions"], references=test_case["references"]
@@ -42,6 +40,4 @@ for i, test_case in enumerate(test_cases):
     print(f"Test case {i+1}:")
     print("Predictions:", test_case["predictions"])
     print("References:", test_case["references"])
-    print("Results:", results)
-    print("Expected:", test_case["result"])
-    print()

                 "a passenger train pulls into a train station",
                 "a train coming down the tracks arriving at a station",
             ]
+        ]
     },
     {
         "predictions": [
                 "the plane is flying over top of the cars",
             ],
             ["a blue plate filled with marshmallows chocolate chips and banana"],
+        ]
+    },
 ]
+metric = evaluate.load("sunhill/spice")
 for i, test_case in enumerate(test_cases):
     results = metric.compute(
         predictions=test_case["predictions"], references=test_case["references"]
     print(f"Test case {i+1}:")
     print("Predictions:", test_case["predictions"])
     print("References:", test_case["references"])
+    print(results)