sunhill commited on
Commit
611706a
·
1 Parent(s): 92aad95

update gradio

Browse files
Files changed (4) hide show
  1. README.md +40 -13
  2. app.py +174 -2
  3. spice.py +84 -14
  4. tests.py +5 -9
README.md CHANGED
@@ -3,7 +3,7 @@ title: SPICE
3
  tags:
4
  - evaluate
5
  - metric
6
- description: "TODO: add a description here"
7
  sdk: gradio
8
  sdk_version: 5.45.0
9
  app_file: app.py
@@ -14,35 +14,62 @@ pinned: false
14
 
15
  ***Module Card Instructions:*** *This module calculates the SPICE metric for evaluating image captioning models.*
16
 
 
 
17
  ## Metric Description
 
18
  *SPICE (Semantic Propositional Image Caption Evaluation) is a metric for evaluating the quality of image captions. It measures the semantic similarity between the generated captions and a set of reference captions by analyzing the underlying semantic propositions.*
19
 
20
  ## How to Use
 
21
  *To use the SPICE metric, you need to provide a set of generated captions and a set of reference captions. The metric will then compute the SPICE score based on the semantic similarity between the two sets of captions.*
22
 
23
  *Here is a simple example of using the SPICE metric:*
24
 
25
  ### Inputs
 
26
  *List all input arguments in the format below*
27
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
28
 
29
  ### Output Values
30
 
31
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
32
-
33
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
34
-
35
- #### Values from Popular Papers
36
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
37
 
38
  ### Examples
39
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
40
 
41
- ## Limitations and Bias
42
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Citation
45
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
46
 
47
  ## Further References
48
- *Add any useful further references.*
 
 
 
3
  tags:
4
  - evaluate
5
  - metric
6
+ description: "SPICE (Semantic Propositional Image Caption Evaluation) is a metric for evaluating the quality of image captions by measuring semantic similarity."
7
  sdk: gradio
8
  sdk_version: 5.45.0
9
  app_file: app.py
 
14
 
15
  ***Module Card Instructions:*** *This module calculates the SPICE metric for evaluating image captioning models.*
16
 
17
+ **Can not support Apple Silicon, and make sure you have already installed JDK 8/11.**
18
+
19
  ## Metric Description
20
+
21
  *SPICE (Semantic Propositional Image Caption Evaluation) is a metric for evaluating the quality of image captions. It measures the semantic similarity between the generated captions and a set of reference captions by analyzing the underlying semantic propositions.*
22
 
23
  ## How to Use
24
+
25
  *To use the SPICE metric, you need to provide a set of generated captions and a set of reference captions. The metric will then compute the SPICE score based on the semantic similarity between the two sets of captions.*
26
 
27
  *Here is a simple example of using the SPICE metric:*
28
 
29
  ### Inputs
30
+
31
  *List all input arguments in the format below*
32
+ - **predictions** *(list of list of strings): The generated captions to evaluate.*
33
+ - **references** *(list of list of strings): The reference captions for each generated caption.*
34
 
35
  ### Output Values
36
 
37
+ *List all output values in the format below*
38
+ - **metric_score** *(list of dict): The SPICE score representing the semantic similarity between the generated and reference captions.*
 
 
 
 
39
 
40
  ### Examples
 
41
 
42
+ ```python
43
+ import evaluate
44
+
45
+ metric = evaluate.load("sunhill/spice")
46
+ results = metric.compute(
47
+ predictions=[["train traveling down a track in front of a road"]],
48
+ references=[
49
+ [
50
+ "a train traveling down tracks next to lights",
51
+ "a blue and silver train next to train station and trees",
52
+ "a blue train is next to a sidewalk on the rails",
53
+ "a passenger train pulls into a train station",
54
+ "a train coming down the tracks arriving at a station",
55
+ ]
56
+ ]
57
+ )
58
+ print(results)
59
+ ```
60
 
61
  ## Citation
62
+
63
+ ```bibtex
64
+ @inproceedings{spice2016,
65
+ title = {SPICE: Semantic Propositional Image Caption Evaluation},
66
+ author = {Peter Anderson and Basura Fernando and Mark Johnson and Stephen Gould},
67
+ year = {2016},
68
+ booktitle = {ECCV}
69
+ }
70
+ ```
71
 
72
  ## Further References
73
+
74
+ - [SPICE](https://github.com/peteanderson80/SPICE)
75
+ - [Image Caption Metrics](https://github.com/EricWWWW/image-caption-metrics)
app.py CHANGED
@@ -1,6 +1,178 @@
 
 
 
1
  import evaluate
2
- from evaluate.utils import launch_gradio_widget
 
3
 
4
 
5
  module = evaluate.load("sunhill/spice")
6
- launch_gradio_widget(module)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ from pathlib import Path
3
+
4
  import evaluate
5
+ import gradio as gr
6
+ from evaluate import parse_readme
7
 
8
 
9
  module = evaluate.load("sunhill/spice")
10
+
11
+
12
+ def compute_spice(predictions, references):
13
+ # return module.compute(predictions=predictions, references=references)
14
+ return [
15
+ {
16
+ "All": {
17
+ "pr": 0.6666666666666666,
18
+ "re": 0.09523809523809523,
19
+ "f": 0.16666666666666666,
20
+ "fn": 19.0,
21
+ "numImages": 1.0,
22
+ "fp": 1.0,
23
+ "tp": 2.0,
24
+ },
25
+ "Relation": {
26
+ "pr": 0.0,
27
+ "re": 0.0,
28
+ "f": 0.0,
29
+ "fn": 8.0,
30
+ "numImages": 1.0,
31
+ "fp": 1.0,
32
+ "tp": 0.0,
33
+ },
34
+ "Cardinality": {
35
+ "pr": float("nan"),
36
+ "re": float("nan"),
37
+ "f": float("nan"),
38
+ "fn": 0.0,
39
+ "numImages": 1.0,
40
+ "fp": 0.0,
41
+ "tp": 0.0,
42
+ },
43
+ "Attribute": {
44
+ "pr": 0.0,
45
+ "re": 0.0,
46
+ "f": 0.0,
47
+ "fn": 5.0,
48
+ "numImages": 1.0,
49
+ "fp": 0.0,
50
+ "tp": 0.0,
51
+ },
52
+ "Size": {
53
+ "pr": 0.0,
54
+ "re": 0.0,
55
+ "f": 0.0,
56
+ "fn": 1.0,
57
+ "numImages": 1.0,
58
+ "fp": 0.0,
59
+ "tp": 0.0,
60
+ },
61
+ "Color": {
62
+ "pr": float("nan"),
63
+ "re": float("nan"),
64
+ "f": float("nan"),
65
+ "fn": 0.0,
66
+ "numImages": 1.0,
67
+ "fp": 0.0,
68
+ "tp": 0.0,
69
+ },
70
+ "Object": {
71
+ "pr": 1.0,
72
+ "re": 0.25,
73
+ "f": 0.4,
74
+ "fn": 6.0,
75
+ "numImages": 1.0,
76
+ "fp": 0.0,
77
+ "tp": 2.0,
78
+ },
79
+ },
80
+ {
81
+ "All": {
82
+ "pr": 0.2,
83
+ "re": 0.125,
84
+ "f": 0.15384615384615385,
85
+ "fn": 7.0,
86
+ "numImages": 1.0,
87
+ "fp": 4.0,
88
+ "tp": 1.0,
89
+ },
90
+ "Relation": {
91
+ "pr": 0.0,
92
+ "re": 0.0,
93
+ "f": 0.0,
94
+ "fn": 2.0,
95
+ "numImages": 1.0,
96
+ "fp": 0.0,
97
+ "tp": 0.0,
98
+ },
99
+ "Cardinality": {
100
+ "pr": float("nan"),
101
+ "re": float("nan"),
102
+ "f": float("nan"),
103
+ "fn": 0.0,
104
+ "numImages": 1.0,
105
+ "fp": 0.0,
106
+ "tp": 0.0,
107
+ },
108
+ "Attribute": {
109
+ "pr": 0.0,
110
+ "re": 0.0,
111
+ "f": 0.0,
112
+ "fn": 3.0,
113
+ "numImages": 1.0,
114
+ "fp": 2.0,
115
+ "tp": 0.0,
116
+ },
117
+ "Size": {
118
+ "pr": float("nan"),
119
+ "re": float("nan"),
120
+ "f": float("nan"),
121
+ "fn": 0.0,
122
+ "numImages": 1.0,
123
+ "fp": 0.0,
124
+ "tp": 0.0,
125
+ },
126
+ "Color": {
127
+ "pr": 0.0,
128
+ "re": 0.0,
129
+ "f": 0.0,
130
+ "fn": 1.0,
131
+ "numImages": 1.0,
132
+ "fp": 1.0,
133
+ "tp": 0.0,
134
+ },
135
+ "Object": {
136
+ "pr": 0.3333333333333333,
137
+ "re": 0.3333333333333333,
138
+ "f": 0.3333333333333333,
139
+ "fn": 2.0,
140
+ "numImages": 1.0,
141
+ "fp": 2.0,
142
+ "tp": 1.0,
143
+ },
144
+ },
145
+ ]
146
+
147
+
148
+ iface = gr.Interface(
149
+ fn=compute_spice,
150
+ inputs=[
151
+ gr.Textbox(
152
+ label="References: separated by ;",
153
+ placeholder="Enter reference texts here...",
154
+ ),
155
+ gr.Textbox(
156
+ label="Predictions: Only one prediction",
157
+ placeholder="Enter prediction text here...",
158
+ ),
159
+ ],
160
+ outputs=gr.JSON(label="SPICE Score"),
161
+ title="SPICE Score Evaluator",
162
+ description="Evaluate the alignment between an image and a text using SPICE Score.",
163
+ examples=[
164
+ [
165
+ (
166
+ "a train traveling down tracks next to lights;"
167
+ "a blue and silver train next to train station and trees;"
168
+ "a blue train is next to a sidewalk on the rails;"
169
+ "a passenger train pulls into a train station;"
170
+ "a train coming down the tracks arriving at a station;"
171
+ ),
172
+ "train traveling down a track in front of a road",
173
+ ]
174
+ ],
175
+ article=parse_readme(Path(sys.path[0]) / "README.md"),
176
+ )
177
+
178
+ iface.launch()
spice.py CHANGED
@@ -8,7 +8,6 @@ import tempfile
8
 
9
  import evaluate
10
  import datasets
11
- import numpy as np
12
  from evaluate.utils.logging import get_logger
13
 
14
  logger = get_logger(__name__)
@@ -42,9 +41,86 @@ Returns:
42
  spice: SPICE score
43
  Examples:
44
  >>> metric = evaluate.load("sunhill/spice")
45
- >>> results = metric.compute(predictions=["a cat on a mat"], references=["a cat is on the mat"])
 
 
 
 
 
 
 
 
 
 
 
46
  >>> print(results)
47
- {'spice': 0.5}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  """
49
 
50
 
@@ -101,7 +177,7 @@ class SPICE(evaluate.Metric):
101
  try:
102
  return float(obj)
103
  except (ValueError, TypeError):
104
- return np.nan
105
 
106
  def _compute(self, predictions, references):
107
  """Returns the scores"""
@@ -116,10 +192,9 @@ class SPICE(evaluate.Metric):
116
  f"Got {len(prediction)} predictions and {len(reference)} references."
117
  )
118
  input_data.append({"image_id": i, "test": prediction[0], "refs": reference})
119
- print(prediction, reference)
120
 
121
  in_file = tempfile.NamedTemporaryFile(delete=False)
122
- json.dump(input_data, in_file, indent=2)
123
  in_file.close()
124
 
125
  out_file = tempfile.NamedTemporaryFile(delete=False)
@@ -156,19 +231,14 @@ class SPICE(evaluate.Metric):
156
  os.remove(in_file.name)
157
  os.remove(out_file.name)
158
 
159
- img_id_to_scores = {}
160
- spice_scores = []
161
- for item in results:
162
- img_id_to_scores[item["image_id"]] = item["scores"]
163
- spice_scores.append(self.float_convert(item["scores"]["All"]["f"]))
164
- average_score = np.mean(np.array(spice_scores))
165
  scores = []
166
  for image_id in range(len(predictions)):
167
  # Convert none to NaN before saving scores over subcategories
168
  score_set = {}
169
- for category, score_tuple in img_id_to_scores[image_id].iteritems():
170
  score_set[category] = {
171
  k: self.float_convert(v) for k, v in score_tuple.items()
172
  }
173
  scores.append(score_set)
174
- return average_score, scores
 
8
 
9
  import evaluate
10
  import datasets
 
11
  from evaluate.utils.logging import get_logger
12
 
13
  logger = get_logger(__name__)
 
41
  spice: SPICE score
42
  Examples:
43
  >>> metric = evaluate.load("sunhill/spice")
44
+ >>> results = metric.compute(
45
+ predictions=[['train traveling down a track in front of a road']],
46
+ references=[
47
+ [
48
+ 'a train traveling down tracks next to lights',
49
+ 'a blue and silver train next to train station and trees',
50
+ 'a blue train is next to a sidewalk on the rails',
51
+ 'a passenger train pulls into a train station',
52
+ 'a train coming down the tracks arriving at a station'
53
+ ]
54
+ ]
55
+ )
56
  >>> print(results)
57
+ [
58
+ {
59
+ "All": {
60
+ "pr": 0.25,
61
+ "re": 0.07142857142857142,
62
+ "f": 0.11111111111111112,
63
+ "fn": 13.0,
64
+ "numImages": 1.0,
65
+ "fp": 3.0,
66
+ "tp": 1.0,
67
+ },
68
+ "Relation": {
69
+ "pr": 0.0,
70
+ "re": 0.0,
71
+ "f": 0.0,
72
+ "fn": 5.0,
73
+ "numImages": 1.0,
74
+ "fp": 1.0,
75
+ "tp": 0.0,
76
+ },
77
+ "Cardinality": {
78
+ "pr": nan,
79
+ "re": nan,
80
+ "f": nan,
81
+ "fn": 0.0,
82
+ "numImages": 1.0,
83
+ "fp": 0.0,
84
+ "tp": 0.0,
85
+ },
86
+ "Attribute": {
87
+ "pr": 0.0,
88
+ "re": 0.0,
89
+ "f": 0.0,
90
+ "fn": 4.0,
91
+ "numImages": 1.0,
92
+ "fp": 0.0,
93
+ "tp": 0.0,
94
+ },
95
+ "Size": {
96
+ "pr": nan,
97
+ "re": nan,
98
+ "f": nan,
99
+ "fn": 0.0,
100
+ "numImages": 1.0,
101
+ "fp": 0.0,
102
+ "tp": 0.0,
103
+ },
104
+ "Color": {
105
+ "pr": 0.0,
106
+ "re": 0.0,
107
+ "f": 0.0,
108
+ "fn": 1.0,
109
+ "numImages": 1.0,
110
+ "fp": 0.0,
111
+ "tp": 0.0,
112
+ },
113
+ "Object": {
114
+ "pr": 0.3333333333333333,
115
+ "re": 0.2,
116
+ "f": 0.25,
117
+ "fn": 4.0,
118
+ "numImages": 1.0,
119
+ "fp": 2.0,
120
+ "tp": 1.0,
121
+ },
122
+ }
123
+ ]
124
  """
125
 
126
 
 
177
  try:
178
  return float(obj)
179
  except (ValueError, TypeError):
180
+ return float("nan")
181
 
182
  def _compute(self, predictions, references):
183
  """Returns the scores"""
 
192
  f"Got {len(prediction)} predictions and {len(reference)} references."
193
  )
194
  input_data.append({"image_id": i, "test": prediction[0], "refs": reference})
 
195
 
196
  in_file = tempfile.NamedTemporaryFile(delete=False)
197
+ in_file.write(json.dumps(input_data, indent=2).encode("utf-8"))
198
  in_file.close()
199
 
200
  out_file = tempfile.NamedTemporaryFile(delete=False)
 
231
  os.remove(in_file.name)
232
  os.remove(out_file.name)
233
 
234
+ img_id_to_scores = {item["image_id"]: item["scores"] for item in results}
 
 
 
 
 
235
  scores = []
236
  for image_id in range(len(predictions)):
237
  # Convert none to NaN before saving scores over subcategories
238
  score_set = {}
239
+ for category, score_tuple in img_id_to_scores[image_id].items():
240
  score_set[category] = {
241
  k: self.float_convert(v) for k, v in score_tuple.items()
242
  }
243
  scores.append(score_set)
244
+ return scores
tests.py CHANGED
@@ -12,8 +12,7 @@ test_cases = [
12
  "a passenger train pulls into a train station",
13
  "a train coming down the tracks arriving at a station",
14
  ]
15
- ],
16
- "result": {"metric_score": 0},
17
  },
18
  {
19
  "predictions": [
@@ -29,12 +28,11 @@ test_cases = [
29
  "the plane is flying over top of the cars",
30
  ],
31
  ["a blue plate filled with marshmallows chocolate chips and banana"],
32
- ],
33
- "result": {"metric_score": 1},
34
- }
35
  ]
36
 
37
- metric = evaluate.load("./spice.py")
38
  for i, test_case in enumerate(test_cases):
39
  results = metric.compute(
40
  predictions=test_case["predictions"], references=test_case["references"]
@@ -42,6 +40,4 @@ for i, test_case in enumerate(test_cases):
42
  print(f"Test case {i+1}:")
43
  print("Predictions:", test_case["predictions"])
44
  print("References:", test_case["references"])
45
- print("Results:", results)
46
- print("Expected:", test_case["result"])
47
- print()
 
12
  "a passenger train pulls into a train station",
13
  "a train coming down the tracks arriving at a station",
14
  ]
15
+ ]
 
16
  },
17
  {
18
  "predictions": [
 
28
  "the plane is flying over top of the cars",
29
  ],
30
  ["a blue plate filled with marshmallows chocolate chips and banana"],
31
+ ]
32
+ },
 
33
  ]
34
 
35
+ metric = evaluate.load("sunhill/spice")
36
  for i, test_case in enumerate(test_cases):
37
  results = metric.compute(
38
  predictions=test_case["predictions"], references=test_case["references"]
 
40
  print(f"Test case {i+1}:")
41
  print("Predictions:", test_case["predictions"])
42
  print("References:", test_case["references"])
43
+ print(results)