Update benchmarks for v2 (joint cross-modal training)
Browse files
README.md
CHANGED
|
@@ -67,35 +67,35 @@ All benchmarks run on a single NVIDIA L4 GPU with 5K samples where applicable.
|
|
| 67 |
|
| 68 |
| Direction | TEG-421M (421M) | LCO-3B (4.7B) | Nemotron-3B (4.7B) | ImageBind (1.2B) | EBind |
|
| 69 |
|---|---|---|---|---|---|
|
| 70 |
-
| Text → Image R@1 | 0.
|
| 71 |
-
| Image → Text R@1 | 0.
|
| 72 |
-
| Text → Audio R@1 | **0.
|
| 73 |
-
| Audio → Text R@1 | **0.
|
| 74 |
-
| Audio → Image R@1 | **0.
|
| 75 |
-
| Image → Audio R@1 | **0.
|
| 76 |
|
| 77 |
-
TEG leads all audio cross-modal directions by 2-10x over models that are 3-11x larger. Vision-text trails EBind/ImageBind but uses encoders small enough for edge deployment.
|
| 78 |
|
| 79 |
### Audio retrieval — AudioCaps & Clotho
|
| 80 |
|
| 81 |
| Benchmark | Direction | TEG-421M | LCO-3B | Nemotron-3B | CLAP-Small | CLAP-Large | ImageBind | EBind |
|
| 82 |
|---|---|---|---|---|---|---|---|---|
|
| 83 |
-
| AudioCaps | A→T R@1 | 0.
|
| 84 |
-
| AudioCaps | T→A R@1 | 0.
|
| 85 |
-
| Clotho | A→T R@1 | 0.
|
| 86 |
-
| Clotho | T→A R@1 | 0.
|
| 87 |
|
| 88 |
CLAP models lead on audio-only benchmarks (audio specialists with no image support). Among trimodal models, TEG is competitive with LCO while being 11x smaller.
|
| 89 |
|
| 90 |
### Image-text retrieval — MSCOCO & Flickr30k
|
| 91 |
|
| 92 |
-
| Benchmark | Direction | TEG-421M (421M) | EBind (
|
| 93 |
|---|---|---|---|---|---|---|
|
| 94 |
-
| MSCOCO 5K | I→T R@1 | 0.
|
| 95 |
-
| MSCOCO 5K | T→I R@1 | 0.
|
| 96 |
| MSCOCO 5K | I→T R@10 | 0.622 | **0.948** | 0.918 | 0.784 | 0.630 |
|
| 97 |
-
| Flickr30k | I→T R@1 | 0.
|
| 98 |
-
| Flickr30k | T→I R@1 | 0.
|
| 99 |
|
| 100 |
TEG's image-text retrieval trades accuracy for edge deployability — MobileNetV4-Medium is ~100x smaller than the ViT-H/ViT-L encoders used by competitors. On MSCOCO, TEG outperforms Nemotron-3B on I→T despite being 11x smaller.
|
| 101 |
|
|
@@ -105,8 +105,8 @@ TEG's image-text retrieval trades accuracy for edge deployability — MobileNetV
|
|
| 105 |
|---|---|---|
|
| 106 |
| CLAP-Large | 67.8M | **0.905** |
|
| 107 |
| LCO-3B | 4.7B | 0.853 |
|
| 108 |
-
| TEG-421M | 421M | 0.
|
| 109 |
-
| EBind |
|
| 110 |
| CLAP-Small | 27.5M | 0.751 |
|
| 111 |
| Nemotron-3B | 4.7B | 0.727 |
|
| 112 |
| ImageBind | 1.2B | 0.664 |
|
|
@@ -187,6 +187,8 @@ teg-421m.safetensors # All components in one file (~1 GB)
|
|
| 187 |
- **Matryoshka weighting**: Higher weight on smaller dimensions (4x at 128-dim) ensures quality at aggressive truncation levels
|
| 188 |
- **Edge-first**: Source encoders chosen for edge deployment — MobileNetV4-Medium and EfficientAT mn20 can run on devices like Raspberry Pi 5
|
| 189 |
|
|
|
|
|
|
|
| 190 |
## Limitations
|
| 191 |
|
| 192 |
- Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
|
|
|
|
| 67 |
|
| 68 |
| Direction | TEG-421M (421M) | LCO-3B (4.7B) | Nemotron-3B (4.7B) | ImageBind (1.2B) | EBind |
|
| 69 |
|---|---|---|---|---|---|
|
| 70 |
+
| Text → Image R@1 | 0.672 | 0.660 | 0.529 | 0.712 | **0.779** |
|
| 71 |
+
| Image → Text R@1 | 0.620 | 0.564 | 0.299 | 0.736 | **0.783** |
|
| 72 |
+
| Text → Audio R@1 | **0.113** | 0.042 | 0.018 | 0.038 | 0.047 |
|
| 73 |
+
| Audio → Text R@1 | **0.115** | 0.032 | 0.010 | 0.039 | 0.035 |
|
| 74 |
+
| Audio → Image R@1 | **0.081** | 0.027 | 0.016 | 0.023 | 0.027 |
|
| 75 |
+
| Image → Audio R@1 | **0.083** | 0.034 | 0.018 | 0.025 | 0.032 |
|
| 76 |
|
| 77 |
+
TEG leads all audio cross-modal directions by 2-10x over models that are 3-11x larger. Image↔Audio improved ~40% over v1 via joint cross-modal training. Vision-text trails EBind/ImageBind but uses encoders small enough for edge deployment.
|
| 78 |
|
| 79 |
### Audio retrieval — AudioCaps & Clotho
|
| 80 |
|
| 81 |
| Benchmark | Direction | TEG-421M | LCO-3B | Nemotron-3B | CLAP-Small | CLAP-Large | ImageBind | EBind |
|
| 82 |
|---|---|---|---|---|---|---|---|---|
|
| 83 |
+
| AudioCaps | A→T R@1 | 0.159 | 0.250 | 0.050 | **0.425** | 0.420 | 0.116 | 0.225 |
|
| 84 |
+
| AudioCaps | T→A R@1 | 0.149 | 0.215 | 0.075 | **0.315** | 0.280 | 0.080 | 0.219 |
|
| 85 |
+
| Clotho | A→T R@1 | 0.168 | 0.178 | 0.038 | 0.166 | **0.195** | 0.061 | 0.088 |
|
| 86 |
+
| Clotho | T→A R@1 | 0.123 | **0.187** | 0.070 | 0.159 | 0.167 | 0.074 | 0.118 |
|
| 87 |
|
| 88 |
CLAP models lead on audio-only benchmarks (audio specialists with no image support). Among trimodal models, TEG is competitive with LCO while being 11x smaller.
|
| 89 |
|
| 90 |
### Image-text retrieval — MSCOCO & Flickr30k
|
| 91 |
|
| 92 |
+
| Benchmark | Direction | TEG-421M (421M) | EBind (1.78B*) | ImageBind (1.2B) | LCO-3B (4.7B) | Nemotron-3B (4.7B) |
|
| 93 |
|---|---|---|---|---|---|---|
|
| 94 |
+
| MSCOCO 5K | I→T R@1 | 0.248 | **0.743** | 0.658 | 0.533 | 0.225 |
|
| 95 |
+
| MSCOCO 5K | T→I R@1 | 0.180 | **0.559** | 0.490 | 0.469 | 0.334 |
|
| 96 |
| MSCOCO 5K | I→T R@10 | 0.622 | **0.948** | 0.918 | 0.784 | 0.630 |
|
| 97 |
+
| Flickr30k | I→T R@1 | 0.498 | — | — | **0.840** | 0.419 |
|
| 98 |
+
| Flickr30k | T→I R@1 | 0.358 | — | — | **0.765** | 0.563 |
|
| 99 |
|
| 100 |
TEG's image-text retrieval trades accuracy for edge deployability — MobileNetV4-Medium is ~100x smaller than the ViT-H/ViT-L encoders used by competitors. On MSCOCO, TEG outperforms Nemotron-3B on I→T despite being 11x smaller.
|
| 101 |
|
|
|
|
| 105 |
|---|---|---|
|
| 106 |
| CLAP-Large | 67.8M | **0.905** |
|
| 107 |
| LCO-3B | 4.7B | 0.853 |
|
| 108 |
+
| TEG-421M | 421M | 0.820 |
|
| 109 |
+
| EBind | 1.78B* | 0.770 |
|
| 110 |
| CLAP-Small | 27.5M | 0.751 |
|
| 111 |
| Nemotron-3B | 4.7B | 0.727 |
|
| 112 |
| ImageBind | 1.2B | 0.664 |
|
|
|
|
| 187 |
- **Matryoshka weighting**: Higher weight on smaller dimensions (4x at 128-dim) ensures quality at aggressive truncation levels
|
| 188 |
- **Edge-first**: Source encoders chosen for edge deployment — MobileNetV4-Medium and EfficientAT mn20 can run on devices like Raspberry Pi 5
|
| 189 |
|
| 190 |
+
*\*EBind's [HuggingFace checkpoint](https://huggingface.co/encord-team/ebind-full) is 8.93M parameters (bridge heads only), but inference requires frozen backbones (SigLIP ViT-L, CLAP HTSAT, text encoder) totaling 1.78B loaded parameters as measured by our benchmark harness.*
|
| 191 |
+
|
| 192 |
## Limitations
|
| 193 |
|
| 194 |
- Audio retrieval lags behind specialist models like CLAP on audio-only benchmarks
|