ytewel commited on
Commit
953c697
·
verified ·
1 Parent(s): 8e50265

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-license
4
+ license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
5
+ ---
6
+
7
+ # Heb-CLIP Overview
8
+
9
+ ### Description:
10
+ Heb-CLIP is a Hebrew text encoder fine-tuned from `xlm-roberta-base` and trained to map Hebrew text into the latent space of `openai/clip-vit-large-patch14`.
11
+ Given a Hebrew text input, the model produces an embedding vector that is aligned with the CLIP visual embedding space, allowing direct comparison with image embeddings generated by `openai/clip-vit-large-patch14`.
12
+ This enables image–text matching and supports applications such as text-based image retrieval, zero-shot classification, and more.<br>
13
+
14
+ This model is ready for commercial use. <br>
15
+
16
+ ### License/Terms of Use:
17
+
18
+ [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
19
+ <br>
20
+
21
+ ### Deployment Geography:
22
+ Global <br>
23
+
24
+ ### Use Case: <br>
25
+ The model is intended for researchers and developers requiring Hebrew vision-language capabilities, such as matching Hebrew text with images in a shared embedding space. Typical use cases include text-based image retrieval, zero-shot image classification, and multimodal search over Hebrew content. <br>
26
+
27
+ ### Release Date: <br>
28
+ Hugging Face 03/28/2026 via [https://huggingface.co/nvidia/hebclip-model] <br>
29
+
30
+ ## References(s):
31
+ [1] [XLM-RoBERTa Base](https://huggingface.co/FacebookAI/xlm-roberta-base) <br>
32
+ [2] [CLIP ViT-Large/14](https://huggingface.co/openai/clip-vit-large-patch14) <br>
33
+
34
+ ## Model Architecture:
35
+ **Architecture Type:** Transformer <br>
36
+
37
+ **Network Architecture:** XLM-RoBERTa (12 layers, hidden size 768). <br>
38
+
39
+ **This model was developed based on [XLM-RoBERTa Base](https://huggingface.co/FacebookAI/xlm-roberta-base).** <br>
40
+
41
+ **Number of model parameters:** 2.78 × 10^8 (278M) <br>
42
+
43
+ ## Input(s): <br>
44
+ **Input Type(s):** Text <br>
45
+
46
+ **Input Format(s):** String (Hebrew text) <br>
47
+
48
+ **Input Parameters:** One-Dimensional (1D) <br>
49
+
50
+ **Other Properties Related to Input:** Maximum sequence length of 512 tokens <br>
51
+
52
+ ## Output(s)
53
+
54
+ **Output Type(s):** Embedding vector <br>
55
+
56
+ **Output Format(s):** Float tensor <br>
57
+
58
+ **Output Parameters:** One-Dimensional (1D) a 768-dimensional embedding vector <br>
59
+
60
+ **Other Properties Related to Output:** The output is a normalized 768-dimensional float vector aligned with the CLIP ViT-Large/14 visual embedding space, enabling direct cosine similarity comparison with image embeddings. <br>
61
+
62
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
63
+
64
+ ## Software Integration:
65
+ **Runtime Engine(s):**
66
+ * N/A
67
+
68
+ **Supported Hardware Microarchitecture Compatibility:** <br>
69
+ * NVIDIA Ampere
70
+ * NVIDIA Blackwell
71
+ * NVIDIA Hopper <br>
72
+
73
+ **Preferred/Supported Operating System(s):**
74
+ * [Linux]
75
+
76
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>
77
+
78
+ ## Model Version(s):
79
+ v1.0 <br>
80
+
81
+ ## Training, Testing, and Evaluation Datasets:
82
+
83
+ ## Training Dataset:
84
+
85
+ **Link**: https://huggingface.co/datasets/nvidia/heb-clip
86
+
87
+ ** Data Modality <br>
88
+ * Image (embeddings) and Text
89
+
90
+ ** Image Training Data Size <br>
91
+ * 7.78 million Image embeddings [1 Million to 1 Billion Images] <br>
92
+
93
+ ** Text Training Data Size <br>
94
+ * 7.78 million Text captions [Less than a Billion Tokens] <br>
95
+
96
+ ** Data Collection Method by dataset <br>
97
+ * Automated
98
+
99
+ ** Labeling Method by dataset <br>
100
+ * Not Applicable <br>
101
+
102
+ **Properties:** The Hebrew-CLIP dataset is a collection of Hebrew image captions designed to facilitate training of vision-language models like CLIP (Contrastive Language-Image Pre-training) for the Hebrew language. This dataset provides captions without actual images, instead offering references to pre-computed image embeddings. <br>
103
+
104
+ ## Testing & Evaluation Dataset:
105
+
106
+ **Link**: https://google.github.io/crossmodal-3600/
107
+
108
+ **Benchmark Score:** MRR: 0.554, Recall@1: 0.436, Recall@5: 0.692, Recall@10: 0.776 <br>
109
+
110
+ Data Collection Method by dataset: <br>
111
+ * Human <br>
112
+
113
+ Labeling Method by dataset: <br>
114
+ * Human <br>
115
+
116
+ **Properties:** Crossmodal-3600 is a multilingual evaluation benchmark of 3600 images from Open Images, each paired with human-written captions in 36 typologically diverse languages (including Hebrew). The dataset covers text and image modalities. All captions were manually annotated and the images are publicly available and it does not contain personal data or machine-generated content. The 36 languages span diverse scripts and language families. We use the Hebrew caption subset for evaluating Heb-CLIP. <br>
117
+
118
+ ## Usage
119
+
120
+ ### Inference
121
+
122
+ ```python
123
+ import numpy as np
124
+ from PIL import Image
125
+ import torch
126
+ from transformers import AutoTokenizer, AutoModel, CLIPModel, CLIPProcessor
127
+
128
+ # Load models
129
+ device = "cuda" if torch.cuda.is_available() else "cpu"
130
+
131
+ # Text encoder - loads directly from Hugging Face
132
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/hebclip-model")
133
+ text_model = AutoModel.from_pretrained("nvidia/hebclip-model")
134
+ text_model.to(device).eval()
135
+
136
+ # Image encoder
137
+ image_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
138
+ image_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
139
+
140
+ # Encode text
141
+ def encode_text(text):
142
+ inputs = tokenizer([text], padding=True, return_tensors="pt", truncation=True, max_length=77).to(device)
143
+ with torch.no_grad():
144
+ outputs = text_model(**inputs)
145
+ emb = outputs.last_hidden_state.mean(dim=1)
146
+ return emb.cpu().numpy()[0]
147
+
148
+ # Encode image
149
+ def encode_image(image_path):
150
+ img = Image.open(image_path).convert("RGB")
151
+ pixel_values = image_processor(images=img, return_tensors="pt")["pixel_values"].to(device)
152
+ with torch.no_grad():
153
+ emb = image_model.get_image_features(pixel_values=pixel_values)
154
+ return emb.cpu().numpy()[0]
155
+
156
+ # Compute similarity
157
+ def score_similarity(text_vec, image_vec):
158
+ tv = text_vec / (np.linalg.norm(text_vec) + 1e-8)
159
+ iv = image_vec / (np.linalg.norm(image_vec) + 1e-8)
160
+ return float(np.dot(tv, iv))
161
+
162
+ # Example usage
163
+ text_vec = encode_text("כלב רץ בפארק") # "A dog running in the park" in Hebrew
164
+ image_vec = encode_image("your_image.jpg")
165
+ similarity = score_similarity(text_vec, image_vec)
166
+ print(f"Similarity: {similarity}")
167
+ ```
168
+
169
+ # Inference:
170
+ **Acceleration Engine:** Pytorch <br>
171
+ **Test Hardware:** <br>
172
+ * NVIDIA A100 <br>
173
+
174
+ ## Ethical Considerations:
175
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
176
+
177
+ Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. <br>
178
+
179
+ For more detailed information on ethical considerations for this model, please see the [Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards](https://huggingface.co/nvidia/hebclip-model/tree/main/model-card).
180
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail). <br>
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "xlm-roberta-base",
3
+ "architectures": [
4
+ "XLMRobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "output_past": true,
21
+ "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.46.3",
25
+ "type_vocab_size": 1,
26
+ "use_cache": true,
27
+ "vocab_size": 250002
28
+ }
model-card/bias.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Field | Response
2
+ :---------------------------------------------------------------------------------------------------|:---------------
3
+ Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None
4
+ Measures taken to mitigate against unwanted bias: | None
model-card/explainability.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
3
+ Intended Task/Domain: | Text-Image Matching
4
+ Model Type: | Transformer
5
+ Intended Users: | The model is intended for researchers and developers requiring Hebrew vision-language capabilities, such as matching Hebrew text with images in a shared embedding space.
6
+ Output: | 768-dimensional embedding vector aligned with CLIP ViT-Large/14
7
+ Describe how the model works: | Hebrew text input is encoded into an embedding vector and compared with an embedding vector from an image to output a matching score.
8
+ Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable
9
+ Technical Limitations & Mitigation: | Currently, this model only supports the Hebrew language. This model may struggle with transliterated/code-mixed text or out-of-domain terminology.
10
+ Verified to have met prescribed NVIDIA quality standards: | Yes
11
+ Performance Metrics: | Retrieval Recall (R@1/5/10), Zero-Shot Classification Accuracy
12
+ Potential Known Risks: | The model may generate and/or reproduce biases present in the training data and/or the underlying base model. Since the model supports image retrieval based on text captions, it may return mismatched or irrelevant images for captions that fall outside the distribution of the training data.
13
+ Licensing: | [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
model-card/privacy.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
3
+ Generatable or reverse engineerable personal data? | No
4
+ Personal data used to create this model? | No
5
+ How often is dataset reviewed? | Before Release.
6
+ Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No
7
+ Is there provenance for all datasets used in training? | Yes
8
+ Does data labeling (annotation, metadata) comply with privacy laws? | Not Applicable
9
+ Is data compliant with data subject requests for data correction or removal, if such a request was made? | Not Applicable
10
+ Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
model-card/safety.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Field | Response
2
+ :---------------------------------------------------|:----------------------------------
3
+ Model Application Field(s): | Hebrew text encoding, text-image matching
4
+ Describe the life critical impact (if present). | Not Applicable
5
+ Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
6
+ Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4234c546b961afcfff506756ccede2e8ff76453738c4cf24ab37a615ef3028f
3
+ size 1112197096
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
3
+ size 17082734
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }