Upload folder using huggingface_hub

Browse files

Files changed (12) hide show

.gitattributes +1 -0
README.md +180 -0
config.json +28 -0
model-card/bias.md +4 -0
model-card/explainability.md +13 -0
model-card/privacy.md +10 -0
model-card/safety.md +6 -0
model.safetensors +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +15 -0
tokenizer.json +3 -0
tokenizer_config.json +54 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,180 @@

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+---
+# Heb-CLIP Overview
+### Description:
+Heb-CLIP is a Hebrew text encoder fine-tuned from `xlm-roberta-base` and trained to map Hebrew text into the latent space of `openai/clip-vit-large-patch14`.
+Given a Hebrew text input, the model produces an embedding vector that is aligned with the CLIP visual embedding space, allowing direct comparison with image embeddings generated by `openai/clip-vit-large-patch14`.
+This enables image–text matching and supports applications such as text-based image retrieval, zero-shot classification, and more.<br>
+This model is ready for commercial use.  <br>
+### License/Terms of Use:
+[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
+ <br>
+### Deployment Geography:
+Global <br>
+### Use Case: <br>
+The model is intended for researchers and developers requiring Hebrew vision-language capabilities, such as matching Hebrew text with images in a shared embedding space. Typical use cases include text-based image retrieval, zero-shot image classification, and multimodal search over Hebrew content. <br>
+### Release Date:  <br>
+Hugging Face 03/28/2026 via [https://huggingface.co/nvidia/hebclip-model] <br>
+## References(s):
+[1] [XLM-RoBERTa Base](https://huggingface.co/FacebookAI/xlm-roberta-base) <br>
+[2] [CLIP ViT-Large/14](https://huggingface.co/openai/clip-vit-large-patch14) <br>
+## Model Architecture:
+**Architecture Type:** Transformer <br>
+**Network Architecture:** XLM-RoBERTa (12 layers, hidden size 768). <br>
+**This model was developed based on [XLM-RoBERTa Base](https://huggingface.co/FacebookAI/xlm-roberta-base).** <br>
+**Number of model parameters:** 2.78 × 10^8 (278M) <br>
+## Input(s): <br>
+**Input Type(s):** Text <br>
+**Input Format(s):** String (Hebrew text) <br>
+**Input Parameters:** One-Dimensional (1D) <br>
+**Other Properties Related to Input:** Maximum sequence length of 512 tokens <br>
+## Output(s)
+**Output Type(s):** Embedding vector <br>
+**Output Format(s):** Float tensor <br>
+**Output Parameters:** One-Dimensional (1D) a 768-dimensional embedding vector <br>
+**Other Properties Related to Output:** The output is a normalized 768-dimensional float vector aligned with the CLIP ViT-Large/14 visual embedding space, enabling direct cosine similarity comparison with image embeddings. <br>
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
+## Software Integration:
+**Runtime Engine(s):**
+* N/A
+**Supported Hardware Microarchitecture Compatibility:** <br>
+* NVIDIA Ampere
+* NVIDIA Blackwell
+* NVIDIA Hopper <br>
+**Preferred/Supported Operating System(s):**
+* [Linux]
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>
+## Model Version(s):
+v1.0 <br>
+## Training, Testing, and Evaluation Datasets:
+## Training Dataset:
+**Link**: https://huggingface.co/datasets/nvidia/heb-clip
+** Data Modality <br>
+* Image (embeddings) and Text
+** Image Training Data Size <br>
+* 7.78 million Image embeddings [1 Million to 1 Billion Images] <br>
+** Text Training Data Size <br>
+* 7.78 million Text captions [Less than a Billion Tokens] <br>
+** Data Collection Method by dataset <br>
+* Automated
+** Labeling Method by dataset <br>
+* Not Applicable <br>
+**Properties:** The Hebrew-CLIP dataset is a collection of Hebrew image captions designed to facilitate training of vision-language models like CLIP (Contrastive Language-Image Pre-training) for the Hebrew language. This dataset provides captions without actual images, instead offering references to pre-computed image embeddings. <br>
+## Testing & Evaluation Dataset:
+**Link**: https://google.github.io/crossmodal-3600/
+**Benchmark Score:** MRR: 0.554, Recall@1: 0.436, Recall@5: 0.692, Recall@10: 0.776 <br>
+Data Collection Method by dataset:  <br>
+* Human <br>
+Labeling Method by dataset:  <br>
+* Human <br>
+**Properties:** Crossmodal-3600 is a multilingual evaluation benchmark of 3600 images from Open Images, each paired with human-written captions in 36 typologically diverse languages (including Hebrew). The dataset covers text and image modalities. All captions were manually annotated and the images are publicly available and it does not contain personal data or machine-generated content. The 36 languages span diverse scripts and language families. We use the Hebrew caption subset for evaluating Heb-CLIP. <br>
+## Usage
+### Inference
+```python
+import numpy as np
+from PIL import Image
+import torch
+from transformers import AutoTokenizer, AutoModel, CLIPModel, CLIPProcessor
+# Load models
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Text encoder - loads directly from Hugging Face
+tokenizer = AutoTokenizer.from_pretrained("nvidia/hebclip-model")
+text_model = AutoModel.from_pretrained("nvidia/hebclip-model")
+text_model.to(device).eval()
+# Image encoder
+image_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
+image_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
+# Encode text
+def encode_text(text):
+    inputs = tokenizer([text], padding=True, return_tensors="pt", truncation=True, max_length=77).to(device)
+    with torch.no_grad():
+        outputs = text_model(**inputs)
+        emb = outputs.last_hidden_state.mean(dim=1)
+    return emb.cpu().numpy()[0]
+# Encode image
+def encode_image(image_path):
+    img = Image.open(image_path).convert("RGB")
+    pixel_values = image_processor(images=img, return_tensors="pt")["pixel_values"].to(device)
+    with torch.no_grad():
+        emb = image_model.get_image_features(pixel_values=pixel_values)
+    return emb.cpu().numpy()[0]
+# Compute similarity
+def score_similarity(text_vec, image_vec):
+    tv = text_vec / (np.linalg.norm(text_vec) + 1e-8)
+    iv = image_vec / (np.linalg.norm(image_vec) + 1e-8)
+    return float(np.dot(tv, iv))
+# Example usage
+text_vec = encode_text("כלב רץ בפארק")  # "A dog running in the park" in Hebrew
+image_vec = encode_image("your_image.jpg")
+similarity = score_similarity(text_vec, image_vec)
+print(f"Similarity: {similarity}")
+```
+# Inference:
+**Acceleration Engine:** Pytorch <br>
+**Test Hardware:** <br>
+* NVIDIA A100  <br>
+## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
+Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. <br>
+For more detailed information on ethical considerations for this model, please see the [Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards](https://huggingface.co/nvidia/hebclip-model/tree/main/model-card).
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail). <br>

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "_name_or_path": "xlm-roberta-base",
+  "architectures": [
+    "XLMRobertaModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.3",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

model-card/bias.md ADDED Viewed

	@@ -0,0 +1,4 @@

+Field                                                                                               |  Response
+:---------------------------------------------------------------------------------------------------|:---------------
+Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing:  |  None
+Measures taken to mitigate against unwanted bias:                                                   | None

model-card/explainability.md ADDED Viewed

	@@ -0,0 +1,13 @@

+Field                                                                                                  |  Response
+:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
+Intended Task/Domain:                                                                   |  Text-Image Matching
+Model Type:                                                                                            |  Transformer
+Intended Users:                                                                                        |  The model is intended for researchers and developers requiring Hebrew vision-language capabilities, such as matching Hebrew text with images in a shared embedding space.
+Output:                                                                                                |  768-dimensional embedding vector aligned with CLIP ViT-Large/14
+Describe how the model works:                                                                          |  Hebrew text input is encoded into an embedding vector and compared with an embedding vector from an image to output a matching score.
+Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:  |  Not Applicable
+Technical Limitations & Mitigation:                                                                    |  Currently, this model only supports the Hebrew language. This model may struggle with transliterated/code-mixed text or out-of-domain terminology.
+Verified to have met prescribed NVIDIA quality standards:  |  Yes
+Performance Metrics:                                                                                   |  Retrieval Recall (R@1/5/10), Zero-Shot Classification Accuracy
+ Potential Known Risks:                                                                                 |  The model may generate and/or reproduce biases present in the training data and/or the underlying base model. Since the model supports image retrieval based on text captions, it may return mismatched or irrelevant images for captions that fall outside the distribution of the training data.
+Licensing:                                                                                             |  [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)

model-card/privacy.md ADDED Viewed

	@@ -0,0 +1,10 @@

+Field                                                                                                                              |  Response
+:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
+Generatable or reverse engineerable personal data?                                                     |  No
+Personal data used to create this model?                                                                                       |  No
+How often is dataset reviewed?                                                                                                     |  Before Release.
+Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? |  No
+Is there provenance for all datasets used in training?                                                                                |  Yes
+Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Not Applicable
+Is data compliant with data subject requests for data correction or removal, if such a request was made?                           |  Not Applicable
+Applicable Privacy Policy        | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

model-card/safety.md ADDED Viewed

	@@ -0,0 +1,6 @@

+Field                                               |  Response
+:---------------------------------------------------|:----------------------------------
+Model Application Field(s):                               |  Hebrew text encoding, text-image matching
+Describe the life critical impact (if present).   |  Not Applicable
+Use Case Restrictions:                              |  Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
+Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to.

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b4234c546b961afcfff506756ccede2e8ff76453738c4cf24ab37a615ef3028f
+size 1112197096

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
+size 17082734

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}