n0w0f
/

mattext-aligned-embeddings

Model card Files Files and versions

xet

Community

n0w0f commited on 14 days ago

Commit

5bc74d1

verified ·

1 Parent(s): 6e805ad

Add comprehensive README with architecture details and usage

Browse files

Files changed (1) hide show

README.md +231 -0

README.md ADDED Viewed

	@@ -0,0 +1,231 @@

+# MatText Aligned Embeddings: Multi-Modal Material Retrieval
+**A CLIP-style multi-modal embedding model that aligns 10 different material text representations into a shared 128-d vector space for cross-modal retrieval.**
+Query with *any* modality (composition, CIF, SLICES, natural language, z-matrix...) → retrieve materials with similar properties across *all* modalities.
+## 🏗️ Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    MatTextEncoder                           │
+│                                                             │
+│  ┌───────────────────────────────────────────────────────┐ │
+│  │  Shared Backbone: ModernBERT-base (150M params)       │ │
+│  │  - 8192 token context window (handles long CIFs)      │ │
+│  │  - Mean pooling → 768-d representation                │ │
+│  └───────────────────────────────────────────────────────┘ │
+│                          │                                  │
+│          ┌───────────────┼───────────────┐                 │
+│          ▼               ▼               ▼                 │
+│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐       │
+│  │ Projection   │ │ Projection   │ │ Projection   │  ...  │
+│  │ composition  │ │ cif_sym      │ │ slices       │       │
+│  │ 768→768→128  │ │ 768→768→128  │ │ 768→768→128  │       │
+│  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘       │
+│         ▼                ▼                ▼                │
+│     128-d L2-norm    128-d L2-norm    128-d L2-norm        │
+│                                                             │
+│              ──── Shared Embedding Space ────                │
+└─────────────────────────────────────────────────────────────┘
+```
+### Key Design Decisions
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Backbone | ModernBERT-base | 8192 ctx handles long CIFs; fast RoPE attention |
+| Projection | 2-layer MLP per modality | MultiMat recipe: modality-specific heads preserve specialization |
+| Embedding dim | 128 | Standard for contrastive learning; compact for FAISS |
+| Loss | AllPairsCLIP + Property-MSE | Aligns all N(N-1)/2 modality pairs; property regularization |
+| Temperature | Learnable (init 0.07) | CLIP standard; learned τ improves convergence |
+## 📊 Modalities Supported
+| Modality | Column | Example | Query Type |
+|----------|--------|---------|------------|
+| Composition | `composition` | `Fe2O3` | "Find iron oxides" |
+| Atom Sequence | `atom_sequences` | `Fe Fe Fe O O O` | Element lists |
+| CIF (symmetrized) | `cif_symmetrized` | Full CIF text | Paste CIF data |
+| CIF (P1) | `cif_p1` | Full CIF in P1 | Paste CIF data |
+| Z-matrix | `zmatrix` | `Fe\nO 1 2.0\nO 1 2.0 2 90` | Internal coords |
+| Atom Seq++ | `atom_sequences_plusplus` | `Fe O 3.57 3.57 90 90` | Elements + lattice |
+| SLICES | `slices` | `Fe O 0 1 o o o` | SLICES encoding |
+| Crystal Text (LLM) | `crystal_text_llm` | `3.6 3.6 3.6\n90 90 90\nFe...` | Gruver format |
+| Local Environment | `local_env` | SMILES-like env | Local bonding |
+| Natural Language | `robocrys_rep` | "FeO crystallizes in..." | Plain English |
+| **Property Query** | property text | "bandgap: 1.5 eV" | Property search |
+## 🧪 Training Recipe
+Based on three key papers:
+1. **MultiMat** (AllPairsCLIP, [arxiv:2312.00111](https://arxiv.org/abs/2312.00111)): Sum of symmetric InfoNCE over all modality pairs
+2. **MatExpert** ([arxiv:2410.21317](https://arxiv.org/abs/2410.21317)): Property↔structure contrastive alignment
+3. **CrystalCLR** ([arxiv:2211.13408](https://arxiv.org/abs/2211.13408)): Composition similarity loss
+4. **SupReMix** ([arxiv:2309.16633](https://arxiv.org/abs/2309.16633)): Property-label-aware soft contrastive
+### Two-Phase Training
+**Phase 1 — Multi-modal alignment** (pretrain100k_v2, 50k samples):
+- AllPairsCLIP loss across all 10 modalities
+- Random modality sampling (4/10 per step) for VRAM efficiency
+- Each step aligns C(4,2)=6 modality pairs
+**Phase 2 — Property-conditioned alignment** (bandgap + form_energy, 50k samples):
+- Same CLIP loss + property similarity MSE loss
+- Property text "composition: Fe2O3 | bandgap: 2.1000" aligned with structure representations
+- Materials with similar property values cluster in embedding space
+### Hyperparameters
+```
+encoder: answerdotai/ModernBERT-base
+embed_dim: 128
+max_length: 512 tokens
+batch_size: 32 × 8 grad_accum = 256 effective
+learning_rate: 2e-5 (cosine decay, 10% warmup)
+temperature: learnable (init 0.07)
+epochs: 3 per phase
+optimizer: AdamW (weight_decay=0.01)
+fp16: True
+gradient_checkpointing: True
+```
+## 🚀 Quick Start
+### Training
+```bash
+pip install torch transformers datasets faiss-cpu huggingface_hub trackio
+# Local GPU
+python train_mattext_embeddings.py
+# HF Jobs (recommended: a10g-large, 24GB VRAM)
+# Set timeout to 6h
+```
+### Inference & Search
+```python
+import torch
+import faiss
+import json
+import numpy as np
+from transformers import AutoModel, AutoTokenizer
+# Load model
+from train_mattext_embeddings import MatTextEncoder, Config, search_vector_db
+config = Config()
+config.device = "cuda" if torch.cuda.is_available() else "cpu"
+model = MatTextEncoder(config)
+model.load_state_dict(torch.load("mattext-embeddings/model.pt", map_location=config.device))
+model = model.to(config.device)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(config.encoder_name)
+# Load FAISS indices
+indices = {}
+for mod in ["composition", "crystal_text_llm", "slices", "cif_symmetrized"]:
+    index = faiss.read_index(f"mattext-embeddings/faiss/{mod}.index")
+    with open(f"mattext-embeddings/faiss/{mod}_metadata.json") as f:
+        metadata = json.load(f)
+    indices[mod] = {"index": index, "metadata": metadata}
+# Search!
+results = search_vector_db("Fe2O3", "composition", model, tokenizer, indices, config, k=5)
+for score, meta in results:
+    print(f"Score: {score:.4f} | {meta['composition']}")
+```
+### Cross-Modal Query Examples
+```python
+# Query by composition → find across all modalities
+search_vector_db("SiO2", "composition", model, tokenizer, indices, config)
+# Query by natural language → find materials
+search_vector_db("perovskite with high bandgap", "robocrys_rep", model, tokenizer, indices, config)
+# Query by SLICES representation
+search_vector_db("Si O 0 1 o o o", "slices", model, tokenizer, indices, config)
+# Query by CIF data
+search_vector_db("data_SiO2\n_symmetry P1\n...", "cif_symmetrized", model, tokenizer, indices, config)
+# Property-conditioned query
+search_vector_db("composition: Si | bandgap: 1.1200", "property", model, tokenizer, indices, config)
+```
+## 🔬 Evaluation Metrics
+Cross-modal Recall@k: for each material, embed in modality A, retrieve in modality B, check if correct match is in top-k.
+| Pair | R@1 | R@5 | R@10 |
+|------|-----|-----|------|
+| composition → crystal_text_llm | TBD | TBD | TBD |
+| composition → cif_symmetrized | TBD | TBD | TBD |
+| slices → crystal_text_llm | TBD | TBD | TBD |
+| robocrys_rep → composition | TBD | TBD | TBD |
+*Results populated after training.*
+## 🧩 Extending: Graph Embeddings
+The architecture supports adding graph neural network (GNN) embeddings:
+```python
+# Add a GNN projection head
+from torch_geometric.nn import SchNet, DimeNet  # or CGCNN
+class GraphEncoder(nn.Module):
+    def __init__(self, embed_dim=128):
+        super().__init__()
+        self.gnn = SchNet(hidden_channels=256, num_filters=128, num_interactions=6)
+        self.proj = ModalityProjection(256, embed_dim)
+    def forward(self, data):
+        # data: PyG Data with pos, z (atomic numbers), batch
+        h = self.gnn(data.z, data.pos, data.batch)
+        return self.proj(h)
+# Add to MatTextEncoder:
+model.graph_encoder = GraphEncoder(config.embed_dim)
+model.projections["graph"] = model.graph_encoder.proj
+# Training: treat graph embeddings as another modality in AllPairsCLIP
+```
+For graph embeddings, convert CIF → PyG Data (using `pymatgen` + `torch_geometric`):
+```python
+from pymatgen.core import Structure
+from torch_geometric.data import Data
+def cif_to_graph(cif_string, cutoff=5.0):
+    struct = Structure.from_str(cif_string, fmt="cif")
+    # Get neighbors within cutoff
+    neighbors = struct.get_all_neighbors(cutoff)
+    # Build edge_index, pos, z ...
+    return Data(z=atomic_numbers, pos=positions, edge_index=edge_index)
+```
+## 📚 References
+- **MatText**: [arxiv:2406.17295](https://arxiv.org/abs/2406.17295) — Dataset and text representations
+- **MultiMat**: [arxiv:2312.00111](https://arxiv.org/abs/2312.00111) — AllPairsCLIP for materials
+- **MatExpert**: [arxiv:2410.21317](https://arxiv.org/abs/2410.21317) — Property↔structure alignment
+- **CrystalCLR**: [arxiv:2211.13408](https://arxiv.org/abs/2211.13408) — Contrastive learning for crystals
+- **SupReMix**: [arxiv:2309.16633](https://arxiv.org/abs/2309.16633) — Property-aware hard negatives
+- **Symile**: [arxiv:2411.01053](https://arxiv.org/abs/2411.01053) — Total-correlation loss for M modalities
+## 📄 License
+MIT
+## 🔗 Dataset
+[n0w0f/MatText](https://huggingface.co/datasets/n0w0f/MatText) — 100k+ crystal structures in 10 text representations