Update README for v2: NL queries, 1024 ctx, LaCLIP architecture

283f249 verified 13 days ago

11.9 kB

	# MatText Aligned Embeddings v2: Multi-Modal Material Retrieval with Natural Language Queries

	A CLIP-style multi-modal embedding model that aligns 10+ material text representations into a shared 128-d vector space. Query with natural language ("oxide with high bandgap"), composition, CIF, SLICES, or any modality → retrieve matching materials.

	## 🆕 v2 Key Features

	\| Feature \| v1 \| v2 \|
	\|---------\|----\|----\|
	\| Context length \| 512 tokens \| 1024 tokens (captures long CIFs) \|
	\| Natural language queries \| ❌ \| ✅ "oxide with high bandgap" \|
	\| Property-aware retrieval \| Basic \| LaCLIP-style diverse NL descriptions \|
	\| GPU optimization \| fp16 / 24GB \| bf16 / 80GB A100 optimized \|
	\| Effective batch size \| 256 \| 288 \|
	\| Modalities per step \| 4 \| 5 \|
	\| Flash Attention 2 \| ❌ \| ✅ (auto-detect) \|

	## 🏗️ Architecture

	```
	┌────────────────────────────────────────────────────────────────────────┐
	│ MatTextEncoder (157M params) │
	│ │
	│ ┌──────────────────────────────────────────────────────────────────┐ │
	│ │ Shared Backbone: ModernBERT-base (150M params, 8192 ctx) │ │
	│ │ Mean pooling → 768-d representation │ │
	│ │ Gradient checkpointing + bf16 │ │
	│ └──────────────────────────────────────────────────────────────────┘ │
	│ │ │
	│ ┌─────────────┬──────────┴──────────┬──────────────┐ │
	│ ▼ ▼ ▼ ▼ │
	│ ┌─────────┐ ┌──────────┐ ┌───────────────────┐ ┌──────────┐ │
	│ │comp │ │cif_sym │ │nl_property_desc │ │property │ ...×12 │
	│ │768→768 │ │768→768 │ │768→768→128 │ │768→768 │ │
	│ │→128 │ │→128 │ │"oxide with high │ │→128 │ │
	│ │ │ │ │ │ bandgap" queries │ │ │ │
	│ └────┬────┘ └────┬─────┘ └────────┬──────────┘ └────┬─────┘ │
	│ ▼ ▼ ▼ ▼ │
	│ 128-d L2 128-d L2 128-d L2 128-d L2 │
	│ │
	│ ──── Shared 128-d Embedding Space ──── │
	│ (FAISS IndexFlatIP for cosine similarity search) │
	└────────────────────────────────────────────────────────────────────────┘
	```

	### 12 Projection Heads

	\| # \| Head \| Input \| Purpose \|
	\|---\|------\|-------\|---------\|
	\| 1 \| `composition` \| "Fe2O3" \| Formula queries \|
	\| 2 \| `atom_sequences` \| "Fe Fe O O O" \| Element list queries \|
	\| 3 \| `cif_symmetrized` \| Full CIF \| Paste CIF data \|
	\| 4 \| `cif_p1` \| CIF in P1 \| P1 space group CIF \|
	\| 5 \| `zmatrix` \| Z-matrix coords \| Internal coordinates \|
	\| 6 \| `atom_sequences_plusplus` \| Elements + lattice \| Atom sequence + cell \|
	\| 7 \| `slices` \| SLICES encoding \| Compact structure encoding \|
	\| 8 \| `crystal_text_llm` \| Gruver format \| Lattice + coords text \|
	\| 9 \| `local_env` \| SMILES-like env \| Local bonding environment \|
	\| 10 \| `robocrys_rep` \| NL description \| "FeO crystallizes in..." \|
	\| 11 \| `nl_property_description` \| Free-form NL \| "oxide with high bandgap" \|
	\| 12 \| `property` \| Structured props \| "bandgap: 2.1 eV" \|

	## 🔍 How NL Queries Work

	The key innovation is a LaCLIP-style training approach ([arxiv:2305.20088](https://arxiv.org/abs/2305.20088)):

	1. During Phase 2 training, for each material with known properties (bandgap, formation energy), we generate diverse natural language descriptions from templates:
	- `"A wide bandgap oxide suitable for UV applications, bandgap 3.20 eV"`
	- `"TiO2: oxide semiconductor with wide band gap of 3.20 electron volts"`
	- `"This binary oxide (TiO2) exhibits a wide bandgap of approximately 3.20 eV"`

	2. These NL descriptions are passed through a dedicated `nl_property_description` projection head and aligned with ALL structure modalities via InfoNCE.

	3. At inference, when you query `"oxide with high bandgap"`, the model maps it through the same NL head into the shared embedding space, and FAISS finds the nearest materials — those that were trained to be close to similar descriptions.

	This is distinct from `robocrys_rep` (which describes crystal structure: "FeO crystallizes in the rock salt structure..."). The NL query head describes properties ("wide bandgap oxide").

	## 🧪 Training Recipe

	### Two-Phase Training

	Phase 1 — Multi-modal alignment (pretrain100k_v2, 60k samples, 3 epochs):
	- AllPairsCLIP loss across 10 modalities
	- Random modality sampling (5/10 per step) — always includes composition + crystal_text_llm
	- Effective batch 288

	Phase 2 — Property-conditioned + NL query alignment (bandgap + formation_energy, 60k samples, 3 epochs):
	- AllPairsCLIP loss (structure modalities)
	- NL description ↔ structure InfoNCE (the key NL query loss)
	- Property ↔ composition/crystal_text_llm InfoNCE ([MatExpert](https://arxiv.org/abs/2410.21317))
	- SupReMix-style property similarity MSE ([arxiv:2309.16633](https://arxiv.org/abs/2309.16633))
	- Loss weights: `L = L_clip + 0.3 * L_property + 0.5 * L_nl`

	### Based On

	\| Paper \| Contribution \| ArXiv \|
	\|-------\|-------------\|-------\|
	\| MultiMat \| AllPairsCLIP loss \| [2312.00111](https://arxiv.org/abs/2312.00111) \|
	\| MatExpert \| Property↔structure InfoNCE \| [2410.21317](https://arxiv.org/abs/2410.21317) \|
	\| LaCLIP \| LLM text augmentation for CLIP \| [2305.20088](https://arxiv.org/abs/2305.20088) \|
	\| SupReMix \| Property-label-aware soft contrastive \| [2309.16633](https://arxiv.org/abs/2309.16633) \|
	\| CrystalCLR \| Composition similarity \| [2211.13408](https://arxiv.org/abs/2211.13408) \|

	### Hyperparameters

	```yaml
	encoder: answerdotai/ModernBERT-base
	embed_dim: 128
	max_length: 1024 tokens
	batch_size: 48 × 6 grad_accum = 288 effective
	learning_rate: 2e-5 (phase 1), 1e-5 (phase 2)
	temperature: learnable (init 0.07)
	epochs: 3 per phase
	optimizer: AdamW (weight_decay=0.01)
	precision: bf16 (A100) / fp16 (T4/V100)
	gradient_checkpointing: True
	max_modalities_per_step: 5
	```

	## 🚀 Quick Start

	### Training (your GPU)

	```bash
	pip install torch transformers datasets faiss-cpu huggingface_hub trackio accelerate

	# Optional but recommended for A100/H100:
	pip install flash-attn --no-build-isolation

	python train_mattext_embeddings.py
	```

	The script auto-detects:
	- GPU capability (bf16 for Ampere+, fp16 otherwise)
	- Flash Attention 2 availability
	- CUDA vs CPU

	### Inference & Search

	```python
	import torch
	import faiss
	import json
	import numpy as np
	from transformers import AutoTokenizer
	from train_mattext_embeddings import MatTextEncoder, Config, search_vector_db

	# Load
	config = Config()
	config.device = "cuda" if torch.cuda.is_available() else "cpu"
	model = MatTextEncoder(config)
	model.load_state_dict(torch.load("mattext-embeddings/model.pt", map_location=config.device))
	model = model.to(config.device).eval()
	tokenizer = AutoTokenizer.from_pretrained(config.encoder_name)

	# Load FAISS indices
	indices = {}
	for mod in ["composition", "crystal_text_llm", "slices", "cif_symmetrized", "robocrys_rep"]:
	index = faiss.read_index(f"mattext-embeddings/faiss/{mod}.index")
	with open(f"mattext-embeddings/faiss/{mod}_metadata.json") as f:
	metadata = json.load(f)
	indices[mod] = {"index": index, "metadata": metadata}
	```

	### Query Examples

	```python
	# 🔍 Natural language property queries (THE KEY FEATURE)
	search_vector_db("oxide with high bandgap", "nl_property_description", model, tokenizer, indices, config)
	search_vector_db("stable ternary nitride", "nl_property_description", model, tokenizer, indices, config)
	search_vector_db("narrow bandgap semiconductor for IR", "nl_property_description", model, tokenizer, indices, config)
	search_vector_db("metallic binary compound", "nl_property_description", model, tokenizer, indices, config)

	# 🧪 Composition queries
	search_vector_db("Fe2O3", "composition", model, tokenizer, indices, config)
	search_vector_db("BaTiO3", "composition", model, tokenizer, indices, config)

	# 📖 Structure description queries
	search_vector_db("perovskite with octahedral coordination", "robocrys_rep", model, tokenizer, indices, config)

	# 📊 Structured property queries
	search_vector_db("composition: TiO2 \| bandgap: 3.2000", "property", model, tokenizer, indices, config)

	# 🔬 CIF queries (paste your CIF)
	search_vector_db("data_TiO2\n_symmetry P1\n_cell 4.59 4.59 2.96 90 90 90", "cif_symmetrized", ...)

	# 🧬 SLICES queries
	search_vector_db("Ti O 0 1 o o o", "slices", model, tokenizer, indices, config)
	```

	## 📊 Evaluation Metrics

	Cross-modal Recall@k on test set:

	\| Pair \| R@1 \| R@5 \| R@10 \| R@20 \|
	\|------\|-----\|-----\|------\|------\|
	\| composition → crystal_text_llm \| TBD \| TBD \| TBD \| TBD \|
	\| composition → cif_symmetrized \| TBD \| TBD \| TBD \| TBD \|
	\| composition → slices \| TBD \| TBD \| TBD \| TBD \|
	\| slices → crystal_text_llm \| TBD \| TBD \| TBD \| TBD \|
	\| robocrys_rep → composition \| TBD \| TBD \| TBD \| TBD \|

	NL Query Results:

	\| Query \| Top-1 Match \| Score \|
	\|-------\|------------\|-------\|
	\| "oxide with high bandgap" \| TBD \| TBD \|
	\| "narrow bandgap semiconductor" \| TBD \| TBD \|
	\| "stable binary oxide" \| TBD \| TBD \|

	Results populated after training.

	## 🧩 Extending: Graph Embeddings

	The architecture is plug-and-play for new modalities:

	```python
	# Add a GNN modality
	from torch_geometric.nn import SchNet

	class GraphEncoder(nn.Module):
	def __init__(self, embed_dim=128):
	super().__init__()
	self.gnn = SchNet(hidden_channels=256)
	self.proj = ModalityProjection(256, embed_dim)

	def forward(self, data):
	h = self.gnn(data.z, data.pos, data.batch)
	return self.proj(h)

	# Register as new modality
	model.projections["graph"] = graph_encoder.proj
	# It gets aligned automatically through AllPairsCLIP
	```

	## 📦 Dataset

	[n0w0f/MatText](https://huggingface.co/datasets/n0w0f/MatText) — 100k+ crystal structures in 10+ text representations

	## 📚 References

	- MatText: [arxiv:2406.17295](https://arxiv.org/abs/2406.17295)
	- MultiMat: [arxiv:2312.00111](https://arxiv.org/abs/2312.00111)
	- MatExpert: [arxiv:2410.21317](https://arxiv.org/abs/2410.21317)
	- LaCLIP: [arxiv:2305.20088](https://arxiv.org/abs/2305.20088)
	- SupReMix: [arxiv:2309.16633](https://arxiv.org/abs/2309.16633)
	- CrystalCLR: [arxiv:2211.13408](https://arxiv.org/abs/2211.13408)
	- Symile: [arxiv:2411.01053](https://arxiv.org/abs/2411.01053)

	## 📄 License

	MIT