--- license: apache-2.0 language: - en - zh base_model: - Qwen/Qwen3-Embedding-8B tags: - embedding - retriever - RAG pipeline_tag: feature-extraction library_name: transformers --- # SFT-Emb-8B [![Paper](https://img.shields.io/badge/Paper-arXiv%3A2512.17220-red)](https://arxiv.org/pdf/2512.17220) [![Model](https://img.shields.io/badge/HuggingFace-SFT--Emb--8B-yellow)](https://huggingface.co/MindscapeRAG/SFT-Emb-8B) This repository provides the inference implementation for **SFT-Emb**, a supervised fine-tuned embedding model serving as a baseline retriever in the **MiA-RAG** framework. Unlike [**MiA-Emb**](https://huggingface.co/MindscapeRAG/MiA-Emb-8B), which conditions on both the query and a global summary (Mindscape), **SFT-Emb** operates on the **query alone** — without any global summary or residual connection. This makes it a standard retrieval baseline that does not leverage document-level semantic scaffolding. --- ## ✨ Key Features - **Standard Query-Only Retrieval** Encodes queries without any global summary, serving as a strong SFT baseline for comparison with Mindscape-aware models. - **Dual-Granularity Retrieval** - **Chunk Retrieval** for narrative passages (standard RAG) - **Node Retrieval** for knowledge graph entities (GraphRAG-style) - **Same Architecture, Simpler Input** Built on the same Qwen3-Embedding-8B backbone and LoRA fine-tuning as MiA-Emb, but without the Mindscape summary injection or residual embedding mechanism. --- ## 🚀 Usage ### Installation ```bash pip install torch transformers>=4.53.0 ``` --- ### 1) Initialization > SFT-Emb-8B is initialized from **`Qwen3-Embedding-8B`**. ```python import torch import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel # Configuration device = "cuda" if torch.cuda.is_available() else "cpu" # Inference Parameters node_delimiter = "<|repo_name|>" # Special token for Node tasks # Load Tokenizer (base) tokenizer = AutoTokenizer.from_pretrained( "Qwen/Qwen3-Embedding-8B", trust_remote_code=True, padding_side="left" ) # Load Model model = AutoModel.from_pretrained( "MindscapeRAG/SFT-Emb-8B", trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map={"": 0} ) ``` --- ### 2) Chunk Retrieval Use this mode to retrieve narrative text chunks. The query is encoded **without** any global summary. ```python def get_query_prompt(query): """Construct input prompt (query-only, no summary).""" task_desc = "Given a search query, retrieve relevant chunks or helpful entities summaries from the given context that answer the query" return ( f"Instruct: {task_desc}\n" f"Query: {query}{node_delimiter}" ) def last_token_pool(last_hidden_states, attention_mask): """Extract the last non-padding token embedding.""" left_padding = attention_mask[:, -1].sum() == attention_mask.shape[0] if left_padding: return last_hidden_states[:, -1] sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] def encode_chunk(texts): batch = tokenizer( texts, max_length=4096, padding=True, truncation=True, return_tensors="pt" ).to(model.device) outputs = model(**batch) # Embedding (Last Token) emb = last_token_pool(outputs.last_hidden_state, batch["attention_mask"]) emb = F.normalize(emb, p=2, dim=-1) return emb # --- Example --- query = "Who is the protagonist?" chunk = "Harry looked at the scar on his forehead." # Encode q_emb = encode_chunk([get_query_prompt(query)]) c_emb = encode_chunk([chunk]) # Score score = q_emb @ c_emb.T print(f"Chunk Similarity: {score.item():.4f}") ``` --- ### 3) Node Retrieval SFT-Emb can retrieve knowledge graph entities (**Nodes**). This mode extracts embeddings from the `<|repo_name|>` token position. **Candidate format:** `Entity Name : Entity Description` Example: `Mary Campbell Smith : Mary Campbell Smith is mentioned as the translator...` ```python def extract_specific_token(outputs, batch, token_id): """Extract embedding at the position of a specific token.""" input_ids = batch["input_ids"] hidden = outputs.last_hidden_state mask = (input_ids == token_id) # Take the last occurrence of the token for each sample positions = mask.long().cumsum(dim=1).eq(mask.long().sum(dim=1, keepdim=True)) & mask return hidden[positions] def encode_node_query(texts, node_delimiter="<|repo_name|>"): batch = tokenizer(texts, padding=True, return_tensors="pt").to(model.device) outputs = model(**batch) # Node Main Embedding: extract from <|repo_name|> position node_id = tokenizer.encode(node_delimiter, add_special_tokens=False)[0] q_emb_node = extract_specific_token(outputs, batch, node_id) q_emb_node = F.normalize(q_emb_node, p=2, dim=-1) return q_emb_node # --- Example --- query = "Who is the protagonist?" # 1) Encode Query (Node Token) q_emb_node = encode_node_query([get_query_prompt(query)]) # 2) Encode Entity Candidate entity_text = "Harry Potter : The main protagonist of the series..." n_emb = encode_chunk([entity_text]) # 3) Score score = q_emb_node @ n_emb.T print(f"Node Similarity: {score.item():.4f}") ``` --- ## 📜 Citation If you find this work useful, please cite: ```bibtex @misc{li2025mindscapeawareretrievalaugmentedgeneration, title={Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding}, author={Yuqing Li and Jiangnan Li and Zheng Lin and Ziyan Zhou and Junjie Wu and Weiping Wang and Jie Zhou and Mo Yu}, year={2025}, eprint={2512.17220}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.17220}, } ``` ---