VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization
Paper • 2602.09934 • Published • 1
POINTS-Seeker-8B is a state-of-the-art multimodal agentic search model built from scratch to overcome the epistemic limits of static parametric knowledge in LMMs. Rather than bolting search tools onto an existing LMM, POINTS-Seeker is natively trained with Agentic Seeding—a dedicated phase that instills the foundational precursors for agentic behaviors—and equipped with V-Fold, an adaptive history-aware compression scheme, effectively resolving the performance bottleneck of long-horizon interactions. POINTS-Seeker-8B achieves superior performance on long-horizon, knowledge-intensive visual reasoning tasks.
Please first install WePOINTS using the following command:
git clone https://github.com/WePOINTS/WePOINTS.git
cd ./WePOINTS
pip install -e .
from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
import torch
user_prompt = "explain the image" # replace with your instruction
image_path = 'your image path'
model_path = 'tencent/POINTS-Seeker'
model = AutoModelForCausalLM.from_pretrained(model_path,
trust_remote_code=True,
dtype=torch.bfloat16,
device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
content = [
dict(type='image', image=image_path),
dict(type='text', text=user_prompt)
]
messages = [
{
'role': 'user',
'content': content
}
]
generation_config = {
'max_new_tokens': 2048,
'do_sample': False
}
response = model.chat(
messages,
tokenizer,
image_processor,
generation_config
)
print(response)
Please refer to our github repo
Base model
Qwen/Qwen3-8B-Base