Model Card for InternVL3Fangwusha2B

InternVL3Fangwusha2B is a 2B-parameter vision-language model (VLM) fine-tuned from InternVL3-2B, optimized for Chinese multimodal understanding, image-text interaction, document analysis, and visual content reasoning in practical application scenarios.

Model Details

Model Description

This model is a lightweight multimodal large model based on the InternVL3-2B architecture, focusing on improving Chinese image-text alignment, visual question answering, document OCR + understanding, and daily scene multimodal interaction. It provides efficient inference while maintaining strong multimodal capabilities.

  • Developed by: Yougen Yuan
  • Funded by [optional]: Personal Research Project
  • Shared by [optional]: Yougen Yuan
  • Model type: Vision-Language Model (VLM), Multimodal Model
  • Language(s) (NLP): Chinese (Simplified)
  • License: Apache-2.0
  • Finetuned from model [optional]: InternVL3-2B

Model Sources [optional]

Uses

Direct Use

This model can be directly used for:

  • Chinese visual question answering (VQA)
  • Image captioning and description generation
  • Document image analysis (tables, forms, scanned files)
  • OCR-assisted text understanding from images
  • Daily scene multimodal dialogue and information extraction

Downstream Use [optional]

Can be further fine-tuned for:

  • Enterprise document intelligent processing systems
  • Educational image-text question answering
  • E-commerce product image understanding
  • Mobile-side lightweight multimodal assistants
  • RAG systems combined with visual information

Out-of-Scope Use

  • Not designed for high-risk visual decision-making (medical imaging, autonomous driving, industrial detection without professional review)
  • Not suitable for generating harmful, illegal, pornographic, or privacy-violating multimodal content
  • Not optimized for non-Chinese languages
  • Not intended for ultra-high-resolution image analysis beyond its input constraints

Bias, Risks, and Limitations

  • The model may inherit social, cultural, and visual biases present in the pre-trained InternVL3 and public image-text datasets.
  • It may produce visual hallucinations, misrecognize objects, or generate inconsistent descriptions for blurry, low-light, or highly abstract images.
  • Performance on specialized professional images (medical, remote sensing, microscopic) is limited without domain fine-tuning.
  • The model does not have autonomous fact-checking and may generate factually incorrect multimodal responses.

Recommendations

All outputs in professional or production scenarios should be verified by humans. Visual inputs with sensitive or private information require appropriate filtering and protection. It is recommended to access professional modules for high-precision visual tasks. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer

model_name = "Yougen/InternVL3Fangwusha2B"
model = AutoModel.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example image-text interaction
# image = load_image("your_image.jpg")
# response = model.chat(tokenizer, image, "请描述这张图片的内容")
# print(response)

Training Details

Training Data

Training data includes high-quality Chinese image-text pairs, document images, daily scene photos, and instruction-based multimodal dialogue data. All data is processed with deduplication, noise filtering, and quality screening.

Training Procedure

Preprocessing [optional]

  • Image resizing and normalization
  • Text cleaning and instruction template construction
  • Multimodal alignment tokenization
  • Noise removal for low-quality image-text pairs

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Learning rate: 2e-5
  • Batch size: 8
  • Optimizer: AdamW
  • Weight decay: 0.01
  • Epochs: 3

Speeds, Sizes, Times [optional]

  • Model size: 2B parameters
  • Training hardware: NVIDIA RTX 4090 / A100 GPU
  • Training time: Several hours

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal Chinese multimodal test set including VQA, image captioning, and document understanding tasks.

Factors

Image quality, text complexity, scene type, document layout complexity.

Metrics

  • BLEU, CIDEr (captioning)
  • Accuracy (VQA)
  • Character recognition accuracy (document OCR)
  • Human evaluation of fluency and rationality

Results

[More Information Needed]

Summary

The model achieves stable performance in Chinese multimodal understanding and interaction, with high efficiency suitable for edge and middle-end deployment scenarios.

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA GPU
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

Vision-language architecture with visual encoder + large language decoder, based on InternVL3. Optimized for Chinese multimodal understanding, alignment, generation, and document analysis.

Compute Infrastructure

Hardware

NVIDIA GPU with CUDA support

Software

  • PyTorch
  • Hugging Face Transformers
  • Accelerate
  • TorchVision
  • Pillow

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

  • VLM (Vision-Language Model): A model that unifies visual and language understanding for cross-modal tasks.
  • InternVL3: A series of open-source vision-language models developed by the InternLM team.
  • Multimodal Alignment: The ability to correctly associate visual content with natural language descriptions.

More Information [optional]

For updates and issues, please visit the model repository on Hugging Face Hub.

Model Card Authors [optional]

Yougen Yuan

Model Card Contact

[More Information Needed]

Downloads last month
24
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Yougen/InternVL3Fangwusha2B