Model Card: Liquid-Web (LFM2.5-VL-450M-WebSight)

Model Details

Model Type: Vision-Language Model (VLM)
Base Model: LiquidAI/LFM2.5-VL-450M
Architecture: Liquid Foundation Model (LFM) with SigLIP2 Vision Encoder.
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Task: Image-to-HTML/Tailwind CSS Generation (UI-to-Code)
Language: English
License: Apache 2.0 (Inherited from base)

Intended Use

Primary Use Case

This model is designed to take screenshots of web pages as input and generate the corresponding HTML code using Tailwind CSS utilities. It is intended for developers looking to automate the conversion of UI designs or existing web pages into clean, functional code.

Out-of-Scope Use

General-purpose image captioning (it is highly specialized for code).
Generating scripts for malicious web automation.
Production-ready code without human review (the model may hallucinate specific color shades or precise pixel alignments).

Training Data

The model was fine-tuned on the HuggingFaceM4/WebSight (v0.2) dataset.

Format: Paired images (screenshots) and text (HTML + Tailwind CSS).
Volume: 1,000 high-quality samples (fine-tuned subset).
Content: Diverse layouts including landing pages, dashboards, and portfolio sites.

Training Procedure

Hyperparameters

Parameter	Value
LoRA Rank (r)	64
LoRA Alpha	64
Optimizer	AdamW (8-bit)
Learning Rate	2e-4
Batch Size	1 (Per device)
Gradient Accumulation	8
Max Steps	100
Precision	4-bit Quantization (NormalFloat4)

Frameworks Used:

Unsloth: For memory-efficient training and fast kernels.
TRL & PEFT: For the Supervised Fine-Tuning (SFT) loop.

Performance & Limitations

Strengths

Extreme Efficiency: At only 450M parameters, this model provides high-speed inference on edge devices (Jetson Orin, Mobile).
Modern CSS: Specifically trained on Tailwind CSS v2.2+, avoiding bloated traditional CSS files.

Limitations

Hallucinations: The model may occasionally invent Tailwind classes that do not exist (e.g., text-custom-500).
Complexity: Very deep or complex nested layouts might result in truncated HTML due to the max_seq_length limit.
Resolution: Fine details in very small text within screenshots may be missed by the vision encoder.

How to Get Started

To use this model for inference, use the following code snippet:

from unsloth import FastLanguageModel
from transformers import AutoProcessor
from PIL import Image

model, tokenizer = FastLanguageModel.from_pretrained("saadxsalman/LFM-WebSight-Tailwind", load_in_4bit=True)
processor = AutoProcessor.from_pretrained("saadxsalman/LFM-WebSight-Tailwind")

# Inference logic goes here

Citation

If you use this model, please cite the original WebSight technical report:

@misc{laurençon2024unlocking,
      title={Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset}, 
      author={Hugo Laurençon and Léo Tronchon and Victor Sanh},
      year={2024},
      eprint={2403.09029},
      archivePrefix={arXiv},
      primaryClass={cs.HC}
}

Downloads last month: 79

Safetensors

Model size

0.4B params

Tensor type

BF16

Dataset used to train saadxsalman/Liquid-WebSight

Paper for saadxsalman/Liquid-WebSight

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Paper • 2403.09029 • Published Mar 14, 2024 • 57