Lizzy 7B

Lizzy 7B header figure (light theme)

Model Name And Summary

Lizzy 7B is an open-weight Flower Labs assistant model in the Lizzy family.

Architecture And Configuration

Lizzy 7B is a 7B-class decoder-only transformer with long-context support, sliding/local attention behaviour, custom chat/control tokens, and deployment-specific serving configurations.

Representative configuration points:

  • 7B-class parameter scale with a 32-layer stack;
  • long-context configuration up to 65k tokens with runtime caps adjusted by deployment profile;
  • 32 attention heads with long-context/sliding-attention behaviour;
  • custom tokenizer and chat markers for instruction-style prompting;
  • deployment variants may include quantised revisions, runtime patches, and serving-time configuration changes.

Training Approach

Lizzy 7B follows a multi-stage training approach that combines:

  • pre-training on large-scale public text, document, code, math, and encyclopedic corpora;
  • supervised fine-tuning on instruction-following, dialogue, reasoning, and tool-use examples;
  • direct preference optimisation on preference pairs for helpfulness, style, and answer quality;
  • reinforcement learning with verifiable rewards for targeted behavioural refinement.

Across these stages, training data has been mixed across:

  • broad public text and knowledge sources;
  • synthetic instruction and preference data;
  • private synthetic data used to favour British behaviour and knowledge;
  • UK-specific examples and preference signals used to strengthen local knowledge and style.

Evaluation Against European Baselines

Britishness comparisons against the European baselines present in the latest local artifact set:

Benchmark Lizzy 7B EuroLLM 9B Apertus 8B
Britishness MCQ 71.0 77.6 80.8
Britishness CoT 80.1 72.1 31.7
Britishness Domains 89.9 69.0 32.6

Broader benchmark comparisons against the same European baselines:

Benchmark Lizzy 7B EuroLLM 9B Apertus 8B
MATH 77.9 31.3 22.4
OMEGA 29.0 4.7 5.0
BigBenchHard 69.0 38.9 42.4
AGI Eval English 65.6 50.2 50.4
MMLU 67.9 57.4 63.4
GPQA 34.6 26.8 28.1
HumanEvalPlus 70.2 28.2 33.4
MBPP+ 52.5 41.7 42.3
LiveCodeBench v3 39.1 6.3 8.5
IFEval 63.8 55.8 65.1
AIME 35.8 0.2 0.6
GSM8K 91.8 64.7 64.7
IFBench 22.7 18.0 15.3
POPQA 22.2 25.6 25.1
ZebraLogic 12.4 4.4 5.9

Summary:

  • Lizzy 7B trails the European baselines on Britishness MCQ (a private Flower Labs benchmark) recall-style probing.
  • Lizzy 7B leads the reported European baselines on Britishness CoT and Britishness domain reasoning (private Flower Labs benchmarks) where comparable metrics are available.
  • Lizzy 7B also leads the latest local European baseline set on most knowledge, reasoning, math, and coding rows represented in the table above.

Intended Uses And Limitations

Intended uses:

  • UK-oriented assistant experiences;
  • general reasoning and coding assistance;
  • managed deployment through private Hugging Face or vLLM serving stacks.

Safety And Bias Considerations

The latest safety-evaluation reports the following task-level primary scores:

Safety benchmark Metric Score
Overall safety average overall_safety_average 66.7%
WildGuardTest inverted_micro_harm_lower 91.9%
HarmBench inverted_micro_asr_lower 57.5%
ToxiGen (tiny) safe_overall 90.2%
XSTest overall_accuracy 85.6%
StrongReject (logprobs) inverted_asr 78.8%
BBQ accuracy 66.5%
WMDP inverted_accuracy 47.5%

Lizzy 7B can still produce incorrect, outdated, or over-confident responses and should be used with human oversight for higher-risk workflows. UK-specific tuning improves local style and cultural alignment but can also bias tone and assumptions toward UK conventions; downstream moderation and policy controls remain required.

License And Citation

  • Model licence: Apache-2.0
  • Public and synthetic training sources include open-licensed public data plus private synthetic and UK-specific data that are not redistributed
  • Citation and legal text should still be confirmed by owner review before any external publication.

Python Example (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo_id = "flwrlabs/Lizzy-7B"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are Lizzy 7B."},
    {"role": "user", "content": "Summarise why queue etiquette matters in the UK."},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output_ids = model.generate(
    **inputs,
    temperature=0.2,
    top_p=0.9,
)
response = tokenizer.decode(output_ids[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Multi-GPU vLLM Tensor Parallel Patch

For reproducible multi-GPU vLLM support with Lizzy-family checkpoints, this deliverable bundles:

  • bundled draft artifact: vllm_patches/transformers_lizzy_tp.py

Apply this patch when all of the following are true:

  • runtime uses vLLM via the generic Transformers backend (model_type=vllm)
  • tensor parallelism is enabled (tensor_parallel_size > 1)
  • checkpoint is Lizzy-family (including RLVR variants)
  • runtime is not guaranteed to include an equivalent upstream fix

You can skip patch bundling only for strict HF-only runs or single-rank vLLM (TP=1).

Why this is included:

  • it mitigates known Lizzy TP failure modes in generic vLLM Transformers loading
  • it fixes rank-local head partitioning and q_norm/k_norm slicing behaviour
  • it prevents the known tensor-shape crash class seen without this patch
Downloads last month
36
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support