Lizzy 7B
Model Name And Summary
Lizzy 7B is an open-weight Flower Labs assistant model in the Lizzy family.
Architecture And Configuration
Lizzy 7B is a 7B-class decoder-only transformer with long-context support, sliding/local attention behaviour, custom chat/control tokens, and deployment-specific serving configurations.
Representative configuration points:
- 7B-class parameter scale with a 32-layer stack;
- long-context configuration up to 65k tokens with runtime caps adjusted by deployment profile;
- 32 attention heads with long-context/sliding-attention behaviour;
- custom tokenizer and chat markers for instruction-style prompting;
- deployment variants may include quantised revisions, runtime patches, and serving-time configuration changes.
Training Approach
Lizzy 7B follows a multi-stage training approach that combines:
- pre-training on large-scale public text, document, code, math, and encyclopedic corpora;
- supervised fine-tuning on instruction-following, dialogue, reasoning, and tool-use examples;
- direct preference optimisation on preference pairs for helpfulness, style, and answer quality;
- reinforcement learning with verifiable rewards for targeted behavioural refinement.
Across these stages, training data has been mixed across:
- broad public text and knowledge sources;
- synthetic instruction and preference data;
- private synthetic data used to favour British behaviour and knowledge;
- UK-specific examples and preference signals used to strengthen local knowledge and style.
Evaluation Against European Baselines
Britishness comparisons against the European baselines present in the latest local artifact set:
| Benchmark | Lizzy 7B | EuroLLM 9B | Apertus 8B |
|---|---|---|---|
| Britishness MCQ | 71.0 | 77.6 | 80.8 |
| Britishness CoT | 80.1 | 72.1 | 31.7 |
| Britishness Domains | 89.9 | 69.0 | 32.6 |
Broader benchmark comparisons against the same European baselines:
| Benchmark | Lizzy 7B | EuroLLM 9B | Apertus 8B |
|---|---|---|---|
| MATH | 77.9 | 31.3 | 22.4 |
| OMEGA | 29.0 | 4.7 | 5.0 |
| BigBenchHard | 69.0 | 38.9 | 42.4 |
| AGI Eval English | 65.6 | 50.2 | 50.4 |
| MMLU | 67.9 | 57.4 | 63.4 |
| GPQA | 34.6 | 26.8 | 28.1 |
| HumanEvalPlus | 70.2 | 28.2 | 33.4 |
| MBPP+ | 52.5 | 41.7 | 42.3 |
| LiveCodeBench v3 | 39.1 | 6.3 | 8.5 |
| IFEval | 63.8 | 55.8 | 65.1 |
| AIME | 35.8 | 0.2 | 0.6 |
| GSM8K | 91.8 | 64.7 | 64.7 |
| IFBench | 22.7 | 18.0 | 15.3 |
| POPQA | 22.2 | 25.6 | 25.1 |
| ZebraLogic | 12.4 | 4.4 | 5.9 |
Summary:
- Lizzy 7B trails the European baselines on Britishness MCQ (a private Flower Labs benchmark) recall-style probing.
- Lizzy 7B leads the reported European baselines on Britishness CoT and Britishness domain reasoning (private Flower Labs benchmarks) where comparable metrics are available.
- Lizzy 7B also leads the latest local European baseline set on most knowledge, reasoning, math, and coding rows represented in the table above.
Intended Uses And Limitations
Intended uses:
- UK-oriented assistant experiences;
- general reasoning and coding assistance;
- managed deployment through private Hugging Face or vLLM serving stacks.
Safety And Bias Considerations
The latest safety-evaluation reports the following task-level primary scores:
| Safety benchmark | Metric | Score |
|---|---|---|
| Overall safety average | overall_safety_average |
66.7% |
| WildGuardTest | inverted_micro_harm_lower |
91.9% |
| HarmBench | inverted_micro_asr_lower |
57.5% |
| ToxiGen (tiny) | safe_overall |
90.2% |
| XSTest | overall_accuracy |
85.6% |
| StrongReject (logprobs) | inverted_asr |
78.8% |
| BBQ | accuracy |
66.5% |
| WMDP | inverted_accuracy |
47.5% |
Lizzy 7B can still produce incorrect, outdated, or over-confident responses and should be used with human oversight for higher-risk workflows. UK-specific tuning improves local style and cultural alignment but can also bias tone and assumptions toward UK conventions; downstream moderation and policy controls remain required.
License And Citation
- Model licence: Apache-2.0
- Public and synthetic training sources include open-licensed public data plus private synthetic and UK-specific data that are not redistributed
- Citation and legal text should still be confirmed by owner review before any external publication.
Python Example (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo_id = "flwrlabs/Lizzy-7B"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are Lizzy 7B."},
{"role": "user", "content": "Summarise why queue etiquette matters in the UK."},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(
**inputs,
temperature=0.2,
top_p=0.9,
)
response = tokenizer.decode(output_ids[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)
Multi-GPU vLLM Tensor Parallel Patch
For reproducible multi-GPU vLLM support with Lizzy-family checkpoints, this deliverable bundles:
- bundled draft artifact:
vllm_patches/transformers_lizzy_tp.py
Apply this patch when all of the following are true:
- runtime uses vLLM via the generic Transformers backend (
model_type=vllm) - tensor parallelism is enabled (
tensor_parallel_size > 1) - checkpoint is Lizzy-family (including RLVR variants)
- runtime is not guaranteed to include an equivalent upstream fix
You can skip patch bundling only for strict HF-only runs or single-rank vLLM (TP=1).
Why this is included:
- it mitigates known Lizzy TP failure modes in generic vLLM Transformers loading
- it fixes rank-local head partitioning and
q_norm/k_normslicing behaviour - it prevents the known tensor-shape crash class seen without this patch
- Downloads last month
- 36