Lizzy 7B

Model Name And Summary

Lizzy 7B is an open-weight Flower Labs assistant model in the Lizzy family.

Architecture And Configuration

Lizzy 7B is a 7B-class decoder-only transformer with long-context support, sliding/local attention behaviour, custom chat/control tokens, and deployment-specific serving configurations.

Representative configuration points:

7B-class parameter scale with a 32-layer stack;
long-context configuration up to 65k tokens with runtime caps adjusted by deployment profile;
32 attention heads with long-context/sliding-attention behaviour;
custom tokenizer and chat markers for instruction-style prompting;
deployment variants may include quantised revisions, runtime patches, and serving-time configuration changes.

Training Approach

Lizzy 7B follows a multi-stage training approach that combines:

pre-training on large-scale public text, document, code, math, and encyclopedic corpora;
supervised fine-tuning on instruction-following, dialogue, reasoning, and tool-use examples;
direct preference optimisation on preference pairs for helpfulness, style, and answer quality;
reinforcement learning with verifiable rewards for targeted behavioural refinement.

Across these stages, training data has been mixed across:

broad public text and knowledge sources;
synthetic instruction and preference data;
private synthetic data used to favour British behaviour and knowledge;
UK-specific examples and preference signals used to strengthen local knowledge and style.

Evaluation Against European Baselines

Britishness comparisons against the European baselines present in the latest local artifact set:

Benchmark	Lizzy 7B	EuroLLM 9B	Apertus 8B
Britishness MCQ	71.0	77.6	80.8
Britishness CoT	80.1	72.1	31.7
Britishness Domains	89.9	69.0	32.6

Broader benchmark comparisons against the same European baselines:

Benchmark	Lizzy 7B	EuroLLM 9B	Apertus 8B
MATH	77.9	31.3	22.4
OMEGA	29.0	4.7	5.0
BigBenchHard	69.0	38.9	42.4
AGI Eval English	65.6	50.2	50.4
MMLU	67.9	57.4	63.4
GPQA	34.6	26.8	28.1
HumanEvalPlus	70.2	28.2	33.4
MBPP+	52.5	41.7	42.3
LiveCodeBench v3	39.1	6.3	8.5
IFEval	63.8	55.8	65.1
AIME	35.8	0.2	0.6
GSM8K	91.8	64.7	64.7
IFBench	22.7	18.0	15.3
POPQA	22.2	25.6	25.1
ZebraLogic	12.4	4.4	5.9

Summary:

Lizzy 7B trails the European baselines on Britishness MCQ (a private Flower Labs benchmark) recall-style probing.
Lizzy 7B leads the reported European baselines on Britishness CoT and Britishness domain reasoning (private Flower Labs benchmarks) where comparable metrics are available.
Lizzy 7B also leads the latest local European baseline set on most knowledge, reasoning, math, and coding rows represented in the table above.

Intended Uses And Limitations

Intended uses:

UK-oriented assistant experiences;
general reasoning and coding assistance;
managed deployment through private Hugging Face or vLLM serving stacks.

Safety And Bias Considerations

The latest safety-evaluation reports the following task-level primary scores:

Safety benchmark	Metric	Score
Overall safety average	`overall_safety_average`	66.7%
WildGuardTest	`inverted_micro_harm_lower`	91.9%
HarmBench	`inverted_micro_asr_lower`	57.5%
ToxiGen (tiny)	`safe_overall`	90.2%
XSTest	`overall_accuracy`	85.6%
StrongReject (logprobs)	`inverted_asr`	78.8%
BBQ	`accuracy`	66.5%
WMDP	`inverted_accuracy`	47.5%

Lizzy 7B can still produce incorrect, outdated, or over-confident responses and should be used with human oversight for higher-risk workflows. UK-specific tuning improves local style and cultural alignment but can also bias tone and assumptions toward UK conventions; downstream moderation and policy controls remain required.

License And Citation

Model licence: Apache-2.0
Public and synthetic training sources include open-licensed public data plus private synthetic and UK-specific data that are not redistributed
Citation and legal text should still be confirmed by owner review before any external publication.

Python Example (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo_id = "flwrlabs/Lizzy-7B"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are Lizzy 7B."},
    {"role": "user", "content": "Summarise why queue etiquette matters in the UK."},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output_ids = model.generate(
    **inputs,
    temperature=0.2,
    top_p=0.9,
)
response = tokenizer.decode(output_ids[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Multi-GPU vLLM Tensor Parallel Patch

For reproducible multi-GPU vLLM support with Lizzy-family checkpoints, this deliverable bundles:

bundled draft artifact: vllm_patches/transformers_lizzy_tp.py

Apply this patch when all of the following are true:

runtime uses vLLM via the generic Transformers backend (model_type=vllm)
tensor parallelism is enabled (tensor_parallel_size > 1)
checkpoint is Lizzy-family (including RLVR variants)
runtime is not guaranteed to include an equivalent upstream fix

You can skip patch bundling only for strict HF-only runs or single-rank vLLM (TP=1).

Why this is included:

it mitigates known Lizzy TP failure modes in generic vLLM Transformers loading
it fixes rank-local head partitioning and q_norm/k_norm slicing behaviour
it prevents the known tensor-shape crash class seen without this patch

Downloads last month: 36

Safetensors

Model size

7B params

Tensor type

BF16