Qwen3-1.77B-g023 (Full Precision)

Overview

This is an optimized variant of Qwen/Qwen3-1.7B created by duplicating layer 21 to produce a 29-layer model (up from the original 28). The optimal duplication point was found through 5 rounds of iterative testing across layers 9–25, evaluating factual accuracy, perplexity, repetition, and thinking mode functionality.

Using this model:

(https://github.com/g023/xinf)

(https://github.com/g023/turboquant)

Key Result

Metric	Baseline (28 layers)	This Model (29 layers)
Overall Score	85.9 / 100	93.6 / 100 (+7.7)
Factual Accuracy	7 / 9	9 / 9
Avg Perplexity	17.71	19.50
Thinking Mode	Working	Working
Non-Thinking Mode	Working	Working

Architecture

Parameter	Value
Layers	29 (28 original + 1 duplicated)
Hidden Size	2048
Intermediate Size	6144
Attention Heads	16 (query) / 8 (KV)
Head Dimension	128
Vocab Size	151,936
Max Position Embeddings	40,960
Total Parameters	~1.77B
Dtype	bfloat16
Tied Embeddings	Yes

Layer Mapping

Source Layer  →  Output Layer
0–20         →  0–20   (unchanged)
21           →  21, 22 (duplicated with noise std=0.001 + depth scaling)
22–27        →  23–28  (shifted +1)

Duplication Method

Noise injection: Gaussian noise (std=0.001) added to duplicate layer to break symmetry
Depth scaling: Factor of √(28/29) ≈ 0.983 applied to prevent activation explosion
Anchors preserved: First layer (0) and last layer (27→28) remain unmodified

Files

File	Size	Description
`model-00001-of-00001.safetensors`	3.3 GB	Model weights (bfloat16)
`config.json`	<1 KB	Model configuration
`tokenizer.json`	11 MB	Tokenizer
`tokenizer_config.json`	10 KB	Tokenizer configuration
`vocab.json`	2.7 MB	Vocabulary
`merges.txt`	1.6 MB	BPE merges
`generation_config.json`	<1 KB	Generation defaults
`eval_results.json`	1 KB	Full evaluation metrics

Usage

# Tweakable parameters
# MODEL_PATH = "./Qwen3-BEST" # local run
MODEL_PATH = "g023/Qwen3-1.77B-g023"
MAX_NEW_TOKENS = 8192
TEMPERATURE = 0.7
DO_SAMPLE = True
TOP_P = 0.9
TOP_K = 50
REPETITION_PENALTY = 1.1
STREAMING = True  # Set to True for streaming inference
INPUT_MESSAGE = "You are completing the next step in a task to create an arcade game in javascript. Your available tools are rationalize, red_green_tdd, and create_plan. Synthesize their output when reasoning. "

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

def load_model():
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    print("Model loaded.")
    return model, tokenizer

def inference_non_streaming(model, tokenizer, messages):
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
    )
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    print("Response:", response)
    return response

def inference_streaming(model, tokenizer, messages):
    final_response = ""
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
        streamer=streamer,
    )


    # return a final str
    return final_response

def llm_stream(model, tokenizer, conversation):
    import time
    start_time = time.time()
    text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    from io import StringIO
    buffer = StringIO()
    class CapturingTextStreamer(TextStreamer):
        def __init__(self, tokenizer, buffer):
            super().__init__(tokenizer, skip_prompt=True, skip_special_tokens=True)
            self.buffer = buffer
        def on_finalized_text(self, text, stream_end=False):
            self.buffer.write(text)
            print(text, end="", flush=True)
    streamer = CapturingTextStreamer(tokenizer, buffer)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
        streamer=streamer,
    )
    response = buffer.getvalue()

    if "</think>" in response:
        parts = response.rsplit("</think>", 1)
        reasoning = parts[0].strip()
        content = parts[1].strip()
    else:
        reasoning = ""
        content = response.strip()
    char_per_token = 3.245
    reasoning_tokens = round(len(reasoning) / char_per_token)
    content_tokens = round(len(content) / char_per_token)
    total_tokens = reasoning_tokens + content_tokens
    time_taken = time.time() - start_time
    ret_dict = {
        "reasoning": reasoning,
        "content": content,
        "usage": {
            "reasoning_tokens": reasoning_tokens,
            "content_tokens": content_tokens,
            "total_tokens": total_tokens,
        },
        "time_taken": time_taken,
    }
    return ret_dict

if __name__ == "__main__":
    model, tokenizer = load_model()
    messages = [{"role": "user", "content": INPUT_MESSAGE}]
    ret = llm_stream(model, tokenizer, messages)
    print("Result dict:", ret)

    # output tokens per second by taking total_tokens and time_taken
    if ret["usage"]["total_tokens"] > 0 and ret["time_taken"] > 0:
        tps = ret["usage"]["total_tokens"] / ret["time_taken"]
        print(f"Tokens per second: {tps:.2f}")

Base Model

Model: Qwen/Qwen3-1.7B
Architecture: Qwen3ForCausalLM (decoder-only transformer with GQA)
License: Apache 2.0

Downloads last month: 1,216

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for g023/Qwen3-1.77B-g023

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(617)

this model