Qwen3-1.77B-g023 (Full Precision)

Overview

This is an optimized variant of Qwen/Qwen3-1.7B created by duplicating layer 21 to produce a 29-layer model (up from the original 28). The optimal duplication point was found through 5 rounds of iterative testing across layers 9–25, evaluating factual accuracy, perplexity, repetition, and thinking mode functionality.

Using this model:

(https://github.com/g023/xinf)

(https://github.com/g023/turboquant)

Key Result

Metric Baseline (28 layers) This Model (29 layers)
Overall Score 85.9 / 100 93.6 / 100 (+7.7)
Factual Accuracy 7 / 9 9 / 9
Avg Perplexity 17.71 19.50
Thinking Mode Working Working
Non-Thinking Mode Working Working

Architecture

Parameter Value
Layers 29 (28 original + 1 duplicated)
Hidden Size 2048
Intermediate Size 6144
Attention Heads 16 (query) / 8 (KV)
Head Dimension 128
Vocab Size 151,936
Max Position Embeddings 40,960
Total Parameters ~1.77B
Dtype bfloat16
Tied Embeddings Yes

Layer Mapping

Source Layer  →  Output Layer
0–20         →  0–20   (unchanged)
21           →  21, 22 (duplicated with noise std=0.001 + depth scaling)
22–27        →  23–28  (shifted +1)

Duplication Method

  • Noise injection: Gaussian noise (std=0.001) added to duplicate layer to break symmetry
  • Depth scaling: Factor of √(28/29) ≈ 0.983 applied to prevent activation explosion
  • Anchors preserved: First layer (0) and last layer (27→28) remain unmodified

Files

File Size Description
model-00001-of-00001.safetensors 3.3 GB Model weights (bfloat16)
config.json <1 KB Model configuration
tokenizer.json 11 MB Tokenizer
tokenizer_config.json 10 KB Tokenizer configuration
vocab.json 2.7 MB Vocabulary
merges.txt 1.6 MB BPE merges
generation_config.json <1 KB Generation defaults
eval_results.json 1 KB Full evaluation metrics

Usage

# Tweakable parameters
# MODEL_PATH = "./Qwen3-BEST" # local run
MODEL_PATH = "g023/Qwen3-1.77B-g023"
MAX_NEW_TOKENS = 8192
TEMPERATURE = 0.7
DO_SAMPLE = True
TOP_P = 0.9
TOP_K = 50
REPETITION_PENALTY = 1.1
STREAMING = True  # Set to True for streaming inference
INPUT_MESSAGE = "You are completing the next step in a task to create an arcade game in javascript. Your available tools are rationalize, red_green_tdd, and create_plan. Synthesize their output when reasoning. "

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

def load_model():
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    print("Model loaded.")
    return model, tokenizer

def inference_non_streaming(model, tokenizer, messages):
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
    )
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    print("Response:", response)
    return response

def inference_streaming(model, tokenizer, messages):
    final_response = ""
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
        streamer=streamer,
    )


    # return a final str
    return final_response

def llm_stream(model, tokenizer, conversation):
    import time
    start_time = time.time()
    text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    from io import StringIO
    buffer = StringIO()
    class CapturingTextStreamer(TextStreamer):
        def __init__(self, tokenizer, buffer):
            super().__init__(tokenizer, skip_prompt=True, skip_special_tokens=True)
            self.buffer = buffer
        def on_finalized_text(self, text, stream_end=False):
            self.buffer.write(text)
            print(text, end="", flush=True)
    streamer = CapturingTextStreamer(tokenizer, buffer)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
        streamer=streamer,
    )
    response = buffer.getvalue()

    if "</think>" in response:
        parts = response.rsplit("</think>", 1)
        reasoning = parts[0].strip()
        content = parts[1].strip()
    else:
        reasoning = ""
        content = response.strip()
    char_per_token = 3.245
    reasoning_tokens = round(len(reasoning) / char_per_token)
    content_tokens = round(len(content) / char_per_token)
    total_tokens = reasoning_tokens + content_tokens
    time_taken = time.time() - start_time
    ret_dict = {
        "reasoning": reasoning,
        "content": content,
        "usage": {
            "reasoning_tokens": reasoning_tokens,
            "content_tokens": content_tokens,
            "total_tokens": total_tokens,
        },
        "time_taken": time_taken,
    }
    return ret_dict

if __name__ == "__main__":
    model, tokenizer = load_model()
    messages = [{"role": "user", "content": INPUT_MESSAGE}]
    ret = llm_stream(model, tokenizer, messages)
    print("Result dict:", ret)

    # output tokens per second by taking total_tokens and time_taken
    if ret["usage"]["total_tokens"] > 0 and ret["time_taken"] > 0:
        tps = ret["usage"]["total_tokens"] / ret["time_taken"]
        print(f"Tokens per second: {tps:.2f}")

Base Model

  • Model: Qwen/Qwen3-1.7B
  • Architecture: Qwen3ForCausalLM (decoder-only transformer with GQA)
  • License: Apache 2.0
Downloads last month
1,216
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for g023/Qwen3-1.77B-g023

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(617)
this model