Apps using this model:

g023's Agentic Chat (https://github.com/g023/g023_agentic_chat/)

Agentic ProHarness β€” Self-Improving LLM Programming Harness (https://github.com/g023/agentica/)

HarnessHarvester: self-learning, self-correcting, LLM-powered harness creation (https://github.com/g023/harnessharvest)

Local Model Router: Ollama/OpenAI-compatible Bridge to llama.cpp (https://github.com/g023/localmodelrouter)

RUN in Ollama quickly

ollama run hf.co/g023/Qwen3-1.77B-g023-GGUF:Q8_0

Note: I just use the Q8_0 but if the others work for your needs, all the power to you.

EXAMPLE RUNNING IN PYTHON:

# OLLAMA PULL AND THEN ROCK AND ROLL
# ollama pull hf.co/g023/Qwen3-1.77B-g023-GGUF:Q8_0

# Author: g023 [ https://x.com/g023dev ]
# https://huggingface.co/g023/

# 2 main functions:
# llm_nonstream(conv=[], thinking=True, options=G_OPTIONS)
# llm_stream(conv=[], thinking=True, options=G_OPTIONS)

G_TURBO_TIME = True # for hf.co/g023/Qwen3-1.77B-g023-GGUF:Q8_0 to run without thinking at full clip (very very very fast; still smart)

# Any thinking outputs will detect and load a reasoning field on a return and content for the answer

import requests
import json
from typing import List, Dict, Optional, Generator, Union
import os
import time

G_APPEND_PROMPT = "" # can set to "no_think" to disable thinking for models that don't support it, or "think:" to use a custom think prefix for models that do support it.

G_HOST = "http://localhost:11434"  # Default host for Ollama server

# # if you want to use Qwen3.5 use these settings:
# G_MODEL = "qwen3.5:2b" # if
# G_THINKING = True # modern LLMs want this switch to enable the "thinking" process. Even when disabled, if there is a </think> tag in the output it will still split reasoning and content
# G_APPEND_PROMPT = "" # can set to "no_think" to disable thinking for models that don't support it, or "think:" to use a custom think prefix for models that do support it.

# OLLAMA PULL AND THEN ROCK AND ROLL
# ollama pull hf.co/g023/Qwen3-1.77B-g023-GGUF:Q8_0
# so this is a model I created by duplicating a layer in the orig qwen3 1.7B . Its fast. 
if G_TURBO_TIME:
    G_MODEL = "hf.co/g023/Qwen3-1.77B-g023-GGUF:Q8_0" 
    # does not support thinking param so have to set thinking=False but you can disable thinking with 'no_think' in the prompt you send
    G_THINKING = False # old school LLMs like Qwen3 original (can deactivate think with no_think in prompt)
    G_APPEND_PROMPT = "<ignore:no_think>" # can set to "no_think" to disable thinking for models that don't support it, or "think:" to use a custom think prefix for models that do support it.
else:
    G_MODEL = "hf.co/g023/Qwen3-1.77B-g023-GGUF:Q8_0" 
    # does not support thinking param so have to set thinking=False but you can disable thinking with 'no_think' in the prompt you send
    G_THINKING = False # old school LLMs like Qwen3 original (can deactivate think with no_think in prompt)
    G_APPEND_PROMPT = "" # can set to "no_think" to disable thinking for models that don't support it, or "think:" to use a custom think prefix for models that do support it.

G_CONTEXT_WINDOW = 45000 # num_ctx in ollama api
G_MAX_OUTPUT_TOKENS = 16384  # -1 for no limit, otherwise set to desired max output tokens (e.g., 2048) # controls how many tokens are output
G_TEMP = 0.85


G_OPTIONS = {
    # "num_keep": 5, # Keep last 5 messages for context
    # "seed": 42,
    "num_predict": G_MAX_OUTPUT_TOKENS,
    "top_k": 90,
    "top_p": 0.9,
    "min_p": 0.3,
    "typical_p": 0.25,
    "repeat_last_n": 32767,
    "temperature": G_TEMP,
    "repeat_penalty": 15.2,
    "presence_penalty": 0.5,
    "frequency_penalty": 0.1,
    "mirostat": 2,
    "mirostat_tau": 0.3,
    "mirostat_eta": 0.2,
    "penalize_newline": True,
    # "stop": ["\n", "user:"],
    "numa": False,
    "num_ctx": G_CONTEXT_WINDOW,
    "num_batch": 1,
    # "num_gpu": 1,
    # "main_gpu": 0,
    "low_vram": False,
    "vocab_only": False,
    "use_mmap": True,
    "use_mlock": True,
    "num_thread": 1        
}

_STREAM_DONE = object()


def _resolve_host(host: Optional[str]) -> str:
    """Resolve the Ollama host with explicit override, environment fallback, then default."""
    return (host or os.getenv("OLLAMA_HOST") or G_HOST).rstrip('/')


def _parse_stream_line(raw_line: Union[bytes, str]) -> Optional[Union[Dict, object]]:
    """Parse one SSE line from an Ollama streaming response."""
    if isinstance(raw_line, bytes):
        line = raw_line.decode('utf-8', errors='replace')
    else:
        line = raw_line

    line = line.strip()
    if not line or line.startswith(':'):
        return None

    if line.startswith('data:'):
        payload = line[5:].strip()
        if not payload or payload == '[DONE]':
            return _STREAM_DONE
    else:
        payload = line

    try:
        return json.loads(payload)
    except json.JSONDecodeError as exc:
        print(f"Failed to decode chunk: {raw_line!r}, error: {exc}")
        return None


def _is_reasoning_effort_unsupported_error(status_code: int, body: str) -> bool:
    if status_code != 400 or not isinstance(body, str):
        return False

    lowered = body.lower()
    return (
        ("think value" in lowered and "not supported" in lowered)
        or "does not support thinking" in lowered
        or "does not support think" in lowered
        or "does not support 'think'" in lowered
    )

def chat_with_ollama(
    messages: List[Dict[str, str]],
    model: str = G_MODEL,
    host: str = None,
    stream: bool = False,
    reasoning_effort: Optional[str] = None,
    options: Dict = G_OPTIONS,  # default options with num_ctx set to max
    thinking: bool = G_THINKING,  #
    **kwargs
) -> Union[Dict, Generator[Dict, None, None]]:
    """
    Sends a conversation to Ollama and returns the model's response.
    This function is designed to be beautiful, robust, and leverage the latest Ollama features.

    Args:
        messages (List[Dict[str, str]]): A list of message dictionaries.
                                         Each dict should have 'role' (e.g., 'user', 'assistant', 'system')
                                         and 'content' (the message text).
        model (str): The name of the model to use (e.g., "gemma3", "llama3.2", "nemotron-3-nano").
                     Defaults to "gemma3". You can also use the ':cloud' suffix for cloud models.
        host (str, optional): The base URL of the Ollama server. If None, it tries to get it from the
                              OLLAMA_HOST environment variable, otherwise defaults to "http://localhost:11434".
        stream (bool): If True, returns a generator yielding response chunks. If False, returns the full response.
                       Defaults to False.
        reasoning_effort (str, optional): For reasoning models, controls the effort ("low", "medium", "high").
        options (Dict, optional): Additional options for the model (e.g., {"num_ctx": 32767}).
        thinking (bool, optional): Whether to include the "thinking" process in the response. Defaults to True.
        **kwargs: Additional parameters to pass to the API (e.g., temperature, max_tokens, top_p).

    Returns:
        Union[Dict, Generator[Dict, None, None]]: If stream=False, returns a dictionary with the full response.
                                                   If stream=True, returns a generator yielding chunks.

    Raises:
        ConnectionError: If the Ollama server is not reachable.
        ValueError: If the input messages are invalid.
        requests.exceptions.RequestException: For other API-related errors.
    """

    # append G_APPEND_PROMPT to the last user message if set
    if G_APPEND_PROMPT and messages and messages[-1].get("role") == "user":
        messages[-1]["content"] += f"\n{G_APPEND_PROMPT}"


    payload = {
        "model": model,
        "messages": messages,
        "stream": stream,
        "think": thinking,
        "options": options,
    }

    # force num_ctx and output token sizes from the globals if not set on request
    if "num_ctx" not in payload["options"]:
        payload["options"]["num_ctx"] = G_CONTEXT_WINDOW
    if "num_predict" not in payload["options"]:
        payload["options"]["num_predict"] = G_MAX_OUTPUT_TOKENS
    # force temp
    if "temperature" not in payload["options"]:
        payload["options"]["temperature"] = G_TEMP

    headers = {"Content-Type": "application/json"}

    # --- Configuration & Validation ---
    if not messages:
        raise ValueError("The 'messages' list cannot be empty.")

    # Determine host with environment variable fallback
    effective_host = _resolve_host(host)
    
    # --- Select Endpoint ---
    endpoint = f"{effective_host}/api/chat"
    
    if reasoning_effort:
        payload["reasoning_effort"] = reasoning_effort

    # Add any additional parameters from kwargs (e.g., temperature, max_tokens)
    payload.update(kwargs)
    
    # --- Make the Request ---
    try:
        if stream:
            # For streaming, we return a generator
            return _stream_response(endpoint, headers, payload)
        else:
            # For non-streaming, make a single request
            response = requests.post(endpoint, headers=headers, json=payload, timeout=120)
            try:
                response.raise_for_status()
            except requests.exceptions.HTTPError as e:
                body = getattr(getattr(e, "response", None), "text", "")
                if reasoning_effort and _is_reasoning_effort_unsupported_error(response.status_code, body):
                    retry_payload = payload.copy()
                    retry_payload.pop("reasoning_effort", None)
                    response = requests.post(
                        endpoint, headers=headers, json=retry_payload, timeout=600)
                    response.raise_for_status()
                    return response.json()
                raise
            return response.json()
        
    except requests.exceptions.ConnectionError as e:
        raise ConnectionError(f"Could not connect to Ollama server at {effective_host}. "
                              f"Please ensure it's running (ollama serve).") from e
    except requests.exceptions.Timeout as e:
        raise requests.exceptions.Timeout("Request to Ollama timed out. Consider increasing the timeout.") from e
    except requests.exceptions.RequestException as e:
        # Re-raise other request exceptions
        raise e


def _stream_response(endpoint: str, headers: Dict, payload: Dict) -> Generator[Dict, None, None]:
    """
    Internal generator to handle streaming responses.
    Yields parsed JSON chunks from the server-sent events stream.
    """
    payload["stream"] = True  # Ensure stream is enabled
    with requests.post(endpoint, headers=headers, json=payload, stream=True, timeout=120) as response:
        response.raise_for_status()
        for line in response.iter_lines(decode_unicode=True):
            chunk = _parse_stream_line(line)
            if chunk is None:
                continue
            if chunk is _STREAM_DONE:
                break
            yield chunk

            if isinstance(chunk, dict):
                if chunk.get("done"):
                    break
                choices = chunk.get("choices") or []
                if choices:
                    finish_reason = choices[0].get("finish_reason")
                    if finish_reason in {"stop", "length", "content_filter"}:
                        break


def llm_nonstream(conv=[], thinking=True, options=G_OPTIONS):

    ret_dict = {
        "reasoning": "",
        "content": "",
        "usage": {},
        "time_taken": 0,
    }

    print("\n--- (Non-Streaming) ---") # 98.14 tokens/second # ~ 25-30% faster than streaming ;)
    try:
        time_start = time.time()    

        response = chat_with_ollama(
            messages=conversation,
            model=G_MODEL, 
            temperature=G_TEMP,
            reasoning_effort="medium",  # New parameter for reasoning models, # don't really care as the models that use it don't really work for me *wink* *wink* *nudge* *nudge*
            thinking=thinking,
            options=options
        )

        ### XXXX
        message = response['message']

        ret_dict["time_taken"] = time.time() - time_start
        ret_dict["reasoning"] = message.get('thinking')
        ret_dict["content"] = message.get('content')

        if "</think>" in ret_dict["content"]:
            # break it in two at the LAST occurrence of </think>
            parts = ret_dict["content"].rsplit("</think>", 1)
            r = parts[0].strip()
            c = parts[1].strip()

            ret_dict["reasoning"] = (ret_dict["reasoning"] or "") + r
            ret_dict["content"] = c


        # tokens are roughly 3.245 characters so calculate estimated reason/output/total tokens based on that.
        reasoning_tokens = len(ret_dict.get('reasoning', '')) / 3.245
        content_tokens = len(ret_dict.get('content', '')) / 3.245
        total_tokens = reasoning_tokens + content_tokens
        # round to nearest whole number
        reasoning_tokens = round(reasoning_tokens)
        content_tokens = round(content_tokens)
        total_tokens = round(total_tokens)

        ret_dict["usage"] = {
            "reasoning_tokens": reasoning_tokens,
            "content_tokens": content_tokens,
            "total_tokens": total_tokens,
        }


    except Exception as e:
        print(f"Error: {e}")

    return ret_dict

def llm_stream(conv=[], thinking=True, options=G_OPTIONS):
    print("\n--- Streaming Example ---") # 80.61 tokens/second

    ret_dict = {
        "reasoning": "",
        "content": "",
        "usage": {},
        "time_taken": 0,
    }
    
    try:
        stream = chat_with_ollama(
            messages=conversation,
            model=G_MODEL,
            stream=True,
            thinking=thinking,
            options=options
        )

        print("Streaming response: ", end="")
        time_start = time.time()
        token_count = 0
        token_count_reasoning = 0
        token_count_content = 0
        reason_str = ""
        response_str = ""
        in_reasoning = True
        for chunk in stream:
            # print(f"\n--\nChunk received: {chunk}\n--\n")  # Debug print for each chunk

            token_count += 1

            if chunk.get('choices'):

                delta = chunk['choices'][0].get('delta', {})

                if delta.get('reasoning'):
                    print(delta['reasoning'], end="", flush=True)
                    reason_str += delta['reasoning']
                if delta.get('content'):
                    if in_reasoning and delta.get('content'):
                        print("\n--- End of Reasoning, Start of Content ---")
                        in_reasoning = False
                    print(delta['content'], end="", flush=True)
                    response_str += delta['content']

            # handle Ollama style
            elif chunk.get('message'):
                message = chunk['message']
                if message.get('thinking'):
                    print(message['thinking'], end="", flush=True)
                    reason_str += message['thinking']
                if message.get('content'):
                    if in_reasoning and message.get('content'):
                        print("\n--- End of Reasoning, Start of Content ---")
                        in_reasoning = False

                    print(message['content'], end="", flush=True)
                    response_str += message['content']
                # update token counts based on whether we're in reasoning or content
                if in_reasoning:
                    token_count_reasoning += 1
                else:
                    token_count_content += 1

        print()  # Newline after stream

        # handle </think>
        if "</think>" in response_str:
            parts = response_str.rsplit("</think>", 1)
            reason_str += parts[0].strip()
            response_str = parts[1].strip()

            # recalculate token counts based on the split
            token_count_reasoning = round(len(reason_str) / 3.245)
            token_count_content = round(len(response_str) / 3.245)
            token_count = token_count_reasoning + token_count_content

            # update ret_dict usage
            ret_dict["usage"] = {
                "reasoning_tokens": token_count_reasoning,
                "content_tokens": token_count_content,
                "total_tokens": token_count,
            }


        # update ret_dict with final values
        ret_dict["reasoning"] = reason_str
        ret_dict["content"] = response_str
        ret_dict["time_taken"] = time.time() - time_start
        ret_dict["generation_speed"] = token_count / (time.time() - time_start) if time.time() - time_start > 0 else 0

    except Exception as e:
        print(f"Error: {e}")

    return ret_dict


# --- Example Usage ---
if __name__ == "__main__":
    # Basic example
    conversation = [
        {"role": "system", "content": "You are a helpful, concise assistant."},
        {"role": "user", "content": "Say hello in an alien language:"}
    ]
    
    # ret_dict = llm_nonstream(conversation, thinking=G_THINKING) # NON-STREAMING
    ret_dict = llm_stream(conversation, thinking=G_THINKING) # STREAMING

    # output reasoning and content separately for clarity
    print(f"\n--- Reasoning ---\n")
    print(ret_dict["reasoning"])
    print(f"\n--- Content ---\n")
    print(ret_dict["content"])

    # output token counts and timing info
    print(f"\n--- Token Counts and Timing Info ---\n")
    print(f"Estimated Reasoning Tokens: {ret_dict['usage'].get('reasoning_tokens', 'N/A')}")
    print(f"Estimated Content Tokens: {ret_dict['usage'].get('content_tokens', 'N/A')}")
    print(f"Estimated Total Tokens: {ret_dict['usage'].get('total_tokens', 'N/A')}")
    print(f"Total Time: {ret_dict.get('time_taken', 'N/A'):.2f} seconds")
    if ret_dict.get('time_taken', 0) > 0:
        print(f"Average Speed: {ret_dict['usage'].get('total_tokens', 0) / ret_dict['time_taken']:.2f} tokens/second")

    print("\n\n\n")

Qwen3-1.77B-g023-GGUF-Q8_0 β€” GGUF Q8_0 (8 Bit Quantized)

Qwen3-1.77B-g023-GGUF-BF16 β€” GGUF BF16 (16 Bit Full Precision, No Quantization Loss)

NOTE:

Fixed think issue in chat template so Ollama users aren't left in the dark ( sorry :( ) Redownload to get correct version. This is a Qwen3 model, which if you really want to deactivate the "think" part you also have to add no_think to your prompts. This thing is boss mode. Load up the context. VRAM is cheap. It's fast.

Overview

GGUF versions of Qwen3-1.77B-g023, an optimized 29-layer variant of Qwen3-1.7B created by duplicating layer 21. Full precision (no quantization loss) and Q8_0 (minimal loss). I have an NF4 available if you want to go even lower (https://huggingface.co/g023/Qwen3-1.77B-g023-NF4). Converted using llama.cpp convert_hf_to_gguf.py.

Some Ollama settings that seem to work well for me for this model:

        data["options"] = {
            # "num_predict": 4000, # if you want to restrict output
            "top_k": 10, 
            "top_p": 0.34, 
            "min_p": 0.1,
            "typical_p": 0.2,
            "repeat_last_n": 16384,
            "temperature": 1.0,
            "repeat_penalty": 5.2,
            "presence_penalty": 0.5,
            "frequency_penalty": 1.0,
            "mirostat": 2,
            "mirostat_tau": 0.8,
            "mirostat_eta": 0.6,
            "penalize_newline": True,
            "num_ctx": 16384, # 
            "num_thread": 1        
        }

LMStudio working good. Just search for g023 in the models.

Q8_0 Quantization Details

Parameter Value
Format GGUF
Type Q8_0 (8-bit, round-to-nearest)
Size on Disk 1.8 GB (vs 3.4 GB BF16)
Compression Ratio ~1.9x vs BF16
Quantization Tool llama.cpp convert_hf_to_gguf.py

BF16 Quantization Details

Parameter Value
Format GGUF
Type BF16 (BFloat16, lossless)
Size on Disk 3.4 GB
Compression None (full precision)
Quantization Tool llama.cpp convert_hf_to_gguf.py

Why Q8_0?

Q8_0 is the highest-quality quantized format available. It uses symmetric round-to-nearest quantization with 8 bits per weight, providing virtually identical output quality to full precision while halving memory usage. It is the optimal choice when you want quantization savings without measurable quality degradation.

Why BF16?

BF16 is the native training dtype for this model. Using BF16 GGUF preserves the exact weight values with zero quantization loss, making it the highest-quality GGUF option. BF16 has the same dynamic range as FP32 with reduced mantissa bits, which is ideal for inference.

Features

  • Thinking mode: Full <think> / </think> support
  • Non-thinking mode: Direct responses without chain-of-thought
  • Developer role: Supports developer message role (rendered as system)
  • Tool calling: Full tool/function calling support
  • System prompts: Standard system message support

Source Model

Parameter Value
Layers 29 (28 original + layer 21 duplicated)
Hidden Size 2048
Vocab Size 151,936
Total Parameters ~1.77B
Base Model Qwen/Qwen3-1.7B
Overall Score 93.6 / 100
Factual Accuracy 9 / 9

System Requirements

  • RAM: >= 4 GB (CPU inference)
  • VRAM: >= 4 GB (GPU inference)

License

Apache 2.0

Downloads last month
701
GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Model tree for g023/Qwen3-1.77B-g023-GGUF

Finetuned
Qwen/Qwen3-1.7B
Quantized
(254)
this model