vLLM or Llama.cpp output

by johnlockejrr - opened Jan 24

Jan 24

Should I set a system prompt also along the user prompt because the output of the model from vLLM or Llama.cpp is never like the output from https://chat.dicta.org.il/chat? Thanks!

Shaltiel

DICTA: The Israel Center for Text Analysis org Jan 24

•

edited Jan 24

The chat interface uses a slightly more elaborate system prompt to enhance the experience (e.g., encouraging more concise response), but the output shouldn't vary so much. You can set the system prompt however you see fit for your uses.

With llama.cpp - make sure you run with "--jinja" otherwise the output won't be good.

johnlockejrr

Jan 24

•

edited Jan 24

Not so sure.
./llama-server --hf-repo VRDate/DictaLM-3.0-1.7B-Thinking-Q4_K_M-GGUF --hf-file dictalm-3.0-1.7b-thinking-q4_k_m.gguf -c 2048
I tried even like this:

import argparse
from openai import OpenAI

# -----------------------------
# 1. CLI argument
# -----------------------------
parser = argparse.ArgumentParser()
parser.add_argument("--text", type=str, required=True)
args = parser.parse_args()

# -----------------------------
# 2. System template (wrapper)
# -----------------------------
SYSTEM_TEMPLATE = """
אתה עורך לשוני מומחה. תמיד בצע את הוראות המשתמש במדויק.

תקן טקסטים עבריים על פי הכללים הבאים:

- שמור תמיד על אם‑הקריאה המקורית כפי שהיא.
- אל תשנה ריבוי רבני מסוג י״ן לצורת י״ם.
- אל תוסיף ניקוד.
- אל תוסיף מילים, אל תסיר מילים, אל תשכתב, ואל תשלים פסוקים.
- בצע רק תיקוני שגיאות כתיב, אותיות שבורות, או טעויות ברורות.
- אם אין צורך בתיקון — החזר את הטקסט בדיוק כפי שהוא.
"""

# -----------------------------
# 3. Build the client
# -----------------------------
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required"
)

# -----------------------------
# 4. Build messages
# -----------------------------
messages = [
    {"role": "system", "content": SYSTEM_TEMPLATE},
    {"role": "user", "content": f"תקן את הטקסט הבא: {args.text}"}
]

# -----------------------------
# 5. Call the model
# -----------------------------
response = client.chat.completions.create(
    model="dicta-il/DictaLM-3.0-1.7B-Thinking",
    messages=messages,
)

# -----------------------------
# 6. Print result
# -----------------------------
print(response.choices[0].message.content)

And the output:

python dicta.py --text "תהיירך לעזרני כי פקוריך בחרתי"
תהיירך לעזרני כי פקוריך בחרתי

But on the chat is better:

תהי יראתך לעזרני כי פקודיך בחרתי .

Yes, with --jinja is a little better:

תהיי לך לעזרני כי פקודיך בחרתי

Shaltiel

DICTA: The Israel Center for Text Analysis org Jan 24

Can you please try it with the official GGUFs from our organization?
The VRDate ggufs weren't converted well .

Shaltiel

DICTA: The Israel Center for Text Analysis org Jan 24

Also - you are testing here with the 1.7B model, the chat serves the 24B model.

johnlockejrr

Jan 24

•

edited Jan 24

Oh! My bad! Let me do that! Which should I try? I have 96Gb VRAM available
I think DictaLM-3.0-24B-Thinking-Q8_0.gguf should perform?

Shaltiel

DICTA: The Israel Center for Text Analysis org Jan 24

Yes - that should work! Make sure to use the one from dicta-il

johnlockejrr

Jan 24

•

edited Jan 24

Yes - that should work! Make sure to use the one from dicta-il

Yes! Thanks! Different story now 🤤

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment