File size: 1,655 Bytes
7ff7119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""vLLM chat-model builder — AMD MI300X served, OpenAI-compatible API.

vLLM serves Qwen 2.5 14B Instruct (or any other compatible model) on an AMD
Instinct MI300X via the OpenAI-compatible REST API. We use `langchain-openai`'s
`ChatOpenAI` adapter with a custom `base_url` pointing at the vLLM endpoint —
NOT the OpenAI cloud.

Why ChatOpenAI:
  * vLLM exposes ``/v1/chat/completions`` in the OpenAI format
  * Tool calling works natively (Qwen 2.5 supports function calling)
  * ``with_structured_output()`` works via tool-binding
  * Streaming works via SSE

Required env vars (see ``.env.example``):
  * ``VLLM_BASE_URL`` — e.g. ``http://<mi300x-public-ip>:8000/v1``
  * ``VLLM_MODEL``    — e.g. ``Qwen/Qwen2.5-14B-Instruct``
  * ``VLLM_API_KEY``  — optional. Empty => sent as ``"EMPTY"`` (vLLM no-auth).
                       In production set a real key and start vLLM with ``--api-key``.
"""

from __future__ import annotations

from langchain_openai import ChatOpenAI

from config import settings


def build_vllm_chat() -> ChatOpenAI:
    """Default ChatOpenAI instance pointed at the AMD MI300X vLLM endpoint.

    The first invocation triggers the underlying HTTP client. If the endpoint
    is unreachable, the call fails fast with a connection error — NOT here at
    construction time, so dummy/Ollama profiles need not have ``VLLM_BASE_URL``
    set.
    """
    return ChatOpenAI(
        model=settings.vllm_model,
        base_url=settings.vllm_base_url,
        api_key=settings.vllm_api_key or "EMPTY",
        temperature=settings.vllm_temperature,
        max_tokens=settings.vllm_max_tokens,
        timeout=120,
    )