Api referenceChat Completions API

Create a chat completion

Generates a model response for the given chat conversation. This is a standard chat completion API that supports both streaming and non-streaming modes without conversation persistence. **Streaming Mode (stream=true):** - Returns Server-Sent Events (SSE) with real-time streaming - Streams completion chunks directly from the inference model - Final event contains "[DONE]" marker **Non-Streaming Mode (stream=false or omitted):** - Returns single JSON response with complete completion - Standard OpenAI ChatCompletionResponse format **Storage Options:** - `store=true`: Persist the latest input message and assistant response to the active conversation - `store_reasoning=true`: Additionally persist reasoning content provided by the model - When `store` is omitted or false, the conversation remains read-only **Features:** - Supports all OpenAI ChatCompletionRequest parameters - Optional conversation context for conversation persistence - User authentication required - Direct inference model integration

POST
/v1/chat/completions
Authorization<token>

Type "Bearer" followed by a space and JWT token.

In: header

Chat completion request with streaming options and optional conversation

chat_template_kwargs?

ChatTemplateKwargs provides a way to add non-standard parameters to the request body. Additional kwargs to pass to the template renderer. Will be accessible by the chat template. Such as think mode for qwen3. "chat_template_kwargs": {"enable_thinking": false} https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes

conversation?object
deep_research?boolean

DeepResearch enables the Deep Research mode which uses a specialized prompt for conducting in-depth investigations with tool usage. Requires a model with supports_reasoning: true capability.

enable_thinking?boolean

EnableThinking controls whether reasoning/thinking capabilities should be used. Defaults to true. When set to false for a model with supports_reasoning: true and an instruct model configured, the instruct model will be used instead.

frequency_penalty?number
function_call?unknown

Deprecated: use ToolChoice instead.

functions?

Deprecated: use Tools instead.

guided_choice?array<string>

GuidedChoice is a vLLM-specific extension that restricts the model's output to one of the predefined string choices provided in this field. This feature is used to constrain the model's responses to a controlled set of options, ensuring predictable and consistent outputs in scenarios where specific choices are required.

logit_bias?

LogitBias is must be a token id string (specified by their token ID in the tokenizer), not a word string. incorrect: "logit_bias":{"You": 6}, correct: "logit_bias":{"1639": 6} refs: https://platform.openai.com/docs/api-reference/chat/create#chat/create-logit_bias

logprobs?boolean

LogProbs indicates whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This option is currently not available on the gpt-4-vision-preview model.

max_completion_tokens?integer

MaxCompletionTokens An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens https://platform.openai.com/docs/guides/reasoning

max_tokens?integer

MaxTokens The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API. Deprecated: use MaxCompletionTokens. Not compatible with o1-series models. refs: https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_tokens

messages?
metadata?

Metadata to store with the completion.

model?string
n?integer
parallel_tool_calls?unknown

Disable the default behavior of parallel tool calls by setting it: false.

prediction?
presence_penalty?number
reasoning_effort?string

Controls effort on reasoning for reasoning models. It can be set to "low", "medium", or "high".

repetition_penalty?number
response_format?
safety_identifier?string

A stable identifier used to help detect users of your application that may be violating OpenAI's usage policies. The IDs should be a string that uniquely identifies each user. We recommend hashing their username or email address, in order to avoid sending us any identifying information. https://platform.openai.com/docs/api-reference/chat/create#chat_create-safety_identifier

seed?integer
service_tier?string
Value in"auto" | "default" | "flex" | "priority"
stop?array<string>
store?boolean

Store controls whether the latest input and generated response should be persisted

store_reasoning?boolean

StoreReasoning controls whether reasoning content (if present) should also be persisted

stream?boolean
stream_options?
temperature?number
tool_choice?unknown

This can be either a string or an ToolChoice object.

tools?
top_k?integer
top_logprobs?integer

TopLogProbs is an integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.

top_p?number
user?string
verbosity?string

Verbosity determines how many output tokens are generated. Lowering the number of tokens reduces overall latency. It can be set to "low", "medium", or "high". Note: This field is only confirmed to work with gpt-5, gpt-5-mini and gpt-5-nano. Also, it is not in the API reference of chat completion at the time of writing, though it is supported by the API.

Response Body

curl -X POST "https://loading/v1/chat/completions" \  -H "Content-Type: application/json" \  -d '{}'
"string"
{
  "code": "string",
  "error": "string",
  "message": "string",
  "request_id": "string"
}
{
  "code": "string",
  "error": "string",
  "message": "string",
  "request_id": "string"
}
{
  "code": "string",
  "error": "string",
  "message": "string",
  "request_id": "string"
}