Create a chat completion
Generates a model response for the given chat conversation. This is a standard chat completion API that supports both streaming and non-streaming modes without conversation persistence. **Streaming Mode (stream=true):** - Returns Server-Sent Events (SSE) with real-time streaming - Streams completion chunks directly from the inference model - Final event contains "[DONE]" marker **Non-Streaming Mode (stream=false or omitted):** - Returns single JSON response with complete completion - Standard OpenAI ChatCompletionResponse format **Storage Options:** - `store=true`: Persist the latest input message and assistant response to the active conversation - `store_reasoning=true`: Additionally persist reasoning content provided by the model - When `store` is omitted or false, the conversation remains read-only **Features:** - Supports all OpenAI ChatCompletionRequest parameters - Optional conversation context for conversation persistence - User authentication required - Direct inference model integration
Type "Bearer" followed by a space and JWT token.
In: header
Chat completion request with streaming options and optional conversation
ChatTemplateKwargs provides a way to add non-standard parameters to the request body. Additional kwargs to pass to the template renderer. Will be accessible by the chat template. Such as think mode for qwen3. "chat_template_kwargs": {"enable_thinking": false} https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes
DeepResearch enables the Deep Research mode which uses a specialized prompt for conducting in-depth investigations with tool usage. Requires a model with supports_reasoning: true capability.
EnableThinking controls whether reasoning/thinking capabilities should be used. Defaults to true. When set to false for a model with supports_reasoning: true and an instruct model configured, the instruct model will be used instead.
Deprecated: use ToolChoice instead.
Deprecated: use Tools instead.
GuidedChoice is a vLLM-specific extension that restricts the model's output to one of the predefined string choices provided in this field. This feature is used to constrain the model's responses to a controlled set of options, ensuring predictable and consistent outputs in scenarios where specific choices are required.
LogitBias is must be a token id string (specified by their token ID in the tokenizer), not a word string.
incorrect: "logit_bias":{"You": 6}, correct: "logit_bias":{"1639": 6}
refs: https://platform.openai.com/docs/api-reference/chat/create#chat/create-logit_bias
LogProbs indicates whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This option is currently not available on the gpt-4-vision-preview model.
MaxCompletionTokens An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens https://platform.openai.com/docs/guides/reasoning
MaxTokens The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API. Deprecated: use MaxCompletionTokens. Not compatible with o1-series models. refs: https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_tokens
Metadata to store with the completion.
Disable the default behavior of parallel tool calls by setting it: false.
Controls effort on reasoning for reasoning models. It can be set to "low", "medium", or "high".
A stable identifier used to help detect users of your application that may be violating OpenAI's usage policies. The IDs should be a string that uniquely identifies each user. We recommend hashing their username or email address, in order to avoid sending us any identifying information. https://platform.openai.com/docs/api-reference/chat/create#chat_create-safety_identifier
"auto" | "default" | "flex" | "priority"Store controls whether the latest input and generated response should be persisted
StoreReasoning controls whether reasoning content (if present) should also be persisted
This can be either a string or an ToolChoice object.
TopLogProbs is an integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
Verbosity determines how many output tokens are generated. Lowering the number of tokens reduces overall latency. It can be set to "low", "medium", or "high". Note: This field is only confirmed to work with gpt-5, gpt-5-mini and gpt-5-nano. Also, it is not in the API reference of chat completion at the time of writing, though it is supported by the API.
Response Body
curl -X POST "https://loading/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{}'"string"{
"code": "string",
"error": "string",
"message": "string",
"request_id": "string"
}{
"code": "string",
"error": "string",
"message": "string",
"request_id": "string"
}{
"code": "string",
"error": "string",
"message": "string",
"request_id": "string"
}