Chat Completions

Creates a chat completion. Compatible with the OpenAI Chat Completions API.

POST/v1/chat/completions

Request body#

Parameter	Type	Required	Description
model	string	Required	Model ID. See the full list at `GET /v1/models`.
messages	array	Required	Array of `{role, content}`. role is system \| user \| assistant.
temperature	number	Optional	0.0 to 2.0. Default 0.7.
max_tokens	integer	Optional	1 to 128,000. Default 4096. Used to estimate the credit pre-hold amount, so set only as much as you need.
stream	boolean	Optional	If true, returns an SSE streaming response. Default false.
top_p	number	Optional	0.0 to 1.0. Nucleus sampling.
stop	string \| string[]	Optional	Sequences that stop generation.
tools	array	Optional	Function-calling tool definitions (OpenAI format). See Tool Calling.
tool_choice	string \| object	Optional	`"auto"` · `"none"`, or an object forcing a specific tool.
response_format	object	Optional	E.g. `{"type": "json_object"}`. See Structured Outputs.
seed	integer	Optional	Deterministic sampling seed. Support varies by model.
frequency_penalty	number	Optional	-2.0 to 2.0.
presence_penalty	number	Optional	-2.0 to 2.0.
parallel_tool_calls	boolean	Optional	Whether parallel tool calls are allowed.
logit_bias	object	Optional	Map of token ID → bias value.
reasoning_effort	string	Optional	`minimal` \| `none` \| `low` \| `medium` \| `high` \| `xhigh` \| `max`. Automatically translated to each provider's format. See Reasoning Models.
thinking	boolean	Optional	Convenience alias for `reasoning_effort` — `true`→medium, `false`→none. If both are sent, `reasoning_effort` wins.
service_tier	string	Optional	`auto` \| `default` \| `flex` \| `background` \| `priority`. Passed through to OpenAI-family providers (latency/cost tradeoff).
reasoning_mode	string	Optional	`standard` \| `pro`. GPT-5.6 Sol/Terra/Luna quality-first (pro) mode.
plugins	array	Optional	`[{"id": "web", ...}]` web search plugin. See Web Search.
provider	object	Optional	`{order, only, ignore, sort, max_price}` provider routing preferences. See Routing Policies.
trace_id	string	Optional	Max 128 chars. Sticky seed for the weighted routing policy — the same `trace_id` always routes to the same model.

The standard sampling / tool parameters (top_p through reasoning_effort) are passed through to the routed provider as-is, without validation — a parameter the model does not support may be ignored by the provider or cause it to return an error. n (multiple choices) is not supported — a single response is always returned.

request body

{
  "model": "gpt-4.1",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello"}
  ],
  "temperature": 0.7,
  "max_tokens": 4096,
  "stream": false
}

Response#

In addition to OpenAI-style choices / usage, PleumRouter also returns cost (KRW cost, FX rate, markup) and provider (the actual routing result).

200 OK

{
  "id": "chatcmpl-gpt-4.1-841ms",
  "object": "chat.completion",
  "model": "gpt-4.1",
  "provider": "openai",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "Hello! How can I help you?"},
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 12,
    "total_tokens": 36
  },
  "cost": {
    "usd": 0.000144,
    "krw": 1,
    "fx_rate": 1525.0,
    "markup_rate": 0.0
  }
}

Streaming#

With stream: true, text chunks are sent as text/event-stream, and cost information is included in the final event.

SSE stream

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! "}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"How can I help you?"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"choices":[],"usage":{"prompt_tokens":12,"completion_tokens":8,"total_tokens":20},"cost":{"usd":0.000144,"krw":1,"fx_rate":1525.0,"markup_rate":0.0}}

data: [DONE]

Prompt caching#

PleumRouter supports prompt caching. OpenAI, Google Gemini, and DeepSeek cache automatically with no extra configuration — repeated input prefixes (long system prompts, reference documents, etc.) are cached on the provider side, and cache hits are billed at a discounted rate below the standard input price, lowering your cost.

Anthropic (Claude) uses explicit caching. Add cache_control: {"type": "ephemeral"} to a content part to mark a cache breakpoint; the prefix up to that point is cached.

request body (explicit caching)

{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "<large reusable context: docs, schema, instructions...>",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {"role": "user", "content": "Answer based on the context above."}
  ]
}

On a cache hit, the response usage includes prompt_tokens_details.cached_tokens (hit tokens). Anthropic also returns cache-write tokens as cache_creation_input_tokens. For other providers the cache token counts are exposed but no discount is applied.

usage (cache hit)

"usage": {
  "prompt_tokens": 10240,
  "completion_tokens": 120,
  "total_tokens": 10360,
  "prompt_tokens_details": {"cached_tokens": 10000},
  "cache_creation_input_tokens": 0
}

Billing#

At request time, credits equal to the estimated cost are pre-held (frozen), and once the call finishes they are settled against actual token usage. If the call fails, the hold is fully released, and a successful call is charged a minimum of ₩1.

On a temporary provider outage, automatic retries (twice) are attempted. If it still fails, a 502 is returned and no credits are charged.