EPP HTTP Headers Reference

This document describes the HTTP headers that the Endpoint Picker (EPP) inspects to manage and control inference requests, specifically for flow control, performance management, and request classification.

Request Classification and Flow Control

These headers allow the EPP to identify the request's goals, group them for fair resource allocation, and handle model-specific targeting.

Header	Description
`x-gateway-inference-objective`	Specifies the name of the `InferenceObjective` resource associated with the request. The EPP uses this to look up the corresponding objective resource in the same namespace as the InferencePool to apply the defined priority and performance goals.
`x-gateway-inference-fairness-id`	Provides a unique identifier for grouping requests for fairness-based flow control. Requests with the same ID share capacity according to the fairness policy. If omitted, the EPP defaults to `default-flow`.
`x-gateway-model-name-rewrite`	Specifies the target model name to be used for the request. This is an alternative approach to model name rewriting; while the `InferenceModelRewrite` API provides rule-based rewriting on the server side, this header allows for an explicit, per-request override. When present, the EPP uses this value to override the model name in the request body and for recording model-specific metrics.

Service Level Objectives (SLOs)

These headers are used by admission control and load balancing plugins to make decisions based on latency targets.

Header	Description
`x-slo-ttft-ms`	Specifies the target Time To First Token (TTFT) in milliseconds. Used by plugins to determine if a request can be admitted while meeting the latency goal.
`x-slo-tpot-ms`	Specifies the target Time Per Output Token (TPOT) in milliseconds. Used for admission control based on predicted or observed token generation latency.

Response Headers

These headers are set by the EPP on responses sent back to the client.

Header	Description
`x-llm-d-request-dropped-reason`	Indicates why a request was dropped by flow control. Only present on `429` responses generated by the EPP (not forwarded from a model server). The value uses a two-prefix scheme: `rejected-` means the request never reached an inference server, `evicted-` means it was dispatched and then killed. See the table below for possible values.

Dropped Reason Values

Value	Meaning
`rejected-saturated`	System at capacity, request rejected before queueing.
`rejected-ttl-expired`	Request entered the queue but its TTL expired before dispatch.
`rejected-context-cancelled`	Client disconnected while the request was queued.
`evicted`	Generic post-dispatch eviction.
`evicted-queue-pressure`	Evicted after dispatch due to backpressure from new arrivals.
`evicted-priority`	Evicted after dispatch, preempted by a higher-priority request.

The two-prefix convention helps consumers decide retry strategy: rejected-* means no inference work was done (cheap to retry), evicted-* means GPU cycles were consumed (factor into backoff).

Implementation Notes

Case Sensitivity: All header lookups are case-insensitive.
Source: These values are typically provided as standard HTTP headers in the incoming request.

Request Classification and Flow Control​

Service Level Objectives (SLOs)​

Response Headers​

Dropped Reason Values​

Implementation Notes​

Request Classification and Flow Control

Service Level Objectives (SLOs)

Response Headers

Dropped Reason Values

Implementation Notes