Unified Inference Platform
Single CRD to manage inference workloads with automatic LeaderWorkerSet, PodGroup, InferencePool, and HTTPRoute generation.
Intelligent Routing
Built-in support for prefix-cache, KV-cache utilization, queue-size, LoRA affinity, and P/D disaggregation routing strategies.
Gang Scheduling
Automatic Volcano PodGroup integration ensures all pods for multi-node inference start together or not at all.
Multi-Node Inference
Native support for distributed inference with Ray integration and per-replica LeaderWorkerSet management.
P/D Disaggregation
First-class support for prefill/decode disaggregated architectures with automatic component separation and routing.
Gateway API Native
Seamless integration with Kubernetes Gateway API and Gateway API Inference Extension for production-grade traffic management.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β InferenceService CR β
β roles: β
β - router (HTTPRoute, InferencePool, EPP) β
β - prefiller/decoder/worker (LeaderWorkerSet, PodGroup) β
βββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β InferenceService Controller β
βββββββββββββββββ¬ββββββββββββββββ
β
βββββββββββββββββ¬ββββββββββββΌββββββββββββββββ¬ββββββββββββββββ
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββββββ
β HTTPRoute β β Inference β β EPP β β Leader β β PodGroup β
β (Gateway API) β β Pool β β Deploymentβ β WorkerSet β β (Volcano) β
βββββββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββββββ