Router Controller Design

Summary

The FusionInfer Router Controller enables automatic creation and management of Gateway API Inference Extension resources based on InferenceService CRDs. This integration transforms FusionInfer-managed inference workloads into optimized inference routing systems by leveraging the Endpoint Picker (EPP) for intelligent request routing, prefix-cache aware scheduling, and load balancing.

When an InferenceService is created with a role having componentType: router, the FusionInfer Router Controller automatically generates corresponding InferencePool, EndpointPickerConfig, HTTPRoute, Gateway, and EPP deployment resources. This provides production-grade traffic management capabilities including prefix-cache optimization, KV-cache utilization based routing, queue-aware scheduling, LoRA adapter affinity, and support for disaggregated prefill/decode architectures.

Motivation

Self-hosted LLM inference workloads on Kubernetes face significant challenges around request routing, load balancing, and resource utilization. Traditional load balancers are not aware of model-specific characteristics such as KV-cache state, prefix caching opportunities, or the distinction between prefill and decode phases. This leads to suboptimal performance, increased latency, and inefficient GPU utilization.

The Gateway API Inference Extension provides sophisticated routing capabilities specifically designed for inference workloads, but requires manual configuration of multiple CRDs and deep understanding of the plugin ecosystem. FusionInfer simplifies this by providing a higher-level abstraction with predefined routing strategies while still allowing advanced users to customize the full EndpointPickerConfig when needed.

Goals

Automatic Router Resource Generation: Automatically create and manage InferencePool, EndpointPickerConfig, HTTPRoute, Gateway, and EPP deployment based on InferenceService specifications
Predefined Routing Strategies: Provide simple, declarative routing strategies (prefix-cache, kv-cache-utilization, queue-size, lora-affinity, pd-disaggregation) for common use cases
Advanced Customization: Allow power users to provide custom EndpointPickerConfig for fine-grained control

Non-Goals

Workload Management: This design focuses only on router/gateway integration; actual pod/deployment management is handled by other FusionInfer components
Gateway Implementation: This does not implement a new gateway, but integrates with existing Gateway API compliant gateways via Envoy ext-proc

User Stories

Story 1: Prefix Cache Aware Routing

As a platform engineer, I want to deploy a Qwen model with prefix cache aware routing so that requests sharing the longest prefixes are routed to the same server to maximize KV-cache reuse.

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: qwen-service
spec:
  roles:
    - name: router
      componentType: router
      strategy: "prefix-cache"
      httproute:
        parentRefs:
        - name: my-gateway
          namespace: gateway-system
        hostnames:
        - "qwen.example.com"
    - name: inference
      componentType: worker
      replicas: 3
      template:
        spec:
          containers:
            - name: vllm
              image: vllm/vllm-openai:latest
              args:
                - --model=Qwen/Qwen2.5-7B-Instruct

This would automatically create an InferencePool pointing to the inference pods and configure EPP with prefix-cache aware scheduling.

Story 2: KV-Cache Utilization Based Load Balancing

As an ML engineer, I want to distribute inference requests based on available KV-cache memory to prevent OOM errors and ensure even resource utilization across all inference servers.

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: balanced-llm-service
spec:
  roles:
    - name: router
      componentType: router
      strategy: "kv-cache-utilization" 
      httproute:
        parentRefs:
        - name: my-gateway
          namespace: gateway-system
        hostnames:
        - "api.balanced.example.com"
    - name: inference
      componentType: worker
      replicas: 3
      template:
        spec:
          containers:
            - name: vllm
              image: vllm/vllm-openai:latest
              args:
                - --model=/models/llama-70b

This configuration routes requests to servers with the most available KV-cache memory, preventing memory exhaustion and ensuring balanced utilization.

Story 3: Disaggregated Prefill/Decode Architecture

As an infrastructure engineer, I want to deploy a disaggregated serving architecture where prefill (prompt processing) and decode (token generation) are handled by specialized server pools for optimal hardware utilization and latency.

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: disaggregated-llm-service
spec:
  roles:
    - name: router
      componentType: router
      strategy: "pd-disaggregation"
      httproute:
        parentRefs:
        - name: my-gateway
          namespace: gateway-system
        hostnames:
        - "llm.disaggregated.example.com"
    
    - name: prefill-servers
      componentType: prefiller
      replicas: 2
      template:
        spec:
          containers:
            - name: prefill
              image: vllm/vllm-openai:latest
              args:
                - --model=meta-llama/Llama-3-70B-Instruct
                - --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
    - name: decode-servers
      componentType: decoder
      replicas: 3
      template:
        spec:
          containers:
            - name: decode
              image: vllm/vllm-openai:latest
              args:
                - --model=meta-llama/Llama-3-70B-Instruct
                - --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

Proposal

The FusionInfer Router Controller watches for InferenceService resources that include roles with componentType: router. When detected, it automatically generates and manages Gateway API Inference Extension resources to enable intelligent request routing and load balancing for LLM inference workloads.

The router role supports two configuration approaches:

Simple strategy-based: Use predefined routing strategies (prefix-cache, kv-cache-utilization, queue-size, lora-affinity, pd-disaggregation) for common use cases
Advanced custom configuration: Provide a complete endpointPickerConfig for fine-grained control over scheduling plugins, profiles, and scoring algorithms

These configurations are translated into InferencePool, EndpointPickerConfig, HTTPRoute, Gateway, and EPP deployment resources that work together to optimize inference request routing based on factors like prefix cache hits, memory utilization, and workload characteristics.

Go Types

// InferenceServiceSpec defines the desired state of InferenceService
type InferenceServiceSpec struct {
    // Roles defines the components of the inference service
    Roles []RoleSpec `json:"roles"`
}

// Role defines a component in the inference pipeline
type RoleSpec struct {
    // Name is the identifier for this role
    Name string `json:"name"`
    
    // ComponentType specifies the type of component
    // +kubebuilder:validation:Enum=router;prefiller;decoder;worker
    ComponentType ComponentType `json:"componentType"`
    
    // Router-specific fields (only for componentType: router)
    Strategy              RoutingStrategy        `json:"strategy,omitempty"`
    HTTPRoute             *runtime.RawExtension  `json:"httproute,omitempty"`   // Gateway API HTTPRouteSpec
    Gateway               *runtime.RawExtension  `json:"gateway,omitempty"`     // Gateway API GatewaySpec
    EndpointPickerConfig  string                 `json:"endpointPickerConfig,omitempty"`  // Raw YAML for advanced users
    
    // Worker-specific fields (for prefiller/decoder/worker)
    Replicas       *int32                 `json:"replicas,omitempty"`
    Template       *runtime.RawExtension  `json:"template,omitempty"`  // corev1.PodTemplateSpec
}

// ComponentType defines the type of component
type ComponentType string

const (
    ComponentTypeRouter    ComponentType = "router"
    ComponentTypePrefiller ComponentType = "prefiller"
    ComponentTypeDecoder   ComponentType = "decoder"
    ComponentTypeWorker    ComponentType = "worker"
)

// RoutingStrategy defines the inference routing strategy
type RoutingStrategy string

const (
    StrategyPrefixCache      RoutingStrategy = "prefix-cache"
    StrategyKVCacheUtil      RoutingStrategy = "kv-cache-utilization"
    StrategyQueueSize        RoutingStrategy = "queue-size"
    StrategyLoRAffinity      RoutingStrategy = "lora-affinity"
    StrategyPDDisaggregation RoutingStrategy = "pd-disaggregation"
)

Configuration Examples

Example 1: Simple Strategy-based Configuration

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: my-service
spec:
  roles:
    - name: gateway
      componentType: router
      strategy: "prefix-cache"
      httproute:
        parentRefs:
        - name: my-gateway
          namespace: gateway-system
        hostnames:
        - "api.example.com"
    - name: inference
      componentType: worker
      replicas: 3
      template:
        spec:
          containers:
          - name: vllm
            image: vllm/vllm-openai:latest
            args:
            - --model=meta-llama/Llama-3-8B-Instruct

FusionInfer automatically generates the following resources:

# 1. HTTPRoute - Routes traffic to the InferencePool
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-service-httproute
spec:
  parentRefs:
  - name: my-gateway
    namespace: gateway-system
  hostnames:
  - "api.example.com"
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: my-service-pool

---
# 2. InferencePool - Manages the inference backend pods
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: my-service-pool
spec:
  selector:
    matchLabels:
      fusioninfer.io/service: my-service
      fusioninfer.io/component-type: worker
  targetPorts:
  - number: 8000
  endpointPickerRef:
    name: my-service-epp
    port:
      number: 9002

---
# 3. EndpointPickerConfig (ConfigMap) - EPP scheduling configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-service-epp-config
data:
  config.yaml: |
    apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: prefix-cache-scorer
      parameters:
        blockSize: 5
        maxPrefixBlocksToMatch: 256
        lruCapacityPerServer: 31250
    - type: max-score-picker
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: max-score-picker
      - pluginRef: prefix-cache-scorer
        weight: 100

---
# 4. EPP Deployment - Endpoint Picker deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service-epp
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: my-service-epp
  template:
    metadata:
      labels:
        app: my-service-epp
    spec:
      serviceAccountName: my-service-epp
      containers:
      - name: epp
        image: registry.k8s.io/gateway-api-inference-extension/epp:v1.2.1
        args:
        - --pool-name=my-service-pool
        - --pool-namespace=default
        - --config-file=/config/config.yaml
        ports:
        - name: grpc
          containerPort: 9002
        - name: grpc-health
          containerPort: 9003
        - name: metrics
          containerPort: 9090
        livenessProbe:
          grpc:
            port: 9003
            service: inference-extension
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          grpc:
            port: 9003
            service: inference-extension
          periodSeconds: 2
        env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        volumeMounts:
        - name: plugins-config-volume
          mountPath: /config
      volumes:
      - name: plugins-config-volume
        configMap:
          name: my-service-epp-config

---
# 5. EPP Service - Exposes the EPP for Envoy ext-proc
apiVersion: v1
kind: Service
metadata:
  name: my-service-epp
spec:
  selector:
    app: my-service-epp
  type: ClusterIP
  ports:
  - name: grpc-ext-proc
    port: 9002
    protocol: TCP
  - name: grpc-health
    port: 9003
    protocol: TCP
  - name: http-metrics
    port: 9090
    protocol: TCP

Example 2: Advanced Configuration with Custom EndpointPickerConfig

For advanced users with specific tuning requirements, FusionInfer allows direct configuration of the EndpointPickerConfig. This provides full control over scheduling behavior, plugin parameters, and scoring weights.

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: advanced-service
spec:
  roles:
    - name: gateway
      componentType: router
      # Direct control over EPP configuration for advanced tuning
      endpointPickerConfig: |
        apiVersion: inference.networking.x-k8s.io/v1alpha1
        kind: EndpointPickerConfig
        plugins:
        - type: prefix-cache-scorer
          parameters:
            blockSize: 10
            maxPrefixBlocksToMatch: 512
            lruCapacityPerServer: 50000 
        - type: kv-cache-utilization-scorer
        - type: max-score-picker
        schedulingProfiles:
        - name: default
          plugins:
          - pluginRef: max-score-picker
          - pluginRef: prefix-cache-scorer
            weight: 70 
          - pluginRef: kv-cache-utilization-scorer
            weight: 30
      httproute:
        parentRefs:
        - name: my-gateway
          namespace: gateway-system
        hostnames:
        - "advanced.example.com"
    - name: inference
      componentType: worker
      replicas: 5
      template:
        spec:
          containers:
          - name: vllm
            image: vllm/vllm-openai:latest
            args:
            - --model=meta-llama/Llama-3-8B-Instruct

Example 3: Disaggregated Prefill/Decode Architecture

For large-scale LLM deployments, disaggregated prefill/decode architecture separates the compute-intensive prompt processing (prefill) phase from the memory-intensive token generation (decode) phase. This allows independent scaling and optimization of each phase:

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: disaggregated-llm-service
spec:
  roles:
    - name: router
      componentType: router
      strategy: "pd-disaggregation"
      httproute:
        parentRefs:
        - name: my-gateway
          namespace: gateway-system
        hostnames:
        - "llm.disaggregated.example.com"
    
    - name: prefill-servers
      componentType: prefiller
      replicas: 2
      template:
        spec:
          containers:
            - name: prefill
              image: vllm/vllm-openai:latest
              args:
                - --model=meta-llama/Llama-3-70B-Instruct
                - --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
    - name: decode-servers
      componentType: decoder
      replicas: 2
      template:
        spec:
          containers:
            - name: decode
              image: vllm/vllm-openai:latest
              args:
                - --model=meta-llama/Llama-3-70B-Instruct
                - --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

Generated Resources:

The FusionInfer Router Controller will automatically create the following Gateway API and routing resources for Disaggregated Prefill/Decode architecture:

Note: In a complete P/D deployment, decode pods typically include a sidecar for orchestrating communication with prefill pods. This sidecar deployment is managed by the workload component, not the gateway component.

# 1. InferencePool - Single pool managing both prefill and decode pods
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: disaggregated-llm-service-pool
spec:
  selector:
    matchLabels:
      fusioninfer.io/service: disaggregated-llm-service
  endpointPickerRef:
    name: disaggregated-llm-service-epp
    port:
      number: 9002
  targetPorts:
  - number: 8000

---
# 2. EndpointPickerConfig - P/D-aware scheduling configuration
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
metadata:
  name: disaggregated-llm-service-epp-config
spec:
  plugins:
  # Pre-request handler to set P/D headers
  - type: prefill-header-handler
  
  # Filters for role-based pod selection
  - type: by-label
    name: prefill-pods
    parameters:
      label: "fusioninfer.io/component-type"
      validValues: ["prefiller"]
  
  - type: by-label
    name: decode-pods
    parameters:
      label: "fusioninfer.io/component-type"
      validValues: ["decoder"]
  
  # Prefix cache optimization
  - type: prefix-cache-scorer
    parameters:
      hashBlockSize: 5
      maxPrefixBlocksToMatch: 256
      lruCapacityPerServer: 31250
  
  # Picker
  - type: max-score-picker
  
  # P/D profile handler for request routing
  - type: pd-profile-handler
    parameters:
      threshold: 0
      hashBlockSize: 5
      primaryPort: 8000
      decodeProfile: "decode"  # Optional, defaults to "decode"
      prefillProfile: "prefill"  # Optional, defaults to "prefill"
  
  schedulingProfiles:
  - name: prefill
    plugins:
    - pluginRef: prefill-pods
    - pluginRef: max-score-picker
    - pluginRef: prefix-cache-scorer
      weight: 50
  
  - name: decode
    plugins:
    - pluginRef: decode-pods
    - pluginRef: max-score-picker
    - pluginRef: prefix-cache-scorer
      weight: 50

---
# 3. HTTPRoute - Routes traffic to the EPP
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: disaggregated-llm-service-httproute
spec:
  parentRefs:
  - name: my-gateway
    namespace: gateway-system
  hostnames:
  - "llm.disaggregated.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: disaggregated-llm-service-pool
      kind: Service
      port: 8000

---
# 4. EPP Deployment and Service (automatically configured)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: disaggregated-llm-service-epp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: disaggregated-llm-service-epp
  template:
    metadata:
      labels:
        app: disaggregated-llm-service-epp
    spec:
      containers:
      - name: epp
        image: registry.k8s.io/gateway-api-inference-extension/epp:v1.2.1
        args:
        - --pool-name=disaggregated-llm-service-pool
        - --pool-namespace=default
        - --config-file=/config/epp-config.yaml
        ports:
        - containerPort: 9002
        - containerPort: 9003
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /config
      volumes:
      - name: config
        configMap:
          name: disaggregated-llm-service-epp-config

---
apiVersion: v1
kind: Service
metadata:
  name: disaggregated-llm-service-epp
spec:
  selector:
    app: disaggregated-llm-service-epp
  ports:
  - name: grpc-ext-proc
    port: 9002
  - name: grpc-health
    port: 9003
  - name: http-metrics
    port: 9090

Strategy to EndpointPickerConfig Mapping

FusionInfer automatically generates the appropriate EndpointPickerConfig based on the routing strategy:

Routing Strategy	Generated Plugins	Use Case
`prefix-cache`	• `prefix-cache-scorer` (with optimized parameters) • `max-score-picker` • Single `default` scheduling profile	Route requests with longest shared prefixes to same server while balancing with KV-cache utilization and queue depth
`kv-cache-utilization`	• `kv-cache-utilization-scorer` • `max-score-picker` • Single `default` scheduling profile	Balance load based on memory usage across inference servers
`queue-size`	• `queue-scorer` • `max-score-picker` • Single `default` scheduling profile	Minimize request waiting time by routing to least loaded servers
`lora-affinity`	• `lora-affinity-scorer` • `max-score-picker` • Single `default` scheduling profile	Route requests to servers with matching LoRA adapters for multi-adapter serving
`pd-disaggregation`	• `pd-profile-handler` • `prefill-header-handler` • `by-label` filters (for prefiller/decoder) • `prefix-cache-scorer` • `max-score-picker` • Two profiles: `prefill` and `decode`	Separate compute-intensive prefill from memory-intensive decode phases

prefix-cache:

plugins:
- type: prefix-cache-scorer
  parameters:
    blockSize: 5
    maxPrefixBlocksToMatch: 256
    lruCapacityPerServer: 31250
- type: max-score-picker
schedulingProfiles:
- name: default
  plugins:
  - pluginRef: max-score-picker
  - pluginRef: prefix-cache-scorer
    weight: 100

kv-cache-utilization:

plugins:
- type: kv-cache-utilization-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
  plugins:
  - pluginRef: max-score-picker
  - pluginRef: kv-cache-utilization-scorer
    weight: 100

queue-size:

plugins:
- type: queue-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
  plugins:
  - pluginRef: max-score-picker
  - pluginRef: queue-scorer
    weight: 100

lora-affinity:

plugins:
- type: lora-affinity-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
  plugins:
  - pluginRef: max-score-picker
  - pluginRef: lora-affinity-scorer
    weight: 100

pd-disaggregation:

plugins:
- type: pd-profile-handler
  parameters:
    threshold: 0
    hashBlockSize: 5
    primaryPort: 8000
- type: prefill-header-handler
- type: by-label
  name: prefill-pods
  parameters:
    label: "fusioninfer.io/component-type"
    validValues: ["prefiller"]
- type: by-label
  name: decode-pods
  parameters:
    label: "fusioninfer.io/component-type"
    validValues: ["decoder"]
- type: prefix-cache-scorer
  parameters:
    hashBlockSize: 5
    maxPrefixBlocksToMatch: 256
    lruCapacityPerServer: 31250
- type: max-score-picker
schedulingProfiles:
- name: prefill
  plugins:
  - pluginRef: prefill-pods
  - pluginRef: max-score-picker
  - pluginRef: prefix-cache-scorer
    weight: 50
- name: decode
  plugins:
  - pluginRef: decode-pods
  - pluginRef: max-score-picker
  - pluginRef: prefix-cache-scorer
    weight: 50

Implementation Phases

Phase 1: Core Router Integration

The first phase focuses on establishing the fundamental router capabilities for standard inference workloads:

Deliverables:

Automatic generation of InferencePool, EndpointPickerConfig, HTTPRoute, Gateway, and EPP deployment resources
Support for basic routing strategies:
- prefix-cache: Route requests with shared prefixes to optimize cache utilization
- kv-cache-utilization: Balance load based on memory usage
- queue-size: Minimize request waiting time
- lora-affinity: Route to servers with matching LoRA adapters
Custom endpointPickerConfig support for advanced users

Phase 2: Disaggregated Prefill/Decode Support

The second phase adds support for advanced Disaggregated Prefill/Decode architectures that separate compute-intensive prefill from memory-intensive decode operations.

Summary​

Motivation​

Goals​

Non-Goals​

User Stories​

Story 1: Prefix Cache Aware Routing​

Story 2: KV-Cache Utilization Based Load Balancing​

Story 3: Disaggregated Prefill/Decode Architecture​

Proposal​

Go Types​

Configuration Examples​

Example 1: Simple Strategy-based Configuration​

Example 2: Advanced Configuration with Custom EndpointPickerConfig​

Example 3: Disaggregated Prefill/Decode Architecture​

Strategy to EndpointPickerConfig Mapping​

Implementation Phases​

Phase 1: Core Router Integration​

Phase 2: Disaggregated Prefill/Decode Support​