FusionInfer

A Kubernetes controller for unified LLM inference orchestration, supporting both monolithic and prefill/decode (PD) disaggregated serving topologies.

Description

FusionInfer provides a single InferenceService CRD that enables:

Monolithic deployment: Single-pod inference handling full request lifecycle
PD disaggregated deployment: Separate prefill and decode roles for better GPU utilization
Multi-node deployment: Distributed inference across multiple nodes using tensor parallelism
Gang scheduling: Atomic scheduling via Volcano PodGroup integration
Intelligent routing: Gateway API integration with EPP (Endpoint Picker) for request scheduling

Demo

Prefix cache aware routing example:

https://github.com/user-attachments/assets/1743bf67-2abd-42cd-a0f3-d7b65281f8cb

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      InferenceService CRD                       │
│   (roles: worker/prefiller/decoder, replicas, multinode)        │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                    ┌───────────────────────────────┐
                    │   InferenceService Controller │
                    └─────────────┬─────────────────┘
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        │                         │                         │
        ▼                         ▼                         ▼
┌───────────────┐       ┌─────────────────┐       ┌─────────────────┐
│   PodGroup    │       │ LeaderWorkerSet │       │  Router (EPP)   │
│  (Volcano)    │       │     (LWS)       │       │  InferencePool  │
│               │       │                 │       │  HTTPRoute      │
└───────────────┘       └─────────────────┘       └─────────────────┘

Getting Started

Install Dependencies

FusionInfer requires the following components:

1. LeaderWorkerSet (LWS) - For multi-node workload management

kubectl create -f https://github.com/kubernetes-sigs/lws/releases/download/v0.7.0/manifests.yaml

Reference: LWS Installation Guide | Releases

2. Volcano - For gang scheduling

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.13.1/installer/volcano-development.yaml

Reference: Volcano Installation Guide | Releases

3. Gateway API - For service routing

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml

Reference: Gateway API Installation Guide | Releases

4. Gateway API Inference Extension - For intelligent inference request routing

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.2.1/manifests.yaml

Reference: Inference Extension Docs | Releases

Install the Gateway

Set the Kgateway version and install the Kgateway CRDs:

KGTW_VERSION=v2.1.0
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds

Install Kgateway:

helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true

Deploy the Inference Gateway:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml

Quick Start (Local Development)

# 1. Create a kind cluster (optional)
kind create cluster --name fusioninfer

# 2. Install FusionInfer CRDs
make install

# 3. Run the controller locally
make run

Usage Examples

Monolithic LLM Service

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
  name: qwen-inference
spec:
  roles:
    - name: router
      componentType: router
      strategy: prefix-cache
      httproute:
        parentRefs:
          - name: inference-gateway
    - name: inference
      componentType: worker
      replicas: 1
      template:
        spec:
          containers:
            - name: vllm
              image: vllm/vllm-openai:v0.11.0
              args: ["--model", "Qwen/Qwen3-8B"]
              resources:
                limits:
                  nvidia.com/gpu: "1"

Send Request

# You can use minikube tunnel to assign IP address to an LoadBalancer Type Service
GATEWAY_IP=$(kubectl get gateway inference-gateway -o jsonpath='{.status.addresses[0].value}')

curl -X POST "http://${GATEWAY_IP}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Description​

Demo​

Architecture​

Getting Started​

Install Dependencies​

Install the Gateway​

Quick Start (Local Development)​

Usage Examples​

Monolithic LLM Service​

Send Request​