Skip to main content

FusionInfer

A Kubernetes controller for unified LLM inference orchestration, supporting both monolithic and prefill/decode (PD) disaggregated serving topologies.

Description

FusionInfer provides a single InferenceService CRD that enables:

  • Monolithic deployment: Single-pod inference handling full request lifecycle
  • PD disaggregated deployment: Separate prefill and decode roles for better GPU utilization
  • Multi-node deployment: Distributed inference across multiple nodes using tensor parallelism
  • Gang scheduling: Atomic scheduling via Volcano PodGroup integration
  • Intelligent routing: Gateway API integration with EPP (Endpoint Picker) for request scheduling

Demo

Prefix cache aware routing example:

https://github.com/user-attachments/assets/1743bf67-2abd-42cd-a0f3-d7b65281f8cb

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ InferenceService CRD │
│ (roles: worker/prefiller/decoder, replicas, multinode) │
└─────────────────────────────────┬───────────────────────────────┘

┌───────────────────────────────┐
│ InferenceService Controller │
└─────────────┬─────────────────┘

┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ PodGroup │ │ LeaderWorkerSet │ │ Router (EPP) │
│ (Volcano) │ │ (LWS) │ │ InferencePool │
│ │ │ │ │ HTTPRoute │
└───────────────┘ └─────────────────┘ └─────────────────┘

Getting Started

Install Dependencies

FusionInfer requires the following components:

1. LeaderWorkerSet (LWS) - For multi-node workload management

kubectl create -f https://github.com/kubernetes-sigs/lws/releases/download/v0.7.0/manifests.yaml

Reference: LWS Installation Guide | Releases

2. Volcano - For gang scheduling

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.13.1/installer/volcano-development.yaml

Reference: Volcano Installation Guide | Releases

3. Gateway API - For service routing

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml

Reference: Gateway API Installation Guide | Releases

4. Gateway API Inference Extension - For intelligent inference request routing

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.2.1/manifests.yaml

Reference: Inference Extension Docs | Releases

Install the Gateway

Set the Kgateway version and install the Kgateway CRDs:

KGTW_VERSION=v2.1.0
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds

Install Kgateway:

helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true

Deploy the Inference Gateway:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml

Quick Start (Local Development)

# 1. Create a kind cluster (optional)
kind create cluster --name fusioninfer

# 2. Install FusionInfer CRDs
make install

# 3. Run the controller locally
make run

Usage Examples

Monolithic LLM Service

apiVersion: fusioninfer.io/v1alpha1
kind: InferenceService
metadata:
name: qwen-inference
spec:
roles:
- name: router
componentType: router
strategy: prefix-cache
httproute:
parentRefs:
- name: inference-gateway
- name: inference
componentType: worker
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.11.0
args: ["--model", "Qwen/Qwen3-8B"]
resources:
limits:
nvidia.com/gpu: "1"

Send Request

# You can use minikube tunnel to assign IP address to an LoadBalancer Type Service
GATEWAY_IP=$(kubectl get gateway inference-gateway -o jsonpath='{.status.addresses[0].value}')

curl -X POST "http://${GATEWAY_IP}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'