Kubernetes WG Serving concludes following successful advancement of AI inference support
Link⚡ TL;DR
📝 Summary
Posted on February 26, 2026 by Yuan Tang, on behalf of Kubernetes WG Serving Co-Chairs CNCF projects highlighted in this post The Kubernetes Working Group (WG) Serving was created to support development of the AI inference stack on Kubernetes. The goal of this working group was to ensure that Kubernetes is an orchestration platform of choice for inference workloads. This goal has been accomplished, and the working group is now being disbanded. WG Serving formed workstreams to collect requirements from various model servers, hardware providers, and inference vendors. This work resulted in a common understanding of inference workload specifics and trends and laid the foundation for improvements across many SIGs in Kubernetes. The working group oversaw several key evolutions related to load balancing and workloads. The inference gateway was adopted as a request scheduler. Multiple groups have worked to standardize AI gateway functionality, and early inference gateway participants went on to seed agent networking work in SIG Network. The use cases and problem statements gathered by the working group informed the design of AIBrix. Many of the unresolved problems in distributed inference — especially benchmarking and recommended best practices — have been picked up by the llm-d project , which hybridizes the infrastructure and ML ecosystems and is better able to steer model server co-evolution. In particular, llm-d and AIBrix represent more appropriate forums for driving requirements to Kubernetes SIGs than this working group. llm-d’s goal is to provide well-lit paths for achieving state-of-the-art inference and aims to provide recommendations that can compose into existing inference user platforms.