Docs

Curated Kubernetes content from AKS, EKS, GKE, OpenShift, Rancher/K3s and more—auto‑aggregated daily.

2026-03-20
Kubernetes Blog
Running Agents on Kubernetes with Agent Sandbox
Running Agents on Kubernetes with Agent Sandbox The Kubernetes advantage (and the abstraction gap) Introducing Kubernetes Agent Sandbox Scaling agents with extensions Quick start The future of agents is cloud native The landscape of artificial intelligence is undergoing a massive architectural shift. In the early days of generative AI, interacting with a model was often treated as a transient, stateless function call: a request that spun up, executed for perhaps 50 milliseconds, and terminated.
#kubernetes
2026-03-20
Kubeflow Blog
Kubeflow Trainer v2.2: JAX & XGBoost Runtimes, Flux for HPC Support, and TrainJob progress and metrics observability
Bringing JAX to Kubernetes with Trainer Bringing XGBoost to Kubernetes with Trainer Track TrainJob Progress and Expose Metrics How it works Future Plans Bringing Flux Framework for HPC and MPI Bootstrapping Resource Timeout for TrainJobs RuntimePatches API to override TrainJob defaults Breaking Changes Replace PodTemplateOverrides with RuntimePatches API Remove numProcPerNode from the Torch MLPolicy API Remove ElasticPolicy API Some TrainJob API fields are now immutable Release Notes Roadmap Moving Forward Join the Community Contribute: Connect with the Community: Learn More: Bringing JAX to Kubernetes with Trainer Bringing XGBoost to Kubernetes with Trainer Track TrainJob Progress and Expose Metrics How it works Future Plans How it works Future Plans Bringing Flux Framework for HPC and MPI Bootstrapping Resource Timeout for TrainJobs RuntimePatches API to override TrainJob defaults Breaking Changes Replace PodTemplateOverrides with RuntimePatches API Remove numProcPerNode from the Torch MLPolicy API Remove ElasticPolicy API Some TrainJob API fields are now immutable Replace PodTemplateOverrides with RuntimePatches API Remove numProcPerNode from the Torch MLPolicy API Remove ElasticPolicy API Some TrainJob API fields are now immutable Release Notes Roadmap Moving Forward Join the Community Contribute: Connect with the Community: Learn More: Contribute: Connect with the Community: Learn More: Just a little over one week ahead of KubeCon + CloudNativeCon EU 2026, the Kubeflow team is excited to ship Trainer v2.2. The v2.2 release reinforces our commitment to expanding the Kubeflow Trainer ecosystem – meeting developers where they are by adding native support for JAX, XGBoost, and Flux, while also delivering deeper observability into training jobs.
#kubeflow #kubernetes
2026-03-19
Digital Ocean
Meet the New Standard for High-Performance, Low-Cost Inference: NVIDIA Dynamo 1.0 is now available to DigitalOcean Customers
Meet the New Standard for High-Performance, Low-Cost Inference: NVIDIA Dynamo 1.0 is now available to DigitalOcean Customers What is NVIDIA Dynamo 1.0? How DigitalOcean optimizes inference workloads with Dynamo to improve throughput and latency The future of inference optimization with NVIDIA and DigitalOcean About the author Connect with our sales team Related Articles The Glue Problem in Modern AI Development The Agentic Era Demands a New Class of Infrastructure: DigitalOcean Acquires Katanemo Labs Run Advanced Reasoning on DigitalOcean with Arcee AI's Trinity Large-Thinking By Waverly Swinton Published: March 19, 2026 3 min read NVIDIA Dynamo 1.0 , which was released on Monday at NVIDIA GTC, is now available to DigitalOcean customers to help drive performance enhancements and cost efficiency. NVIDIA Dynamo 1.0 offers a 7x inference performance increase on NVIDIA GB200 NVL systems, and by pairing it with DigitalOcean’s Agentic Inference Cloud, customers can achieve higher performance at lower costs while benefiting from seamless deployment.
#kubernetes
2026-03-19
Tigera
AI Assistant for Calico: Troubleshooting at the Speed of Thought
Beyond Manual Log Analysis Natural Language Insights Proactive Security and Policy Optimization Real-World Scenario: Rapidly Resolving a Blocked Service Connection A New Standard for Platform Operations Experience the Power of AI Assistant for Calico Despite the wealth of data available, distilling a coherent narrative from a Kubernetes cluster remains a challenge for modern infrastructure teams. Even with powerful visualization tools like the Policy Board, Service Graph, and specialized dashboards, users often find themselves spending significant time piecing together context across different screens.
#tigera
2026-03-19
Kubeflow Blog
Kubeflow SDK v0.4.0: Model Registry, SparkConnect, and Enhanced Developer Experience
Unified Model Management: The Model Registry Client Usage Example Distributed AI Data at Scale: SparkClient & SparkConnect Usage Example A New Home for Documentation Infrastructure & Breaking Changes Better Isolation with Namespaced TrainingRuntimes Furthering Parity Between Local and Remote Execution Required: Upgrading to Python 3.10+ What’s Next for Kubeflow SDK Get Involved! Unified Model Management: The Model Registry Client Usage Example Usage Example Distributed AI Data at Scale: SparkClient & SparkConnect Usage Example Usage Example A New Home for Documentation Infrastructure & Breaking Changes Better Isolation with Namespaced TrainingRuntimes Furthering Parity Between Local and Remote Execution Required: Upgrading to Python 3.10+ Better Isolation with Namespaced TrainingRuntimes Furthering Parity Between Local and Remote Execution Required: Upgrading to Python 3.10+ What’s Next for Kubeflow SDK Get Involved! Explore the full documentation at sdk. kubeflow.
#kubeflow #kubernetes
2026-03-18
AWS Containers Blog (EKS)
Deploy production generative AI at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX
Deploy production generative AI at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX Solution overview Prerequisites Walkthrough Prepare EKS Hybrid Nodes Install NVIDIA GPU Operator for Kubernetes Deploy NVIDIA NIM for inference on EKS Hybrid Nodes Configure centralized monitoring and observability for GPU metrics Cleaning up Conclusion About the authors Modern generative AI applications require deployment closer to where data is generated and business decisions are made, but this creates new infrastructure challenges. Organizations in manufacturing, healthcare, finance, and telecommunications need to deliver low-latency, energy-efficient AI workloads at the edge while maintaining data locality and regulatory compliance.
#eks #aws
2026-03-18
Kubernetes Blog
Securing Production Debugging in Kubernetes
Securing Production Debugging in Kubernetes 1) Using an access broker on top of Kubernetes RBAC Example: a namespaced on-call debug Role 2) Short-lived, identity-bound credentials Option A: short-lived OIDC tokens Option B: Short-lived client certificates (X. 509) 3) Use a just-in-time access gateway to run debugging commands Example: Namespace-scoped role bindings Example: Cluster-scoped role binding References During production debugging, the fastest route is often broad access such as cluster-admin (a ClusterRole that grants administrator-level access), shared bastions/jump boxes, or long-lived SSH keys.
#kubernetes
2026-03-18
Tigera
What Your EKS Flow Logs Aren’t Telling You
What EKS Gives You Out of the Box What EKS Native Observability Doesn’t Tell You What Calico Adds: Goldmane and Whisker Goldmane: Flow Logs That Speak Kubernetes Security Whisker: Real-Time Policy Visibility Without Additional Infrastructure Going Further: Calico Cloud Free Tier A Quick Comparison Sign up for the free tier Conclusion If you’re running workloads on Amazon EKS, there’s a good chance you already have some form of network observability in place. VPC Flow Logs have been a staple of AWS networking for years, and AWS has since introduced Container Network Observability, a newer set of capabilities built on Amazon CloudWatch Network Flow Monitor, that adds pod-level visibility and a service map directly in the EKS console.
#tigera
2026-03-17
Digital Ocean
Prompt Caching for Anthropic and OpenAI Models: Building Cost-Efficient AI Systems
Prompt Caching for Anthropic and OpenAI Models: Building Cost-Efficient AI Systems What is Prompt Caching? How Prompt Caching Works Advantages of Prompt Caching 1. Major Cost Reduction 2.
#kubernetes
2026-03-17
VMware Cloud Foundation Blog
Identity Security for VMware Cloud Foundation – IAM, PAM, and Zero Trust Access
From Static Authentication to Zero Trust IAM and PAM in VMware Cloud Foundation Kubernetes-Native Identity for Private Cloud Identity as a Core Platform Capability What’s Next in the Series Watch The Full Episode Links Mentioned The Virtually Speaking Podcast Discover more from VMware Cloud Foundation (VCF) Blog Related Articles Identity Security for VMware Cloud Foundation - IAM, PAM, and Zero Trust Access Cluster API, Immutability, and the Future of Kubernetes Infrastructure Where Logic and Creativity Meet: Libby Shen on Building Sustainable Solutions with VMware Cloud Foundation Identity is now the primary security perimeter. In the latest episode of the Virtually Speaking Podcast, we sat down with Lee Howard, Head of IAM Product Management at Broadcom, to explore how Identity Security for VMware Cloud Foundation (VCF) enables secure, scalable, zero trust access across modern private cloud environments.
#vmware #cloud-foundation #kubernetes