HolmesGPT: Agentic troubleshooting built for the cloud native era
Link⚡ TL;DR
📝 Summary
What is HolmesGPT? Key benefits: How it works Extensible by design Getting started How to get involved Posted on January 7, 2026 by Aritra Ghosh (Senior PM, Microsoft) and Natan Yellin (CEO & Co-Founder, Robusta. dev) CNCF projects highlighted in this post If you’ve ever debugged a production incident, you know that the hardest part often isn’t the fix, it’s finding where to begin. Most on-call engineers end up spending hours piecing together clues, fighting time pressure, and trying to make sense of scattered data. You’ve probably run into one or more of these challenges: Unwritten knowledge and missing context: You’re pulled into an outage for a service you barely know. The original owners have changed teams, the documentation is half-written, and the “runbook” is either stale or missing altogether. You spend the first 30 minutes trying to find someone who’s seen this issue before — and if you’re unlucky, this incident is a new one. Tool overload and context switching: Your screen looks like an air traffic control dashboard. You’re running monitoring queries, flipping between Grafana and Application Insights, checking container logs, and scrolling through traces — all while someone’s asking for an ETA in the incident channel. Correlating data across tools is manual, slow, and mentally exhausting. Overwhelming complexity and knowledge gaps: Modern cloud-native systems like Kubernetes are powerful, but they’ve made troubleshooting far more complex. Every layer — nodes, pods, controllers, APIs, networking, autoscalers – introduces its own failure modes. To diagnose effectively, you need deep expertise across multiple domains, something even seasoned engineers can’t always keep up with.