Autonomous Testing of etcd’s Robustness

Link

2025-09-25 ~1 min read

Jump to TL;DR Jump to Summary Open Original ↗

⚡ TL;DR

📝 Summary

Enhancing etcd’s Robustness Testing How We Tested What We Found Issues in the Main Development Branch Known Issues Conclusion Posted on September 25, 2025 by Marek Siarkowicz (Google | Kubernetes Maintainer) CNCF projects highlighted in this post As a critical component of many production systems, including Kubernetes, the etcd project’s first priority is reliability. Ensuring consistency and data safety requires our project contributors to continuously improve testing methodologies. In this article, we describe how we use advanced simulation testing to uncover subtle bugs, validate the robustness of our releases, and increase our confidence in etcd’s stability. We’ll share our key findings and how they have improved etcd. Many critical software systems depend on etcd to be correct and consistent, most notably as the primary datastore for Kubernetes. After some issues with the v3.5 release, the etcd maintainers developed a new robustness testing framework to better test for correctness under various failure scenarios. To further enhance our testing capabilities, we integrated a deterministic simulation testing platform from Antithesis into our workflow. The platform works by running the entire etcd cluster inside a deterministic hypervisor. This specialized environment gives the testing software complete control over every source of non-determinism, such as network behavior, thread scheduling, and system clocks. This means any bug it discovers can be perfectly and reliably reproduced. Within this simulated environment, the testing methodology shifts away from traditional, scenario-based tests. Instead of writing tests imperatively with strict assertions for one specific outcome, this approach uses declarative, property-based assertions about system behavior.

Open the original post ↗ https://www.cncf.io/blog/2025/09/25/autonomous-testing-of-etcds-robustness/