Red Hat Performance and Scale Engineering

Link

2025-10-24 ~1 min read

Jump to TL;DR Jump to Summary Open Original ↗

⚡ TL;DR

📝 Summary

Red Hat Performance and Scale Engineering Krkn-AI: A feedback-driven approach to chaos engineering Network performance in distributed training: Maximizing GPU utilization on OpenShift How Red Hat has redefined continuous performance testing Extending the Chaos: A Guide to Building Custom Scenarios for Krkn vLLM or llama. cpp: Choosing the right LLM inference engine for your use case LFX Mentorship: CNCF - Krkn: Chaos scenario rollback feature (2025 Term 2) BGP dynamic routing with Fast Data Path on RHOSO 18 Unleash controlled chaos with krknctl Enhancing system resilience with Krkn chaos dashboard Ollama vs. vLLM: A deep dive into performance benchmarking Scaling OpenShift Network Policies: Results and Takeaways Scaling OpenShift Network Policies: Our Journey in Developing a Robust Workload Testing Tool Improving performance of multiple I/O threads for OpenShift Virtualization OpenShift LACP bonding performance expectations Feature Introduction: Multiple IOthreads for OpenShift Virtualization How we improved AI inference on macOS Podman containers Dynamic VM CPU Workload Rebalancing with Load Aware Descheduler Red Hat Enterprise Linux Performance Results on Intel® Xeon® 6 processors How to run performance and scale validation for OpenShift AI Performance boosts in vLLM 0.8.1: Switching to the V1 engine Evaluating memory overcommitment in OpenShift Virtualization Boost OpenShift database VM density with memory overcommit Supercharge Your AI with OpenShift AI and Redis: Unleash speed and scalability RHEL for Real Time: CPU throttling and risks Scalable Database Performance with OpenShift Virtualization, Out-of-the-Box Unlocking the Effective Context Length: Benchmarking the Granite-3.1-8b Model Monitoring Red Hat Ansible Automation Platform using Performance Co-Pilot RoCE multi-node AI training on Red Hat OpenShift Performance of Terraform provider to manage Openshift fleets A step by step guide to setting up OCP Virtualization on hyperconverged ODF and deploy 10K VMs Virtualized database I/O performance improvements in RHEL 9.4 Use kube-burner to measure Red Hat OpenShift VM and storage deployment at scale Scaling virtio-blk disk I/O with IOThread Virtqueue Mapping Generative AI fine-tuning of LLMs: Red Hat and Supermicro showcase outstanding results for efficient Llama-2-70b fine tuning using LoRA in MLPerf Training v4.0 Unleashing 100GbE network efficiency: SR-IOV in Red Hat OpenShift on OpenStack Scaling Red Hat OpenStack Platform 17.1 to more than 1000+ virtual nodes Sharing is caring: How to make the most of your GPUs (part 1 - time-slicing) Scale testing image-based upgrades for single node OpenShift How to create and scale 6,000 virtual machines in 7 hours with Red Hat OpenShift Virtualization Egress IP Scale Testing in OpenShift Container Platform IPsec Performance on Red Hat Enterprise Linux 9: A Performance Analysis of AES-GCM Ensure a scalable and performant environment for ROSA with hosted control planes Accelerating generative AI adoption: Red Hat OpenShift AI achieves impressive results in MLPerf inference benchmarks with vLLM runtime Red Hat Enterprise Linux Performance Results on 5th Gen Intel® Xeon® Scalable Processors Optimizing Quay/Clair: Database profiling results Optimizing Quay/Clair: Profiling, performance, and efficiency Save memory with OpenShift Virtualization using Free Page Reporting Test Kubernetes performance and scale with kube-burner 5 ways we work to optimize Red Hat Satellite Best practices for OpenShift Data Foundation disaster recovery resource planning DPDK latency in OpenShift - Part II Correlating QPS rate with resource utilization in self-managed Red Hat OpenShift with Hosted Control Planes Continuous performance and scale validation of Red Hat OpenShift AI model-serving stack Kube-burner: Fanning the flames of innovation in the CNCF Sandbox Evaluating LLM inference performance on Red Hat OpenShift AI Operating Tekton at scale: 10 lessons learned from Red Hat Trusted Application Pipeline Behind the scenes: Introducing OpenShift Virtualization Performance and Scale KrknChaos is joining CNCF Sandbox Supercharging chaos testing using AI Quantifying performance of Red Hat OpenShift for Machine Learning (ML) Training on Supermicro A+ Servers with MLPerf Training v3.1 OpenShift Cluster Manager API: Load-testing, breaking, and improving it Data Plane Development Kit (DPDK) latency in Red Hat OpenShift - Part I Running 2500 pods per node on OCP 4.13 Bulk API in Automation Controller Red Hat Enterprise Linux achieves significant performance gains with Intel's 4th Generation Xeon Scalable Processors OpenShift/Kubernetes Chaos Stories Enhancing/Maximizing your Scaling capability with Automation Controller 2.3 Red Hat new Benchmark results on AMD EPYC4 (Genoa) processors A Guide to Scaling OpenShift Data Science to Hundreds of Users and Notebooks Run Windows workloads on OpenShift Container Platform A Guide to Functional and Performance Testing of the NVIDIA DGX A100 Scaling Automation Controller for API Driven Workloads Performance Improvements in Automation Controller 4.1 The Curious Case of the CPU Eating Gunicorn Entitlement-Free Deployment of the NVIDIA GPU Operator on OpenShift Red Hat collaborates with NVIDIA to deliver record-breaking STAC-A2 Market Risk benchmark Red Hat Satellite 6.9 with Puma Web Server Using NVIDIA A100’s Multi-Instance GPU to Run Multiple Workloads in Parallel on a Single GPU Multi-Instance GPU Support with the GPU Operator v1.7.0 Making Chaos Part of Kubernetes/OpenShift Performance and Scalability Tests Demonstrating Performance Capabilities of Red Hat OpenShift for Running Scientific HPC Workloads A Complete Guide for Running Specfem Scientific HPC Workload on Red Hat OpenShift Running HPC workloads with Red Hat OpenShift Using MPI and Lustre Filesystem Introduction to Kraken, a Chaos Tool for OpenShift/Kubernetes About the author Red Hat Performance Team More like this Blog post Blog post Original podcast Original podcast Browse by channel Automation Artificial intelligence Open hybrid cloud Security Edge computing Infrastructure Applications Virtualization Share Red Hat's most recent posts about Performance, Scale, Chaos and more. Chaos engineering is the practice of deliberately introducing controlled failures into a system to uncover weaknesses before they affect end users. By continuously running chaos experiments, teams can build greater confidence in their systems and identify real performance bottlenecks. However, applying chaos in real-world environments can be challenging due to the complex, dynamic nature of applications and infrastructure, especially in environments like Kubernetes. read more We compared two IBM Cloud GPU clusters—one with NVIDIA L40S GPUs and one with H100 GPUs—head‑to‑head to see what really drives distributed training performance. The key finding is that for distributed training, the choice of network architecture is the most significant factor in performance, far outweighing the capabilities of the default container networking. Our tests conclusively show that using the standard Red Hat OpenShift pod network for internode communication creates a severe performance bottleneck that prevents expensive GPU resources from being fully utilized. read more Continuous performance testing (CPT) is a critical aspect of modern software development, especially considering the mission-critical applications and diverse infrastructure on which Red Hat OpenShift can run. In this article, we will discuss the importance of continuous performance testing, the challenges the OpenShift Performance and Scale Team discovered, and how shifting-left has increased our team velocity. read more Your system is resilient… until it’s not.

Open the original post ↗ https://www.redhat.com/en/blog/red-hat-performance-and-scale-engineering