Galileo boosts GPU utilization 40%, cuts latency 70% with Redis-Lua client-side load balancer replacing Kubernetes round-robin for real-time AI inference workloads

How We Boosted GPU Utilization by 40% with Redis & Lua

Real-time AI evaluations demand millisecond latency because any slowdown directly impacts user-facing response time. We built our Luna-2 small language models to enable exactly that: guardrails that evaluate safety without slowing down your application. However, getting these models to perform in production required us to rethink standard load balancing. By building a load-aware client-side balancer backed by Redis, we achieved a ~40% increase in average GPU utilization and reduced tail latency...