F5 and NVIDIA Boost LLM Performance With New Integration

F5 has announced new enhancements to its BIG-IP Next for Kubernetes, now accelerated with NVIDIA BlueField-3 DPUs and the DOCA framework, designed to supercharge traffic management, security, and performance for large-scale AI applications. The announcement follows a successful validation by Sesterce, a European AI infrastructure operator, which saw up to 20% improvement in GPU utilization in its deployment.

This collaboration aims to meet the rising infrastructure demands of generative AI and large language models (LLMs) by offering advanced routing, data caching, and model protection capabilities. Among the key features is intelligent LLM routing, which allows enterprises to assign simple tasks to lightweight models while preserving complex workloads for more advanced systems. This capability dramatically improves GPU efficiency, latency, and overall model responsiveness.

Accelerated inference and enhanced multi-tenancy for GenAI

One of the standout integrations includes NVIDIA’s Dynamo KV Cache Manager, which enables distributed AI inference by routing requests based on cache availability and optimizing memory use across GPUs. F5’s dynamic traffic management complements this by reducing time to first token in AI responses—ensuring faster, more cost-efficient deployment of GenAI services at enterprise scale.

Additionally, the platform supports Model Context Protocol (MCP), developed by Anthropic, adding a reverse proxy layer and customizable security policies for context-based AI requests. With built-in programmability via F5 iRules, customers can tailor security and performance responses to match their AI application needs in real-time.

Also read: Rubrik, Pure Storage Launch Cyber Recovery Solution

AI-native infrastructure for enterprise-grade performance

By programming routing logic directly on BlueField-3 DPUs, F5 is offloading compute-heavy tasks from CPUs, leading to streamlined operations and reduced latency. This approach supports complex, multi-tenant Kubernetes environments and is especially relevant for organizations running multiple LLMs or fine-tuned models simultaneously.

Sesterce, which deployed the solution across its sovereign AI infrastructure, praised the system for its ability to dynamically balance high-volume traffic and deliver GPU optimization with minimal overhead. The joint capabilities of F5 and NVIDIA provide enterprises with a single control point to manage AI-driven traffic, protect critical workloads, and ensure scalable, compliant deployments.

As organizations worldwide race to expand AI capabilities, F5 and NVIDIA’s co-innovation positions them as key players in delivering reliable, high-performance environments built to support evolving AI workloads—from training to real-time inference and agentic applications.

Latest articles

Related articles