New Memory Tier for AI Inference enabled by NVIDIA BlueField®-4 DPU and Supermicro Systems
Executive Summary
As enterprises adopt AI Factory infrastructure and AI inference workloads scale, the inference efficiency is a key factor in delivering cost-effective and power efficient AI. Key to solving this challenge is the implementation of KV (key-value) cache for storing prior inference results and reusing this data with longer context AI queries. NVIDIA has a new tier of memory focused on KV cache which is enabled by the new NVIDIA BlueField-4 DPU (Data Processing Unit). Supermicro AI compute and storage systems will support this new networking technology at the time of launch.
The Challenge
Supermicro, together with NVIDIA, is at the forefront of this new domain and is developing new platforms to address large-scale AI inference. NVIDIA is announcing the BlueField-4 data processor for the Inference Context Memory Storage Platform, a new class of storage infrastructure designed for fast, efficient inference at a giga-scale. As power demands in data centers increase, improving power efficiency in networking and storage frees more power for the brains of AI, the GPUs.
The transformer model with the self-attention mechanism fundamentally changed generative large language models by enabling parallel processing of users’ query inputs and the ability to weigh the importance of different words in the query based on matching words, word position, and other factors. Processing the entire sequence of input query words in parallel, rather than one at a time, left-to-right, dramatically increased the model's performance.
The next change in inference efficiency is the implementation of the KV cache. As inference becomes more sophisticated, it has evolved from a one-shot question-and-answer paradigm to a conversational multi-sequence process in which, as in human conversations, the meaning of the current query depends on the previous sequence of queries. This conversional approach requires storing the sequence of previous queries, extending the context window from the current query to previous queries, even days prior.
Instead of calculating the importance of each word (token) in the input query each time the word is encountered, the KV cache stores the data associated with the word (value) and an associated lookup index (key) so that this data can be reused without recomputing these values the next time the word is encountered. Storing the key-value pairs in a disaggregated inference infrastructure requires a new type of storage infrastructure, which NVIDIA calls CME. While the conversation context could be stored in the GPU’s High Bandwidth Memory (HBM), it is far too expensive and not large enough to store all of the context for all of the queries being processed. The CME solves this by introducing a new tier of storage that enables scaling the KV caches for large inference deployments.
The CME is implemented in the networking infrastructure using the newly announced NVIDIA BlueField-4 Data Processor Unit (DPU). This provides the high performance required for looking up the key values in real-time inference workloads.
The Solution
As part of the NVIDIA Vera Rubin platform, NVIDIA is also announcing new generation of data processing units (DPU). A critical component is the Inference Context Memory Storage Platform, which enables efficient storage of critical data and accelerates AI-native key-value (KV) cache access. This feature enables very fast data sharing across nodes and delivers significantly improved power efficiency for this task. Supermicro and NVIDIA are working together to expand the set of high-performance enterprise AI solutions that incorporate these new technologies.
With BlueField-4, KV cache data can be stored directly on the network interface card, accelerating its distribution to other GPUs. As part of a software-defined infrastructure, the NVIDIA BlueField-4 increases performance and security while freeing GPUs to perform their intended tasks. Supermicro's future GPU and storage systems will include this new capability.
However, KV cache does not require the same durability, redundancy, or data protection features used for long-lived enterprise data. In the context of AI inference, it can be recomputed if lost.
The Result
Upcoming future Supermicro systems that incorporate the NVIDIA Vera Rubin platform, including the new BlueField-4 data processor, will deliver a dramatic increase in inference and agentic AI-native performance across multiple nodes. For data centers, when using the NVIDIA BlueField-4 data processor, the following results can be expected:
- Lower power usage for data transfer, allowing for more GPUs per data center
- Massive KV cache capacity for long-context reasoning
- High-speed pod-wide and efficient access to data
- Maximize GPU utilization
- Accelerate agentic AI serving, reducing time-to-first-token
For More Information, visit: https://www.supermicro.com/en/accelerators/nvidia/vera-rubin
Subscribe to Data Center Stories
By clicking subscribe, you consent to allow Supermicro to store and process the personal information submitted above to provide you the content requested.
You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.
