Explore the new Azure ND H200 v5 series VMs, designed for advanced AI workloads with enhanced performance and scalability. Learn about their features, integration options, and performance benchmarks for efficient AI solutions.
Explore the new Azure ND H200 v5 series VMs, designed for advanced AI workloads with enhanced performance and scalability. Learn about their features, integration options, and performance benchmarks for efficient AI solutions.
As the AI landscape rapidly evolves, the demand for scalable and high-performance infrastructure continues to grow. In response to this need, Microsoft has introduced new cloud-based AI supercomputing clusters powered by the Azure ND H200 v5 series virtual machines (VMs), which are now generally available. These VMs are specifically designed to handle the increasing complexity of advanced AI workloads, such as foundational model training and generative inference. With enhanced scale, efficiency, and performance, the ND H200 v5 VMs have already seen adoption among customers and are utilized by Microsoft AI services, including Azure Machine Learning and Azure OpenAI Service.
The Azure ND H200 v5 VMs are built with Microsoft’s system-optimized architecture and are equipped with eight NVIDIA H200 Tensor Core GPUs. These VMs address the challenge of GPUs advancing in computational power faster than in memory and bandwidth by offering a 76% increase in High Bandwidth Memory (HBM) to 141GB and a 43% improvement in HBM bandwidth, reaching 4.8 TB/s compared to the previous generation Azure ND H100 v5 VMs. This increase allows faster access to model parameters, reducing application latency - a critical factor for real-time applications like interactive agents. With this expanded memory capacity, the ND H200 v5 VMs can accommodate complex Large Language Models (LLMs) within a single VM, minimizing the need for distributed jobs and boosting overall performance.
The architecture of the H200 clusters is designed to optimize GPU memory usage for model weights, key-value cache, and batch sizes, significantly improving throughput, latency, and cost-efficiency in LLM-based generative AI inference workloads. Thanks to its larger HBM capacity, the ND H200 v5 VM supports higher batch sizes, resulting in better GPU utilization and increased throughput compared to the ND H100 v5 series for both small and large language model inference tasks. In early tests, Microsoft recorded up to a 35% increase in throughput for inference workloads using the LLAMA 3.1 405B model with the ND H200 v5 VMs versus the ND H100 v5 series (world size 8, input length 128, output length 8, and batch sizes - 32 for H100 and 96 for H200).
For detailed information on Azure’s high-performance computing benchmarks, Microsoft provides the AI Benchmarking Guide on its GitHub repository. For more technical information and documentation on the Azure ND H200 v5 VMs, visit Virtual Machines documentation