What Microsoft’s custom silicon means for Azure

The history of modern software development has been a dance between what hardware can give and what software demands. Over the decades, the steps in this dance have moved us from the original Intel 8086, which we now consider very basic functionality, to today’s multi-faceted processors, which provide virtualization support, end-to-end access to encrypted memory and data, and extended instruction sets that power the most demanding application stacks.

This dance swings from side to side. Sometimes our software has to stretch to meet the capabilities of a new generation of silicon, and sometimes it has to squeeze out every last ounce of available performance. Now, we’re finally seeing the arrival of a new generation of hardware that mixes familiar CPUs with new system-level accelerators that provide the ability to run complex AI models on both client hardware and servers, both on premises and in the public cloud.

You’ll find AI accelerators not only in the familiar Intel and AMD processors but also in Arm’s latest generation of Neoverse server-grade designs, which mix those features with low power demands (as do Qualcomm’s mobile and laptop offerings). It’s an attractive combination of features for hyperscale clouds like Azure, where low power and high density can help keep costs down while allowing growth to continue.

At the same time, system-level accelerators promise an interesting future for Windows, allowing us to use on-board AI assistants as an alternative to the cloud as Microsoft continues to improve the performance of its Phi series of small language models.

Azure Boost: Silicon for virtualization offload

Ignite 2023 saw Microsoft announce its own custom silicon for Azure, hardware that should start rolling out to customers in 2024. Microsoft has been using custom silicon and FPGAs in its own services for some time now. The use of Zipline hardware compression and Project Brainwave FPGA-based AI accelerators are good examples. The most recent arrival is Azure Boost, which offloads virtualization processes from the hypervisor and host OS to accelerate storage and networking for Azure VMs. Azure Boost also includes the Cerberus on-board supply chain security chipset.

Azure Boost is intended to give your virtual machine workloads access to as much of the available CPU as possible. Instead of using CPU to compress data or manage security, dedicated hardware takes over, allowing Azure to run more customer workloads on the same hardware. Running systems at high utilization is key to the economics of the public cloud, and any investment in hardware will quickly be paid off.

Maia 100: Silicon for large language models

Large language models (and generative AI generally) show the importance of dense compute, with OpenAI using Microsoft’s GPU-based supercomputer to train its GPT models. Even on a system like Microsoft’s, big foundation models like GPT-4 require months of training, with more than a trillion parameters. The next generation of LLMs will need even more compute, both for training and for operation. If we’re building grounded applications around those LLMs, using Retrieval Augmented Generation, we’ll need additional capacity to create embeddings for our source content and to provide the underlying vector-based search.

GPU-based supercomputers are a significant investment, even when Microsoft can recoup some of the capital costs from subscribers. Operational costs are also large, with hefty cooling requirements on top of power, bandwidth, and storage. So, we might expect those resources to be limited to very few data centers, where there’s sufficient space, power, and cooling.

But if large-scale AI is to be a successful differentiator for Azure, versus competitors such as AWS and Google Cloud, it will need to be available everywhere and it will need to be affordable. That will require new silicon (for both training and inferencing) that can be run at higher densities and at lower power than today’s GPUs.

Looking back at Azure’s Project Brainwave FPGAs, these used programmable silicon to implement key algorithms. While they worked well, they were single-purpose devices that acted as accelerators for specific machine learning models. You could develop a variant that supported the complex neural networks of a LLM, but it would need to implement a massive array of simple processors to support the multi-dimensional vector arithmetic that drives these semantic models. That’s beyond the capabilities of most FPGA technologies.

Vector processing is something that modern GPUs are very good at (not surprisingly, as many of the original architects began their careers developing vector processing hardware for early supercomputers). A GPU is basically an array of simple processors that work with matrices and vectors, using technologies like Nvidia’s CUDA to provide access to linear algebra functions that aren’t commonly part of a CPU’s instruction set. The resulting acceleration lets us build and use modern AI models like LLMs.

Microsoft’s new custom AI accelerator chip, Maia 100, is designed for both training and inference. Building on lessons learned running OpenAI workloads, Maia is intended to fit alongside existing Azure infrastructure, as part of a new accelerator rack unit that sits alongside existing compute racks. With over 100 billion transistors delivered by a five-nanometer process, the Maia 100 is certainly a very large and very dense chip, with much more compute capability than a GPU.

The development of the Maia was refined alongside OpenAI’s models, and uses a new rack design that includes custom liquid-based cooling elements. That last part is key to delivering AI workloads to more than the largest Azure data centers. Adding liquid cooling infrastructure is expensive, so putting it in the Maia 100 racks ensures that it can be dropped into any data center, anywhere in the world.

Installing Maia 100 racks does require readjusting rack spacing, as the cooling system makes them larger than Azure’s typical 21-inch racks, which are sized for Open Compute Project servers. In addition to the liquid cooling hardware, the extra space is used for 4.8 Tb high-bandwidth interconnects, essential for pushing large amounts of data between CPUs and accelerators.

There are still questions about how applications will get to use the new chips. Absent additional details, it’s likely that they’ll run Microsoft-provided AI models, like OpenAI’s and Hugging Face’s, as well as their own Cognitive Services and the Phi small language models. If they become available to train your own models, expect to see a new class of virtual machines alongside the current range of GPU options in Azure AI Studio.

Cobalt 100: Azure’s own Arm processor

Alongside the unveiling of Maia, Microsoft announced its own Arm server processor, the Cobalt 100. This is a 128-core 64-bit processor, designed to support high-density, low-power applications, based on Arm’s Neoverse reference design. Azure is already using Arm processors for some of its platform services, and Cobalt 100 is likely to support these and more services, rather than being used for infrastructure as a service.

There’s no need to know if your Azure App Service code is running on Intel, AMD, or Arm, as long as it performs well and your users get the results they expect. We can expect to see Cobalt processors running internet-facing services, where density and power efficiency are important requirements, as well as hosting elements of Azure’s content delivery network outside of its main data centers.

Microsoft describes its silicon engineering as a way of delivering a “systems approach” to its Azure data centers, with end-to-end support from its initial storage and networking offerings to its own compute services. And it’s not only Azure. Better silicon is coming to Windows too, as NPU-enabled processors from Intel and Qualcomm start to arrive in 2024’s desktops and laptops. After many years of software leading hardware, it will be interesting to see how we can push these new platforms to their limits with code.

READ SOURCE