3 secrets to deploying LLMs on cloud platforms

In the past two years, I’ve been involved with generative AI projects using large language models (LLMs) more than traditional systems. I’ve become nostalgic for serverless cloud computing. Their applications range from enhancing conversational AI to providing complex analytical solutions across industries and many functions beyond that. Many enterprises deploy these models on cloud platforms because there is a ready-made ecosystem of public cloud providers and it’s the path of least resistance. However, it’s not cheap.

Clouds also offer other benefits such as scalability, efficiency, and advanced computational capabilities (GPUs on demand). The LLM deployment process on public cloud platforms has lesser-known secrets that can significantly impact success or failure. Perhaps because there are not many AI experts out there who can deal with LLMs, and because we have not been doing this for a long time, there are a lot of gaps in our knowledge.

Let’s explore three lesser-known “tips” for deploying LLMs on clouds that perhaps even your AI engineers may not know. Considering that many of those guys and gals earn north of $300,000, maybe it’s time to quiz them on the details of doing this stuff right. I see more mistakes than ever as everyone runs to generative AI like their hair is on fire.

Managing cost efficiency and scalability

One of the primary appeals of using cloud platforms for deploying LLMs is the ability to scale resources as needed. We don’t have to be good capacity planners because the cloud platforms have resources we can allocate with a mouse click or automatically.

But wait, we’re about to make the same mistakes we made when first using cloud computing. Managing cost while scaling is a skill that many need help with to navigate effectively. Remember, cloud services often charge based on the compute resources consumed; they function as a utility. The more you process, the more you pay. Considering that GPUs will cost more (and burn more power), this is a core concern with LLMs on public cloud providers.

Make sure you utilize cost management tools, both those provided by cloud platforms and those offered by solid third-party cost governance and monitoring players (finops). Examples would be implementing auto-scaling and scheduling, choosing suitable instance types, or using preemptible instances to optimize costs. Also, remember to continuously monitor the deployment to adjust resources based on usage rather than just using the forecasted load. This means avoiding overprovisioning at all costs (see what I did there?).

Data privacy in multitenant environments

Deploying LLMs often involves processing vast amounts of data and trained knowledge models that might contain sensitive or proprietary data. The risk in using public clouds is that you have neighbors in the form of processing instances operating on the same physical hardware. Therefore, public clouds do come with the risk that as data is stored and processed, it’s somehow accessed by another virtual machine running on the same physical hardware in the public cloud data center.

Ask a public cloud provider about this, and they will run to get their updated PowerPoint presentations, which will show that this is not possible. While that is mainly true, it’s not entirely accurate. All multitenant systems come with this risk; you need to mitigate it. I’ve found that the smaller the cloud provider, such as the many that operate in just a single country, the more likely this will be an issue. This is for data storage and LLMs.

The secret is to select cloud providers that comply with stringent security standards that they can prove: at-rest and in-transit encryption, identity and access management (IAM), and isolation policies. Of course, it’s a much better idea for you to implement your security strategy and security technology stack to ensure the risk is low with the multitenant use of LLMs on clouds.

Handling stateful model deployment

LLMs are mostly stateful, which means they maintain information from one interaction to the next. This old trick provides a new benefit: the ability to enhance efficiency in continuous learning scenarios. However, managing the statefulness of these models in cloud environments, where instances might be ephemeral or stateless by design, is tricky.

Orchestration tools such as Kubernetes that support stateful deployments are helpful. They can leverage persistent storage options for the LLMs and be configured to maintain and operate their state across sessions. You’ll need this to support the LLM’s continuity and performance.

With the explosion of generative AI, deploying LLMs on cloud platforms is a foregone conclusion. For most enterprises, it’s just too convenient not to use the cloud. My fear with this next mad rush is that we’ll miss things that are easy to address and we’ll make huge, costly mistakes that, at the end of the day, were mostly avoidable.

READ SOURCE