Most cloud-based genAI performance stinks

I’ve been asked if generative AI systems are always slow. Of course, I reply, “Slow, as compared to what?” The response I always get is funny. “Slower than we thought it would be.” And the circle continues.

Performance is often an afterthought with generative AI development and deployment. Most deploying generative AI systems on the cloud, and even not the cloud, have yet to learn what the performance of their generative AI systems should be, take no steps to determine performance, and end up complaining about the performance after deployment. Or, more often, the users complain, and then generative AI designers and developers complain to me.

Challenges of generative AI performance

At their essence, generative AI systems are complex, distributed data-oriented systems that are challenging to build, deploy, and operate. They are all different, with different moving parts. Most of the parts are distributed everywhere, from the source databases for the training data, to the output data, to the core inference engines that often exist on cloud providers.

Here is my list of the most common difficulties:

Complex deployment landscapes. Generative AI systems often comprise various components. They include data ingestion services, storage, computing, and networking. Architecting these components to work synergistically often leads to overcomplexity, where performance issues, determined by the poorest performing components, are different from isolating. I’ve seen poorly performing networks and saturated databases. Those things are not directly related to generative AI, but they can cause performance problems, nonetheless.

AI model tuning. Performance is not solely a function of infrastructure, which is a conclusion that many reach. The AI models must be tuned and optimized, requiring deep technical expertise that few have.

Vendors could have done a better job establishing best practices in performance tuning. Many enterprises are concerned that they may worsen things or introduce issues that cause erroneous outcomes. This can’t be ignored, and depending on the type of generative AI system you’re working on in the cloud, you need to figure this out by working with the generative AI service providers.

Security concerns. Protecting AI models and their data against unauthorized access and breaches goes without saying, especially in cloud environments where multitenancy is common. Too many performance issues raise security risks.

In many instances, security mechanisms, such as encryption, introduce performance issues that if not resolved will worsen as the data grows. Architecture and testing are your friends here. Take some time to understand how security affects generative AI performance.

Regulatory compliance. Related to security is adherence to data governance and compliance standards. They can impose additional layers of performance management complexity.

Much like security, we need to figure out how to work with these requirements. Most of the time, we can find a happy medium to provide the compliance we need. As with optimized performance, it just takes some trial and error.

Generative AI best practices

Remember that if I list best practices here, they are holistic. They don’t consider the specific type of generative AI systems you’re running, all of which have very different components and platform considerations. You’ll have to check with your specific generative AI provider about how these are carried out for your particular use cases. Given that warning, here are a few to consider:

Implement automation for scaling and resource optimization, or autoscaling, which cloud providers provide. This includes using machine learning operations (MLOps) techniques and approaches for operating AI models.

Utilize serverless computing, which abstracts away infrastructure management. This means you no longer must allocate the resources your generative AI will need; it’s done automatically. Although I’m not always okay with turning the keys over to an automated process that will allocate resources that we have to pay for, given all the other things you need to be concerned with, this is one less thing to worry about.

Conduct regular load testing and performance evaluations. Ensure that your generative AI systems can handle peak demands. Most skip this and guess how much the load will be at the top of the curve. Can you say “outage”?

Employ a continuous learning approach. AI models should be regularly updated with new data and refined to maintain performance and relevance.

Tap into the expertise and support of cloud service providers. Also, make sure to monitor online communities supporting your specific technology stack. You’ll find many answers there that $700-an-hour consultants won’t be able to provide.

I suspect that generative AI performance will become an area of focus more than it is today. Perhaps it should be, given the amount of resources and cash we’re focusing on this exploding space.

READ SOURCE