Meta trials Purple Llama project for AI developers to test safety risks in models

Meta has launched Purple Llama – a project aimed at building open source tools to help developers assess and improve trust and safety in their generative AI models before deployment.

The project was announced by the platform’s president of global affairs (and former UK deputy prime minister) Nick Clegg on Thursday.  

“Collaboration on safety will build trust in the developers driving this new wave of innovation, and requires additional research and contributions on responsible AI,” Meta explained. “The people building AI systems can’t address the challenges of AI in a vacuum, which is why we want to level the playing field and create a center of mass for open trust and safety.”

Under Purple Llama, Meta is collaborating with other AI application developers – including cloud platforms like AWS and Google Cloud, chip designers like Intel, AMD and Nvidia, and software businesses like Microsoft – to release tools to test models’ capabilities and check for safety risks. The software licensed under the Purple Llama project supports research and commercial applications.

The first package unveiled includes tools to test cyber security issues in software-generating models, and a language model that classifies text that is inappropriate or discusses violent, or illegal activities. The package, dubbed CyberSec Eval, allows developers to run benchmark tests that check how likely an AI model is to generate insecure code or assist users in carrying out cyber attacks. 

They could, for example, try to instruct their models to create malware and see how often it complies with the request, and then block these requests. Or they could ask their models to execute what seems like a benign task, see if it generates insecure code, and try to figure out how the model has gone awry. 

Initial tests showed that on average, large language models suggested vulnerable code 30 percent of the time, researchers at Meta revealed in a paper [PDF] detailing the system. These cyber security benchmark assessments can be run repeatedly, to check if adjustments to the model are actually making them more secure.

Meanwhile, Llama Guard is a large language model trained to classify text. It looks out for language that is sexually explicit, offensive, harmful or discusses unlawful activities. 

Developers can test whether their own models accept or generate unsafe text by running input prompts and output responses generated by Llama Guard. They could then filter out specific items that might incite the model to produce inappropriate content.

Meta positioned Purple Llama as a two-pronged approach to security and safety, looking at both the inputs and the outputs of AI. “We believe that to truly mitigate the challenges that generative AI presents we need to take both attack (red team) and defensive (blue team) postures. Purple teaming, composed of both red and blue team responsibilities, is a collaborative approach to evaluating and mitigating potential risks.” ®