AI models, the subject of ongoing safety concerns about harmful and biased output, pose a risk beyond content emission. When wedded with tools that enable automated interaction with other systems, they can act on their own as malicious agents.
Computer scientists affiliated with the University of Illinois Urbana-Champaign (UIUC) have demonstrated this by weaponizing several large language models (LLMs) to compromise vulnerable websites without human guidance. Prior research suggests LLMs can be used, despite safety controls, to assist [PDF] with the creation of malware.
Researchers Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang went a step further and showed that LLM-powered agents – LLMs provisioned with tools for accessing APIs, automated web browsing, and feedback-based planning – can wander the web on their own and break into buggy web apps without oversight.
They describe their findings in a paper titled, “LLM Agents can Autonomously Hack Websites.”
“In this work, we show that LLM agents can autonomously hack websites, performing complex tasks without prior knowledge of the vulnerability,” the UIUC academics explain in their paper.
“For example, these agents can perform complex SQL union attacks, which involve a multi-step process (38 actions) of extracting a database schema, extracting information from the database based on this schema, and performing the final hack.”
In an interview with The Register, Daniel Kang, assistant professor at UIUC, emphasized that he and his co-authors did not actually let their malicious LLM agents loose on the world. The tests, he said, were done on real websites in a sandboxed environment to ensure no harm would be done and no personal information would be compromised.
What we found is that GPT-4 is highly capable of these tasks. Every open source model failed, and GPT-3.5 is only marginally better than the open source models
“We used three major tools,” said Kang. “We used the OpenAI Assistants API, LangChain, and the Playwright browser testing framework.
“The OpenAI Assistants API is basically used to have context, to do the function calling, and many of the other things like document retrieval that are really important for high performance. LandChain was basically used to wrap it all up. And the Playwright web browser testing framework was used to actually interact with websites.”
The researchers created agents using 10 different LLMs: GPT-4, GPT-3.5, OpenHermes-2.5-Mistral-7B, LLaMA-2 Chat (70B), LLaMA-2 Chat (13B), LLaMA-2 Chat (7B), Mixtral-8x7B Instruct, Mistral (7B) Instruct v0.2, Nous Hermes-2 Yi (34B), and OpenChat 3.5.
The first two, GPT-4 and GPT-3.5, are proprietary models operated by OpenAI while the remaining eight are open source. Google’s Gemini model, said to be at least as capable as GPT-4 in its latest iteration, was not available at the time.
The researchers had their LLM-agents probe test websites for 15 vulnerabilities, including SQL injection, cross-site scripting, and cross-site request forgery, among others. The open source models that were tested all failed.
But OpenAI’s GPT-4 had an overall success rate of 73.3 percent with five passes and 42.7 percent with one pass. The second place contender, OpenAI’s GPT-3.5, eked out a success rate of only 6.7 percent with five passes and 2.7 percent with one pass.
“That’s one of the things we find very surprising,” said Kang. “So depending on who you talk to, this might be called scaling law or an emergent capability. What we found is that GPT-4 is highly capable of these tasks. Every open source model failed, and GPT-3.5 is only marginally better than the open source models.”
One explanation cited in the paper is that GPT-4 was better able to change its actions based on the response it got from the target website than the open source models.
Kang said it’s difficult to be certain why that’s the case. “Qualitatively speaking, we found that the open source models are not nearly as good at function calling as the OpenAI models.”
He also cited the need to process large contexts (prompts). “GPT-4 needs to take up to 50 actions, if you include backtracking, to accomplish some of these hacks and this requires a lot of context to actually perform,” he explained. “We found that the open source models were not nearly as good as GPT-4 for long contexts.”
Backtracking refers to having a model revert to its previous state to try another approach when confronted with an error.
The researchers conducted a cost analysis of attacking websites with LLM agents and found the software agent is far more affordable than hiring a penetration tester.
“To estimate the cost of GPT-4, we performed five runs using the most capable agent (document reading and detailed prompt) and measured the total cost of the input and output tokens,” the paper says. “Across these 5 runs, the average cost was $4.189. With an overall success rate of 42.7 percent, this would total $9.81 per website.”
Assuming that a human security analyst paid $100,000 annually, or $50 an hour, would take about 20 minutes to check a website manually, the researchers say a live pen tester would cost about $80 or eight times the cost of an LLM agent. Kang said that while these numbers are highly speculative, he expects LLMs will be incorporated into penetration testing regimes in the coming years.
Asked whether cost might be a gating factor to prevent the widespread use of LLM agents for automated attacks, Kang said that may be somewhat true today but he expects costs will fall.
Kang said that while traditional safety concerns related to biased and harmful training data and model output are obviously very important, the risk expands when LLMs get turned into agents.
Agents are what really scares me in terms of future safety concerns
“Agents are what really scares me in terms of future safety concerns,” he said. “Some of the vulnerabilities that we tested on, you can actually find today using automatic scanners. You can find that they exist, but you can’t autonomously exploit them using the automated scanner, at least as far as I’m aware of. You aren’t able to actually autonomously leverage that information.
“What really worries me about future highly capable models is the ability to do autonomous hacks and self-reflection to try multiple different strategies at scale.”
Asked whether he has any advice for developers, industry, and policy makers. Kang said, “The first thing is just think very carefully about what these models could potentially be used for.” He also argued for safe harbor guarantees to allow security researchers to continue this kind of research, along with responsible disclosure agreements.
Midjourney, he said, had banned some researchers and journalists who pointed out their models appeared to be using copyrighted material. OpenAI, he said, has been generous by not banning his account.
The Register asked OpenAI to comment on the researchers’ findings. “We take the safety of our products seriously and are continually improving our safety measures based on how people use our products,” a spokesperson told us.
“We don’t want our tools to be used for malicious purposes, and we are always working on how we can make our systems more robust against this type of abuse. We thank the researchers for sharing their work with us.”
OpenAI earlier downplayed GPT-4’s abilities in aiding cyberattacks, saying the model “offers only limited, incremental capabilities for malicious cybersecurity tasks beyond what is already achievable with publicly available, non-AI powered tools.” ®