Many real-world problems require complex coordination between multiple agents — e.g., people or algorithms. A machine learning technique called multi-agent reinforcement learning (MARL) has shown success with respect to this, mainly in two-team games like Go, DOTA 2, Starcraft, hide-and-seek, and capture the flag. But the human world is far messier than games. That’s because humans face social dilemmas at multiple scales, from the interpersonal to the international, and they must decide not only how to cooperate but when to cooperate.
To address this challenge, researchers at OpenAI propose training AI agents with what they call randomized uncertain social preferences (RUSP), an augmentation that expands the distribution of environments in which reinforcement learning agents train. During training, agents share varying amounts of reward with each other; however, each agent has an independent degree of uncertainty over their relationships, creating “asymmetry” that the researchers hypothesize pressures agents to learn socially reactive behaviors.
To demonstrate RUSP’s potential, the coauthors had agents play Prisoner’s Buddy, a grid-based game where agents receive a reward by “finding a buddy.” On each timestep, agents act by either choosing another agent or deciding to choose no one and sitting out the round. If two agents mutually choose each other, they each get a reward of “2.” If an agent Alice chooses Bob but the choice isn’t reciprocated, Alice receives “-2” and Bob receives “1.” Agents who choose no one receive “0.”
The coauthors also explored preliminary team dynamics in a much more complex environment called Oasis. It’s physics-based and tasks agents with survival; their reward is “+1” for every timestep they remain alive and a large negative reward when they die. Their health decreases with each step, but they can regain health by eating food pellets and can attack others to reduce their health. If an agent is reduced below “0” health, it dies and respawns at the edge of the play area after 100 timesteps.
There’s only enough food to support two of the three agents in Oasis, creating a social dilemma. Agents must break symmetry and gang up on the third to secure the food source to stay alive.
RUSP agents in Oasis performed much better than a “selfish” baseline in that they achieved higher reward and died less frequently, the researchers report. (For agents trained with high uncertainty levels, up to 90% of the deaths in an episode were attributable to a single agent, indicating that two agents learned to form a coalition and mostly exclude the third from the food source.) And in Prisoner’s Buddy, RUSP agents successfully partition into teams that tended to be stable and maintained throughout an episode.
The researchers note that RUSP is inefficient — with the training setup in Oasis, 1,000 iterations corresponded to roughly 3.8 million episodes of experience. This being the case, they argue that RUSP and techniques like it warrant further exploration. “Reciprocity and team formation are hallmark behaviors of sustained cooperation in both animals and humans,” they wrote in a paper submitted to the 2020 NeurIPS conference. “The foundations of many of our social structures are rooted in these basic behaviors and are even explicitly written into them — almost 4,000 years ago, reciprocal punishment was at the core of Hammurabi’s code of laws. If we are to see the emergence of more complex social structures and norms, it seems a prudent first step to understanding how simple forms of reciprocity may develop in artificial agents.”