If you’ve ever wished Minecraft had an Alexa-like assistant that could perform any task asked of it, you’re in luck. Facebook researchers recently argued for an interactive, collaborative Minecraft bot for natural language understanding (NLU) research. They posit that the constraints of Minecraft make it well-suited to experiments in various NLU subfields, and to this end, they’ve made baseline data, code, labeling tools, and infrastructure freely available on GitHub.
Their work to an extent builds on LIGHT, an open source research environment in the form of a large-scale, crowdsourced text adventure within which AI systems and humans interact as player characters. Scientists at Facebook AI Research, the Lorraine Research Laboratory in Computer Science and its Applications, and the University College London detailed LIGHT in paper published earlier this year.
“Despite the numerous important research directions related to virtual assistants, they themselves are not ideal platforms for the research community. They have a broad scope and need a large amount of world knowledge, and they have complex codebases maintained by hundreds, if not thousands of engineers,” wrote the coauthors in a preprint paper published on Arxiv.org. “Furthermore, their proprietary nature and their commercial importance makes experimentation with them difficult. Instead of a ‘real world’ assistant, we propose working in the sandbox construction game of Minecraft.”
For those unfamiliar, Minecraft is a voxel-based building and crafting game with procedurally created worlds containing block-based trees, mountains, fields, animals, non-player characters (NPCs), and so on. Blocks are placed on a 3D voxel grid, and each voxel in the grid contains one material. Players can move, place, or remove blocks of different types, and attack or fend off attacks from NPCs or other players.
The researchers, then, describe a Minecraft bot that understands natural language commands (e.g., “build a tower 15 blocks tall and then put a giant smiley on top”) fed to it via the in-game chat window. They concede that implementing this is easier said than done, namely because of the complexity of tasks players might ask the bot to perform. In the aforementioned example — “build a tower 15 blocks tall and then put a giant smiley on top” — the assistant would need to understand the meaning of “tower” and “smiley” and how to build them; know that “15 blocks high” measures the height of the tower; recognize the significance of “15”; and reconcile the relative position “top”.
Still, the paper’s coauthors assert that Minecraft’s task space and environment have “regularities” that could be used to simplify task execution. For instance, sets of language/action templates for generating example task commands could be used to build training data and inform the structure of the bot’s underlying NLU models. Moreover, Minecraft’s structure could function as a knowledge resource shared between AI and player. For example, if a user asks the assistant to “build a smiley,” the agent could infer that “a smiley” is a kind of block object because “build” is a common task the bot would already understand.
The researchers make a case for a modular approach to streamline a hypothetical assistant’s design and subsequent research. They propose that the actions necessary to complete basic Minecraft tasks (like path-planning and building) could be scripted by accessing the game’s internal world state. Furthermore, they note that it’d be relatively easy to collect or generate data for actions by recording players’ interactions with the assistant.
Formidable challenges stand in the way of a Minecraft bot that’s “engaging” and “fun,” the team points out. It’d need to be immediately responsive to feedback, as latency often has a large effect on players’ impression of performance, and it’d have to “optimally” interact with players by seeking clarification without bombarding them with annoying questions. But despite the blockers, the team firmly believes that Minecraft is ideal for studying learning from interaction, and especially learning from language interaction.
“[I]nstead of [exploring] ML methods [that can] can learn representations of the environment that allow an agent to act effectively … we are interested in the problem of what approaches allow an agent to understand player intent and improve itself via interaction, given the most favorable representations … of the environment we can engineer,” wrote the team. “While we are sympathetic to arguments suggesting that we will be unable to effectively attack the NLU problems without fundamental advances in methods for representation learning, we think it is time to try anyway.”