In a paper published this week on the preprint server Arxiv.org, scientists at DeepMind introduce the idea of simple sensor intentions (SSIs), a way to reduce the knowledge needed to define rewards (functions describing how AI ought to behave) in reinforcement learning systems. They claim that SSIs can help to solve a range of complex robotic tasks — for example, grasping, lifting, and placing a ball into a cup — with only raw sensor data.
Training AI in the robotics domain typically requires a human expert and prior information. The AI must be tailored with adjustments depending on the overarching task at hand, which entails defining a reward that indicates success and that facilitates meaningful exploration. SSIs ostensibly provide a generic means of encouraging agents to explore their environments, as well as guidance for collecting data to solve a main task. If ever commercialized or deployed into a production system, like a warehouse robot, SSIs could reduce the need for manual fine-tuning and computationally expensive state estimation (i.e., estimating the state of a system from measurements of the inputs and outputs).
As the researchers explain, in the absence of reward signals, AI systems can form exploration strategies through learning policies that cause effects on robots’ sensors (e.g., touch sensors, joint angle sensors, and position sensors). These policies explore environments to find fruitful regions, enabling them to collect quality data for main learning tasks. Concretely, SSIs are sets of auxiliary tasks defined by obtaining a sensor response and calculating a reward according to one of two schemes: (1) rewarding an agent for reaching a specific target response or (2) rewarding an agent for incurring a specific change in response.
In experiments, the paper’s coauthors transformed raw images from a camera-equipped robot (a Rethink Sawyer) into small amounts of SSIs. They aggregated the statistics of the images’ spatial color distributions, defining color ranges and corresponding sensor values from estimates of the color of the objects in a scene. In total, they used six SSIs based on the robot’s touch sensor as well as two cameras around a basket containing a colored block. An AI system controlling the robot received the maximum reward only if it moved the color distribution’s average in both cameras to the desired direction.
The researchers report that the AI successfully learned to lift the block after 9,000 episodes — six days — of training. Even after they replaced the SSIs for a single color channel with SSIs that aggregated rewards over multiple color channels, the AI managed to learn to lift a “wide variety” of different objects from the raw sensor information. And after 4,000 episodes (three days) of training in a separate environment, it learned to play cup-and-ball.
In future work, the coauthors intend to concentrate on extending SSIs to automatically generate rewards and reward combinations. “We argue that our approach requires less prior knowledge than the broadly used shaping reward formulation, that typically rely on task insight for their definition and state estimation for their computation,” they wrote. “The definition of the SSIs was straight-forward with no or minor adaptation between domains.”