A new paper published by researchers affiliated with Facebook and Tel-Aviv University investigates whether machine learning language models can understand basic sets of instructions. They propose a test — the Turking Test — designed to examine a model’s ability to follow natural language instructions, and despite what the researchers characterize as a lenient evaluation methodology, they observe that a pretrained language model performs poorly across all tasks.
One of the fundamental problems in AI is how to build a model that can generalize to new, previously unseen tasks. Recent work proposes a few-shot inference approach, in which a language model is conditioned on a few examples of a new task followed by input someone wishes the model to process. This approach works well on a range of tasks, but the coauthors of this paper sought to determine If language models could perform new tasks by conditioning on instructions.
The Turking Test consists of instruction-following benchmarks of varying syntactic complexity, beginning with “turking” tasks where a model must create valid examples of popular natural language processing datasets. (This is meant to simulate tasks commonly done by laypeople on crowdsourcing platforms like Amazon Mechanical Turk.) Another portion of the test tasks the model with listing all the nouns of a given sentence that satisfy a simple condition. Finally, to pass the Turking Test, the model must write the Nth word or character in a given sentence (Section 5).
The researchers applied the Turking Task to OpenAI’s GPT-2, a model with 1.5 billion parameters (variables internal to the model that shape its predictions). Overall, the results were disappointing. GPT-2 achieved only 2% accuracy on the task of writing the Nth word, something the authors note an elementary school child can easily do. The model also ignored explicit restrictions and conditions that appear in the instructions, achieving only slightly higher accuracy on open-ended tasks than those with specific answers.
“Analyzing the model’s error patterns reveals that the model tends to ignore explicit instructions and often generates outputs that cannot be construed as an attempt to solve the task,” the researchers wrote. “The fact that such a large percentage of outputs is comprised of senseless repetitions indicates that the model fails to understand these trivial instructions. Even though these tasks are similar and have almost identical instructions, we find that their repetition patterns significantly differ, suggesting the model is hyper-sensitive to small changes in the instructions.”
Language models have much to learn if they’re to one day converse like respectful humans. Beyond an apparent inability to follow instructions, the jury is out on the potential for bias in language models and their grasp of general knowledge. Some research suggests benchmarks such as XTREME don’t measure models’ knowledge well and that models like T-ULRv2 can exhibit toxicity and prejudice against demographic groups.
New techniques and approaches will likely be required to bridge the gaps. As Sam Altman, CEO of OpenAI, the firm behind GPT-2 and GPT-3 (its successor), recently said: “The …hype is way too much. It’s impressive, but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but [cutting-edge language models] are just a very early glimpse. We have a lot still to figure out.”
The audio problem:
Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here