A machine learning model’s performance is only as good as the quality of the data set on which it’s trained, and in the domain of self-driving vehicles, it’s critical this performance isn’t adversely impacted by errors. A troubling report from computer vision startup Roboflow alleges that exactly this scenario occurred — according to founder Brad Dwyer, crucial bits of data were omitted from a corpus used to train self-driving car models.
Dwyer writes that Udacity Dataset 2, which contains 15,000 images captured while driving in Mountain View and neighboring cities during daylight, has omissions. Thousands of unlabeled vehicles, hundreds of unlabeled pedestrians, and dozens of unlabeled cyclists are present in roughly 5,000 of the samples, or 33%. (217 lack annotations but actually contain cars, trucks, street lights, or pedestrians.) Worse are the instances of phantom annotations and duplicated bounding boxes (where “bounding box” refers to objects of interest), in addition to “drastically” oversized bounding boxes.
It’s problematic considering labels are what allow an AI system to understand the implications of patterns (like when a person steps in front of a car) and evaluate future events based on that knowledge. Mislabeled or unlabeled items could lead to low accuracy and poor decision-making in turn, which in a self-driving car could be a recipe for disaster.
“Open source datasets are great, but if the public is going to trust our community with their safety we need to do a better job of ensuring the data we’re sharing is complete and accurate,” wrote Dwyer, who noted that thousands of students in Udacity’s self-driving engineering course use Udacity Dataset 2 in conjunction with an open-source self-driving car project. “If you’re using public datasets in your projects, please do your due diligence and check their integrity before using them in the wild.”
It’s well understood that AI is prone to bias problems stemming from incomplete or skewed data sets. For instance, word embedding, a common algorithmic training technique that involves linking words to vectors, unavoidably picks up — and at worst amplifies — prejudices implicit in source text and dialogue. Many facial recognition systems misidentify people of color more often than white people. Google Photos once infamously labeled pictures of darker-skinned people as “gorillas.”
But biased or poorly-performing AI systems could inflict far more harm when they’re at the wheel of vehicles, so to speak. There hasn’t been a documented instance of a driverless car causing a collision, but they’re driving around on public roads in small numbers. That’s likely to change — as many as 8 million driverless cars will be added to the road in 2025, according to marketing firm ABI, and Research and Markets anticipates that there will be some 20 million autonomous cars in operation in the U.S. by 2030. If those millions of cars run flawed AI models, the impact could be devastating.
And that would make a public wary of autonomous vehicles even more skeptical. Two studies — one published by the Brookings Institution and another by the Advocates for Highway and Auto Safety (AHAS) — found that a majority of Americans aren’t convinced of driverless cars’ safety. More than 60% of respondents to the Brookings poll said that they weren’t inclined to ride in self-driving cars, and almost 70% of those surveyed by the AHAS expressed concerns about sharing the road with them. Elsewhere, a study conducted by think tank HNTB found that 59% of people expect self-driving cars will be “no safer” than cars driven by humans.
A solution might lie in better data labeling. According to the Udacity Dataset 2’s Github page, crowd-sourced corpus annotation firm CrowdAI handled the labeling, using a combination of machine learning and human taskmasters. It’s unclear whether this approach might’ve contributed to the errors — we’ve reached out to CrowdAI for information — but an improved validation step might spotlight them going forward.
For its part, Roboflow tells Sophos’ Naked Security that it plans to run experiments with the original data set and the company’s fixed version of the data set, which it made available in open source, to see how much of a problem it would’ve been for training various model architectures.
“Of the datasets I’ve looked at in other domains (eg medicine, animals, games), this one stood out as being of particularly poor quality,” Dwyer told Naked Security. “I would hope that the big companies who are actually putting cars on the road are being much more rigorous with their data labeling, cleaning, and verification processes.”