Video understanding is an AI subfield that not only underpins systems capable of sussing out semantics automatically, like web-video classifiers and sport activity recognizers, but that’s at the cornerstone of robot perception and navigation systems. Unfortunately, devising machine learning models that take advantage of videos’ spatiotemporal information isn’t easy, nor is identifying those that aren’t overly computation-intensive.
That’s why researchers at Google conducted a series of studies into automatic searches for optimal computer vision algorithms, which they detailed in a blog post today. The team reports that the best-performing architectures identified with their three of their approaches — EvaNet, AssembleNet, and TinyVideoNet — demonstrated a 10 to 100 times improvement in runtime speed over existing hand-crafted systems on multiple public data sets.
“To our knowledge, this is the very first work on neural architecture search for video understanding,” wrote contributing researchers Michael S. Ryoo and AJ Piergiovanni in the blog post. “The video architectures we generate with our new evolutionary algorithms outperform the best known hand-designed CNN [convolutional neural network] architectures on public datasets, by a significant margin.”
Ryoo, Piergiovanni, and colleagues describe the first of the models, EvaNet, as a module-level architecture searcher that finds optimal configurations. An evolutionary algorithm iteratively updates a collection of candidate AI models, while EvaNet modifies the modules within each model to generate entirely new architectures. Those that rank highest in a validation step are granted more “offspring” — modified copies of themselves or combinations of themselves with others — in the next generation, while those that score poorly are removed from the population.
According to the researchers, the approach excels at identifying “non-trivial” modules that are both faster and superior in terms of performance compared with conventionally designed modules. Additionally, they say that the resulting architectures are sufficiently diverse such that even ensembles of them are computationally efficient.
As for AssembleNet, it’s a method of fusing different sub-models with different input modalities (e.g., RGB and optical flow) and temporal resolutions, where a family of architectures learns relationships among feature representations across the modalities through evolution. Google says that an AssembleNet architecture trained for between 50 to 150 rounds achieved state-of-the-art results on the popular video recognition data sets Charades and Moments-in-Time (MiT).
Lastly, TinyVideoNets automatically designs networks that provide cutting-edge performance at a fraction of the computational cost of most video understanding systems. The bulk of the gains are achieved by considering the model at runtime during the architecture evolution and by forcing the algorithm to explore the search space in a fashion that reduces the number of computations.
Ryoo, Piergiovanni, and colleagues say that TinyVideoNets’ models achieve competitive accuracy and run efficiently (at real-time or better speeds) within 37ms to 100ms on a processor and 10ms on a graphics chip per roughly one second of video. That’s “hundreds of times” faster speeds than the other human-designed contemporary models on average, they claim.
“This research opens new directions and demonstrates the promise of machine-evolved CNNs for video understanding,” said Ryoo and Piergiovanni.