Because of their ability to detect, track, and follow objects of interest while maintaining safe distances, drones have become an important tool for professional and amateur filmmakers alike. This being the case, quadcopters’ camera controls remain difficult to master. Drones might take different paths for the same scenes even if their positions, velocities, and angles are carefully tuned, potentially ruining the consistency of a shot.
In search of a solution, Carnegie Mellon, University of Sao Paulo, and Facebook researchers developed a framework that enables users to define drone camera shots working from labels like “exciting,” “enjoyable,” and “establishing.” Using a software simulator, they generated a database of video clips with a diverse set of shot types and then leveraged crowdsourcing and AI to learn the relationship between the labels and certain semantic descriptors.
Videography can be a costly endeavor. Filming a short commercial runs $1,500 to $3,500 on the low end, a hefty expense for small-to-medium-size businesses. This leads some companies to pursue in-house solutions, but not all have the expertise required to execute on a vision. AI like Facebook’s, as well as Disney’s and Pixar’s, could lighten the load in a meaningful way.
The coauthors of this new framework began by conducting a series of experiments to determine the “minimal perceptually valid step sizes” — i.e., the minimum number of shots a drone had to take — for various shot parameters. Next, they built a dataset of 200 videos using these steps and tasked people recruited from Amazon Mechanical Turk with assigning scores to semantic descriptors. The scores informed a machine learning model that mapped the descriptors to parameters that could guide the drone through shots. Lastly, the team deployed the framework to a real-world Parrot Bepop 2 drone, which they claim managed to generalize well to different actors, activities, and settings.
The researchers assert that while the framework targets nontechnical users, experts could adapt it to gain more control over the model’s outcome. For example, they could learn separate generative models for individual shot types and exert more direction over the model’s inputs and outputs.
“Our … model is able to successfully generate shots that are rated by participants as having the expected degrees of expression for each descriptor,” the researchers wrote. “Furthermore, the model generalizes well to other simulated scenes and to real-world footages, which strongly suggests that our semantic control space is not overly attached to specific features of the training environment nor to a single set of actor motions.”
In the future, the researchers hope to explore a larger set of parameters to control each shot, including lens zoom and potentially even soundtracks. They would also like to extend the framework to take into account features like terrain and scenery.