By akademiotoelektronik, 19/04/2022
Technique improves AI's ability to understand 3D space using 2D images
“We live in a 3D world, but when you take a picture, it records that world in a 2D image,” says Tianfu Wu, corresponding author of a paper on the work and assistant professor of electrical and computer engineering at the state of North Carolina. University.
“AI programs receive visual information from cameras. So if we want AI to interact with the world, we need to make sure it's able to interpret what 2D images can tell it about 3D space. In this research, we focus on part of that challenge: how to get AI to accurately recognize 3D objects, such as people or cars, in 2D images, and place those objects in space. . »
While the work could be important for autonomous vehicles, it also has applications for manufacturing and robotics.
In the context of autonomous vehicles, most existing systems rely on lidar – which uses lasers to measure distance – to navigate 3D space. However, lidar technology is expensive. And because lidar is expensive, autonomous systems don't include much redundancy. For example, it would be too expensive to put dozens of lidar sensors on a mass-produced driverless car.
"But if an autonomous vehicle could use visual inputs to navigate through space, you could create redundancy," Wu said. additional cameras, which would create redundancy in the system and make it both more secure and more robust.
“It's a practical application. However, we are also excited about the fundamental breakthrough of this work: that it is possible to obtain 3D data from 2D objects. »
Specifically, MonoCon is able to identify 3D objects in 2D images and place them in a "bounding box", which effectively tells the AI the outermost edges of the object in question.
MonoCon builds on a substantial amount of existing work aimed at helping AI programs extract 3D data from 2D images. Many of these efforts train the AI by "showing" it 2D images and placing 3D bounding boxes around objects in the image. These boxes are cuboids, which have eight points – think of the corners of a shoebox. During training, the AI receives 3D coordinates for each of the eight corners of the box, so the AI "understands" the height, width, and length of the "bounding box", as well as the distance between each of these corners and the camera. The training technique uses this to learn the AI how to estimate the dimensions of each bounding box and asks it to predict the distance between the camera and the car. After each prediction, the trainers "correct" the AI by giving it the correct answers. Over time, this allows the AI to better and better identify objects ts, place them in a bounding box and estimate the dimensions of the objects.
"What sets our work apart is the way we train the AI, which builds on previous training techniques," Wu said. when training the AI. However, in addition to asking the AI to predict the camera-to-object distance and the dimensions of the bounding boxes, we also ask the AI to predict the locations of each of the box's eight points and its distance from the center. of the bounding box in two dimensions. We call this “auxiliary context,” and we've found that it helps AI more accurately identify and predict 3D objects based on 2D images.
“The proposed method is motivated by a well-known measure theory theorem, the Cramér-Wold theorem. It is also potentially applicable to other structured-output prediction tasks in computer vision. »
The researchers tested MonoCon using a widely used benchmark data set called KITTI.
“At the time we submitted this article, MonoCon performed better than any of the dozens of other AI programs aimed at extracting 3D automotive data from 2D images,” Wu said. MonoCon said did well at identifying pedestrians and bicycles, but was not the best AI program for these identification tasks.
“Moving forward, we are expanding this and working with larger datasets to evaluate and refine MonoCon for use in autonomous driving,” Wu said. “We also want to explore applications in manufacturing, to see if we can improve the performance of tasks such as using robotic arms. »
The work was carried out with support from the National Science Foundation, under grants 1909644, 1822477, 2024688, and 2013451; the Office of Army Research, under grant W911NF1810295; and the U.S. Department of Health and Social Services, Administration for Community Living, under grant 90IFDV0017-01-00.
Related Articles