A new machine learning model being developed by MIT engineers will help robots perform multi-step tasks.

Thanks to the new deep learning model, AI perceives non-obvious connections between objects in a specific situation. The model first understands each individual relationship, and then builds a big picture out of them. In particular, it helps AI generate more accurate images from text descriptions.

This approach will help in situations where robots are entrusted with complex multistage tasks: placing items in storage areas or when assembling equipment. Without understanding the connections, the instruction “lift the box to the right of the cabinet and place it on the shelf on top” would make the robot get confused. And this approach also brings the future closer, where robots can learn by interacting with the environment. Just like people.

“When I look at the table, I cannot determine the coordinates of the object. Our brain works differently: we understand the situation based on the connections between objects. We think that by teaching this AI, we will enable it to operate more efficiently with the environment, ”says Yulin Du, a graduate student at the MIT Artificial Intelligence Laboratory.

The model works the other way around: it learns to create a textual description of objects in the image. And also – edit the image to arrange things the way it says in the changed description.

The researchers compared their model with other deep learning methods, where AIs were tasked with generating an image from a description. In all cases, their option exceeded expectations. To confirm the observations, the work of the model was also evaluated by people: they were asked how the created image corresponded to the description. In the most difficult examples, where there were three relationships between objects, 91% of the participants confirmed that the new model performed better. “It’s interesting that we can increase the complexity by increasing the number of connections, but our model can still handle it. Others don’t, ”adds Du.

Engineers tried other approaches, showing models of the scene that she had not seen before, offering different descriptions of the same model, and she still coped with testing. But engineers are careful in their predictions: while they plan to test their model on objects in the real world, where there is a lot of visual noise, and objects obscure each other.