The aim is to recognise all possible interactions between objects and people and also between people. This technology allows a fine analysis of a scene and can be applied in video surveillance to detect fights or abandoned luggage for example.
It is necessary to detect all the objects in a scene beforehand, which is a research axis in its own right. Once these objects are detected, it must be possible to associate them with the right type of interaction. An interaction such as “holding” can be carried out with a multitude of different types of objects. The algorithm must be able to detect the interaction even if it has never seen a person holding a certain type of object during learning: it must be able to generalise the interaction. Moreover, objects in interaction are sometimes hidden or not visible in the scene. In these cases, the algorithm must still be able to recognize the interaction just with the appearance of the person. Finally, some semantically different interactions are visually close such as eating and drinking or holding and lifting.
The proposed solution
The main majority of state of the art methods detect all the objects in the scene beforehand and then calculate a probability of interaction between all possible pairs. The calculation time required to process an image is therefore quadratic and depends on the number of objects in the scene.
The CEA proposes the Calipso (Classifying all interacting pairs) solution, which is a single shot method because it estimates the interactions by passing the image through the network only once. To do this, the interactions are estimated on a dense grid of anchors. The strong point of Calipso is that it is fast and independent of the number of objects in the image.
The dataset used
We use the V-COCO dataset which consists of 10,000 images and is annotated with about 30 interaction verbs.
Calipso shows a score of 45% of good recognition of the triplets on this dataset.