Multimedia Indexing and Search

In the area of “Multimedia Indexing and Search” the MIDI Team works on subjects around Machine Learning for Computer Vision like Large Scale Indexing, Visual matching with high-order occurrence pooling and fine-grained image classification [1]. The goal of this research (2 PhD theses: Negrel (2014), Jacob (ongoing); Projects: ALICE) is to learn image representations so as to encode visual similarity. The focus is on high order statistics that are able to encode fine grain details in the images while controlling the computational complexity of the embedding methods. This includes metrics for media comparison, like image, videos and 3D meshes. More recently, these approaches have been integrated into a deep learning framework.

Additionally, team members are working on texture classification and human action recognition, proposing different feature extraction techniques, varying from the traditional self-similarity-based features to the recently proposed deep normalized convolution network as well as a unified framework which incorporates all their advantages (PhD thesis: L. Nguyen, 2018).

Finally, members of the team worked in pure Machine Learning problems, a notable work in this area being the work around asynchronous decentralized learning using Gossip protocol approaches [3]. The focus of this research area (PhD thesis: J. Fellus, 2017) is to transpose well-known machine learning algorithms into a decentralized framework where no central authority is known and where nodes can enter and leave the computation network at will. We propose an asynchronous framework based on gossip communication protocols in which we are able to propose several algorithms provably equivalent to their centralized counterparts, like "gossiped" methods of K-Means, PCA and SVM. All methods’ performance is validated on usual benchmarks. A recent collaboration (PhD M. Blot, LIP6/UPMC) studies the extension of the framework to deep neural networks.

Video analysis (PhD: D. Luvizon (ongoing); Postdoc: Kihl (2013-2015); Project: Terrarush) is also an active area of research, where we perform the analysis of a video stream in order to recognize events such as actions performs by humans. The focus of this research consists in learning efficient representation, first from the raw image and then from structural information such as pose data. More recently, these approaches have been integrated into a unified deep learning framework.

Selected references

[1] Romain Negrel, David Picard, and Philippe-Henri Gosselin. Web scale image retrieval using compact tensor aggregation of visual descriptors. IEEE MultiMedia, 20(3):24–33, March 2013.

[2] Hedi Tabia and Hamid Laga. Modeling and Exploring Co-variations in the Geometry and Configuration of Man-made 3D Shape Families. Computer Graphics Forum, 36(5):13-25, August 2017b. doi:10.1111/cgf.13241.

[3] Jérôme Fellus, David Picard, and Philippe-Henri Gosselin. Dimensionality reduction in decentralized networks by Gossip aggregation of principal components analyzers. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pages 171-176, Bruges, Belgium, April 2014.

Working with 3D data

The team is also working in the area of describing, recognizing, retrieving and classifying three-dimensional (3D) data. These are fundamental problems and building blocks to many applications in computer vision, computer graphics, medical imaging, and archaeology, incorporating 3D shapes as well as 3D actions.

Our research focuses on the analysis of 3D shapes including 3D shape recognition, retrieval, and classification and mainly includes techniques that assign a high-level definition of an object based on the sensed data that are represented. Applications include generic 3D object recognition or specific cases like facial recognition [2].

Additionally, we worked on efficient 3D search tools including searching on diverse 3D data collections containing cross-domain objects and allowing for 3D shape retrieval based on queries using images, sketches, shapes or combinations. Finally, due to the emergence of low cost sensors, we were able to start working on 3D action recognition, given that we can now get videos containing information on pixel depth, representing a 3D model of the scene, and allow a more reliable and robust action recognition than with a conventional RGB camera. Applications of this are becoming popular in research and industrial development, such as human-machine interaction (HMI), behavior monitoring or sign language recognition. We proposed new methods for 3D action recognition mostly based on feature extraction and classification. The work has been supported since 2016 by the platform ARAV3D.

A special note has to be made for the work done by the team in the area of applying machine vision and analysis techniques specifically for cultural heritage problems. This research area covers various aspects. For example, it deals (PhD thesis: M. Paumard (ongoing), Postdoc Y. Ren (2015)) with applications for indexing cultural heritage image collections, ancient photographic paper recognition, ancient paper watermark recognition or automatic reassembly of ancient archeological fragments (Project: ARCHEPUZ'3D). Additionally, members of the team worked on the creation of 3D models and the 3D reconstruction of the Château de Versailles from ancient architectural sketches, an effort that was very well received by the experts/partners (Centre de recherche du château de Versailles, Archives Nationales, BNF) in the cultural domain. This work was funded under the VERSPERA project [4] and a reusable software for reconstructing other ancient buildings was produced.


VERSPERA project: original floor plan and 3D model.

VERSPERA project: original floor plan and 3D model.


Selected references

[2] Hedi Tabia and Hamid Laga. Modeling and Exploring Co-variations in the Geometry and Configuration of Man-made 3D Shape Families. Computer Graphics Forum, 36(5):13-25, August 2017b. doi:10.1111/cgf.13241.

[4] Hedi Tabia, Christophe Riedinger, Michel Jordan. Automatic reconstruction of heritage monuments from old architecture documents. Journal of Electronic Imaging, SPIE and IS&T, 2016, Special Section on Image Processing for Cultural Heritage, 26(1). DOI〈10.1117/1.JEI.26.1.011006〉