AI across Multiple Modalities
The way we sense and interact with the world around us is inherently multimodal – we see, we hear, we talk. As AI systems are being embedded in the real world, let these be robots, autonomous vehicles or medical diagnosis systems, they also need the capability to both understand different types of signals, and generate different types of information, including images, audio and language.
Within the centre we have world class researchers in the areas of Computer Vision, Audio and Natural Language Processing working at the forefront of the research in both single modalities, and increasingly more at the crossroads of multiple modalities.
Some of the central themes in this area with the Centre for Multimodal AI are:
- Computational models for analysis and understanding of images, video and text, including Detection/Recognition and Segmentation of objects and scenes in images and in videos; Large Language Model (LLM) driven Video Understanding and Search; Vision-Language Joint Learning of Compositional Representations; Vision-Language models driven understanding of Human Behaviour, Affect and Actions; Vision-Language and LLM models driven Question-Answering; Video Summarisation
- Computational models for the 3D Vision, including 3D reconstruction, 3D animation, and 3D search and retrieval.
- Methods and systems for Analysis of Environmental Audio, Sound Production. Semantic Audio, Music Informatics, Computational Musicology, Computational & Virtual Acoustics, Sound Synthesis Augmented Instruments, Generative AI for Music
- Development of computational models able to understand and generate language and visual information in widely varying contexts, including social media, human-human and human-computer/robot interaction, and able to extract key information from vast amounts of text.