Grand Challenges:

Main Contact:

Dong Chen: superminih1998@gs.zzu.edu.cn

Challenge Description

General vision-language pre-trained models (VLMs) have made significant advancements, they primarily focus on modality-specific instructions that only involve a single im- age as visual context with limited expression, while also overlooking the need for compelling image-text generation. This restricts the broader applicability of such versatile multi- modal assistants. To facilitate research on interleaved vision- language instruction following capability, we build Inova, a comprehensive challenge of 36 tasks of 7 categories covering 20 diverse scenarios. Inova has three important properties: (1) Mixed forms of comprehension and generation instruction, ranging from visual reference customization needing pixel- level generation, to multi-image discrimination requiring fine- grained understanding. (2) Interleaved vision-language con- texts, all instructions involve sequences of interconnected im- ages and texts that fully express comprehension and generation tasks. (3) Diverse range of instruction-following scenarios, the benchmark covers various real-world scenarios. This challenge consists of two sub-tracks: Interleaved Image-Text Comprehension and Interleaved Image-Text Generation. Specifically, we extensively gather a wide variety of multi-modal datasets from various tasks.

Challenge Website

https://inovachallenge.github.io/ICME2025

Main Contact:

Liu Kai: kai.lk@u.nus.edu

Challenge Description

The Responsible Multimodal AI Challenge aims to foster advancements in the development of reliable and trustworthy multimodal AI systems by addressing two crucial tasks: i) multimodal hallucination detection and ii) multimodal factuality detection. These tasks are designed to highlight the challenges and encourage innovative solutions for mitigating critical risks associated with generative multimodal AI. Task A, Multimodal Hallucination Detection, focuses on identifying hallucinated content in AI-generated captions for images. Participants will analyze captions to detect objects, attributes, or relationships that are fabricated or unsupported by the visual input. Task B, Multimodal Factuality Detection, emphasizes verifying the factuality of textual claims using both visual and contextual textual information. Participants will assess the factuality of claims in realworld scenarios. By addressing these tasks, the challenge seeks to promote the development of robust evaluation methodologies and algorithms that mitigate risks such as misinformation, bias, and errors in multimodal systems.

Challenge Website

https://mm-hall-fact.github.io/ICME2025

Main Contact:

Babak Naderi: babaknaderi@microsoft.com

Challenge Description

Super-Resolution (SR) is a critical task in computer vision, focusing on reconstructing high-resolution (HR) images from low-resolution (LR) inputs. The field has seen significant progress through various challenges, particularly in single image SR. Video Super-Resolution (VSR) extends this to the temporal domain, aiming to enhance video quality using methods like local, uni/bi-directional propagation, or traditional upscaling followed by restoration.

Recent challenges, such as NTIRE and AIM, have explored VSR under different conditions, including clean LR, motion blur, and frame drops, as well as quality enhancement for videos encoded with H.265 and AV1. Performance is typically measured using metrics like PSNR, SSIM, and LPIPS, though these do not always align with subjective opinions. Models trained on synthetic data often face issues with error propagation in real-world scenarios.

This ICME Grand Challenge addresses VSR for conferencing, where LR videos are encoded with H.265 at fixed QPs. The goal is to upscale videos by a fixed factor, providing HR outputs with enhanced perceptual quality under a low-delay scenario, i.e., the VSR models must be causal and not use any future frames in the rendering of the current frame. Outputs will be evaluated through subjective tests according to the crowdsourcing implementation of the ITU-T Rec P.910. The challenge includes three tracks:
– Track 1: General-purpose videos, x4 upscaling
– Track 2: Talking head videos, x4 upscaling
– Track 3: Screen sharing videos, x3 upscaling

Challenge Website

https://www.microsoft.com/en-us/research/video-super-resolution-challenge-icme-2025

Main Contact:

Hai Wei: haiwei@amazon.com

Challenge Description

Objective Video Quality Models (VQM) have been an active research area for decades, aiming to automate the video quality assessment tasks and processes. Even with the well-known deficiency of the existing methods (PSNR, SSIM, VMAF, etc.), most OTT (over the top) streaming service providers are leveraging objective VQM to improve their video encoding efficiency, and monitor and control the video quality along the streaming workflow. Therefore, continuous improvement of the objective VQM models in terms of their accuracy (i.e. correlations with human perceptual quality) and runtime performance will enable the OTT service providers to drive down their encoding costs while maintaining a high-quality experience for streaming customers. Over the years, HDR video contents have seen increasing adoption by various streaming and video hosting services (such as Amazon Prime Video, Netflix, and YouTube). HDR is also increasingly available as part of live broadcast and streaming workflows. Streaming HDR contents has introduced unique challenges related to the quality of user experience and the performance of video compression algorithms. The increases in bit depth and the use of nonlinear transfer functions in HDR can change the visibility and severity of compression distortions. Being able to objectively measure and control HDR perceptual quality has become a critical capability needed for premium video streaming services. However, the lack of generalizable VQM models that works on both HDR and SDR has become a bottleneck for OTT service providers to scale up their offerings and make improvement to the compression efficiency and perceptual quality for HDR contents. In this grand challenge, we invite the research community to participate and submit novel/improved VQM models for objectively predicting HDR and SDR video quality for both full-reference and no-reference use cases. HDR & SDR video dataset with human subjective quality scores (as ground truth) will be shared to facilitate the VQM model training and testing. The new dataset is collected using pairwise comparison (PC) instead of Absolute Category Rating (ACR) protocol for the reduced uncertainty. The challenge will focus on predicting video quality of both HDR and SDR videos with various degrees of compression and scaling artifacts.

Challenge Website

https://sites.google.com/view/icme25-vqm-gc/home?authuser=0 

Main Contact:

Hang Chen: ch199703@mail.ustc.edu.cn

Challenge Description

Lip reading aims to recognize spoken content based solely on visual information derived from the speaker’s lip movements. This emerging and challenging field lies at the intersection of computer vision and natural language processing and plays a key role in various applications in different domains.

Meetings represent one of the most valuable yet challenging contexts for lipreading due to the rich information exchange and decision-making processes involved. The meeting-scenario Chinese Lipreading (MeetCLR) challenge centers on the multi-speaker lipreading task, where both the training and

evaluation datasets involve different groups of speakers.The specific tasks considered in the challenge are: 1. Visual Speaker Diarization and 2. Visual Speech Recognition.

Challenge Website

https://mispchallenge.github.io/MeetCLR/index.html

Main Contact:

Junbo Zhang: zhangjunbo1@xiaomi.com

Challenge Description

The ICME 2025 Audio Encoder Capability Challenge aims to evaluate the capabilities of audio encoders, especially in the context of multi-task learning and real-world applications. Participants are invited to submit pre-trained audio encoders that map raw waveforms to continuous embeddings. These encoders will be tested across diverse tasks including speech, environmental sounds, and music, with a focus on real-world usability. The challenge features two tracks: Track A for parameterized evaluation, and Track B for parameter-free evaluation. This challenge provides a platform for evaluating and advancing the state-of-the-art in audio encoder design

Challenge Website

https://dataoceanai.github.io/ICME2025-Audio-Encoder-Challenge/