Computer Vision and Pattern Recognition
★ SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation CVPR 2025
Creating high-quality animatable 3D human avatars from a single image remains
a significant challenge in computer vision due to the inherent difficulty of
reconstructing complete 3D information from a single viewpoint. Current
approaches face a clear limitation: 3D Gaussian Splatting (3DGS) methods
produce high-quality results but require multiple views or video sequences,
while video diffusion models can generate animations from single images but
struggle with consistency and identity preservation. We present SVAD, a novel
approach that addresses these limitations by leveraging complementary strengths
of existing techniques. Our method generates synthetic training data through
video diffusion, enhances it with identity preservation and image restoration
modules, and utilizes this refined data to train 3DGS avatars. Comprehensive
evaluations demonstrate that SVAD outperforms state-of-the-art (SOTA)
single-image methods in maintaining identity consistency and fine details
across novel poses and viewpoints, while enabling real-time rendering
capabilities. Through our data augmentation pipeline, we overcome the
dependency on dense monocular or multi-view training data typically required by
traditional 3DGS approaches. Extensive quantitative, qualitative comparisons
show our method achieves superior performance across multiple metrics against
baseline models. By effectively combining the generative power of diffusion
models with both the high-quality results and rendering efficiency of 3DGS, our
work establishes a new approach for high-fidelity avatar generation from a
single image input.
comment: Accepted by CVPR 2025 SyntaGen Workshop, Project Page:
https://yc4ny.github.io/SVAD/
★ 3D Scene Generation: A Survey
3D scene generation seeks to synthesize spatially structured, semantically
meaningful, and photorealistic environments for applications such as immersive
media, robotics, autonomous driving, and embodied AI. Early methods based on
procedural rules offered scalability but limited diversity. Recent advances in
deep generative models (e.g., GANs, diffusion models) and 3D representations
(e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene
distributions, improving fidelity, diversity, and view consistency. Recent
advances like diffusion models bridge 3D scene synthesis and photorealism by
reframing generation as image or video synthesis problems. This survey provides
a systematic overview of state-of-the-art approaches, organizing them into four
paradigms: procedural generation, neural 3D-based generation, image-based
generation, and video-based generation. We analyze their technical foundations,
trade-offs, and representative results, and review commonly used datasets,
evaluation protocols, and downstream applications. We conclude by discussing
key challenges in generation capacity, 3D representation, data and annotations,
and evaluation, and outline promising directions including higher fidelity,
physics-aware and interactive generation, and unified perception-generation
models. This review organizes recent advances in 3D scene generation and
highlights promising directions at the intersection of generative AI, 3D
vision, and embodied intelligence. To track ongoing developments, we maintain
an up-to-date project page:
https://github.com/hzxie/Awesome-3D-Scene-Generation.
comment: Project Page: https://github.com/hzxie/Awesome-3D-Scene-Generation
★ DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion CVPR 2025
Current Structure-from-Motion (SfM) methods typically follow a two-stage
pipeline, combining learned or geometric pairwise reasoning with a subsequent
global optimization step. In contrast, we propose a data-driven multi-view
reasoning approach that directly infers 3D scene geometry and camera poses from
multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry
and cameras as pixel-wise ray origins and endpoints in a global frame and
employs a transformer-based denoising diffusion model to predict them from
multi-view inputs. To address practical challenges in training diffusion models
with missing data and unbounded scene coordinates, we introduce specialized
mechanisms that ensure robust learning. We empirically validate DiffusionSfM on
both synthetic and real datasets, demonstrating that it outperforms classical
and learning-based approaches while naturally modeling uncertainty.
comment: CVPR 2025. Project website: https://qitaozhao.github.io/DiffusionSfM
★ Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang
Recent progress in unified models for image understanding and generation has
been impressive, yet most approaches remain limited to single-modal generation
conditioned on multiple modalities. In this paper, we present Mogao, a unified
framework that advances this paradigm by enabling interleaved multi-modal
generation through a causal approach. Mogao integrates a set of key technical
improvements in architecture design, including a deep-fusion design, dual
vision encoders, interleaved rotary position embeddings, and multi-modal
classifier-free guidance, which allow it to harness the strengths of both
autoregressive models for text generation and diffusion models for high-quality
image synthesis. These practical improvements also make Mogao particularly
effective to process interleaved sequences of text and images arbitrarily. To
further unlock the potential of unified models, we introduce an efficient
training strategy on a large-scale, in-house dataset specifically curated for
joint text and image generation. Extensive experiments show that Mogao not only
achieves state-of-the-art performance in multi-modal understanding and
text-to-image generation, but also excels in producing high-quality, coherent
interleaved outputs. Its emergent capabilities in zero-shot image editing and
compositional generation highlight Mogao as a practical omni-modal foundation
model, paving the way for future development and scaling the unified
multi-modal systems.
comment: Mogao Technical Report
★ Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
We propose Flow-GRPO, the first method integrating online reinforcement
learning (RL) into flow matching models. Our approach uses two key strategies:
(1) an ODE-to-SDE conversion that transforms a deterministic Ordinary
Differential Equation (ODE) into an equivalent Stochastic Differential Equation
(SDE) that matches the original model's marginal distribution at all timesteps,
enabling statistical sampling for RL exploration; and (2) a Denoising Reduction
strategy that reduces training denoising steps while retaining the original
inference timestep number, significantly improving sampling efficiency without
performance degradation. Empirically, Flow-GRPO is effective across multiple
text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly
perfect object counts, spatial relations, and fine-grained attributes, boosting
GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy
improves from $59\%$ to $92\%$, significantly enhancing text generation.
Flow-GRPO also achieves substantial gains in human preference alignment.
Notably, little to no reward hacking occurred, meaning rewards did not increase
at the cost of image quality or diversity, and both remained stable in our
experiments.
comment: Code: https://github.com/yifan123/flow_grpo
★ Generating Physically Stable and Buildable LEGO Designs from Text
We introduce LegoGPT, the first approach for generating physically stable
LEGO brick models from text prompts. To achieve this, we construct a
large-scale, physically stable dataset of LEGO designs, along with their
associated captions, and train an autoregressive large language model to
predict the next brick to add via next-token prediction. To improve the
stability of the resulting designs, we employ an efficient validity check and
physics-aware rollback during autoregressive inference, which prunes infeasible
token predictions using physics laws and assembly constraints. Our experiments
show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO
designs that align closely with the input text prompts. We also develop a
text-based LEGO texturing method to generate colored and textured designs. We
show that our designs can be assembled manually by humans and automatically by
robotic arms. We also release our new dataset, StableText2Lego, containing over
47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed
captions, along with our code and models at the project website:
https://avalovelace1.github.io/LegoGPT/.
comment: Project page: https://avalovelace1.github.io/LegoGPT/
★ StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, Ping Huang
We present StreamBridge, a simple yet effective framework that seamlessly
transforms offline Video-LLMs into streaming-capable models. It addresses two
fundamental challenges in adapting existing models into online scenarios: (1)
limited capability for multi-turn real-time understanding, and (2) lack of
proactive response mechanisms. Specifically, StreamBridge incorporates (1) a
memory buffer combined with a round-decayed compression strategy, supporting
long-context multi-turn interactions, and (2) a decoupled, lightweight
activation model that can be effortlessly integrated into existing Video-LLMs,
enabling continuous proactive responses. To further support StreamBridge, we
construct Stream-IT, a large-scale dataset tailored for streaming video
understanding, featuring interleaved video-text sequences and diverse
instruction formats. Extensive experiments show that StreamBridge significantly
improves the streaming understanding capabilities of offline Video-LLMs across
various tasks, outperforming even proprietary models such as GPT-4o and Gemini
1.5 Pro. Simultaneously, it achieves competitive or superior performance on
standard video understanding benchmarks.
★ SITE: towards Spatial Intelligence Thorough Evaluation
Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, Boqing Gong
Spatial intelligence (SI) represents a cognitive ability encompassing the
visualization, manipulation, and reasoning about spatial relationships,
underpinning disciplines from neuroscience to robotics. We introduce SITE, a
benchmark dataset towards SI Thorough Evaluation in a standardized format of
multi-choice visual question-answering, designed to assess large
vision-language models' spatial intelligence across diverse visual modalities
(single-image, multi-image, and video) and SI factors (figural to environmental
scales, spatial visualization and orientation, intrinsic and extrinsic, static
and dynamic). Our approach to curating the benchmark combines a bottom-up
survey about 31 existing datasets and a top-down strategy drawing upon three
classification systems in cognitive science, which prompt us to design two
novel types of tasks about view-taking and dynamic scenes. Extensive
experiments reveal that leading models fall behind human experts especially in
spatial orientation, a fundamental SI factor. Moreover, we demonstrate a
positive correlation between a model's spatial reasoning proficiency and its
performance on an embodied AI task.
★ Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding CVPR2025
Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Visual Document Understanding has become essential with the increase of
text-rich visual content. This field poses significant challenges due to the
need for effective integration of visual perception and textual comprehension,
particularly across diverse document types with complex layouts. Moreover,
existing fine-tuning datasets for this domain often fall short in providing the
detailed contextual information for robust understanding, leading to
hallucinations and limited comprehension of spatial relationships among visual
elements. To address these challenges, we propose an innovative pipeline that
utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML,
and TiKZ, to build highly structured document representations and deliver
contextually-grounded responses. We introduce two fine-grained structured
datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs
for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data
annotations for grounded instruction following. Extensive experiments
demonstrate that our proposed model significantly outperforms existing
state-of-theart MLLMs across a range of visual document understanding
benchmarks, facilitating advanced reasoning and comprehension capabilities in
complex visual scenarios. Our code and models are released at https://github.
com/Euphoria16/DocMark.
comment: CVPR2025
★ TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Pioneering token-based works such as Chameleon and Emu3 have established a
foundation for multimodal unification but face challenges of high training
computational overhead and limited comprehension performance due to a lack of
high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer
that enhances comprehension by semanticizing vector-quantized (VQ) tokens and
incorporating CLIP-level semantics while enabling end-to-end multimodal
autoregressive training with standard VQ tokens. TokLIP integrates a low-level
discrete VQ tokenizer with a ViT-based token encoder to capture high-level
continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize
high-level features, TokLIP disentangles training objectives for comprehension
and generation, allowing the direct application of advanced VQ tokenizers
without the need for tailored quantization operations. Our empirical results
demonstrate that TokLIP achieves exceptional data efficiency, empowering visual
tokens with high-level semantic understanding while enhancing low-level
generative capacity, making it well-suited for autoregressive Transformers in
both comprehension and generation tasks. The code and models are available at
https://github.com/TencentARC/TokLIP.
comment: Technical Report
★ PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model
Serving the Intelligent Transport System (ITS) and Vehicle-to-Everything
(V2X) tasks, roadside perception has received increasing attention in recent
years, as it can extend the perception range of connected vehicles and improve
traffic safety. However, roadside point cloud oriented 3D object detection has
not been effectively explored. To some extent, the key to the performance of a
point cloud detector lies in the receptive field of the network and the ability
to effectively utilize the scene context. The recent emergence of Mamba, based
on State Space Model (SSM), has shaken up the traditional convolution and
transformers that have long been the foundational building blocks, due to its
efficient global receptive field. In this work, we introduce Mamba to
pillar-based roadside point cloud perception and propose a framework based on
Cross-stage State-space Group (CSG), called PillarMamba. It enhances the
expressiveness of the network and achieves efficient computation through
cross-stage feature fusion. However, due to the limitations of scan directions,
state space model faces local connection disrupted and historical relationship
forgotten. To address this, we propose the Hybrid State-space Block (HSB) to
obtain the local-global context of roadside point cloud. Specifically, it
enhances neighborhood connections through local convolution and preserves
historical memory through residual attention. The proposed method outperforms
the state-of-the-art methods on the popular large scale roadside benchmark:
DAIR-V2X-I. The code will be released soon.
★ EDmamba: A Simple yet Effective Event Denoising Method with State Space Model
Event cameras excel in high-speed vision due to their high temporal
resolution, high dynamic range, and low power consumption. However, as dynamic
vision sensors, their output is inherently noisy, making efficient denoising
essential to preserve their ultra-low latency and real-time processing
capabilities. Existing event denoising methods struggle with a critical
dilemma: computationally intensive approaches compromise the sensor's
high-speed advantage, while lightweight methods often lack robustness across
varying noise levels. To address this, we propose a novel event denoising
framework based on State Space Models (SSMs). Our approach represents events as
4D event clouds and includes a Coarse Feature Extraction (CFE) module that
extracts embedding features from both geometric and polarity-aware subspaces.
The model is further composed of two essential components: A Spatial Mamba
(S-SSM) that models local geometric structures and a Temporal Mamba (T-SSM)
that captures global temporal dynamics, efficiently propagating spatiotemporal
features across events. Experiments demonstrate that our method achieves
state-of-the-art accuracy and efficiency, with 88.89K parameters, 0.0685s per
100K events inference time, and a 0.982 accuracy score, outperforming
Transformer-based methods by 2.08% in denoising accuracy and 36X faster.
★ GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans
We propose a novel method that reconstructs hair strands directly from
colorless 3D scans by leveraging multi-modal hair orientation extraction. Hair
strand reconstruction is a fundamental problem in computer vision and graphics
that can be used for high-fidelity digital avatar synthesis, animation, and
AR/VR applications. However, accurately recovering hair strands from raw scan
data remains challenging due to human hair's complex and fine-grained
structure. Existing methods typically rely on RGB captures, which can be
sensitive to the environment and can be a challenging domain for extracting the
orientation of guiding strands, especially in the case of challenging
hairstyles. To reconstruct the hair purely from the observed geometry, our
method finds sharp surface features directly on the scan and estimates strand
orientation through a neural 2D line detector applied to the renderings of scan
shading. Additionally, we incorporate a diffusion prior trained on a diverse
set of synthetic hair scans, refined with an improved noise schedule, and
adapted to the reconstructed contents via a scan-specific text prompt. We
demonstrate that this combination of supervision signals enables accurate
reconstruction of both simple and intricate hairstyles without relying on color
information. To facilitate further research, we introduce Strands400, the
largest publicly available dataset of hair strands with detailed surface
geometry extracted from real-world data, which contains reconstructed hair
strands from the scans of 400 subjects.
comment: 15 pages, 9 figures, 1 table
★ Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks IJCNN 2025
Recently, spiking neural networks (SNNs), deployed on neuromorphic chips,
provide highly efficient solutions on edge devices in different scenarios.
However, their ability to adapt to distribution shifts after deployment has
become a crucial challenge. Online test-time adaptation (OTTA) offers a
promising solution by enabling models to dynamically adjust to new data
distributions without requiring source data or labeled target samples.
Nevertheless, existing OTTA methods are largely designed for traditional
artificial neural networks and are not well-suited for SNNs. To address this
gap, we propose a low-power, neuromorphic chip-friendly online test-time
adaptation framework, aiming to enhance model generalization under distribution
shifts. The proposed approach is called Threshold Modulation (TM), which
dynamically adjusts the firing threshold through neuronal dynamics-inspired
normalization, being more compatible with neuromorphic hardware. Experimental
results on benchmark datasets demonstrate the effectiveness of this method in
improving the robustness of SNNs against distribution shifts while maintaining
low computational cost. The proposed method offers a practical solution for
online test-time adaptation of SNNs, providing inspiration for the design of
future neuromorphic chips. The demo code is available at
github.com/NneurotransmitterR/TM-OTTA-SNN.
comment: Accepted by IJCNN 2025. \c{opyright} 2025 IEEE. Personal use of this
material is permitted. Permission from IEEE must be obtained for all other
uses, including reprinting/republishing this material for advertising or
promotional purposes, collecting new collected works for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
★ OcularAge: A Comparative Study of Iris and Periocular Images for Pediatric Age Estimation
Estimating a child's age from ocular biometric images is challenging due to
subtle physiological changes and the limited availability of longitudinal
datasets. Although most biometric age estimation studies have focused on facial
features and adult subjects, pediatric-specific analysis, particularly of the
iris and periocular regions, remains relatively unexplored. This study presents
a comparative evaluation of iris and periocular images for estimating the ages
of children aged between 4 and 16 years. We utilized a longitudinal dataset
comprising more than 21,000 near-infrared (NIR) images, collected from 288
pediatric subjects over eight years using two different imaging sensors. A
multi-task deep learning framework was employed to jointly perform age
prediction and age-group classification, enabling a systematic exploration of
how different convolutional neural network (CNN) architectures, particularly
those adapted for non-square ocular inputs, capture the complex variability
inherent in pediatric eye images. The results show that periocular models
consistently outperform iris-based models, achieving a mean absolute error
(MAE) of 1.33 years and an age-group classification accuracy of 83.82%. These
results mark the first demonstration that reliable age estimation is feasible
from children's ocular images, enabling privacy-preserving age checks in
child-centric applications. This work establishes the first longitudinal
benchmark for pediatric ocular age estimation, providing a foundation for
designing robust, child-focused biometric systems. The developed models proved
resilient across different imaging sensors, confirming their potential for
real-world deployment. They also achieved inference speeds of less than 10
milliseconds per image on resource-constrained VR headsets, demonstrating their
suitability for real-time applications.
★ Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt
We propose a novel joint framework by integrating super-resolution and
segmentation, called JointSeg, which enables the generation of 1-meter ISA maps
directly from freely available Sentinel-2 imagery. JointSeg was trained on
multimodal cross-resolution inputs, offering a scalable and affordable
alternative to traditional approaches. This synergistic design enables gradual
resolution enhancement from 10m to 1m while preserving fine-grained spatial
textures, and ensures high classification fidelity through effective
cross-scale feature fusion. This method has been successfully applied to the
Yangtze River Economic Belt (YREB), a region characterized by complex
urban-rural patterns and diverse topography. As a result, a comprehensive ISA
mapping product for 2021, referred to as ISA-1, was generated, covering an area
of over 2.2 million square kilometers. Quantitative comparisons against the 10m
ESA WorldCover and other benchmark products reveal that ISA-1 achieves an
F1-score of 85.71%, outperforming bilinear-interpolation-based segmentation by
9.5%, and surpassing other ISA datasets by 21.43%-61.07%. In densely urbanized
areas (e.g., Suzhou, Nanjing), ISA-1 reduces ISA overestimation through
improved discrimination of green spaces and water bodies. Conversely, in
mountainous regions (e.g., Ganzi, Zhaotong), it identifies significantly more
ISA due to its enhanced ability to detect fragmented anthropogenic features
such as rural roads and sparse settlements, demonstrating its robustness across
diverse landscapes. Moreover, we present biennial ISA maps from 2017 to 2023,
capturing spatiotemporal urbanization dynamics across representative cities.
The results highlight distinct regional growth patterns: rapid expansion in
upstream cities, moderate growth in midstream regions, and saturation in
downstream metropolitan areas.
★ Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields
Runfeng Li, Mikhail Okunev, Zixuan Guo, Anh Ha Duong, Christian Richardt, Matthew O'Toole, James Tompkin
We present a method to reconstruct dynamic scenes from monocular
continuous-wave time-of-flight (C-ToF) cameras using raw sensor samples that
achieves similar or better accuracy than neural volumetric approaches and is
100x faster. Quickly achieving high-fidelity dynamic 3D reconstruction from a
single viewpoint is a significant challenge in computer vision. In C-ToF
radiance field reconstruction, the property of interest-depth-is not directly
measured, causing an additional challenge. This problem has a large and
underappreciated impact upon the optimization when using a fast primitive-based
scene representation like 3D Gaussian splatting, which is commonly used with
multi-view data to produce satisfactory results and is brittle in its
optimization otherwise. We incorporate two heuristics into the optimization to
improve the accuracy of scene geometry represented by Gaussians. Experimental
results show that our approach produces accurate reconstructions under
constrained C-ToF sensing conditions, including for fast motions like swinging
baseball bats. https://visual.cs.brown.edu/gftorf
★ Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization WACV 2024
Large-scale vision-language models demonstrate strong multimodal alignment
and generalization across diverse tasks. Among them, CLIP stands out as one of
the most successful approaches. In this work, we extend the application of CLIP
to sound source localization, proposing a self-supervised method operates
without explicit text input. We introduce a framework that maps audios into
tokens compatible with CLIP's text encoder, producing audio-driven embeddings.
These embeddings are used to generate sounding region masks, from which visual
features are extracted and aligned with the audio embeddings through a
contrastive audio-visual correspondence objective. Our findings show that
alignment knowledge of pre-trained multimodal foundation model enables our
method to generate more complete and compact localization for sounding objects.
We further propose an LLM-guided extension that distills object-aware
audio-visual scene understanding into the model during training to enhance
alignment. Extensive experiments across five diverse tasks demonstrate that our
method, in all variants, outperforms state-of-the-art approaches and achieves
strong generalization in zero-shot settings.
comment: Journal Extension of WACV 2024 paper (arXiv:2311.04066). Code is
available at https://github.com/swimmiing/ACL-SSL
★ Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors
The motion capture system that supports full-body virtual representation is
of key significance for virtual reality. Compared to vision-based systems,
full-body pose estimation from sparse tracking signals is not limited by
environmental conditions or recording range. However, previous works either
face the challenge of wearing additional sensors on the pelvis and lower-body
or rely on external visual sensors to obtain global positions of key joints. To
improve the practicality of the technology for virtual reality applications, we
estimate full-body poses using only inertial data obtained from three Inertial
Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing
the complexity of the hardware system. In this work, we propose a method called
Progressive Inertial Poser (ProgIP) for human pose estimation, which combines
neural network estimation with a human dynamics model, considers the
hierarchical structure of the kinematic chain, and employs a multi-stage
progressive network estimation with increased depth to reconstruct full-body
motion in real time. The encoder combines Transformer Encoder and bidirectional
LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial
sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms
high-dimensional features and accurately projects them onto Skinned
Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative
experimental results on multiple public datasets show that our method
outperforms state-of-the-art methods with the same inputs, and is comparable to
recent works using six IMU sensors.
★ Aesthetics Without Semantics
While it is easy for human observers to judge an image as beautiful or ugly,
aesthetic decisions result from a combination of entangled perceptual and
cognitive (semantic) factors, making the understanding of aesthetic judgements
particularly challenging from a scientific point of view. Furthermore, our
research shows a prevailing bias in current databases, which include mostly
beautiful images, further complicating the study and prediction of aesthetic
responses. We address these limitations by creating a database of images with
minimal semantic content and devising, and next exploiting, a method to
generate images on the ugly side of aesthetic valuations. The resulting Minimum
Semantic Content (MSC) database consists of a large and balanced collection of
10,426 images, each evaluated by 100 observers. We next use established image
metrics to demonstrate how augmenting an image set biased towards beautiful
images with ugly images can modify, or even invert, an observed relationship
between image features and aesthetics valuation. Taken together, our study
reveals that works in empirical aesthetics attempting to link image content and
aesthetic judgements may magnify, underestimate, or simply miss interesting
effects due to a limitation of the range of aesthetic values they consider.
comment: Parts of this work were presented in abstract format at the Vision
Science of Art Conference (VSAC2016), the Iberian Conference on Perception
(CIP2022), and the European Conference on Visual Perception (ECVP2022). See
Perception 51, No1 (Suppl.) pp139, 2022)
★ Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery
Accurate building segmentation from high-resolution RGB imagery remains
challenging due to spectral similarity with non-building features, shadows, and
irregular building geometries. In this study, we present a comprehensive deep
learning framework for multiscale building segmentation using RGB aerial and
satellite imagery with spatial resolutions ranging from 0.4m to 2.7m. We curate
a diverse, multi-sensor dataset and introduce feature-augmented inputs by
deriving secondary representations including Principal Component Analysis
(PCA), Visible Difference Vegetation Index (VDVI), Morphological Building Index
(MBI), and Sobel edge filters from RGB channels. These features guide a
Res-U-Net architecture in learning complex spatial patterns more effectively.
We also propose training policies incorporating layer freezing, cyclical
learning rates, and SuperConvergence to reduce training time and resource
usage. Evaluated on a held-out WorldView-3 image, our model achieves an overall
accuracy of 96.5%, an F1-score of 0.86, and an Intersection over Union (IoU) of
0.80, outperforming existing RGB-based benchmarks. This study demonstrates the
effectiveness of combining multi-resolution imagery, feature augmentation, and
optimized training strategies for robust building segmentation in remote
sensing applications.
comment: in preparation for journal submission, 25 pages, 11 figures
★ Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects
The rapid adoption of Vision Language Models (VLMs), pre-trained on large
image-text and video-text datasets, calls for protecting and informing users
about when to trust these systems. This survey reviews studies on trust
dynamics in user-VLM interactions, through a multi-disciplinary taxonomy
encompassing different cognitive science capabilities, collaboration modes, and
agent behaviours. Literature insights and findings from a workshop with
prospective VLM users inform preliminary requirements for future VLM trust
studies.
★ Augmented Deep Contexts for Spatially Embedded Video Coding CVPR
Most Neural Video Codecs (NVCs) only employ temporal references to generate
temporal-only contexts and latent prior. These temporal-only NVCs fail to
handle large motions or emerging objects due to limited contexts and misaligned
latent prior. To relieve the limitations, we propose a Spatially Embedded Video
Codec (SEVC), in which the low-resolution video is compressed for spatial
references. Firstly, our SEVC leverages both spatial and temporal references to
generate augmented motion vectors and hybrid spatial-temporal contexts.
Secondly, to address the misalignment issue in latent prior and enrich the
prior information, we introduce a spatial-guided latent prior augmented by
multiple temporal latent representations. At last, we design a joint
spatial-temporal optimization to learn quality-adaptive bit allocation for
spatial references, further boosting rate-distortion performance. Experimental
results show that our SEVC effectively alleviates the limitations in handling
large motions or emerging objects, and also reduces 11.9% more bitrate than the
previous state-of-the-art NVC while providing an additional low-resolution
bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.
comment: 15 pages,CVPR
★ PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
Event cameras excel in high temporal resolution and dynamic range but suffer
from dense noise in rainy conditions. Existing event deraining methods face
trade-offs between temporal precision, deraining effectiveness, and
computational efficiency. In this paper, we propose PRE-Mamba, a novel
point-based event camera deraining framework that fully exploits the
spatiotemporal characteristics of raw event and rain. Our framework introduces
a 4D event cloud representation that integrates dual temporal scales to
preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion
module (STDF) that enhances deraining capability by enabling shallow decoupling
and interaction of temporal and spatial information, and a Multi-Scale State
Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and
multi-spatial scales with linear computational complexity. Enhanced by
frequency-domain regularization, PRE-Mamba achieves superior performance (0.95
SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a
comprehensive dataset with labeled synthetic and real-world sequences.
Moreover, our method generalizes well across varying rain intensities,
viewpoints, and even snowy conditions.
★ Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection
Benjamin A. Cohen, Jonathan Fhima, Meishar Meisel, Baskin Meital, Luis Filipe Nakayama, Eran Berkowitz, Joachim A. Behar
Self-supervised learning (SSL) has enabled Vision Transformers (ViTs) to
learn robust representations from large-scale natural image datasets, enhancing
their generalization across domains. In retinal imaging, foundation models
pretrained on either natural or ophthalmic data have shown promise, but the
benefits of in-domain pretraining remain uncertain. To investigate this, we
benchmark six SSL-pretrained ViTs on seven digital fundus image (DFI) datasets
totaling 70,000 expert-annotated images for the task of moderate-to-late
age-related macular degeneration (AMD) identification. Our results show that
iBOT pretrained on natural images achieves the highest out-of-distribution
generalization, with AUROCs of 0.80-0.97, outperforming domain-specific models,
which achieved AUROCs of 0.78-0.96 and a baseline ViT-L with no pretraining,
which achieved AUROCs of 0.68-0.91. These findings highlight the value of
foundation models in improving AMD identification and challenge the assumption
that in-domain pretraining is necessary. Furthermore, we release BRAMD, an
open-access dataset (n=587) of DFIs with AMD labels from Brazil.
comment: 10 pages, 3 figures
★ PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
Ahmed Abdelreheem, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Abdelrahman Eldesokey, Peter Wonka, Gabriel Brostow, Sara Vicente, Guillermo Garcia-Hernando
We introduce the novel task of Language-Guided Object Placement in Real 3D
Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual
prompt broadly describing where the 3D asset should be placed. The task here is
to find a valid placement for the 3D asset that respects the prompt. Compared
with other language-guided localization tasks in 3D scenes such as grounding,
this task has specific challenges: it is ambiguous because it has multiple
valid solutions, and it requires reasoning about 3D geometric relationships and
free space. We inaugurate this task by proposing a new benchmark and evaluation
protocol. We also introduce a new dataset for training 3D LLMs on this task, as
well as the first method to serve as a non-trivial baseline. We believe that
this challenging task and our new benchmark could become part of the suite of
benchmarks used to evaluate and compare generalist 3D LLM models.
comment: Tech report. Project page: https://nianticlabs.github.io/placeit3d/
★ MTL-UE: Learning to Learn Nothing for Multi-Task Learning ICML 2025
Most existing unlearnable strategies focus on preventing unauthorized users
from training single-task learning (STL) models with personal data.
Nevertheless, the paradigm has recently shifted towards multi-task data and
multi-task learning (MTL), targeting generalist and foundation models that can
handle multiple tasks simultaneously. Despite their growing importance, MTL
data and models have been largely neglected while pursuing unlearnable
strategies. This paper presents MTL-UE, the first unified framework for
generating unlearnable examples for multi-task data and MTL models. Instead of
optimizing perturbations for each sample, we design a generator-based structure
that introduces label priors and class-wise feature embeddings which leads to
much better attacking performance. In addition, MTL-UE incorporates intra-task
and inter-task embedding regularization to increase inter-class separation and
suppress intra-class variance which enhances the attack robustness greatly.
Furthermore, MTL-UE is versatile with good supports for dense prediction tasks
in MTL. It is also plug-and-play allowing integrating existing
surrogate-dependent unlearnable methods with little adaptation. Extensive
experiments show that MTL-UE achieves superior attacking performance
consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5
MTL task-weighting strategies.
comment: Accepted by ICML 2025
★ White Light Specular Reflection Data Augmentation for Deep Learning Polyp Detection
Colorectal cancer is one of the deadliest cancers today, but it can be
prevented through early detection of malignant polyps in the colon, primarily
via colonoscopies. While this method has saved many lives, human error remains
a significant challenge, as missing a polyp could have fatal consequences for
the patient. Deep learning (DL) polyp detectors offer a promising solution.
However, existing DL polyp detectors often mistake white light reflections from
the endoscope for polyps, which can lead to false positives.To address this
challenge, in this paper, we propose a novel data augmentation approach that
artificially adds more white light reflections to create harder training
scenarios. Specifically, we first generate a bank of artificial lights using
the training dataset. Then we find the regions of the training images that we
should not add these artificial lights on. Finally, we propose a sliding window
method to add the artificial light to the areas that fit of the training
images, resulting in augmented images. By providing the model with more
opportunities to make mistakes, we hypothesize that it will also have more
chances to learn from those mistakes, ultimately improving its performance in
polyp detection. Experimental results demonstrate the effectiveness of our new
data augmentation method.
comment: 5 pages, 4 Figures, paper accepted by the ISBI (International
Symposium on Biomedical Imaging) 2025 Conference
★ PADriver: Towards Personalized Autonomous Driving
Genghua Kou, Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Ziheng Zhang, Osamu Yoshie, Tiancai Wang, Ying Li, Xiangyu Zhang
In this paper, we propose PADriver, a novel closed-loop framework for
personalized autonomous driving (PAD). Built upon Multi-modal Large Language
Model (MLLM), PADriver takes streaming frames and personalized textual prompts
as inputs. It autoaggressively performs scene understanding, danger level
estimation and action decision. The predicted danger level reflects the risk of
the potential action and provides an explicit reference for the final action,
which corresponds to the preset personalized prompt. Moreover, we construct a
closed-loop benchmark named PAD-Highway based on Highway-Env simulator to
comprehensively evaluate the decision performance under traffic rules. The
dataset contains 250 hours videos with high-quality annotation to facilitate
the development of PAD behavior analysis. Experimental results on the
constructed benchmark show that PADriver outperforms state-of-the-art
approaches on different evaluation metrics, and enables various driving modes.
★ Does CLIP perceive art the same way we do?
CLIP has emerged as a powerful multimodal model capable of connecting images
and text through joint embeddings, but to what extent does it "see" the same
way humans do - especially when interpreting artworks? In this paper, we
investigate CLIP's ability to extract high-level semantic and stylistic
information from paintings, including both human-created and AI-generated
imagery. We evaluate its perception across multiple dimensions: content, scene
understanding, artistic style, historical period, and the presence of visual
deformations or artifacts. By designing targeted probing tasks and comparing
CLIP's responses to human annotations and expert benchmarks, we explore its
alignment with human perceptual and contextual understanding. Our findings
reveal both strengths and limitations in CLIP's visual representations,
particularly in relation to aesthetic cues and artistic intent. We further
discuss the implications of these insights for using CLIP as a guidance
mechanism during generative processes, such as style transfer or prompt-based
image synthesis. Our work highlights the need for deeper interpretability in
multimodal systems, especially when applied to creative domains where nuance
and subjectivity play a central role.
★ Multi-Objective Reinforcement Learning for Adaptive Personalized Autonomous Driving
Human drivers exhibit individual preferences regarding driving style.
Adapting autonomous vehicles to these preferences is essential for user trust
and satisfaction. However, existing end-to-end driving approaches often rely on
predefined driving styles or require continuous user feedback for adaptation,
limiting their ability to support dynamic, context-dependent preferences. We
propose a novel approach using multi-objective reinforcement learning (MORL)
with preference-driven optimization for end-to-end autonomous driving that
enables runtime adaptation to driving style preferences. Preferences are
encoded as continuous weight vectors to modulate behavior along interpretable
style objectives$\unicode{x2013}$including efficiency, comfort, speed, and
aggressiveness$\unicode{x2013}$without requiring policy retraining. Our
single-policy agent integrates vision-based perception in complex mixed-traffic
scenarios and is evaluated in diverse urban environments using the CARLA
simulator. Experimental results demonstrate that the agent dynamically adapts
its driving behavior according to changing preferences while maintaining
performance in terms of collision avoidance and route completion.
★ Diffusion Model Quantization: A Review
Recent success of large text-to-image models has empirically underscored the
exceptional performance of diffusion models in generative tasks. To facilitate
their efficient deployment on resource-constrained edge devices, model
quantization has emerged as a pivotal technique for both compression and
acceleration. This survey offers a thorough review of the latest advancements
in diffusion model quantization, encapsulating and analyzing the current state
of the art in this rapidly advancing domain. First, we provide an overview of
the key challenges encountered in the quantization of diffusion models,
including those based on U-Net architectures and Diffusion Transformers (DiT).
We then present a comprehensive taxonomy of prevalent quantization techniques,
engaging in an in-depth discussion of their underlying principles.
Subsequently, we perform a meticulous analysis of representative diffusion
model quantization schemes from both qualitative and quantitative perspectives.
From a quantitative standpoint, we rigorously benchmark a variety of methods
using widely recognized datasets, delivering an extensive evaluation of the
most recent and impactful research in the field. From a qualitative standpoint,
we categorize and synthesize the effects of quantization errors, elucidating
these impacts through both visual analysis and trajectory examination. In
conclusion, we outline prospective avenues for future research, proposing novel
directions for the quantization of generative models in practical applications.
The list of related papers, corresponding codes, pre-trained models and
comparison results are publicly available at the survey project homepage
https://github.com/TaylorJocelyn/Diffusion-Model-Quantization.
comment: 40 pages, 8 figures
★ HQC-NBV: A Hybrid Quantum-Classical View Planning Approach
Efficient view planning is a fundamental challenge in computer vision and
robotic perception, critical for tasks ranging from search and rescue
operations to autonomous navigation. While classical approaches, including
sampling-based and deterministic methods, have shown promise in planning camera
viewpoints for scene exploration, they often struggle with computational
scalability and solution optimality in complex settings. This study introduces
HQC-NBV, a hybrid quantum-classical framework for view planning that leverages
quantum properties to efficiently explore the parameter space while maintaining
robustness and scalability. We propose a specific Hamiltonian formulation with
multi-component cost terms and a parameter-centric variational ansatz with
bidirectional alternating entanglement patterns that capture the hierarchical
dependencies between viewpoint parameters. Comprehensive experiments
demonstrate that quantum-specific components provide measurable performance
advantages. Compared to the classical methods, our approach achieves up to
49.2% higher exploration efficiency across diverse environments. Our analysis
of entanglement architecture and coherence-preserving terms provides insights
into the mechanisms of quantum advantage in robotic exploration tasks. This
work represents a significant advancement in integrating quantum computing into
robotic perception systems, offering a paradigm-shifting solution for various
robot vision tasks.
★ EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution
Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind
Super-Resolution (BSR) has become a predominant approach in the field. While
T2I models have traditionally relied on U-Net architectures, recent
advancements have demonstrated that Diffusion Transformers (DiT) achieve
significantly higher performance in this domain. In this work, we introduce
Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and
outperforms previous U-Net-based approaches. We introduce a novel block,
$\Psi$-DiT, which effectively guides the DiT to enhance image restoration. This
block employs a low-resolution latent as a separable flow injection control,
forming a triple-flow architecture that effectively leverages the prior
knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance
capabilities of T2I models and enhance their generalization in BSR, we
introduce a progressive Masked Image Modeling strategy, which also reduces
training costs. Additionally, we propose a subject-aware prompt generation
strategy that employs a robust multi-modal model in an in-context learning
framework. This strategy automatically identifies key image areas, provides
detailed descriptions, and optimizes the utilization of T2I diffusion priors.
Our experiments demonstrate that EAM achieves state-of-the-art results across
multiple datasets, outperforming existing methods in both quantitative metrics
and visual quality.
★ Improved Brain Tumor Detection in MRI: Fuzzy Sigmoid Convolution in Deep Learning IJCNN 2025
Early detection and accurate diagnosis are essential to improving patient
outcomes. The use of convolutional neural networks (CNNs) for tumor detection
has shown promise, but existing models often suffer from overparameterization,
which limits their performance gains. In this study, fuzzy sigmoid convolution
(FSC) is introduced along with two additional modules: top-of-the-funnel and
middle-of-the-funnel. The proposed methodology significantly reduces the number
of trainable parameters without compromising classification accuracy. A novel
convolutional operator is central to this approach, effectively dilating the
receptive field while preserving input data integrity. This enables efficient
feature map reduction and enhances the model's tumor detection capability. In
the FSC-based model, fuzzy sigmoid activation functions are incorporated within
convolutional layers to improve feature extraction and classification. The
inclusion of fuzzy logic into the architecture improves its adaptability and
robustness. Extensive experiments on three benchmark datasets demonstrate the
superior performance and efficiency of the proposed model. The FSC-based
architecture achieved classification accuracies of 99.17%, 99.75%, and 99.89%
on three different datasets. The model employs 100 times fewer parameters than
large-scale transfer learning architectures, highlighting its computational
efficiency and suitability for detecting brain tumors early. This research
offers lightweight, high-performance deep-learning models for medical imaging
applications.
comment: IEEE IJCNN 2025 has accepted the paper
★ Concept-Based Unsupervised Domain Adaptation ICML 2025
Concept Bottleneck Models (CBMs) enhance interpretability by explaining
predictions through human-understandable concepts but typically assume that
training and test data share the same distribution. This assumption often fails
under domain shifts, leading to degraded performance and poor generalization.
To address these limitations and improve the robustness of CBMs, we propose the
Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed
to: (1) align concept representations across domains using adversarial
training, (2) introduce a relaxation threshold to allow minor domain-specific
differences in concept distributions, thereby preventing performance drop due
to over-constraints of these distributions, (3) infer concepts directly in the
target domain without requiring labeled concept data, enabling CBMs to adapt to
diverse domains, and (4) integrate concept learning into conventional domain
adaptation (DA) with theoretical guarantees, improving interpretability and
establishing new benchmarks for DA. Experiments demonstrate that our approach
significantly outperforms the state-of-the-art CBM and DA methods on real-world
datasets.
comment: Accepted by ICML 2025
★ Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models
Prompt learning is one of the most effective paradigms for adapting
pre-trained vision-language models (VLMs) to the biomedical image
classification tasks in few shot scenarios. However, most of the current prompt
learning methods only used the text prompts and ignored the particular
structures (such as the complex anatomical structures and subtle pathological
features) in the biomedical images. In this work, we propose Biomed-DPT, a
knowledge-enhanced dual modality prompt tuning technique. In designing the text
prompt, Biomed-DPT constructs a dual prompt including the template-driven
clinical prompts and the large language model (LLM)-driven domain-adapted
prompts, then extracts the clinical knowledge from the domain-adapted prompts
through the knowledge distillation technique. In designing the vision prompt,
Biomed-DPT introduces the zero vector as a soft prompt to leverage attention
re-weighting so that the focus on non-diagnostic regions and the recognition of
non-critical pathological features are avoided. Biomed-DPT achieves an average
classification accuracy of 66.14\% across 11 biomedical image datasets covering
9 modalities and 10 organs, with performance reaching 78.06\% in base classes
and 75.97\% in novel classes, surpassing the Context Optimization (CoOp) method
by 6.20\%, 3.78\%, and 8.04\%, respectively. Our code are available at
\underline{https://github.com/Kanyooo/Biomed-DPT}.
★ PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting
Elad Feldman, Jacob Shams, Dudi Biton, Alfred Chen, Shaoyuan Xie, Satoru Koda, Yisroel Mirsky, Asaf Shabtai, Yuval Elovici, Ben Nassi
The safety of autonomous cars has come under scrutiny in recent years,
especially after 16 documented incidents involving Teslas (with autopilot
engaged) crashing into parked emergency vehicles (police cars, ambulances, and
firetrucks). While previous studies have revealed that strong light sources
often introduce flare artifacts in the captured image, which degrade the image
quality, the impact of flare on object detection performance remains unclear.
In this research, we unveil PaniCar, a digital phenomenon that causes an object
detector's confidence score to fluctuate below detection thresholds when
exposed to activated emergency vehicle lighting. This vulnerability poses a
significant safety risk, and can cause autonomous vehicles to fail to detect
objects near emergency vehicles. In addition, this vulnerability could be
exploited by adversaries to compromise the security of advanced driving
assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3,
"manufacturer C", HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors
(YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle
lighting to understand the influence of various technical and environmental
factors. We also evaluate four SOTA flare removal methods and show that their
performance and latency are insufficient for real-time driving constraints. To
mitigate this risk, we propose Caracetamol, a robust framework designed to
enhance the resilience of object detectors against the effects of activated
emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster
RCNN, Caracetamol improves the models' average confidence of car detection by
0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by
0.33. In addition, Caracetamol is capable of processing frames at a rate of
between 30-50 FPS, enabling real-time ADAS car detection.
★ Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models UAI 2025
Vision-Language Models (VLMs) learn joint representations by mapping images
and text into a shared latent space. However, recent research highlights that
deterministic embeddings from standard VLMs often struggle to capture the
uncertainties arising from the ambiguities in visual and textual descriptions
and the multiple possible correspondences between images and texts. Existing
approaches tackle this by learning probabilistic embeddings during VLM
training, which demands large datasets and does not leverage the powerful
representations already learned by large-scale VLMs like CLIP. In this paper,
we propose GroVE, a post-hoc approach to obtaining probabilistic embeddings
from frozen VLMs. GroVE builds on Gaussian Process Latent Variable Model
(GPLVM) to learn a shared low-dimensional latent space where image and text
inputs are mapped to a unified representation, optimized through single-modal
embedding reconstruction and cross-modal alignment objectives. Once trained,
the Gaussian Process model generates uncertainty-aware probabilistic
embeddings. Evaluation shows that GroVE achieves state-of-the-art uncertainty
calibration across multiple downstream tasks, including cross-modal retrieval,
visual question answering, and active learning.
comment: UAI 2025, 22 pages
★ Research on Anomaly Detection Methods Based on Diffusion Models
Anomaly detection is a fundamental task in machine learning and data mining,
with significant applications in cybersecurity, industrial fault diagnosis, and
clinical disease monitoring. Traditional methods, such as statistical modeling
and machine learning-based approaches, often face challenges in handling
complex, high-dimensional data distributions. In this study, we explore the
potential of diffusion models for anomaly detection, proposing a novel
framework that leverages the strengths of diffusion probabilistic models (DPMs)
to effectively identify anomalies in both image and audio data. The proposed
method models the distribution of normal data through a diffusion process and
reconstructs input data via reverse diffusion, using a combination of
reconstruction errors and semantic discrepancies as anomaly indicators. To
enhance the framework's performance, we introduce multi-scale feature
extraction, attention mechanisms, and wavelet-domain representations, enabling
the model to capture fine-grained structures and global dependencies in the
data. Extensive experiments on benchmark datasets, including MVTec AD and
UrbanSound8K, demonstrate that our method outperforms state-of-the-art anomaly
detection techniques, achieving superior accuracy and robustness across diverse
data modalities. This research highlights the effectiveness of diffusion models
in anomaly detection and provides a robust and efficient solution for
real-world applications.
comment: 6 pages, 3 table
★ Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation
Clara Tomasini, Javier Rodriguez-Puigvert, Dinora Polanco, Manuel Viñuales, Luis Riazuelo, Ana Cristina Murillo
Purpose: Subglottic stenosis refers to the narrowing of the subglottis, the
airway between the vocal cords and the trachea. Its severity is typically
evaluated by estimating the percentage of obstructed airway. This estimation
can be obtained from CT data or through visual inspection by experts exploring
the region. However, visual inspections are inherently subjective, leading to
less consistent and robust diagnoses. No public methods or datasets are
currently available for automated evaluation of this condition from
bronchoscopy video.
Methods: We propose a pipeline for automated subglottic stenosis severity
estimation during the bronchoscopy exploration, without requiring the physician
to traverse the stenosed region. Our approach exploits the physical effect of
illumination decline in endoscopy to segment and track the lumen and obtain a
3D model of the airway. This 3D model is obtained from a single frame and is
used to measure the airway narrowing.
Results: Our pipeline is the first to enable automated and robust subglottic
stenosis severity measurement using bronchoscopy images. The results show
consistency with ground-truth estimations from CT scans and expert estimations,
and reliable repeatability across multiple estimations on the same patient. Our
evaluation is performed on our new Subglottic Stenosis Dataset of real
bronchoscopy procedures data.
Conclusion: We demonstrate how to automate evaluation of subglottic stenosis
severity using only bronchoscopy. Our approach can assist with and shorten
diagnosis and monitoring procedures, with automated and repeatable estimations
and less exploration time, and save radiation exposure to patients as no CT is
required. Additionally, we release the first public benchmark for subglottic
stenosis severity assessment.
★ An Active Contour Model for Silhouette Vectorization using Bézier Curves
In this paper, we propose an active contour model for silhouette
vectorization using cubic B\'ezier curves. Among the end points of the B\'ezier
curves, we distinguish between corner and regular points where the orientation
of the tangent vector is prescribed. By minimizing the distance of the B\'ezier
curves to the silhouette boundary, the active contour model optimizes the
location of the B\'ezier curves end points, the orientation of the tangent
vectors in the regular points, and the estimation of the B\'ezier curve
parameters. This active contour model can use the silhouette vectorization
obtained by any method as an initial guess. The proposed method significantly
reduces the average distance between the silhouette boundary and its
vectorization obtained by the world-class graphic software Inkscape, Adobe
Illustrator, and a curvature-based vectorization method, which we introduce for
comparison. Our method also allows us to impose additional regularity on the
B\'ezier curves by reducing their lengths.
comment: 14 pages, 5 figures and 1 table
★ MDAA-Diff: CT-Guided Multi-Dose Adaptive Attention Diffusion Model for PET Denoising
Acquiring high-quality Positron Emission Tomography (PET) images requires
administering high-dose radiotracers, which increases radiation exposure risks.
Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a
potential solution. However, previous studies have primarily focused on single
low-dose PET denoising, neglecting two critical factors: discrepancies in dose
response caused by inter-patient variability, and complementary anatomical
constraints derived from CT images. In this work, we propose a novel CT-Guided
Multi-dose Adaptive Attention Denoising Diffusion Model (MDAA-Diff) for
multi-dose PET denoising. Our approach integrates anatomical guidance and
dose-level adaptation to achieve superior denoising performance under low-dose
conditions. Specifically, this approach incorporates a CT-Guided High-frequency
Wavelet Attention (HWA) module, which uses wavelet transforms to separate
high-frequency anatomical boundary features from CT images. These extracted
features are then incorporated into PET imaging through an adaptive weighted
fusion mechanism to enhance edge details. Additionally, we propose the
Dose-Adaptive Attention (DAA) module, a dose-conditioned enhancement mechanism
that dynamically integrates dose levels into channel-spatial attention weight
calculation. Extensive experiments on 18F-FDG and 68Ga-FAPI datasets
demonstrate that MDAA-Diff outperforms state-of-the-art approaches in
preserving diagnostic quality under reduced-dose conditions. Our code is
publicly available.
★ MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models
Multi-object editing aims to modify multiple objects or regions in complex
scenes while preserving structural coherence. This task faces significant
challenges in scenarios involving overlapping or interacting objects: (1)
Inaccurate localization of target objects due to attention misalignment,
leading to incomplete or misplaced edits; (2) Attribute-object mismatch, where
color or texture changes fail to align with intended regions due to
cross-attention leakage, creating semantic conflicts (\textit{e.g.}, color
bleeding into non-target areas). Existing methods struggle with these
challenges: approaches relying on global cross-attention mechanisms suffer from
attention dilution and spatial interference between objects, while mask-based
methods fail to bind attributes to geometrically accurate regions due to
feature entanglement in multi-object scenarios. To address these limitations,
we propose a training-free, inference-stage optimization approach that enables
precise localized image manipulation in complex multi-object scenes, named
MDE-Edit. MDE-Edit optimizes the noise latent feature in diffusion models via
two key losses: Object Alignment Loss (OAL) aligns multi-layer cross-attention
with segmentation masks for precise object positioning, and Color Consistency
Loss (CCL) amplifies target attribute attention within masks while suppressing
leakage to adjacent regions. This dual-loss design ensures localized and
coherent multi-object edits. Extensive experiments demonstrate that MDE-Edit
outperforms state-of-the-art methods in editing accuracy and visual quality,
offering a robust solution for complex multi-object image manipulation tasks.
comment: 9 pages, 7 figures
★ X-Driver: Explainable Autonomous Driving with Vision-Language Models
End-to-end autonomous driving has advanced significantly, offering benefits
such as system simplicity and stronger driving performance in both open-loop
and closed-loop settings than conventional pipelines. However, existing
frameworks still suffer from low success rates in closed-loop evaluations,
highlighting their limitations in real-world deployment. In this paper, we
introduce X-Driver, a unified multi-modal large language models(MLLMs)
framework designed for closed-loop autonomous driving, leveraging
Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and
decision-making. We validate X-Driver across multiple autonomous driving tasks
using public benchmarks in CARLA simulation environment, including
Bench2Drive[6]. Our experimental results demonstrate superior closed-loop
performance, surpassing the current state-of-the-art(SOTA) while improving the
interpretability of driving decisions. These findings underscore the importance
of structured reasoning in end-to-end driving and establish X-Driver as a
strong baseline for future research in closed-loop autonomous driving.
★ DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions CVPR 2025
Deep learning (DL) has surpassed human performance on standard benchmarks,
driving its widespread adoption in computer vision tasks. One such task is
disparity estimation, estimating the disparity between matching pixels in
stereo image pairs, which is crucial for safety-critical applications like
medical surgeries and autonomous navigation. However, DL-based disparity
estimation methods are highly susceptible to distribution shifts and
adversarial attacks, raising concerns about their reliability and
generalization. Despite these concerns, a standardized benchmark for evaluating
the robustness of disparity estimation methods remains absent, hindering
progress in the field.
To address this gap, we introduce DispBench, a comprehensive benchmarking
tool for systematically assessing the reliability of disparity estimation
methods. DispBench evaluates robustness against synthetic image corruptions
such as adversarial attacks and out-of-distribution shifts caused by 2D Common
Corruptions across multiple datasets and diverse corruption scenarios. We
conduct the most extensive performance and robustness analysis of disparity
estimation methods to date, uncovering key correlations between accuracy,
reliability, and generalization. Open-source code for DispBench:
https://github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation
comment: Accepted at CVPR 2025 Workshop on Synthetic Data for Computer Vision
★ Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow ICRA 2025
Event cameras have the potential to capture continuous motion information
over time and space, making them well-suited for optical flow estimation.
However, most existing learning-based methods for event-based optical flow
adopt frame-based techniques, ignoring the spatio-temporal characteristics of
events. Additionally, these methods assume linear motion between consecutive
events within the loss time window, which increases optical flow errors in
long-time sequences. In this work, we observe that rich spatio-temporal
information and accurate nonlinear motion between events are crucial for
event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel
unsupervised event-based optical flow network focusing on long-time sequences.
We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an
Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich
spatio-temporal information to learn spatio-temporal data associations.
Meanwhile, we propose a nonlinear motion compensation loss that utilizes the
accurate nonlinear motion between events to improve the unsupervised learning
of our network. Extensive experiments demonstrate the effectiveness and
superiority of our method. Remarkably, our method ranks first among
unsupervised learning methods on the MVSEC and DSEC-Flow datasets. Our project
page is available at https://wynelio.github.io/E-NMSTFlow.
comment: Accepted to ICRA 2025. Project Page:
https://wynelio.github.io/E-NMSTFlow
★ SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal
Visible watermark removal is challenging due to its inherent complexities and
the noise carried within images. Existing methods primarily rely on supervised
learning approaches that require paired datasets of watermarked and
watermark-free images, which are often impractical to obtain in real-world
scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and
Hybrid Network specifically designed for noisy image watermark removal. SSH-Net
synthesizes reference watermark-free images using the watermark distribution in
a self-supervised manner and adopts a dual-network design to address the task.
The upper network, focused on the simpler task of noise removal, employs a
lightweight CNN-based architecture, while the lower network, designed to handle
the more complex task of simultaneously removing watermarks and noise,
incorporates Transformer blocks to model long-range dependencies and capture
intricate image features. To enhance the model's effectiveness, a shared
CNN-based feature encoder is introduced before dual networks to extract common
features that both networks can leverage. Our code will be available at
https://github.com/wenyang001/SSH-Net.
comment: Under Review in JVCI
★ PIDiff: Image Customization for Personalized Identities with Diffusion Models
Text-to-image generation for personalized identities aims at incorporating
the specific identity into images using a text prompt and an identity image.
Based on the powerful generative capabilities of DDPMs, many previous works
adopt additional prompts, such as text embeddings and CLIP image embeddings, to
represent the identity information, while they fail to disentangle the identity
information and background information. As a result, the generated images not
only lose key identity characteristics but also suffer from significantly
reduced diversity. To address this issue, previous works have combined the W+
space from StyleGAN with diffusion models, leveraging this space to provide a
more accurate and comprehensive representation of identity features through
multi-level feature extraction. However, the entanglement of identity and
background information in in-the-wild images during training prevents accurate
identity localization, resulting in severe semantic interference between
identity and background. In this paper, we propose a novel fine-tuning-based
diffusion model for personalized identities text-to-image generation, named
PIDiff, which leverages the W+ space and an identity-tailored fine-tuning
strategy to avoid semantic entanglement and achieves accurate feature
extraction and localization. Style editing can also be achieved by PIDiff
through preserving the characteristics of identity features in the W+ space,
which vary from coarse to fine. Through the combination of the proposed
cross-attention block and parameter optimization strategy, PIDiff preserves the
identity information and maintains the generation capability for in-the-wild
images of the pre-trained model during inference. Our experimental results
validate the effectiveness of our method in this task.
comment: 9 pages, 11 figures
★ The City that Never Settles: Simulation-based LiDAR Dataset for Long-Term Place Recognition Under Extreme Structural Changes
Large-scale construction and demolition significantly challenge long-term
place recognition (PR) by drastically reshaping urban and suburban
environments. Existing datasets predominantly reflect limited or indoor-focused
changes, failing to adequately represent extensive outdoor transformations. To
bridge this gap, we introduce the City that Never Settles (CNS) dataset, a
simulation-based dataset created using the CARLA simulator, capturing major
structural changes-such as building construction and demolition-across diverse
maps and sequences. Additionally, we propose TCR_sym, a symmetric version of
the original TCR metric, enabling consistent measurement of structural changes
irrespective of source-target ordering. Quantitative comparisons demonstrate
that CNS encompasses more extensive transformations than current real-world
benchmarks. Evaluations of state-of-the-art LiDAR-based PR methods on CNS
reveal substantial performance degradation, underscoring the need for robust
algorithms capable of handling significant environmental changes. Our dataset
is available at https://github.com/Hyunho111/CNS_dataset.
★ Visual Affordances: Enabling Robots to Understand Object Functionality
Human-robot interaction for assistive technologies relies on the prediction
of affordances, which are the potential actions a robot can perform on objects.
Predicting object affordances from visual perception is formulated differently
for tasks such as grasping detection, affordance classification, affordance
segmentation, and hand-object interaction synthesis. In this work, we highlight
the reproducibility issue in these redefinitions, making comparative benchmarks
unfair and unreliable. To address this problem, we propose a unified
formulation for visual affordance prediction, provide a comprehensive and
systematic review of previous works highlighting strengths and limitations of
methods and datasets, and analyse what challenges reproducibility. To favour
transparency, we introduce the Affordance Sheet, a document to detail the
proposed solution, the datasets, and the validation. As the physical properties
of an object influence the interaction with the robot, we present a generic
framework that links visual affordance prediction to the physical world. Using
the weight of an object as an example for this framework, we discuss how
estimating object mass can affect the affordance prediction. Our approach
bridges the gap between affordance perception and robot actuation, and accounts
for the complete information about objects of interest and how the robot
interacts with them to accomplish its task.
comment: 24 pages, 12 figures, 10 tables. Project website at
https://apicis.github.io/aff-survey/
★ RepSNet: A Nucleus Instance Segmentation model based on Boundary Regression and Structural Re-parameterization
Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus
instance segmentation is a key step in digital pathology analysis and
pathological diagnosis. However, the computational efficiency of the model and
the treatment of overlapping targets are the major challenges in the studies of
this problem. To this end, a neural network model RepSNet was designed based on
a nucleus boundary regression and a structural re-parameterization scheme for
segmenting and classifying the nuclei in H\&E-stained histopathological images.
First, RepSNet estimates the boundary position information (BPI) of the parent
nucleus for each pixel. The BPI estimation incorporates the local information
of the pixel and the contextual information of the parent nucleus. Then, the
nucleus boundary is estimated by aggregating the BPIs from a series of pixels
using a proposed boundary voting mechanism (BVM), and the instance segmentation
results are computed from the estimated nucleus boundary using a connected
component analysis procedure. The BVM intrinsically achieves a kind of
synergistic belief enhancement among the BPIs from various pixels. Therefore,
different from the methods available in literature that obtain nucleus
boundaries based on a direct pixel recognition scheme, RepSNet computes its
boundary decisions based on some guidances from macroscopic information using
an integration mechanism. In addition, RepSNet employs a re-parametrizable
encoder-decoder structure. This model can not only aggregate features from some
receptive fields with various scales which helps segmentation accuracy
improvement, but also reduce the parameter amount and computational burdens in
the model inference phase through the structural re-parameterization technique.
Extensive experiments demonstrated the superiorities of RepSNet compared to
several typical benchmark models.
comment: 25 pages, 7 figures, 5 tables
★ FG-CLIP: Fine-Grained Visual and Textual Alignment ICML 2025
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks
such as image-text retrieval and zero-shot classification but struggles with
fine-grained understanding due to its focus on coarse-grained short captions.
To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances
fine-grained understanding through three key innovations. First, we leverage
large multimodal models to generate 1.6 billion long caption-image pairs for
capturing global-level semantic details. Second, a high-quality dataset is
constructed with 12 million images and 40 million region-specific bounding
boxes aligned with detailed captions to ensure precise, context-rich
representations. Third, 10 million hard fine-grained negative samples are
incorporated to improve the model's ability to distinguish subtle semantic
differences. Corresponding training methods are meticulously designed for these
data. Extensive experiments demonstrate that FG-CLIP outperforms the original
CLIP and other state-of-the-art methods across various downstream tasks,
including fine-grained understanding, open-vocabulary object detection,
image-text retrieval, and general multimodal benchmarks. These results
highlight FG-CLIP's effectiveness in capturing fine-grained image details and
improving overall model performance. The related data, code, and models are
available at https://github.com/360CVGroup/FG-CLIP.
comment: Accepted at ICML 2025
★ ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning
Based on the success of large-scale visual foundation models like CLIP in
various downstream tasks, this paper initially attempts to explore their impact
on Long-Tailed Semi-Supervised Learning (LTSSL) by employing the foundation
model with three strategies: Linear Probing (LP), Lightweight Fine-Tuning
(LFT), and Full Fine-Tuning (FFT). Our analysis presents the following
insights: i) Compared to LTSSL algorithms trained from scratch, FFT results in
a decline in model performance, whereas LP and LFT, although boosting overall
model performance, exhibit negligible benefits to tail classes. ii) LP produces
numerous false pseudo-labels due to \textit{underlearned} training data, while
LFT can reduce the number of these false labels but becomes overconfident about
them owing to \textit{biased fitting} training data. This exacerbates the
pseudo-labeled and classifier biases inherent in LTSSL, limiting performance
improvement in the tail classes. With these insights, we propose a Unbiased
Lightweight Fine-tuning strategy, \textbf{ULFine}, which mitigates the
overconfidence via confidence-aware adaptive fitting of textual prototypes and
counteracts the pseudo-labeled and classifier biases via complementary fusion
of dual logits. Extensive experiments demonstrate that ULFine markedly
decreases training costs by over ten times and substantially increases
prediction accuracies compared to state-of-the-art methods.
★ Direct Image Classification from Fourier Ptychographic Microscopy Measurements without Reconstruction SC
Navya Sonal Agarwal, Jan Philipp Schneider, Kanchana Vaishnavi Gandikota, Syed Muhammad Kazim, John Meshreki, Ivo Ihrke, Michael Moeller
The computational imaging technique of Fourier Ptychographic Microscopy (FPM)
enables high-resolution imaging with a wide field of view and can serve as an
extremely valuable tool, e.g. in the classification of cells in medical
applications. However, reconstructing a high-resolution image from tens or even
hundreds of measurements is computationally expensive, particularly for a wide
field of view. Therefore, in this paper, we investigate the idea of classifying
the image content in the FPM measurements directly without performing a
reconstruction step first. We show that Convolutional Neural Networks (CNN) can
extract meaningful information from measurement sequences, significantly
outperforming the classification on a single band-limited image (up to 12 %)
while being significantly more efficient than a reconstruction of a
high-resolution image. Furthermore, we demonstrate that a learned multiplexing
of several raw measurements allows maintaining the classification accuracy
while reducing the amount of data (and consequently also the acquisition time)
significantly.
comment: ISCS 2025
★ UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model ICML'25
The introduction of the Segment Anything Model (SAM) has paved the way for
numerous semantic segmentation applications. For several tasks, quantifying the
uncertainty of SAM is of particular interest. However, the ambiguous nature of
the class-agnostic foundation model SAM challenges current uncertainty
quantification (UQ) approaches. This paper presents a theoretically motivated
uncertainty quantification model based on a Bayesian entropy formulation
jointly respecting aleatoric, epistemic, and the newly introduced task
uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ
method. Our model traces the root of uncertainty back to under-parameterised
models, insufficient prompts or image ambiguities. Our proposed deterministic
USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k,
DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ
alternative that can support user-prompting, enhance semi-supervised pipelines,
or balance the tradeoff between accuracy and cost efficiency.
comment: Accepted to ICML'25
★ xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition
Mani Kumar Tellamekala, Shashank Jaiswal, Thomas Smith, Timur Alamev, Gary McKeown, Anthony Brown, Michel Valstar
Recognising expressive behaviours in face videos is a long-standing challenge
in Affective Computing. Despite significant advancements in recent years, it
still remains a challenge to build a robust and reliable system for
naturalistic and in-the-wild facial expressive behaviour analysis in real time.
This paper addresses two key challenges in building such a system: (1). The
paucity of large-scale labelled facial affect video datasets with extensive
coverage of the 2D emotion space, and (2). The difficulty of extracting facial
video features that are discriminative, interpretable, robust, and
computationally efficient. Toward addressing these challenges, we introduce
xTrace, a robust tool for facial expressive behaviour analysis and predicting
continuous values of dimensional emotions, namely valence and arousal, from
in-the-wild face videos.
To address challenge (1), our affect recognition model is trained on the
largest facial affect video data set, containing ~450k videos that cover most
emotion zones in the dimensional emotion space, making xTrace highly versatile
in analysing a wide spectrum of naturalistic expressive behaviours. To address
challenge (2), xTrace uses facial affect descriptors that are not only
explainable, but can also achieve a high degree of accuracy and robustness with
low computational complexity. The key components of xTrace are benchmarked
against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox.
On an in-the-wild validation set composed of 50k videos, xTrace achieves 0.86
mean CCC and 0.13 mean absolute error values. We present a detailed error
analysis of affect predictions from xTrace, illustrating (a). its ability to
recognise emotions with high accuracy across most bins in the 2D emotion space,
(b). its robustness to non-frontal head pose angles, and (c). a strong
correlation between its uncertainty estimates and its accuracy.
★ ADNP-15: An Open-Source Histopathological Dataset for Neuritic Plaque Segmentation in Human Brain Whole Slide Images with Frequency Domain Image Enhancement for Stain Normalization
Chenxi Zhao, Jianqiang Li, Qing Zhao, Jing Bai, Susana Boluda, Benoit Delatour, Lev Stimmer, Daniel Racoceanu, Gabriel Jimenez, Guanghui Fu
Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by
amyloid-beta plaques and tau neurofibrillary tangles, which serve as key
histopathological features. The identification and segmentation of these
lesions are crucial for understanding AD progression but remain challenging due
to the lack of large-scale annotated datasets and the impact of staining
variations on automated image analysis. Deep learning has emerged as a powerful
tool for pathology image segmentation; however, model performance is
significantly influenced by variations in staining characteristics,
necessitating effective stain normalization and enhancement techniques. In this
study, we address these challenges by introducing an open-source dataset
(ADNP-15) of neuritic plaques (i.e., amyloid deposits combined with a crown of
dystrophic tau-positive neurites) in human brain whole slide images. We
establish a comprehensive benchmark by evaluating five widely adopted deep
learning models across four stain normalization techniques, providing deeper
insights into their influence on neuritic plaque segmentation. Additionally, we
propose a novel image enhancement method that improves segmentation accuracy,
particularly in complex tissue structures, by enhancing structural details and
mitigating staining inconsistencies. Our experimental results demonstrate that
this enhancement strategy significantly boosts model generalization and
segmentation accuracy. All datasets and code are open-source, ensuring
transparency and reproducibility while enabling further advancements in the
field.
★ Image-Text Relation Prediction for Multilingual Tweets
Various social networks have been allowing media uploads for over a decade
now. Still, it has not always been clear what is their relation with the posted
text or even if there is any at all. In this work, we explore how multilingual
vision-language models tackle the task of image-text relation prediction in
different languages, and construct a dedicated balanced benchmark data set from
Twitter posts in Latvian along with their manual translations into English. We
compare our results to previous work and show that the more recently released
vision-language model checkpoints are becoming increasingly capable at this
task, but there is still much room for further improvement.
★ Split Matching for Inductive Zero-shot Semantic Segmentation
Jialei Chen, Xu Zheng, Dongyue Li, Chong Yi, Seigo Ito, Danda Pani Paudel, Luc Van Gool, Hiroshi Murase, Daisuke Deguchi
Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not
annotated during training. While fine-tuning vision-language models has
achieved promising results, these models often overfit to seen categories due
to the lack of supervision for unseen classes. As an alternative to fully
supervised approaches, query-based segmentation has shown great latent in ZSS,
as it enables object localization without relying on explicit labels. However,
conventional Hungarian matching, a core component in query-based frameworks,
needs full supervision and often misclassifies unseen categories as background
in the setting of ZSS. To address this issue, we propose Split Matching (SM), a
novel assignment strategy that decouples Hungarian matching into two
components: one for seen classes in annotated regions and another for latent
classes in unannotated regions (referred to as unseen candidates).
Specifically, we partition the queries into seen and candidate groups, enabling
each to be optimized independently according to its available supervision. To
discover unseen candidates, we cluster CLIP dense features to generate pseudo
masks and extract region-level embeddings using CLS tokens. Matching is then
conducted separately for the two groups based on both class-level similarity
and mask-level consistency. Additionally, we introduce a Multi-scale Feature
Enhancement (MFE) module that refines decoder features through residual
multi-scale aggregation, improving the model's ability to capture spatial
details across resolutions. SM is the first to introduce decoupled Hungarian
matching under the inductive ZSS setting, and achieves state-of-the-art
performance on two standard benchmarks.
★ SOAP: Style-Omniscient Animatable Portraits
Creating animatable 3D avatars from a single image remains challenging due to
style limitations (realistic, cartoon, anime) and difficulties in handling
accessories or hairstyles. While 3D diffusion models advance single-view
reconstruction for general objects, outputs often lack animation controls or
suffer from artifacts because of the domain gap. We propose SOAP, a
style-omniscient framework to generate rigged, topology-consistent avatars from
any portrait. Our method leverages a multiview diffusion model trained on 24K
3D heads with multiple styles and an adaptive optimization pipeline to deform
the FLAME mesh while maintaining topology and rigging via differentiable
rendering. The resulting textured avatars support FACS-based animation,
integrate with eyeballs and teeth, and preserve details like braided hair or
accessories. Extensive experiments demonstrate the superiority of our method
over state-of-the-art techniques for both single-view head modeling and
diffusion-based generation of Image-to-3D. Our code and data are publicly
available for research purposes at https://github.com/TingtingLiao/soap.
★ Adaptive Contextual Embedding for Robust Far-View Borehole Detection
In controlled blasting operations, accurately detecting densely distributed
tiny boreholes from far-view imagery is critical for operational safety and
efficiency. However, existing detection methods often struggle due to small
object scales, highly dense arrangements, and limited distinctive visual
features of boreholes. To address these challenges, we propose an adaptive
detection approach that builds upon existing architectures (e.g., YOLO) by
explicitly leveraging consistent embedding representations derived through
exponential moving average (EMA)-based statistical updates.
Our method introduces three synergistic components: (1) adaptive augmentation
utilizing dynamically updated image statistics to robustly handle illumination
and texture variations; (2) embedding stabilization to ensure consistent and
reliable feature extraction; and (3) contextual refinement leveraging spatial
context for improved detection accuracy. The pervasive use of EMA in our method
is particularly advantageous given the limited visual complexity and small
scale of boreholes, allowing stable and robust representation learning even
under challenging visual conditions. Experiments on a challenging proprietary
quarry-site dataset demonstrate substantial improvements over baseline
YOLO-based architectures, highlighting our method's effectiveness in realistic
and complex industrial scenarios.
★ Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition
Accurate online map matching is fundamental to vehicle navigation and the
activation of intelligent driving functions. Current online map matching
methods are prone to errors in complex road networks, especially in multilevel
road area. To address this challenge, we propose an online Standard Definition
(SD) map matching method by constructing a Hidden Markov Model (HMM) with
multiple probability factors. Our proposed method can achieve accurate map
matching even in complex road networks by carefully leveraging lane markings
and scenario recognition in the designing of the probability factors. First,
the lane markings are generated by a multi-lane tracking method and associated
with the SD map using HMM to build an enriched SD map. In areas covered by the
enriched SD map, the vehicle can re-localize itself by performing Iterative
Closest Point (ICP) registration for the lane markings. Then, the probability
factor accounting for the lane marking detection can be obtained using the
association probability between adjacent lanes and roads. Second, the driving
scenario recognition model is applied to generate the emission probability
factor of scenario recognition, which improves the performance of map matching
on elevated roads and ordinary urban roads underneath them. We validate our
method through extensive road tests in Europe and China, and the experimental
results show that our proposed method effectively improves the online map
matching accuracy as compared to other existing methods, especially in
multilevel road area. Specifically, the experiments show that our proposed
method achieves $F_1$ scores of 98.04% and 94.60% on the Zenseact Open Dataset
and test data of multilevel road areas in Shanghai respectively, significantly
outperforming benchmark methods. The implementation is available at
https://github.com/TRV-Lab/LMSR-OMM.
comment: 9 pages and 12 figures. Under review at IEEE RA-L
★ Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort
Hendrik Möller, Hanna Schön, Alina Dima, Benjamin Keinert-Weth, Robert Graf, Matan Atad, Johannes Paetzold, Friederike Jungmann, Rickmer Braren, Florian Kofler, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke
Thoracolumbar stump ribs are one of the essential indicators of thoracolumbar
transitional vertebrae or enumeration anomalies. While some studies manually
assess these anomalies and describe the ribs qualitatively, this study aims to
automate thoracolumbar stump rib detection and analyze their morphology
quantitatively. To this end, we train a high-resolution deep-learning model for
rib segmentation and show significant improvements compared to existing models
(Dice score 0.997 vs. 0.779, p-value < 0.01). In addition, we use an iterative
algorithm and piece-wise linear interpolation to assess the length of the ribs,
showing a success rate of 98.2%. When analyzing morphological features, we show
that stump ribs articulate more posteriorly at the vertebrae (-19.2 +- 3.8 vs
-13.8 +- 2.5, p-value < 0.01), are thinner (260.6 +- 103.4 vs. 563.6 +- 127.1,
p-value < 0.01), and are oriented more downwards and sideways within the first
centimeters in contrast to full-length ribs. We show that with partially
visible ribs, these features can achieve an F1-score of 0.84 in differentiating
stump ribs from regular ones. We publish the model weights and masks for public
use.
★ StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps
We retarget video stitching to an emerging issue, named warping shake, which
unveils the temporal content shakes induced by sequentially unsmooth warps when
extending image stitching to video stitching. Even if the input videos are
stable, the stitched video can inevitably cause undesired warping shakes and
affect the visual experience. To address this issue, we propose StabStitch++, a
novel video stitching framework to realize spatial stitching and temporal
stabilization with unsupervised learning simultaneously. First, different from
existing learning-based image stitching solutions that typically warp one image
to align with another, we suppose a virtual midplane between original image
planes and project them onto it. Concretely, we design a differentiable
bidirectional decomposition module to disentangle the homography transformation
and incorporate it into our spatial warp, evenly spreading alignment burdens
and projective distortions across two views. Then, inspired by camera paths in
video stabilization, we derive the mathematical expression of stitching
trajectories in video stitching by elaborately integrating spatial and temporal
warps. Finally, a warp smoothing model is presented to produce stable stitched
videos with a hybrid loss to simultaneously encourage content alignment,
trajectory smoothness, and online collaboration. Compared with StabStitch that
sacrifices alignment for stabilization, StabStitch++ makes no compromise and
optimizes both of them simultaneously, especially in the online mode. To
establish an evaluation benchmark and train the learning framework, we build a
video stitching dataset with a rich diversity in camera motions and scenes.
Experiments exhibit that StabStitch++ surpasses current solutions in stitching
performance, robustness, and efficiency, offering compelling advancements in
this field by building a real-time online video stitching system.
comment: TPAMI2025; https://github.com/nie-lang/StabStitch2. arXiv admin note:
text overlap with arXiv:2403.06378
★ Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication ICMR 2025
Full-body gestures play a pivotal role in natural interactions and are
crucial for achieving effective communication. Nevertheless, most existing
studies primarily focus on the gesture generation of speakers, overlooking the
vital role of listeners in the interaction process and failing to fully explore
the dynamic interaction between them. This paper innovatively proposes an
Inter-Diffusion Generation Model of Speakers and Listeners for Effective
Communication. For the first time, we integrate the full-body gestures of
listeners into the generation framework. By devising a novel inter-diffusion
mechanism, this model can accurately capture the complex interaction patterns
between speakers and listeners during communication. In the model construction
process, based on the advanced diffusion model architecture, we innovatively
introduce interaction conditions and the GAN model to increase the denoising
step size. As a result, when generating gesture sequences, the model can not
only dynamically generate based on the speaker's speech information but also
respond in realtime to the listener's feedback, enabling synergistic
interaction between the two. Abundant experimental results demonstrate that
compared with the current state-of-the-art gesture generation methods, the
model we proposed has achieved remarkable improvements in the naturalness,
coherence, and speech-gesture synchronization of the generated gestures. In the
subjective evaluation experiments, users highly praised the generated
interaction scenarios, believing that they are closer to real life human
communication situations. Objective index evaluations also show that our model
outperforms the baseline methods in multiple key indicators, providing more
powerful support for effective communication.
comment: accepted by ICMR 2025
★ Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization IJCAI-25
Attribute bias in federated learning (FL) typically leads local models to
optimize inconsistently due to the learning of non-causal associations,
resulting degraded performance. Existing methods either use data augmentation
for increasing sample diversity or knowledge distillation for learning
invariant representations to address this problem. However, they lack a
comprehensive analysis of the inference paths, and the interference from
confounding factors limits their performance. To address these limitations, we
propose the \underline{Fed}erated \underline{D}econfounding and
\underline{D}ebiasing \underline{L}earning (FedDDL) method. It constructs a
structured causal graph to analyze the model inference process, and performs
backdoor adjustment to eliminate confounding paths. Specifically, we design an
intra-client deconfounding learning module for computer vision tasks to
decouple background and objects, generating counterfactual samples that
establish a connection between the background and any label, which stops the
model from using the background to infer the label. Moreover, we design an
inter-client debiasing learning module to construct causal prototypes to reduce
the proportion of the background in prototype components. Notably, it bridges
the gap between heterogeneous representations via causal prototypical
regularization. Extensive experiments on 2 benchmarking datasets demonstrate
that \methodname{} significantly enhances the model capability to focus on main
objects in unseen data, leading to 4.5\% higher Top-1 Accuracy on average over
9 state-of-the-art existing methods.
comment: IJCAI-25 Accepted
★ ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment
Bilingual text-to-motion generation, which synthesizes 3D human motions from
bilingual text inputs, holds immense potential for cross-linguistic
applications in gaming, film, and robotics. However, this task faces critical
challenges: the absence of bilingual motion-language datasets and the
misalignment between text and motion distributions in diffusion models, leading
to semantically inconsistent or low-quality motions. To address these
challenges, we propose BiHumanML3D, a novel bilingual human motion dataset,
which establishes a crucial benchmark for bilingual text-to-motion generation
models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD),
which leverages cross-lingual aligned representations to capture semantics,
thereby achieving a unified bilingual model. Building upon this, we propose
Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware
reward model to assess alignment quality during sampling and a reward-guided
strategy that directs the diffusion process toward an optimally aligned
distribution. This reward model integrates step-aware tokens and combines a
text-aligned module for semantic consistency and a motion-aligned module for
realism, refining noisy motions at each timestep to balance probability density
and alignment. Experiments demonstrate that our approach significantly improves
text-motion alignment and motion quality compared to existing state-of-the-art
methods. Project page: https://wengwanjiang.github.io/ReAlign-page/.
comment: 17 pages, 9 figures
★ AI and Vision based Autonomous Navigation of Nano-Drones in Partially-Known Environments
The miniaturisation of sensors and processors, the advancements in connected
edge intelligence, and the exponential interest in Artificial Intelligence are
boosting the affirmation of autonomous nano-size drones in the Internet of
Robotic Things ecosystem. However, achieving safe autonomous navigation and
high-level tasks such as exploration and surveillance with these tiny platforms
is extremely challenging due to their limited resources. This work focuses on
enabling the safe and autonomous flight of a pocket-size, 30-gram platform
called Crazyflie 2.1 in a partially known environment. We propose a novel
AI-aided, vision-based reactive planning method for obstacle avoidance under
the ambit of Integrated Sensing, Computing and Communication paradigm. We deal
with the constraints of the nano-drone by splitting the navigation task into
two parts: a deep learning-based object detector runs on the edge (external
hardware) while the planning algorithm is executed onboard. The results show
the ability to command the drone at $\sim8$ frames-per-second and a model
performance reaching a COCO mean-average-precision of $60.8$. Field experiments
demonstrate the feasibility of the solution with the drone flying at a top
speed of $1$ m/s while steering away from an obstacle placed in an unknown
position and reaching the target destination. The outcome highlights the
compatibility of the communication delay and the model performance with the
requirements of the real-time navigation task. We provide a feasible
alternative to a fully onboard implementation that can be extended to
autonomous exploration with nano-drones.
comment: in DCOSS-IoT 2025, Wi-DroIT 2025
★ General Transform: A Unified Framework for Adaptive Transform to Enhance Representations
Discrete transforms, such as the discrete Fourier transform, are widely used
in machine learning to improve model performance by extracting meaningful
features. However, with numerous transforms available, selecting an appropriate
one often depends on understanding the dataset's properties, making the
approach less effective when such knowledge is unavailable. In this work, we
propose General Transform (GT), an adaptive transform-based representation
designed for machine learning applications. Unlike conventional transforms, GT
learns data-driven mapping tailored to the dataset and task of interest. Here,
we demonstrate that models incorporating GT outperform conventional
transform-based approaches across computer vision and natural language
processing tasks, highlighting its effectiveness in diverse learning scenarios.
★ DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding ICLR 2025
Enabling intelligent agents to comprehend and interact with 3D environments
through natural language is crucial for advancing robotics and human-computer
interaction. A fundamental task in this field is ego-centric 3D visual
grounding, where agents locate target objects in real-world 3D spaces based on
verbal descriptions. However, this task faces two significant challenges: (1)
loss of fine-grained visual semantics due to sparse fusion of point clouds with
ego-centric multi-view images, (2) limited textual semantic context due to
arbitrary language descriptions. We propose DenseGrounding, a novel approach
designed to address these issues by enhancing both visual and textual
semantics. For visual features, we introduce the Hierarchical Scene Semantic
Enhancer, which retains dense semantics by capturing fine-grained global scene
features and facilitating cross-modal alignment. For text descriptions, we
propose a Language Semantic Enhancer that leverages large language models to
provide rich context and diverse language descriptions with additional context
during model training. Extensive experiments show that DenseGrounding
significantly outperforms existing methods in overall accuracy, with
improvements of 5.81% and 7.56% when trained on the comprehensive full dataset
and smaller mini subset, respectively, further advancing the SOTA in egocentric
3D visual grounding. Our method also achieves 1st place and receives the
Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D
Visual Grounding Track, validating its effectiveness and robustness.
comment: Accepted by ICLR 2025
★ CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems
Yuto Nakamura, Satoshi Kodera, Haruki Settai, Hiroki Shinohara, Masatsugu Tamura, Tomohiro Noguchi, Tatsuki Furusawa, Ryo Takizawa, Tempei Kabayama, Norihiko Takeda
Coronary angiography (CAG) is the gold-standard imaging modality for
evaluating coronary artery disease, but its interpretation and subsequent
treatment planning rely heavily on expert cardiologists. To enable AI-based
decision support, we introduce a two-stage, physician-curated pipeline and a
bilingual (Japanese/English) CAG image-report dataset. First, we sample 14,686
frames from 539 exams and annotate them for key-frame detection and left/right
laterality; a ConvNeXt-Base CNN trained on this data achieves 0.96 F1 on
laterality classification, even on low-contrast frames. Second, we apply the
CNN to 243 independent exams, extract 1,114 key frames, and pair each with its
pre-procedure report and expert-validated diagnostic and treatment summary,
yielding a parallel corpus. We then fine-tune three open-source VLMs
(PaliGemma2, Gemma3, and ConceptCLIP-enhanced Gemma3) via LoRA and evaluate
them using VLScore and cardiologist review. Although PaliGemma2 w/LoRA attains
the highest VLScore, Gemma3 w/LoRA achieves the top clinician rating (mean
7.20/10); we designate this best-performing model as CAG-VLM. These results
demonstrate that specialized, fine-tuned VLMs can effectively assist
cardiologists in generating clinical reports and treatment recommendations from
CAG images.
★ ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis
Synthesizing medical images remains challenging due to limited annotated
pathological data, modality domain gaps, and the complexity of representing
diffuse pathologies such as liver cirrhosis. Existing methods often struggle to
maintain anatomical fidelity while accurately modeling pathological features,
frequently relying on priors derived from natural images or inefficient
multi-step sampling. In this work, we introduce ViCTr (Vital Consistency
Transfer), a novel two-stage framework that combines a rectified flow
trajectory with a Tweedie-corrected diffusion process to achieve high-fidelity,
pathology-aware image synthesis. First, we pretrain ViCTr on the ATLAS-8k
dataset using Elastic Weight Consolidation (EWC) to preserve critical
anatomical structures. We then fine-tune the model adversarially with Low-Rank
Adaptation (LoRA) modules for precise control over pathology severity. By
reformulating Tweedie's formula within a linear trajectory framework, ViCTr
supports one-step sampling, reducing inference from 50 steps to just 4, without
sacrificing anatomical realism. We evaluate ViCTr on BTCV (CT), AMOS (MRI), and
CirrMRI600+ (cirrhosis) datasets. Results demonstrate state-of-the-art
performance, achieving a Medical Frechet Inception Distance (MFID) of 17.01 for
cirrhosis synthesis 28% lower than existing approaches and improving nnUNet
segmentation by +3.8% mDSC when used for data augmentation. Radiologist reviews
indicate that ViCTr-generated liver cirrhosis MRIs are clinically
indistinguishable from real scans. To our knowledge, ViCTr is the first method
to provide fine-grained, pathology-aware MRI synthesis with graded severity
control, closing a critical gap in AI-driven medical imaging research.
★ An Efficient Method for Accurate Pose Estimation and Error Correction of Cuboidal Objects IROS 2022
The proposed system outlined in this paper is a solution to a use case that
requires the autonomous picking of cuboidal objects from an organized or
unorganized pile with high precision. This paper presents an efficient method
for precise pose estimation of cuboid-shaped objects, which aims to reduce
errors in target pose in a time-efficient manner. Typical pose estimation
methods like global point cloud registrations are prone to minor pose errors
for which local registration algorithms are generally used to improve pose
accuracy. However, due to the execution time overhead and uncertainty in the
error of the final achieved pose, an alternate, linear time approach is
proposed for pose error estimation and correction. This paper presents an
overview of the solution followed by a detailed description of individual
modules of the proposed algorithm.
comment: Accepted in IEEE/RSJ IROS 2022 Workshop on Mobile Manipulation and
Embodied Intelligence (MOMA)
★ ADD: Physics-Based Motion Imitation with Adversarial Differential Discriminators
Multi-objective optimization problems, which require the simultaneous
optimization of multiple terms, are prevalent across numerous applications.
Existing multi-objective optimization methods often rely on manually tuned
aggregation functions to formulate a joint optimization target. The performance
of such hand-tuned methods is heavily dependent on careful weight selection, a
time-consuming and laborious process. These limitations also arise in the
setting of reinforcement-learning-based motion tracking for physically
simulated characters, where intricately crafted reward functions are typically
used to achieve high-fidelity results. Such solutions not only require domain
expertise and significant manual adjustment, but also limit the applicability
of the resulting reward function across diverse skills. To bridge this gap, we
present a novel adversarial multi-objective optimization technique that is
broadly applicable to a range of multi-objective optimization problems,
including motion tracking. The proposed adversarial differential discriminator
receives a single positive sample, yet is still effective at guiding the
optimization process. We demonstrate that our technique can enable characters
to closely replicate a variety of acrobatic and agile behaviors, achieving
comparable quality to state-of-the-art motion-tracking methods, without relying
on manually tuned reward functions. Results are best visualized through
https://youtu.be/rz8BYCE9E2w.
comment: 19 pages, 15 figures
★ MoRe-3DGSMR: Motion-resolved reconstruction framework for free-breathing pulmonary MRI based on 3D Gaussian representation
This study presents an unsupervised, motion-resolved reconstruction framework
for high-resolution, free-breathing pulmonary magnetic resonance imaging (MRI),
utilizing a three-dimensional Gaussian representation (3DGS). The proposed
method leverages 3DGS to address the challenges of motion-resolved 3D isotropic
pulmonary MRI reconstruction by enabling data smoothing between voxels for
continuous spatial representation. Pulmonary MRI data acquisition is performed
using a golden-angle radial sampling trajectory, with respiratory motion
signals extracted from the center of k-space in each radial spoke. Based on the
estimated motion signal, the k-space data is sorted into multiple respiratory
phases. A 3DGS framework is then applied to reconstruct a reference image
volume from the first motion state. Subsequently, a patient-specific
convolutional neural network is trained to estimate the deformation vector
fields (DVFs), which are used to generate the remaining motion states through
spatial transformation of the reference volume. The proposed reconstruction
pipeline is evaluated on six datasets from six subjects and bench-marked
against three state-of-the-art reconstruction methods. The experimental
findings demonstrate that the proposed reconstruction framework effectively
reconstructs high-resolution, motion-resolved pulmonary MR images. Compared
with existing approaches, it achieves superior image quality, reflected by
higher signal-to-noise ratio and contrast-to-noise ratio. The proposed
unsupervised 3DGS-based reconstruction method enables accurate motion-resolved
pulmonary MRI with isotropic spatial resolution. Its superior performance in
image quality metrics over state-of-the-art methods highlights its potential as
a robust solution for clinical pulmonary MR imaging.
★ T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models
Thanks to recent advancements in scalable deep architectures and large-scale
pretraining, text-to-video generation has achieved unprecedented capabilities
in producing high-fidelity, instruction-following content across a wide range
of styles, enabling applications in advertising, entertainment, and education.
However, these models' ability to render precise on-screen text, such as
captions or mathematical formulas, remains largely untested, posing significant
challenges for applications requiring exact textual accuracy. In this work, we
introduce T2VTextBench, the first human-evaluation benchmark dedicated to
evaluating on-screen text fidelity and temporal consistency in text-to-video
models. Our suite of prompts integrates complex text strings with dynamic scene
changes, testing each model's ability to maintain detailed instructions across
frames. We evaluate ten state-of-the-art systems, ranging from open-source
solutions to commercial offerings, and find that most struggle to generate
legible, consistent text. These results highlight a critical gap in current
video generators and provide a clear direction for future research aimed at
enhancing textual manipulation in video synthesis.
★ Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping
Accurate building damage assessment using bi-temporal multi-modal remote
sensing images is essential for effective disaster response and recovery
planning. This study proposes a novel Building-Guided Pseudo-Label Learning
Framework to address the challenges of mapping building damage from
pre-disaster optical and post-disaster SAR images. First, we train a series of
building extraction models using pre-disaster optical images and building
labels. To enhance building segmentation, we employ multi-model fusion and
test-time augmentation strategies to generate pseudo-probabilities, followed by
a low-uncertainty pseudo-label training method for further refinement. Next, a
change detection model is trained on bi-temporal cross-modal images and damaged
building labels. To improve damage classification accuracy, we introduce a
building-guided low-uncertainty pseudo-label refinement strategy, which
leverages building priors from the previous step to guide pseudo-label
generation for damaged buildings, reducing uncertainty and enhancing
reliability. Experimental results on the 2025 IEEE GRSS Data Fusion Contest
dataset demonstrate the effectiveness of our approach, which achieved the
highest mIoU score (54.28%) and secured first place in the competition.
★ FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration
In recent years, deformable medical image registration techniques have made
significant progress. However, existing models still lack efficiency in
parallel extraction of coarse and fine-grained features. To address this, we
construct a new pyramid registration network based on feature and deformation
field (FF-PNet). For coarse-grained feature extraction, we design a Residual
Feature Fusion Module (RFFM), for fine-grained image deformation, we propose a
Residual Deformation Field Fusion Module (RDFFM). Through the parallel
operation of these two modules, the model can effectively handle complex image
deformations. It is worth emphasizing that the encoding stage of FF-PNet only
employs traditional convolutional neural networks without any attention
mechanisms or multilayer perceptrons, yet it still achieves remarkable
improvements in registration accuracy, fully demonstrating the superior feature
decoding capabilities of RFFM and RDFFM. We conducted extensive experiments on
the LPBA and OASIS datasets. The results show our network consistently
outperforms popular methods in metrics like the Dice Similarity Coefficient.
★ Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training
Palmprint recognition is a secure and privacy-friendly method of biometric
identification. One of the major challenges to improve palmprint recognition
accuracy is the scarcity of palmprint data. Recently, a popular line of
research revolves around the synthesis of virtual palmprints for large-scale
pre-training purposes. In this paper, we propose a novel synthesis method named
Canny2Palm that extracts palm textures with Canny edge detector and uses them
to condition a Pix2Pix network for realistic palmprint generation. By
re-assembling palmprint textures from different identities, we are able to
create new identities by seeding the generator with new assemblies. Canny2Palm
not only synthesizes realistic data following the distribution of real
palmprints but also enables controllable diversity to generate large-scale new
identities. On open-set palmprint recognition benchmarks, models pre-trained
with Canny2Palm synthetic data outperform the state-of-the-art with up to 7.2%
higher identification accuracy. Moreover, the performance of models pre-trained
with Canny2Palm continues to improve given 10,000 synthetic IDs while those
with existing methods already saturate, demonstrating the potential of our
method for large-scale pre-training.
★ Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Reasoning lies at the heart of intelligence, shaping the ability to make
decisions, draw conclusions, and generalize across domains. In artificial
intelligence, as systems increasingly operate in open, uncertain, and
multimodal environments, reasoning becomes essential for enabling robust and
adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a
promising paradigm, integrating modalities such as text, images, audio, and
video to support complex reasoning capabilities and aiming to achieve
comprehensive perception, precise understanding, and deep reasoning. As
research advances, multimodal reasoning has rapidly evolved from modular,
perception-driven pipelines to unified, language-centric frameworks that offer
more coherent cross-modal understanding. While instruction tuning and
reinforcement learning have improved model reasoning, significant challenges
remain in omni-modal generalization, reasoning depth, and agentic behavior. To
address these issues, we present a comprehensive and structured survey of
multimodal reasoning research, organized around a four-stage developmental
roadmap that reflects the field's shifting design philosophies and emerging
capabilities. First, we review early efforts based on task-specific modules,
where reasoning was implicitly embedded across stages of representation,
alignment, and fusion. Next, we examine recent approaches that unify reasoning
into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT)
and multimodal reinforcement learning enabling richer and more structured
reasoning chains. Finally, drawing on empirical insights from challenging
benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the
conceptual direction of native large multimodal reasoning models (N-LMRMs),
which aim to support scalable, agentic, and adaptive reasoning and planning in
complex, real-world environments.
comment: 75 Pages,10 figures; Project:
https://github.com/HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models
★ A Simple Detector with Frame Dynamics is a Strong Tracker CVPR
Chenxu Peng, Chenxu Wang, Minrui Zou, Danyang Li, Zhengpeng Yang, Yimian Dai, Ming-Ming Cheng, Xiang Li
Infrared object tracking plays a crucial role in Anti-Unmanned Aerial Vehicle
(Anti-UAV) applications. Existing trackers often depend on cropped template
regions and have limited motion modeling capabilities, which pose challenges
when dealing with tiny targets. To address this, we propose a simple yet
effective infrared tiny-object tracker that enhances tracking performance by
integrating global detection and motion-aware learning with temporal priors.
Our method is based on object detection and achieves significant improvements
through two key innovations. First, we introduce frame dynamics, leveraging
frame difference and optical flow to encode both prior target features and
motion characteristics at the input level, enabling the model to better
distinguish the target from background clutter. Second, we propose a trajectory
constraint filtering strategy in the post-processing stage, utilizing
spatio-temporal priors to suppress false positives and enhance tracking
robustness. Extensive experiments show that our method consistently outperforms
existing approaches across multiple metrics in challenging infrared UAV
tracking scenarios. Notably, we achieve state-of-the-art performance in the 4th
Anti-UAV Challenge, securing 1st place in Track 1 and 2nd place in Track 2.
comment: 2025 CVPR Anti-UAV Workshop
★ GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing CVPR 2025
Scene text editing, a subfield of image editing, requires modifying texts in
images while preserving style consistency and visual coherence with the
surrounding environment. While diffusion-based methods have shown promise in
text generation, they still struggle to produce high-quality results. These
methods often generate distorted or unrecognizable characters, particularly
when dealing with complex characters like Chinese. In such systems, characters
are composed of intricate stroke patterns and spatial relationships that must
be precisely maintained. We present GlyphMastero, a specialized glyph encoder
designed to guide the latent diffusion model for generating texts with
stroke-level precision. Our key insight is that existing methods, despite using
pretrained OCR models for feature extraction, fail to capture the hierarchical
nature of text structures - from individual strokes to stroke-level
interactions to overall character-level structure. To address this, our glyph
encoder explicitly models and captures the cross-level interactions between
local-level individual characters and global-level text lines through our novel
glyph attention module. Meanwhile, our model implements a feature pyramid
network to fuse the multi-scale OCR backbone features at the global-level.
Through these cross-level and multi-scale fusions, we obtain more detailed
glyph-aware guidance, enabling precise control over the scene text generation
process. Our method achieves an 18.02\% improvement in sentence accuracy over
the state-of-the-art multi-lingual scene text editing baseline, while
simultaneously reducing the text-region Fr\'echet inception distance by
53.28\%.
comment: CVPR 2025
★ Advanced 3D Imaging Approach to TSV/TGV Metrology and Inspection Using Only Optical Microscopy
This paper introduces an innovative approach to silicon and glass via
inspection, which combines hybrid field microscopy with photometric stereo.
Conventional optical microscopy techniques are generally limited to superficial
inspections and struggle to effectively visualize the internal structures of
silicon and glass vias. By utilizing various lighting conditions for 3D
reconstruction, the proposed method surpasses these limitations. By integrating
photometric stereo to the traditional optical microscopy, the proposed method
not only enhances the capability to detect micro-scale defects but also
provides a detailed visualization of depth and edge abnormality, which are
typically not visible with conventional optical microscopy inspection. The
experimental results demonstrated that the proposed method effectively captures
intricate surface details and internal structures. Quantitative comparisons
between the reconstructed models and actual measurements present the capability
of the proposed method to significantly improve silicon and glass via
inspection process. As a result, the proposed method achieves enhanced
cost-effectiveness while maintaining high accuracy and repeatability,
suggesting substantial advancements in silicon and glass via inspection
techniques
comment: 6 pages, 6 figures, Submitted to arXiv for preprint
★ SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
This study introduces SpatialPrompting, a novel framework that harnesses the
emergent reasoning capabilities of off-the-shelf multimodal large language
models to achieve zero-shot spatial reasoning in three-dimensional (3D)
environments. Unlike existing methods that rely on expensive 3D-specific
fine-tuning with specialized 3D inputs such as point clouds or voxel-based
features, SpatialPrompting employs a keyframe-driven prompt generation
strategy. This framework uses metrics such as vision-language similarity,
Mahalanobis distance, field of view, and image sharpness to select a diverse
and informative set of keyframes from image sequences and then integrates them
with corresponding camera pose data to effectively abstract spatial
relationships and infer complex 3D structures. The proposed framework not only
establishes a new paradigm for flexible spatial reasoning that utilizes
intuitive visual and positional cues but also achieves state-of-the-art
zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across
several metrics. The proposed method effectively eliminates the need for
specialized 3D inputs and fine-tuning, offering a simpler and more scalable
alternative to conventional approaches.
comment: 18 pages, 11 figures
★ Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization ECCV 2024
Weakly Supervised Object Localization (WSOL), which aims to localize objects
by only using image-level labels, has attracted much attention because of its
low annotation cost in real applications. Current studies focus on the Class
Activation Map (CAM) of CNN and the self-attention map of transformer to
identify the region of objects. However, both CAM and self-attention maps can
not learn pixel-level fine-grained information on the foreground objects, which
hinders the further advance of WSOL. To address this problem, we initiatively
leverage the capability of zero-shot generalization and fine-grained
segmentation in Segment Anything Model (SAM) to boost the activation of
integral object regions. Further, to alleviate the semantic ambiguity issue
accrued in single point prompt-based SAM, we propose an innovative mask prompt
to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a
Global Token Transformer (GTFormer) to generate a coarse-grained foreground map
as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and
novel global tokens to learn foreground semantics. Secondly, we deliver grid
points as dense prompts into SAM to maximize the probability of foreground
mask, which avoids the lack of objects caused by a single point/box prompt.
Finally, we propose a pixel-level similarity metric to come true the mask
matching from mask prompt to SAM, where the mask with the highest score is
viewed as the final localization map. Experiments show that the proposed
Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC,
with 84.03\% and 66.85\% Top-1 Loc, respectively.
comment: Accepted by ECCV 2024
★ OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging
Sifan Song, Siyeop Yoon, Pengfei Jin, Sekeun Kim, Matthew Tivnan, Yujin Oh, Runqi Meng, Ling Chen, Zhiliang Lyu, Dufan Wu, Ning Guo, Xiang Li, Quanzheng Li
Recent advances in representation learning often rely on holistic, black-box
embeddings that entangle multiple semantic components, limiting
interpretability and generalization. These issues are especially critical in
medical imaging. To address these limitations, we propose an Organ-Wise
Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR)
training paradigm. Unlike conventional approaches that produce holistic
features, OWT explicitly disentangles an image into separable token groups,
each corresponding to a distinct organ or semantic entity. Our design ensures
each token group encapsulates organ-specific information, boosting
interpretability, generalization, and efficiency while allowing fine-grained
control in downstream tasks. Experiments on CT and MRI datasets demonstrate the
effectiveness of OWT in not only achieving strong image reconstruction and
segmentation performance, but also enabling novel semantic-level generation and
retrieval applications that are out of reach for standard holistic embedding
methods. These findings underscore the potential of OWT as a foundational
framework for semantically disentangled representation learning, offering broad
scalability and applicability to real-world medical imaging scenarios and
beyond.
★ Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection
Remarkable advancements in generative AI technology have given rise to a
spectrum of novel deepfake categories with unprecedented leaps in their
realism, and deepfakes are increasingly becoming a nuisance to law enforcement
authorities and the general public. In particular, we observe alarming levels
of confusion, deception, and loss of faith regarding multimedia content within
society caused by face deepfakes, and existing deepfake detectors are
struggling to keep up with the pace of improvements in deepfake generation.
This is primarily due to their reliance on specific forgery artifacts, which
limits their ability to generalise and detect novel deepfake types. To combat
the spread of malicious face deepfakes, this paper proposes a new strategy that
leverages coarse-to-fine spatial information, semantic information, and their
interactions while ensuring feature distinctiveness and reducing the redundancy
of the modelled features. A novel feature orthogonality-based disentanglement
strategy is introduced to ensure branch-level and cross-branch feature
disentanglement, which allows us to integrate multiple feature vectors without
adding complexity to the feature space or compromising generalisation.
Comprehensive experiments on three public benchmarks: FaceForensics++,
Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design
choices enable the proposed approach to outperform current state-of-the-art
methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a
cross-dataset evaluation setting.
★ Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning
Mixed Precision Quantization (MPQ) has become an essential technique for
optimizing neural network by determining the optimal bitwidth per layer.
Existing MPQ methods, however, face a major hurdle: they require a
computationally expensive search for quantization policies on large-scale
datasets. To resolve this issue, we introduce a novel approach that first
searches for quantization policies on small datasets and then generalizes them
to large-scale datasets. This approach simplifies the process, eliminating the
need for large-scale quantization fine-tuning and only necessitating model
weight adjustment. Our method is characterized by three key techniques:
sharpness-aware minimization for enhanced quantization generalization, implicit
gradient direction alignment to handle gradient conflicts among different
optimization objectives, and an adaptive perturbation radius to accelerate
optimization. Both theoretical analysis and experimental results validate our
approach. Using the CIFAR10 dataset (just 0.5\% the size of ImageNet training
data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a
significantly lower computational cost, while improving efficiency by up to
150% over the baselines.
★ Auto-regressive transformation for image alignment
Existing methods for image alignment struggle in cases involving
feature-sparse regions, extreme scale and field-of-view differences, and large
deformations, often resulting in suboptimal accuracy. Robustness to these
challenges improves through iterative refinement of the transformation field
while focusing on critical regions in multi-scale image representations. We
thus propose Auto-Regressive Transformation (ART), a novel method that
iteratively estimates the coarse-to-fine transformations within an
auto-regressive framework. Leveraging hierarchical multi-scale features, our
network refines the transformations using randomly sampled points at each
scale. By incorporating guidance from the cross-attention layer, the model
focuses on critical regions, ensuring accurate alignment even in challenging,
feature-limited conditions. Extensive experiments across diverse datasets
demonstrate that ART significantly outperforms state-of-the-art methods,
establishing it as a powerful new method for precise image alignment with broad
applicability.
★ Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model
The Segment Anything Model (SAM) is a popular vision foundation model;
however, its high computational and memory demands make deployment on
resource-constrained devices challenging. While Post-Training Quantization
(PTQ) is a practical approach for reducing computational overhead, existing PTQ
methods rely on fixed bit-width quantization, leading to suboptimal accuracy
and efficiency. To address this limitation, we propose Mix-QSAM, a
mixed-precision PTQ framework for SAM. First, we introduce a layer-wise
importance score, derived using Kullback-Leibler (KL) divergence, to quantify
each layer's contribution to the model's output. Second, we introduce
cross-layer synergy, a novel metric based on causal mutual information, to
capture dependencies between adjacent layers. This ensures that highly
interdependent layers maintain similar bit-widths, preventing abrupt precision
mismatches that degrade feature propagation and numerical stability. Using
these metrics, we formulate an Integer Quadratic Programming (IQP) problem to
determine optimal bit-width allocation under model size and bit-operation
constraints, assigning higher precision to critical layers while minimizing
bit-width in less influential layers. Experimental results demonstrate that
Mix-QSAM consistently outperforms existing PTQ methods on instance segmentation
and object detection tasks, achieving up to 20% higher average precision under
6-bit and 4-bit mixed-precision settings, while maintaining computational
efficiency.
comment: 12 pages, 2 Figures
★ D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation
Learning bimanual manipulation is challenging due to its high dimensionality
and tight coordination required between two arms. Eye-in-hand imitation
learning, which uses wrist-mounted cameras, simplifies perception by focusing
on task-relevant views. However, collecting diverse demonstrations remains
costly, motivating the need for scalable data augmentation. While prior work
has explored visual augmentation in single-arm settings, extending these
approaches to bimanual manipulation requires generating viewpoint-consistent
observations across both arms and producing corresponding action labels that
are both valid and feasible. In this work, we propose Diffusion for COordinated
Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation
tailored to eye-in-hand bimanual imitation learning that trains a diffusion
model to synthesize novel, viewpoint-consistent wrist-camera images for both
arms while simultaneously generating joint-space action labels. It employs
constrained optimization to ensure that augmented states involving
gripper-to-object contacts adhere to constraints suitable for bimanual
coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our
results across 2250 simulation trials and 300 real-world trials demonstrate
that it outperforms baselines and ablations, showing its potential for scalable
data augmentation in eye-in-hand bimanual manipulation. Our project website is
at: https://dcodaaug.github.io/D-CODA/.
♻ ★ MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection
Accurately predicting 3D attributes is crucial for monocular 3D object
detection (Mono3D), with depth estimation posing the greatest challenge due to
the inherent ambiguity in mapping 2D images to 3D space. While existing methods
leverage multiple depth cues (e.g., estimating depth uncertainty, modeling
depth error) to improve depth accuracy, they overlook that accurate depth
prediction requires conditioning on other 3D attributes, as these attributes
are intrinsically inter-correlated through the 3D to 2D projection, which
ultimately limits overall accuracy and stability. Inspired by Chain-of-Thought
(CoT) in large language models (LLMs), this paper proposes MonoCoP, which
leverages a Chain-of-Prediction (CoP) to predict attributes sequentially and
conditionally via three key designs. First, it employs a lightweight
AttributeNet (AN) for each 3D attribute to learn attribute-specific features.
Next, MonoCoP constructs an explicit chain to propagate these learned features
from one attribute to the next. Finally, MonoCoP uses a residual connection to
aggregate features for each attribute along the chain, ensuring that later
attribute predictions are conditioned on all previously processed attributes
without forgetting the features of earlier ones. Experimental results show that
our MonoCoP achieves state-of-the-art (SoTA) performance on the KITTI
leaderboard without requiring additional data and further surpasses existing
methods on the Waymo and nuScenes frontal datasets.
♻ ★ TetWeave: Isosurface Extraction using On-The-Fly Delaunay Tetrahedral Grids for Gradient-Based Mesh Optimization SIGGRAPH 2025
We introduce TetWeave, a novel isosurface representation for gradient-based
mesh optimization that jointly optimizes the placement of a tetrahedral grid
used for Marching Tetrahedra and a novel directional signed distance at each
point. TetWeave constructs tetrahedral grids on-the-fly via Delaunay
triangulation, enabling increased flexibility compared to predefined grids. The
extracted meshes are guaranteed to be watertight, two-manifold and
intersection-free. The flexibility of TetWeave enables a resampling strategy
that places new points where reconstruction error is high and allows to
encourage mesh fairness without compromising on reconstruction error. This
leads to high-quality, adaptive meshes that require minimal memory usage and
few parameters to optimize. Consequently, TetWeave exhibits near-linear memory
scaling relative to the vertex count of the output mesh - a substantial
improvement over predefined grids. We demonstrate the applicability of TetWeave
to a broad range of challenging tasks in computer graphics and vision, such as
multi-view 3D reconstruction, mesh compression and geometric texture
generation.
comment: ACM Trans. Graph. 44, 4. SIGGRAPH 2025. 19 pages, 21 figures
♻ ★ HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Customized video generation aims to produce videos featuring specific
subjects under flexible user-defined conditions, yet existing methods often
struggle with identity consistency and limited input modalities. In this paper,
we propose HunyuanCustom, a multi-modal customized video generation framework
that emphasizes subject consistency while supporting image, audio, video, and
text conditions. Built upon HunyuanVideo, our model first addresses the
image-text conditioned generation task by introducing a text-image fusion
module based on LLaVA for enhanced multi-modal understanding, along with an
image ID enhancement module that leverages temporal concatenation to reinforce
identity features across frames. To enable audio- and video-conditioned
generation, we further propose modality-specific condition injection
mechanisms: an AudioNet module that achieves hierarchical alignment via spatial
cross-attention, and a video-driven injection module that integrates
latent-compressed conditional video through a patchify-based feature-alignment
network. Extensive experiments on single- and multi-subject scenarios
demonstrate that HunyuanCustom significantly outperforms state-of-the-art open-
and closed-source methods in terms of ID consistency, realism, and text-video
alignment. Moreover, we validate its robustness across downstream tasks,
including audio and video-driven customized video generation. Our results
highlight the effectiveness of multi-modal conditioning and identity-preserving
strategies in advancing controllable video generation. All the code and models
are available at https://hunyuancustom.github.io.
♻ ★ Defining and Quantifying Creative Behavior in Popular Image Generators
Creativity of generative AI models has been a subject of scientific debate in
the last years, without a conclusive answer. In this paper, we study creativity
from a practical perspective and introduce quantitative measures that help the
user to choose a suitable AI model for a given task. We evaluated our measures
on a number of popular image-to-image generation models, and the results of
this suggest that our measures conform to human intuition.
♻ ★ FA-KPConv: Introducing Euclidean Symmetries to KPConv via Frame Averaging IJCNN 2025
We present Frame-Averaging Kernel-Point Convolution (FA-KPConv), a neural
network architecture built on top of the well-known KPConv, a widely adopted
backbone for 3D point cloud analysis. Even though invariance and/or
equivariance to Euclidean transformations are required for many common tasks,
KPConv-based networks can only approximately achieve such properties when
training on large datasets or with significant data augmentations. Using Frame
Averaging, we allow to flexibly customize point cloud neural networks built
with KPConv layers, by making them exactly invariant and/or equivariant to
translations, rotations and/or reflections of the input point clouds. By simply
wrapping around an existing KPConv-based network, FA-KPConv embeds geometrical
prior knowledge into it while preserving the number of learnable parameters and
not compromising any input information. We showcase the benefit of such an
introduced bias for point cloud classification and point cloud registration,
especially in challenging cases such as scarce training data or randomly
rotated test data.
comment: 8 pages, 2 figures, accepted at IJCNN 2025
♻ ★ Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge
Vladyslav Zalevskyi, Thomas Sanchez, Misha Kaandorp, Margaux Roulet, Diego Fajardo-Rojas, Liu Li, Jana Hutter, Hongwei Bran Li, Matthew Barkovich, Hui Ji, Luca Wilhelmi, Aline Dändliker, Céline Steger, Mériam Koob, Yvan Gomez, Anton Jakovčić, Melita Klaić, Ana Adžić, Pavel Marković, Gracia Grabarić, Milan Rados, Jordina Aviles Verdera, Gregor Kasprian, Gregor Dovjak, Raphael Gaubert-Rachmühl, Maurice Aschwanden, Qi Zeng, Davood Karimi, Denis Peruzzo, Tommaso Ciceri, Giorgio Longari, Rachika E. Hamadache, Amina Bouzid, Xavier Lladó, Simone Chiarella, Gerard Martí-Juan, Miguel Ángel González Ballester, Marco Castellaro, Marco Pinamonti, Valentina Visani, Robin Cremese, Keïn Sam, Fleur Gaudfernau, Param Ahir, Mehul Parikh, Maximilian Zenk, Michael Baumgartner, Klaus Maier-Hein, Li Tianhong, Yang Hong, Zhao Longfei, Domen Preloznik, Žiga Špiclin, Jae Won Choi, Muyang Li, Jia Fu, Guotai Wang, Jingwen Jiang, Lyuyang Tong, Bo Du, Andrea Gondova, Sungmin You, Kiho Im, Abdul Qayyum, Moona Mazher, Steven A Niederer, Andras Jakab, Roxane Licandro, Kelly Payette, Meritxell Bach Cuadra
Accurate fetal brain tissue segmentation and biometric analysis are essential
for studying brain development in utero. The FeTA Challenge 2024 advanced
automated fetal brain MRI analysis by introducing biometry prediction as a new
task alongside tissue segmentation. For the first time, our diverse
multi-centric test set included data from a new low-field (0.55T) MRI dataset.
Evaluation metrics were also expanded to include the topology-specific Euler
characteristic difference (ED). Sixteen teams submitted segmentation methods,
most of which performed consistently across both high- and low-field scans.
However, longitudinal trends indicate that segmentation accuracy may be
reaching a plateau, with results now approaching inter-rater variability. The
ED metric uncovered topological differences that were missed by conventional
metrics, while the low-field dataset achieved the highest segmentation scores,
highlighting the potential of affordable imaging systems when paired with
high-quality reconstruction. Seven teams participated in the biometry task, but
most methods failed to outperform a simple baseline that predicted measurements
based solely on gestational age, underscoring the challenge of extracting
reliable biometric estimates from image data alone. Domain shift analysis
identified image quality as the most significant factor affecting model
generalization, with super-resolution pipelines also playing a substantial
role. Other factors, such as gestational age, pathology, and acquisition site,
had smaller, though still measurable, effects. Overall, FeTA 2024 offers a
comprehensive benchmark for multi-class segmentation and biometry estimation in
fetal brain MRI, underscoring the need for data-centric approaches, improved
topological evaluation, and greater dataset diversity to enable clinically
robust and generalizable AI tools.
♻ ★ Learning from Similarity Proportion Loss for Classifying Skeletal Muscle Recovery Stages MICCAI2024
Evaluating the regeneration process of damaged muscle tissue is a fundamental
analysis in muscle research to measure experimental effect sizes and uncover
mechanisms behind muscle weakness due to aging and disease. The conventional
approach to assessing muscle tissue regeneration involves whole-slide imaging
and expert visual inspection of the recovery stages based on the morphological
information of cells and fibers. There is a need to replace these tasks with
automated methods incorporating machine learning techniques to ensure a
quantitative and objective analysis. Given the limited availability of fully
labeled data, a possible approach is Learning from Label Proportions (LLP), a
weakly supervised learning method using class label proportions. However,
current LLP methods have two limitations: (1) they cannot adapt the feature
extractor for muscle tissues, and (2) they treat the classes representing
recovery stages and cell morphological changes as nominal, resulting in the
loss of ordinal information. To address these issues, we propose Ordinal Scale
Learning from Similarity Proportion (OSLSP), which uses a similarity proportion
loss derived from two bag combinations. OSLSP can update the feature extractor
by using class proportion attention to the ordinal scale of the class. Our
model with OSLSP outperforms large-scale pre-trained and fine-tuning models in
classification tasks of skeletal muscle recovery stages.
comment: MICCAI2024 workshop ADSMI in Morocco (oral) [Peer-reviewed]
♻ ★ MAISY: Motion-Aware Image SYnthesis for Medical Image Motion Correction
Patient motion during medical image acquisition causes blurring, ghosting,
and distorts organs, which makes image interpretation challenging. Current
state-of-the-art algorithms using Generative Adversarial Network (GAN)-based
methods with their ability to learn the mappings between corrupted images and
their ground truth via Structural Similarity Index Measure (SSIM) loss
effectively generate motion-free images. However, we identified the following
limitations: (i) they mainly focus on global structural characteristics and
therefore overlook localized features that often carry critical pathological
information, and (ii) the SSIM loss function struggles to handle images with
varying pixel intensities, luminance factors, and variance. In this study, we
propose Motion-Aware Image SYnthesis (MAISY) which initially characterize
motion and then uses it for correction by: (a) leveraging the foundation model
Segment Anything Model (SAM), to dynamically learn spatial patterns along
anatomical boundaries where motion artifacts are most pronounced and, (b)
introducing the Variance-Selective SSIM (VS-SSIM) loss which adaptively
emphasizes spatial regions with high pixel variance to preserve essential
anatomical details during artifact correction. Experiments on chest and head CT
datasets demonstrate that our model outperformed the state-of-the-art
counterparts, with Peak Signal-to-Noise Ratio (PSNR) increasing by 40%, SSIM by
10%, and Dice by 16%.
♻ ★ Automated detection of underdiagnosed medical conditions via opportunistic imaging
Asad Aali, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Laura T Derry, David Svec, Jason Hom, Robert D. Boutin, Akshay S. Chaudhari
Abdominal computed tomography (CT) scans are frequently performed in clinical
settings. Opportunistic CT involves repurposing routine CT images to extract
diagnostic information and is an emerging tool for detecting underdiagnosed
conditions such as sarcopenia, hepatic steatosis, and ascites. This study
utilizes deep learning methods to promote accurate diagnosis and clinical
documentation. We analyze 2,674 inpatient CT scans to identify discrepancies
between imaging phenotypes (characteristics derived from opportunistic CT
scans) and their corresponding documentation in radiology reports and ICD
coding. Through our analysis, we find that only 0.5%, 3.2%, and 30.7% of scans
diagnosed with sarcopenia, hepatic steatosis, and ascites (respectively)
through either opportunistic imaging or radiology reports were ICD-coded. Our
findings demonstrate opportunistic CT's potential to enhance diagnostic
precision and accuracy of risk adjustment models, offering advancements in
precision medicine.
♻ ★ Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment
In this work, we rethink the approach to video super-resolution by
introducing a method based on the Diffusion Posterior Sampling framework,
combined with an unconditional video diffusion transformer operating in latent
space. The video generation model, a diffusion transformer, functions as a
space-time model. We argue that a powerful model, which learns the physics of
the real world, can easily handle various kinds of motion patterns as prior
knowledge, thus eliminating the need for explicit estimation of optical flows
or motion parameters for pixel alignment. Furthermore, a single instance of the
proposed video diffusion transformer model can adapt to different sampling
conditions without re-training. Empirical results on synthetic and real-world
datasets illustrate the feasibility of diffusion-based, alignment-free video
super-resolution.
♻ ★ Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns
Sharp, multidimensional changepoints-abrupt shifts in a regression surface
whose locations and magnitudes are unknown-arise in settings as varied as
gene-expression profiling, financial covariance breaks, climate-regime
detection, and urban socioeconomic mapping. Despite their prevalence, there are
no current approaches that jointly estimate the location and size of the
discontinuity set in a one-shot approach with statistical guarantees. We
therefore introduce Free Discontinuity Regression (FDR), a fully nonparametric
estimator that simultaneously (i) smooths a regression surface, (ii) segments
it into contiguous regions, and (iii) provably recovers the precise locations
and sizes of its jumps. By extending a convex relaxation of the Mumford-Shah
functional to random spatial sampling and correlated noise, FDR overcomes the
fixed-grid and i.i.d. noise assumptions of classical image-segmentation
approaches, thus enabling its application to real-world data of any dimension.
This yields the first identification and uniform consistency results for
multivariate jump surfaces: under mild SBV regularity, the estimated function,
its discontinuity set, and all jump sizes converge to their true population
counterparts. Hyperparameters are selected automatically from the data using
Stein's Unbiased Risk Estimate, and large-scale simulations up to three
dimensions validate the theoretical results and demonstrate good finite-sample
performance. Applying FDR to an internet shutdown in India reveals a 25-35%
reduction in economic activity around the estimated shutdown boundaries-much
larger than previous estimates. By unifying smoothing, segmentation, and
effect-size recovery in a general statistical setting, FDR turns
free-discontinuity ideas into a practical tool with formal guarantees for
modern multivariate data.
comment: 24 pages, 3 figures, 2 tables; authors listed alphabetically; code
available at https://github.com/Davidvandijcke/fdr
♻ ★ USTEP: Spatio-Temporal Predictive Learning under A Unified View
Spatio-temporal predictive learning plays a crucial role in self-supervised
learning, with wide-ranging applications across a diverse range of fields.
Previous approaches for temporal modeling fall into two categories:
recurrent-based and recurrent-free methods. The former, while meticulously
processing frames one by one, neglect short-term spatio-temporal information
redundancies, leading to inefficiencies. The latter naively stack frames
sequentially, overlooking the inherent temporal dependencies. In this paper, we
re-examine the two dominant temporal modeling approaches within the realm of
spatio-temporal predictive learning, offering a unified perspective. Building
upon this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive
learning), an innovative framework that reconciles the recurrent-based and
recurrent-free methods by integrating both micro-temporal and macro-temporal
scales. Extensive experiments on a wide range of spatio-temporal predictive
learning demonstrate that USTEP achieves significant improvements over existing
temporal modeling approaches, thereby establishing it as a robust solution for
a wide range of spatio-temporal applications.
comment: Accepted by TPAMI
♻ ★ AirMorph: Topology-Preserving Deep Learning for Pulmonary Airway Analysis
Minghui Zhang, Chenyu Li, Fangfang Xie, Yaoyu Liu, Hanxiao Zhang, Junyang Wu, Chunxi Zhang, Jie Yang, Jiayuan Sun, Guang-Zhong Yang, Yun Gu
Accurate anatomical labeling and analysis of the pulmonary structure and its
surrounding anatomy from thoracic CT is getting increasingly important for
understanding the etilogy of abnormalities or supporting targetted therapy and
early interventions. Whilst lung and airway cell atlases have been attempted,
there is a lack of fine-grained morphological atlases that are clinically
deployable. In this work, we introduce AirMorph, a robust, end-to-end deep
learning pipeline enabling fully automatic and comprehensive airway anatomical
labeling at lobar, segmental, and subsegmental resolutions that can be used to
create digital atlases of the lung. Evaluated across large-scale multi-center
datasets comprising diverse pulmonary conditions, the AirMorph consistently
outperformed existing segmentation and labeling methods in terms of accuracy,
topological consistency, and completeness. To simplify clinical interpretation,
we further introduce a compact anatomical signature quantifying critical
morphological airway features, including stenosis, ectasia, tortuosity,
divergence, length, and complexity. When applied to various pulmonary diseases
such as pulmonary fibrosis, emphysema, atelectasis, consolidation, and
reticular opacities, it demonstrates strong discriminative power, revealing
disease-specific morphological patterns with high interpretability and
explainability. Additionally, AirMorph supports efficient automated branching
pattern analysis, potentially enhancing bronchoscopic navigation planning and
procedural safety, offering a valuable clinical tool for improved diagnosis,
targeted treatment, and personalized patient care.
comment: Under Review
♻ ★ Label-Efficient Deep Learning in Medical Image Analysis: Challenges and Future Directions
Deep learning has significantly advanced medical imaging analysis (MIA),
achieving state-of-the-art performance across diverse clinical tasks. However,
its success largely depends on large-scale, high-quality labeled datasets,
which are costly and time-consuming to obtain due to the need for expert
annotation. To mitigate this limitation, label-efficient deep learning methods
have emerged to improve model performance under limited supervision by
leveraging labeled, unlabeled, and weakly labeled data. In this survey, we
systematically review over 350 peer-reviewed studies and present a
comprehensive taxonomy of label-efficient learning methods in MIA. These
methods are categorized into four labeling paradigms: no label, insufficient
label, inexact label, and label refinement. For each category, we analyze
representative techniques across imaging modalities and clinical applications,
highlighting shared methodological principles and task-specific adaptations. We
also examine the growing role of health foundation models (HFMs) in enabling
label-efficient learning through large-scale pre-training and transfer
learning, enhancing the use of limited annotations in downstream tasks.
Finally, we identify current challenges and future directions to facilitate the
translation of label-efficient learning from research promise to everyday
clinical care.
comment: Under Review
♻ ★ Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency
Christian Keilstrup Ingwersen, Rasmus Tirsgaard, Rasmus Nylander, Janus Nørtoft Jensen, Anders Bjorholm Dahl, Morten Rieger Hannemose
Deducing a 3D human pose from a single 2D image is inherently challenging
because multiple 3D poses can correspond to the same 2D representation. 3D data
can resolve this pose ambiguity, but it is expensive to record and requires an
intricate setup that is often restricted to controlled lab environments. We
propose a method that improves the performance of deep learning-based monocular
3D human pose estimation models by using multiview data only during training,
but not during inference. We introduce a novel loss function, consistency loss,
which operates on two synchronized views. This approach is simpler than
previous models that require 3D ground truth or intrinsic and extrinsic camera
parameters. Our consistency loss penalizes differences in two pose sequences
after rigid alignment. We also demonstrate that our consistency loss
substantially improves performance for fine-tuning without requiring 3D data.
Furthermore, we show that using our consistency loss can yield state-of-the-art
performance when training models from scratch in a semi-supervised manner. Our
findings provide a simple way to capture new data, e.g in a new domain. This
data can be added using off-the-shelf cameras with no calibration requirements.
We make all our code and data publicly available.
♻ ★ Evaluating Deep Learning Models for Breast Cancer Classification: A Comparative Study
This study evaluates the effectiveness of deep learning models in classifying
histopathological images for early and accurate detection of breast cancer.
Eight advanced models, including ResNet-50, DenseNet-121, ResNeXt-50, Vision
Transformer (ViT), GoogLeNet (Inception v3), EfficientNet, MobileNet, and
SqueezeNet, were compared using a dataset of 277,524 image patches. The Vision
Transformer (ViT) model, with its attention-based mechanisms, achieved the
highest validation accuracy of 94%, outperforming conventional CNNs. The study
demonstrates the potential of advanced machine learning methods to enhance
precision and efficiency in breast cancer diagnosis in clinical settings.
comment: 4 pages, 2 figures, 2 tables
♻ ★ Transformer-based assignment decision network for multiple object tracking
Data association is a crucial component for any multiple object tracking
(MOT) method that follows the tracking-by-detection paradigm. To generate
complete trajectories such methods employ a data association process to
establish assignments between detections and existing targets during each
timestep. Recent data association approaches try to solve either a
multi-dimensional linear assignment task or a network flow minimization problem
or tackle it via multiple hypotheses tracking. However, during inference an
optimization step that computes optimal assignments is required for every
sequence frame inducing additional complexity to any given solution. To this
end, in the context of this work we introduce Transformer-based Assignment
Decision Network (TADN) that tackles data association without the need of any
explicit optimization during inference. In particular, TADN can directly infer
assignment pairs between detections and active targets in a single forward pass
of the network. We have integrated TADN in a rather simple MOT framework,
designed a novel training strategy for efficient end-to-end training and
demonstrated the high potential of our approach for online visual
tracking-by-detection MOT on several popular benchmarks, i.e. MOT17, MOT20 and
UA-DETRAC. Our proposed approach demonstrates strong performance in most
evaluation metrics despite its simple nature as a tracker lacking significant
auxiliary components such as occlusion handling or re-identification. The
implementation of our method is publicly available at
https://github.com/psaltaath/tadn-mot.
comment: Preprint version. Under consideration at Computer Vision and Image
Understanding
♻ ★ Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions ICME
For efficient human-agent interaction, an agent should proactively recognize
their target user and prepare for upcoming interactions. We formulate this
challenging problem as the novel task of jointly forecasting a person's intent
to interact with the agent, their attitude towards the agent and the action
they will perform, from the agent's (egocentric) perspective. So we propose
\emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task
dependencies through a hierarchical multitask learning approach. SocialEgoNet
uses whole-body skeletons (keypoints from face, hands and body) extracted from
only 1 second of video input for high inference speed. For evaluation, we
augment an existing egocentric human-agent interaction dataset with new class
labels and bounding box annotations. Extensive experiments on this augmented
dataset, named JPL-Social, demonstrate \emph{real-time} inference and superior
performance (average accuracy across all tasks: 83.15\%) of our model
outperforming several competitive baselines. The additional annotations and
code will be available upon acceptance.
comment: Accepted to ICME, 2025. Camera-ready Version
♻ ★ Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections
Max Kirchner, Alexander C. Jenke, Sebastian Bodenstedt, Fiona R. Kolbinger, Oliver L. Saldanha, Jakob N. Kather, Martin Wagner, Stefanie Speidel
Purpose: In this study, we investigate the training of foundation models
using federated learning to address data-sharing limitations and enable
collaborative model training without data transfer for minimally invasive
surgery. Methods: Inspired by the EndoViT study, we adapt the Masked
Autoencoder for federated learning, enhancing it with adaptive Sharpness-Aware
Minimization (FedSAM) and Stochastic Weight Averaging (SWA). Our model is
pretrained on the Endo700k dataset collection and later fine-tuned and
evaluated for tasks such as Semantic Segmentation, Action Triplet Recognition,
and Surgical Phase Recognition. Results: Our findings demonstrate that
integrating adaptive FedSAM into the federated MAE approach improves
pretraining, leading to a reduction in reconstruction loss per patch. The
application of FL-EndoViT in surgical downstream tasks results in performance
comparable to CEN-EndoViT. Furthermore, FL-EndoViT exhibits advantages over
CEN-EndoViT in surgical scene segmentation when data is limited and in action
triplet recognition when large datasets are used. Conclusion: These findings
highlight the potential of federated learning for privacy-preserving training
of surgical foundation models, offering a robust and generalizable solution for
surgical data science. Effective collaboration requires adapting federated
learning methods, such as the integration of FedSAM, which can accommodate the
inherent data heterogeneity across institutions. In future, exploring FL in
video-based models may enhance these capabilities by incorporating
spatiotemporal dynamics crucial for real-world surgical environments.
comment: Preprint submitted to IEEE TMI
♻ ★ CloudTrack: Scalable UAV Tracking with Cloud Semantics
Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and
rescue scenarios to gather information in the search area. The automatic
identification of the person searched for in aerial footage could increase the
autonomy of such systems, reduce the search time, and thus increase the missed
person's chances of survival. In this paper, we present a novel approach to
perform semantically conditioned open vocabulary object tracking that is
specifically designed to cope with the limitations of UAV hardware. Our
approach has several advantages. It can run with verbal descriptions of the
missing person, e.g., the color of the shirt, it does not require dedicated
training to execute the mission and can efficiently track a potentially moving
person. Our experimental results demonstrate the versatility and efficacy of
our approach.
comment: 7 pages, 3 figures
♻ ★ How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred
interest in complex multimodal reasoning tasks in the real-world and virtual
environment, which require coordinating multiple abilities, including visual
perception, visual reasoning, spatial awareness, and target deduction. However,
existing evaluations primarily assess the final task completion, often
degrading assessments to isolated abilities such as visual grounding and visual
question answering. Less attention is given to comprehensively and
quantitatively analyzing reasoning process in multimodal environments, which is
crucial for understanding model behaviors and underlying reasoning mechanisms
beyond merely task success. To address this, we introduce MM-Escape, an
extensible benchmark for investigating multimodal reasoning, inspired by
real-world escape games. MM-Escape emphasizes intermediate model behaviors
alongside final task completion. To achieve this, we develop EscapeCraft, a
customizable and open environment that enables models to engage in free-form
exploration for assessing multimodal reasoning. Extensive experiments show that
MLLMs, regardless of scale, can successfully complete the simplest room escape
tasks, with some exhibiting human-like exploration strategies. Yet, performance
dramatically drops as task difficulty increases. Moreover, we observe that
performance bottlenecks vary across models, revealing distinct failure modes
and limitations in their multimodal reasoning abilities, such as repetitive
trajectories without adaptive exploration, getting stuck in corners due to poor
visual spatial awareness, and ineffective use of acquired props, such as the
key. We hope our work sheds light on new challenges in multimodal reasoning,
and uncovers potential improvements in MLLMs capabilities.
♻ ★ Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction
Indoor pathloss prediction is a fundamental task in wireless network
planning, yet it remains challenging due to environmental complexity and data
scarcity. In this work, we propose a deep learning-based approach utilizing a
vision transformer (ViT) architecture with DINO-v2 pretrained weights to model
indoor radio propagation. Our method processes a floor map with additional
features of the walls to generate indoor pathloss maps. We systematically
evaluate the effects of architectural choices, data augmentation strategies,
and feature engineering techniques. Our findings indicate that extensive
augmentation significantly improves generalization, while feature engineering
is crucial in low-data regimes. Through comprehensive experiments, we
demonstrate the robustness of our model across different generalization
scenarios.
comment: Work partly supported by the RA Science Committee grant No. 22rl-052
(DISTAL) and the EU under Italian National Recovery and Resilience Plan of
NextGenerationEU on "Telecommunications of the Future" (PE00000001 - program
"RESTART")
♻ ★ Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
Most existing video anomaly detectors rely solely on RGB frames, which lack
the temporal resolution needed to capture abrupt or transient motion cues, key
indicators of anomalous events. To address this limitation, we propose
Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that
synthesizes event representations directly from RGB videos and fuses them with
image features through a principled, uncertainty-aware process. The system (i)
models heavy-tailed sensor noise with a Student`s-t likelihood, deriving
value-level inverse-variance weights via a Laplace approximation; (ii) applies
Kalman-style frame-wise updates to balance modalities over time; and (iii)
iteratively refines the fused latent state to erase residual cross-modal noise.
Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new
state of the art across multiple real-world anomaly detection benchmarks. These
findings highlight the utility of synthetic event representations in
emphasizing motion cues that are often underrepresented in RGB frames, enabling
accurate and robust video understanding across diverse applications without
requiring dedicated event sensors. Code and models are available at
https://github.com/EavnJeong/IEF-VAD.
♻ ★ Expanding Event Modality Applications through a Robust CLIP-Based Encoder
This paper introduces a powerful encoder that transfers CLIP`s capabilities
to event-based data, enhancing its utility and expanding its applicability
across diverse domains. While large-scale datasets have significantly advanced
image-based models, the scarcity of comprehensive event datasets has limited
performance potential in event modality. To address this challenge, we adapt
CLIP`s architecture to align event embeddings with image embeddings, supporting
zero-shot learning and preserving text alignment while mitigating catastrophic
forgetting. Our encoder achieves strong performance in object recognition, with
competitive results in zero-shot and few-shot learning tasks. Notably, it
generalizes effectively to events extracted from video data without requiring
additional training, highlighting its versatility. Additionally, we integrate
this encoder within a cross-modality framework that facilitates interaction
across five modalities-Image, Event, Text, Sound, and Depth-expanding the
possibilities for cross-modal applications. Overall, this work underscores the
transformative potential of a robust event encoder, broadening the scope and
utility of event-based data across various fields.
♻ ★ Search is All You Need for Few-shot Anomaly Detection
Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging
task in industrial inspection, where normal distribution modeling must be
accomplished with only a few normal images. While existing approaches typically
employ multi-modal foundation models combining language and vision modalities
for prompt-guided anomaly detection, these methods often demand sophisticated
prompt engineering and extensive manual tuning. In this paper, we demonstrate
that a straightforward nearest-neighbor search framework can surpass
state-of-the-art performance in both single-class and multi-class FSAD
scenarios. Our proposed method, VisionAD, consists of four simple yet essential
components: (1) scalable vision foundation models that extract universal and
discriminative features; (2) dual augmentation strategies - support
augmentation to enhance feature matching adaptability and query augmentation to
address the oversights of single-view prediction; (3) multi-layer feature
integration that captures both low-frequency global context and high-frequency
local details with minimal computational overhead; and (4) a class-aware visual
memory bank enabling efficient one-for-all multi-class detection. Extensive
evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate
VisionAD's exceptional performance. Using only 1 normal images as support, our
method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8%
respectively, outperforming current state-of-the-art approaches by significant
margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior
few-shot capabilities of VisionAD make it particularly appealing for real-world
applications where samples are scarce or expensive to obtain. Code is available
at https://github.com/Qiqigeww/VisionAD.
♻ ★ REHEARSE-3D: A Multi-modal Emulated Rain Dataset for 3D Point Cloud De-raining
Abu Mohammed Raisuddin, Jesper Holmblad, Hamed Haghighi, Yuri Poledna, Maikol Funk Drechsler, Valentina Donzella, Eren Erdal Aksoy
Sensor degradation poses a significant challenge in autonomous driving.
During heavy rainfall, the interference from raindrops can adversely affect the
quality of LiDAR point clouds, resulting in, for instance, inaccurate point
measurements. This, in turn, can potentially lead to safety concerns if
autonomous driving systems are not weather-aware, i.e., if they are unable to
discern such changes. In this study, we release a new, large-scale, multi-modal
emulated rain dataset, REHEARSE-3D, to promote research advancements in 3D
point cloud de-raining. Distinct from the most relevant competitors, our
dataset is unique in several respects. First, it is the largest point-wise
annotated dataset, and second, it is the only one with high-resolution LiDAR
data (LiDAR-256) enriched with 4D Radar point clouds logged in both daytime and
nighttime conditions in a controlled weather environment. Furthermore,
REHEARSE-3D involves rain-characteristic information, which is of significant
value not only for sensor noise modeling but also for analyzing the impact of
weather at a point level. Leveraging REHEARSE-3D, we benchmark raindrop
detection and removal in fused LiDAR and 4D Radar point clouds. Our
comprehensive study further evaluates the performance of various statistical
and deep-learning models. Upon publication, the dataset and benchmark models
will be made publicly available at: https://sporsho.github.io/REHEARSE3D.
♻ ★ SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding CVPR 2025
Chenkai Zhang, Yiming Lei, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
With the rapid development of Multi-modal Large Language Models (MLLMs), an
increasing number of benchmarks have been established to evaluate the video
understanding capabilities of these models. However, these benchmarks focus on
standalone videos and mainly assess "visual elements" like human actions and
object states. In reality, contemporary videos often encompass complex and
continuous narratives, typically presented as a series. To address this
challenge, we propose SeriesBench, a benchmark consisting of 105 carefully
curated narrative-driven series, covering 28 specialized tasks that require
deep narrative understanding. Specifically, we first select a diverse set of
drama series spanning various genres. Then, we introduce a novel long-span
narrative annotation method, combined with a full-information transformation
approach to convert manual annotations into diverse task formats. To further
enhance model capacity for detailed analysis of plot structures and character
relationships within series, we propose a novel narrative reasoning framework,
PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still
face significant challenges in understanding narrative-driven series, while
PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our
SeriesBench and PC-DCoT highlight the critical necessity of advancing model
capabilities to understand narrative-driven series, guiding the future
development of MLLMs. SeriesBench is publicly available at
https://github.com/zackhxn/SeriesBench-CVPR2025.
comment: 29 pages, 15 figures, CVPR 2025
♻ ★ 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
3D spatial reasoning is the ability to analyze and interpret the positions,
orientations, and spatial relationships of objects within the 3D space. This
allows models to develop a comprehensive understanding of the 3D scene,
enabling their applicability to a broader range of areas, such as autonomous
navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have
achieved remarkable progress in a wide range of image and video understanding
tasks, their capabilities to perform 3D spatial reasoning on diverse natural
images are less studied. In this work we present the first comprehensive 3D
spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual
question-answer pairs across 12 question types. We conduct robust and thorough
evaluation of 3D spatial reasoning capabilities by balancing the data
distribution and adopting a novel FlipEval strategy. To further study the
robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench
includes two subsets with 3D spatial reasoning questions on paired images with
common and uncommon viewpoints. We benchmark a wide range of open-sourced and
proprietary LMMs, uncovering their limitations in various aspects of 3D
awareness, such as height, orientation, location, and multi-object reasoning,
as well as their degraded performance on images with uncommon camera
viewpoints. Our 3DSRBench provide valuable findings and insights about the
future development of LMMs with strong 3D reasoning capabilities. Our project
page and dataset is available https://3dsrbench.github.io.
comment: Project page: https://3dsrbench.github.io
♻ ★ Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing
Hong Zhang, Zhongjie Duan, Xingjun Wang, Yuze Zhao, Weiyi Lu, Zhipeng Di, Yixuan Xu, Yingda Chen, Yu Zhang
Unified multimodal large language models (MLLMs) aim to integrate multimodal
understanding and generation abilities through a single framework. Despite
their versatility, existing open-source unified models exhibit performance gaps
against domain-specific architectures. To bridge this gap, we present
Nexus-Gen, a unified model that synergizes the language reasoning capabilities
of LLMs with the image synthesis power of diffusion models. To align the
embedding space of the LLM and diffusion model, we conduct a dual-phase
alignment training process. (1) The autoregressive LLM learns to predict image
embeddings conditioned on multimodal inputs, while (2) the vision decoder is
trained to reconstruct high-fidelity images from these embeddings. During
training the LLM, we identified a critical discrepancy between the
autoregressive paradigm's training and inference phases, where error
accumulation in continuous embedding space severely degrades generation
quality. To avoid this issue, we introduce a prefilled autoregression strategy
that prefills input sequence with position-embedded special tokens instead of
continuous embeddings. Through dual-phase training, Nexus-Gen has developed the
integrated capability to comprehensively address the image understanding,
generation and editing tasks. All models, datasets, and codes are published at
https://github.com/modelscope/Nexus-Gen.git to facilitate further advancements
across the field.
♻ ★ FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger
Geometrically accurate and semantically expressive map representations have
proven invaluable to facilitate robust and safe mobile robot navigation and
task planning. Nevertheless, real-time, open-vocabulary semantic understanding
of large-scale unknown environments is still an open problem. In this paper we
present FindAnything, an open-world mapping and exploration framework that
incorporates vision-language information into dense volumetric submaps. Thanks
to the use of vision-language features, FindAnything bridges the gap between
pure geometric and open-vocabulary semantic information for a higher level of
understanding while allowing to explore any environment without the help of any
external source of ground-truth pose information. We represent the environment
as a series of volumetric occupancy submaps, resulting in a robust and accurate
map representation that deforms upon pose updates when the underlying SLAM
system corrects its drift, allowing for a locally consistent representation
between submaps. Pixel-wise vision-language features are aggregated from
efficient SAM (eSAM)-generated segments, which are in turn integrated into
object-centric volumetric submaps, providing a mapping from open-vocabulary
queries to 3D geometry that is scalable also in terms of memory usage. The
open-vocabulary map representation of FindAnything achieves state-of-the-art
semantic accuracy in closed-set evaluations on the Replica dataset. This level
of scene understanding allows a robot to explore environments based on objects
or areas of interest selected via natural language queries. Our system is the
first of its kind to be deployed on resource-constrained devices, such as MAVs,
leveraging vision-language information for real-world robotic tasks.
comment: 11 pages, 5 figures
♻ ★ WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines NAACL 2025
Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia, Jan Christian Blaise Cruz, Jan Wira Gotama Putra, Junho Myung, Lucky Susanto, Maria Angelica Riera Machin, Marina Zhukova, Michael Anugraha, Muhammad Farid Adilazuarda, Natasha Santosa, Peerat Limkonchotiwat, Raj Dabre, Rio Alexander Audino, Samuel Cahyawijaya, Shi-Xiong Zhang, Stephanie Yulia Salim, Yi Zhou, Yinxuan Gui, David Ifeoluwa Adelani, En-Shiun Annie Lee, Shogo Okada, Ayu Purwarianti, Alham Fikri Aji, Taro Watanabe, Derry Tanti Wijaya, Alice Oh, Chong-Wah Ngo
Vision Language Models (VLMs) often struggle with culture-specific knowledge,
particularly in languages other than English and in underrepresented cultural
contexts. To evaluate their understanding of such knowledge, we introduce
WorldCuisines, a massive-scale benchmark for multilingual and multicultural,
visually grounded language understanding. This benchmark includes a visual
question answering (VQA) dataset with text-image pairs across 30 languages and
dialects, spanning 9 language families and featuring over 1 million data
points, making it the largest multicultural VQA benchmark to date. It includes
tasks for identifying dish names and their origins. We provide evaluation
datasets in two sizes (12k and 60k instances) alongside a training dataset (1
million instances). Our findings show that while VLMs perform better with
correct location context, they struggle with adversarial contexts and
predicting specific regional cuisines and languages. To support future
research, we release a knowledge base with annotated food entries and images
along with the VQA data.
comment: Best Theme Paper at NAACL 2025
♻ ★ Balanced 3DGS: Gaussian-wise Parallelism Rendering with Fine-Grained Tiling
Hao Gui, Lin Hu, Rui Chen, Mingxiao Huang, Yuxin Yin, Jin Yang, Yong Wu, Chen Liu, Zhongxu Sun, Xueyang Zhang, Kun Zhan
3D Gaussian Splatting (3DGS) is increasingly attracting attention in both
academia and industry owing to its superior visual quality and rendering speed.
However, training a 3DGS model remains a time-intensive task, especially in
load imbalance scenarios where workload diversity among pixels and Gaussian
spheres causes poor renderCUDA kernel performance. We introduce Balanced 3DGS,
a Gaussian-wise parallelism rendering with fine-grained tiling approach in 3DGS
training process, perfectly solving load-imbalance issues. First, we
innovatively introduce the inter-block dynamic workload distribution technique
to map workloads to Streaming Multiprocessor(SM) resources within a single GPU
dynamically, which constitutes the foundation of load balancing. Second, we are
the first to propose the Gaussian-wise parallel rendering technique to
significantly reduce workload divergence inside a warp, which serves as a
critical component in addressing load imbalance. Based on the above two
methods, we further creatively put forward the fine-grained combined load
balancing technique to uniformly distribute workload across all SMs, which
boosts the forward renderCUDA kernel performance by up to 7.52x. Besides, we
present a self-adaptive render kernel selection strategy during the 3DGS
training process based on different load-balance situations, which effectively
improves training efficiency.
♻ ★ MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion ICLR 25
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, Ming-Hsuan Yang
Estimating geometry from dynamic scenes, where objects move and deform over
time, remains a core challenge in computer vision. Current approaches often
rely on multi-stage pipelines or global optimizations that decompose the
problem into subtasks, like depth and flow, leading to complex systems prone to
errors. In this paper, we present Motion DUSt3R (MonST3R), a novel
geometry-first approach that directly estimates per-timestep geometry from
dynamic scenes. Our key insight is that by simply estimating a pointmap for
each timestep, we can effectively adapt DUST3R's representation, previously
only used for static scenes, to dynamic scenes. However, this approach presents
a significant challenge: the scarcity of suitable training data, namely
dynamic, posed videos with depth labels. Despite this, we show that by posing
the problem as a fine-tuning task, identifying several suitable datasets, and
strategically training the model on this limited data, we can surprisingly
enable the model to handle dynamics, even without an explicit motion
representation. Based on this, we introduce new optimizations for several
downstream video-specific tasks and demonstrate strong performance on video
depth and camera pose estimation, outperforming prior work in terms of
robustness and efficiency. Moreover, MonST3R shows promising results for
primarily feed-forward 4D reconstruction.
comment: Accepted by ICLR 25, Project page: https://monst3r-project.github.io/
♻ ★ PhysFlow: Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation CVPR 2025
Realistic simulation of dynamic scenes requires accurately capturing diverse
material properties and modeling complex object interactions grounded in
physical principles. However, existing methods are constrained to basic
material types with limited predictable parameters, making them insufficient to
represent the complexity of real-world materials. We introduce PhysFlow, a
novel approach that leverages multi-modal foundation models and video diffusion
to achieve enhanced 4D dynamic scene simulation. Our method utilizes
multi-modal models to identify material types and initialize material
parameters through image queries, while simultaneously inferring 3D Gaussian
splats for detailed scene representation. We further refine these material
parameters using video diffusion with a differentiable Material Point Method
(MPM) and optical flow guidance rather than render loss or Score Distillation
Sampling (SDS) loss. This integrated framework enables accurate prediction and
realistic simulation of dynamic interactions in real-world scenarios, advancing
both accuracy and flexibility in physics-based simulations.
comment: CVPR 2025. Homepage: https://zhuomanliu.github.io/PhysFlow/
♻ ★ Adaptive Rate Control for Deep Video Compression with Rate-Distortion Prediction
Deep video compression has made significant progress in recent years,
achieving rate-distortion performance that surpasses that of traditional video
compression methods. However, rate control schemes tailored for deep video
compression have not been well studied. In this paper, we propose a neural
network-based $\lambda$-domain rate control scheme for deep video compression,
which determines the coding parameter $\lambda$ for each to-be-coded frame
based on the rate-distortion-$\lambda$ (R-D-$\lambda$) relationships directly
learned from uncompressed frames, achieving high rate control accuracy
efficiently without the need for pre-encoding. Moreover, this content-aware
scheme is able to mitigate inter-frame quality fluctuations and adapt to abrupt
changes in video content. Specifically, we introduce two neural network-based
predictors to estimate the relationship between bitrate and $\lambda$, as well
as the relationship between distortion and $\lambda$ for each frame. Then we
determine the coding parameter $\lambda$ for each frame to achieve the target
bitrate. Experimental results demonstrate that our approach achieves high rate
control accuracy at the mini-GOP level with low time overhead and mitigates
inter-frame quality fluctuations across video content of varying resolutions.
♻ ★ MambaNUT: Nighttime UAV Tracking via Mamba-based Adaptive Curriculum Learning
Harnessing low-light enhancement and domain adaptation, nighttime UAV
tracking has made substantial strides. However, over-reliance on image
enhancement, limited high-quality nighttime data, and a lack of integration
between daytime and nighttime trackers hinder the development of an end-to-end
trainable framework. Additionally, current ViT-based trackers demand heavy
computational resources due to their reliance on the self-attention mechanism.
In this paper, we propose a novel pure Mamba-based tracking framework
(MambaNUT) that employs a state space model with linear complexity as its
backbone, incorporating a single-stream architecture that integrates feature
learning and template-search coupling within Vision Mamba. We introduce an
adaptive curriculum learning (ACL) approach that dynamically adjusts sampling
strategies and loss weights, thereby improving the model's ability of
generalization. Our ACL is composed of two levels of curriculum schedulers: (1)
sampling scheduler that transforms the data distribution from imbalanced to
balanced, as well as from easier (daytime) to harder (nighttime) samples; (2)
loss scheduler that dynamically assigns weights based on the size of the
training data and IoU of individual instances. Exhaustive experiments on
multiple nighttime UAV tracking benchmarks demonstrate that the proposed
MambaNUT achieves state-of-the-art performance while requiring lower
computational costs. The code will be available at
https://github.com/wuyou3474/MambaNUT.
♻ ★ Generalizable Human Gaussians from Single-View Image ICLR 2025
In this work, we tackle the task of learning 3D human Gaussians from a single
image, focusing on recovering detailed appearance and geometry including
unobserved regions. We introduce a single-view generalizable Human Gaussian
Model (HGM), which employs a novel generate-then-refine pipeline with the
guidance from human body prior and diffusion prior. Our approach uses a
ControlNet to refine rendered back-view images from coarse predicted human
Gaussians, then uses the refined image along with the input image to
reconstruct refined human Gaussians. To mitigate the potential generation of
unrealistic human poses and shapes, we incorporate human priors from the SMPL-X
model as a dual branch, propagating image features from the SMPL-X volume to
the image Gaussians using sparse convolution and attention mechanisms. Given
that the initial SMPL-X estimation might be inaccurate, we gradually refine it
with our HGM model. We validate our approach on several publicly available
datasets. Our method surpasses previous methods in both novel view synthesis
and surface reconstruction. Our approach also exhibits strong generalization
for cross-dataset evaluation and in-the-wild images.
comment: ICLR 2025: https://jinnan-chen.github.io/projects/HGM/
♻ ★ A nonlinear elasticity model in computer vision
The purpose of this paper is to analyze a nonlinear elasticity model
introduced by the authors for comparing two images, regarded as bounded open
subsets of $\R^n$ together with associated vector-valued intensity maps.
Optimal transformations between the images are sought as minimisers of an
integral functional among orientation-preserving homeomorphisms. The existence
of minimisers is proved under natural coercivity and polyconvexity conditions,
assuming only that the intensity functions are bounded measurable. Variants of
the existence theorem are also proved, first under the constraint that finite
sets of landmark points in the two images are mapped one to the other, and
second when one image is to be compared to an unknown part of another.
The question is studied as to whether for images related by an affine mapping
the unique minimiser is given by that affine mapping. For a natural class of
functional integrands an example is given guaranteeing that this property holds
for pairs of images in which the second is a scaling of the first by a constant
factor. However for the property to hold for arbitrary pairs of affinely
related images it is shown that the integrand has to depend on the gradient of
the transformation as a convex function of its determinant alone. This suggests
a new model in which the integrand depends also on second derivatives of the
transformation, and an example is given for which both existence of minimisers
is assured and the above property holds for all pairs of affinely related
images.
comment: The paper has been substantially revised. In particular the section
on metrics has been rewritten to correct an error, and a new result added on
the existence of discrete morphing sequences in the mass-conserving case. In
the mass-conserving case there is a new formulation of the question
concerning whether the minimizing deformation for affinely related images is
the corresponding affine map
♻ ★ Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning ICML 2025
Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn
from distinct categories without retaining exemplars but easily suffers from
catastrophic forgetting of learned knowledge. While existing EFCIL methods
leverage knowledge distillation to alleviate forgetting, they still face two
critical challenges: semantic shift and decision bias. Specifically, the
embeddings of old tasks shift in the embedding space after learning new tasks,
and the classifier becomes biased towards new tasks due to training solely with
new data, thereby hindering the balance between old and new knowledge. To
address these issues, we propose the Dual-Projection Shift Estimation and
Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates
semantic shift through a dual-projection, which combines a learnable
transformation with a row-space projection to capture both task-wise and
category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs
ridge regression to reformulate classifier training as a reconstruction
process. This reconstruction exploits previous information encoded in
covariance and prototype of each class after calibration with estimated shift,
thereby reducing decision bias. Extensive experiments demonstrate that, across
various datasets, DPCR effectively balances old and new tasks, outperforming
state-of-the-art EFCIL methods.
comment: Accepted by ICML 2025
♻ ★ DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration
Diffusion models have achieved remarkable progress in universal image
restoration. While existing methods speed up inference by reducing sampling
steps, substantial step intervals often introduce cumulative errors. Moreover,
they struggle to balance the commonality of degradation representations and
restoration quality. To address these challenges, we introduce
\textbf{DGSolver}, a diffusion generalist solver with universal posterior
sampling. We first derive the exact ordinary differential equations for
generalist diffusion models and tailor high-order solvers with a queue-based
accelerated sampling strategy to improve both accuracy and efficiency. We then
integrate universal posterior sampling to better approximate
manifold-constrained gradients, yielding a more accurate noise estimation and
correcting errors in inverse inference. Extensive experiments show that
DGSolver outperforms state-of-the-art methods in restoration accuracy,
stability, and scalability, both qualitatively and quantitatively. Code and
models will be available at https://github.com/MiliLab/DGSolver.
♻ ★ Transforming faces into video stories -- VideoFace2.0
Face detection and face recognition have been in the focus of vision
community since the very beginnings. Inspired by the success of the original
Videoface digitizer, a pioneering device that allowed users to capture video
signals from any source, we have designed an advanced video analytics tool to
efficiently create structured video stories, i.e. identity-based information
catalogs. VideoFace2.0 is the name of the developed system for spatial and
temporal localization of each unique face in the input video, i.e. face
re-identification (ReID), which also allows their cataloging, characterization
and creation of structured video outputs for later downstream tasks. Developed
near real-time solution is primarily designed to be utilized in application
scenarios involving TV production, media analysis, and as an efficient tool for
creating large video datasets necessary for training machine learning (ML)
models in challenging vision tasks such as lip reading and multimodal speech
recognition. Conducted experiments confirm applicability of the proposed face
ReID algorithm that is combining the concepts of face detection, face
recognition and passive tracking-by-detection in order to achieve robust and
efficient face ReID. The system is envisioned as a compact and modular
extensions of the existing video production equipment. Presented results are
based on test implementation that achieves between 18-25 fps on consumer type
notebook. Ablation experiments also confirmed that the proposed algorithm
brings relative gain in the reduction of number of false identities in the
range of 73%-93%. We hope that the presented work and shared code
implementation will stimulate further interest in development of similar,
application specific video analysis tools, and lower the entry barrier for
production of high-quality multi-modal datasets in the future.
comment: 4 Pages, 2 Figures, 1 Table, 1 Algorithm; Associated VideoFace2.0
code, test videos and results visualizations are available at
https://github.com/brkljac/VideoFace2.0 ; Preprint accepted for publication
at the 14th Mediterranean Conference on Embedded Computing (MECO), 10-14 June
2025, Budva, Montenegro
♻ ★ LUDO: Low-Latency Understanding of Deformable Objects using Point Cloud Occupancy Functions
Accurately determining the shape of objects and the location of their
internal structures within deformable objects is crucial for medical tasks that
require precise targeting, such as robotic biopsies. We introduce LUDO, a
method for accurate low-latency understanding of deformable objects. LUDO
reconstructs objects in their deformed state, including their internal
structures, from a single-view point cloud observation in under 30 ms using
occupancy networks. LUDO provides uncertainty estimates for its predictions.
Additionally, it provides explainability by highlighting key features in its
input observations. Both uncertainty and explainability are important for
safety-critical applications such as surgical interventions. We demonstrate
LUDO's abilities for autonomous targeting of internal regions of interest
(ROIs) in deformable objects. We evaluate LUDO in real-world robotic
experiments, achieving a success rate of 98.9% for puncturing various ROIs
inside deformable objects. LUDO demonstrates the potential to interact with
deformable objects without the need for deformable registration methods.
♻ ★ Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization
Dense-localization Audio-Visual Events (DAVE) aims to identify time
boundaries and corresponding categories for events that are both audible and
visible in a long video, where events may co-occur and exhibit varying
durations. However, complex audio-visual scenes often involve asynchronization
between modalities, making accurate localization challenging. Existing DAVE
solutions extract audio and visual features through unimodal encoders, and fuse
them via dense cross-modal interaction. However, independent unimodal encoding
struggles to emphasize shared semantics between modalities without cross-modal
guidance, while dense cross-modal attention may over-attend to semantically
unrelated audio-visual features. To address these problems, we present LoCo, a
Locality-aware cross-modal Correspondence learning framework for DAVE. LoCo
leverages the local temporal continuity of audio-visual events as important
guidance to filter irrelevant cross-modal signals and enhance cross-modal
alignment throughout both unimodal and cross-modal encoding stages. i)
Specifically, LoCo applies Local Correspondence Feature (LCF) Modulation to
enforce unimodal encoders to focus on modality-shared semantics by modulating
agreement between audio and visual features based on local cross-modal
coherence. ii) To better aggregate cross-modal relevant features, we further
customize Local Adaptive Cross-modal (LAC) Interaction, which dynamically
adjusts attention regions in a data-driven manner. This adaptive mechanism
focuses attention on local event boundaries and accommodates varying event
durations. By incorporating LCF and LAC, LoCo provides solid performance gains
and outperforms existing DAVE methods.
♻ ★ Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models ICML 2025
Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Kening Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, Xuming Hu
Despite their impressive capabilities, multimodal large language models
(MLLMs) are prone to hallucinations, i.e., the generated content that is
nonsensical or unfaithful to input sources. Unlike in LLMs, hallucinations in
MLLMs often stem from the sensitivity of text decoder to visual tokens, leading
to a phenomenon akin to "amnesia" about visual information. To address this
issue, we propose MemVR, a novel decoding paradigm inspired by common
cognition: when the memory of an image seen the moment before is forgotten,
people will look at it again for factual answers. Following this principle, we
treat visual tokens as supplementary evidence, re-injecting them into the MLLM
through Feed Forward Network (FFN) as "key-value memory" at the middle trigger
layer. This "look-twice" mechanism occurs when the model exhibits high
uncertainty during inference, effectively enhancing factual alignment.
Comprehensive experimental evaluations demonstrate that MemVR significantly
mitigates hallucination across various MLLMs and excels in general benchmarks
without incurring additional time overhead. The implementation is available
from https://github.com/1zhou-Wang/MemVR
comment: Accepted by ICML 2025
♻ ★ SceneCraft: Layout-Guided 3D Scene Generation NeurIPS 2024
The creation of complex 3D scenes tailored to user specifications has been a
tedious and challenging task with traditional 3D modeling tools. Although some
pioneering methods have achieved automatic text-to-3D generation, they are
generally limited to small-scale scenes with restricted control over the shape
and texture. We introduce SceneCraft, a novel method for generating detailed
indoor scenes that adhere to textual descriptions and spatial layout
preferences provided by users. Central to our method is a rendering-based
technique, which converts 3D semantic layouts into multi-view 2D proxy maps.
Furthermore, we design a semantic and depth conditioned diffusion model to
generate multi-view images, which are used to learn a neural radiance field
(NeRF) as the final scene representation. Without the constraints of panorama
image generation, we surpass previous methods in supporting complicated indoor
space generation beyond a single room, even as complicated as a whole
multi-bedroom apartment with irregular shapes and layouts. Through experimental
analysis, we demonstrate that our method significantly outperforms existing
approaches in complex indoor scene generation with diverse textures, consistent
geometry, and realistic visual quality. Code and more results are available at:
https://orangesodahub.github.io/SceneCraft
comment: NeurIPS 2024. Code: https://github.com/OrangeSodahub/SceneCraft
Project Page: https://orangesodahub.github.io/SceneCraft
♻ ★ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding NeurIPS 2024
Complex 3D scene understanding has gained increasing attention, with scene
encoding strategies playing a crucial role in this success. However, the
optimal scene encoding strategies for various scenarios remain unclear,
particularly compared to their image-based counterparts. To address this issue,
we present a comprehensive study that probes various visual encoding models for
3D scene understanding, identifying the strengths and limitations of each model
across different scenarios. Our evaluation spans seven vision foundation
encoders, including image-based, video-based, and 3D foundation models. We
evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual
Grounding, Segmentation, and Registration, each focusing on different aspects
of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates
superior performance, video models excel in object-level tasks, diffusion
models benefit geometric tasks, and language-pretrained models show unexpected
limitations in language-related tasks. These insights challenge some
conventional understandings, provide novel perspectives on leveraging visual
foundation models, and highlight the need for more flexible encoder selection
in future vision-language and scene-understanding tasks. Code:
https://github.com/YunzeMan/Lexicon3D
comment: NeurIPS 2024. Project page: https://yunzeman.github.io/lexicon3d
Github: https://github.com/YunzeMan/Lexicon3D
♻ ★ Leveraging Depth Maps and Attention Mechanisms for Enhanced Image Inpainting
Existing deep learning-based image inpainting methods typically rely on
convolutional networks with RGB images to reconstruct images. However, relying
exclusively on RGB images may neglect important depth information, which plays
a critical role in understanding the spatial and structural context of a scene.
Just as human vision leverages stereo cues to perceive depth, incorporating
depth maps into the inpainting process can enhance the model's ability to
reconstruct images with greater accuracy and contextual awareness. In this
paper, we propose a novel approach that incorporates both RGB and depth images
for enhanced image inpainting. Our models employ a dual encoder architecture,
where one encoder processes the RGB image and the other handles the depth
image. The encoded features from both encoders are then fused in the decoder
using an attention mechanism, effectively integrating the RGB and depth
representations. We use two different masking strategies, line and square, to
test the robustness of the model under different types of occlusions. To
further analyze the effectiveness of our approach, we use Gradient-weighted
Class Activation Mapping (Grad-CAM) visualizations to examine the regions of
interest the model focuses on during inpainting. We show that incorporating
depth information alongside the RGB image significantly improves the
reconstruction quality. Through both qualitative and quantitative comparisons,
we demonstrate that the depth-integrated model outperforms the baseline, with
attention mechanisms further enhancing inpainting performance, as evidenced by
multiple evaluation metrics and visualization.
♻ ★ DejAIvu: Identifying and Explaining AI Art on the Web in Real-Time with Saliency Maps IJCAI 2025
The recent surge in advanced generative models, such as diffusion models and
generative adversarial networks (GANs), has led to an alarming rise in
AI-generated images across various domains on the web. While such technologies
offer benefits such as democratizing artistic creation, they also pose
challenges in misinformation, digital forgery, and authenticity verification.
Additionally, the uncredited use of AI-generated images in media and marketing
has sparked significant backlash from online communities. In response to this,
we introduce DejAIvu, a Chrome Web extension that combines real-time
AI-generated image detection with saliency-based explainability while users
browse the web. Using an ONNX-optimized deep learning model, DejAIvu
automatically analyzes images on websites such as Google Images, identifies
AI-generated content using model inference, and overlays a saliency heatmap to
highlight AI-related artifacts. Our approach integrates efficient in-browser
inference, gradient-based saliency analysis, and a seamless user experience,
ensuring that AI detection is both transparent and interpretable. We also
evaluate DejAIvu across multiple pretrained architectures and benchmark
datasets, demonstrating high accuracy and low latency, making it a practical
and deployable tool for enhancing AI image accountability. The code for this
system can be found at https://github.com/Noodulz/dejAIvu.
comment: 5 pages, 3 figures. Accepted to IJCAI 2025 Demo Track. Revised
version will be uploaded soon
♻ ★ LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces ICML 2025
We introduce the Local Intersectional Visual Spaces (LIVS) dataset, a
benchmark for multi-criteria alignment, developed through a two-year
participatory process with 30 community organizations to support the
pluralistic alignment of text-to-image (T2I) models in inclusive urban
planning. The dataset encodes 37,710 pairwise comparisons across 13,462 images,
structured along six criteria - Accessibility, Safety, Comfort, Invitingness,
Inclusivity, and Diversity - derived from 634 community-defined concepts. Using
Direct Preference Optimization (DPO), we fine-tune Stable Diffusion XL to
reflect multi-criteria spatial preferences and evaluate the LIVS dataset and
the fine-tuned model through four case studies: (1) DPO increases alignment
with annotated preferences, particularly when annotation volume is high; (2)
preference patterns vary across participant identities, underscoring the need
for intersectional data; (3) human-authored prompts generate more distinctive
visual outputs than LLM-generated ones, influencing annotation decisiveness;
and (4) intersectional groups assign systematically different ratings across
criteria, revealing the limitations of single-objective alignment. While DPO
improves alignment under specific conditions, the prevalence of neutral ratings
indicates that community values are heterogeneous and often ambiguous. LIVS
provides a benchmark for developing T2I models that incorporate local,
stakeholder-driven preferences, offering a foundation for context-aware
alignment in spatial design.
comment: ICML 2025
♻ ★ Quaternionic Reweighted Amplitude Flow for Phase Retrieval in Image Reconstruction
Quaternionic signal processing provides powerful tools for efficiently
managing color signals by preserving the intrinsic correlations among signal
dimensions through quaternion algebra. In this paper, we address the
quaternionic phase retrieval problem by systematically developing novel
algorithms based on an amplitude-based model. Specifically, we propose the
Quaternionic Reweighted Amplitude Flow (QRAF) algorithm, which is further
enhanced by three of its variants: incremental, accelerated, and adapted QRAF
algorithms. In addition, we introduce the Quaternionic Perturbed Amplitude Flow
(QPAF) algorithm, which has linear convergence. Extensive numerical experiments
on both synthetic data and real images, demonstrate that our proposed methods
significantly improve recovery performance and computational efficiency
compared to state-of-the-art approaches.
♻ ★ Boosting Adverse Weather Crowd Counting via Multi-queue Contrastive Learning
Currently, most crowd counting methods have outstanding performance under
normal weather conditions. However, our experimental validation reveals two key
obstacles limiting the accuracy improvement of crowd counting models: 1) the
domain gap between the adverse weather and the normal weather images; 2) the
weather class imbalance in the training set. To address the problems, we
propose a two-stage crowd counting method named Multi-queue Contrastive
Learning (MQCL). Specifically, in the first stage, our target is to equip the
backbone network with weather-awareness capabilities. In this process, a
contrastive learning method named multi-queue MoCo designed by us is employed
to enable representation learning under weather class imbalance. After the
first stage is completed, the backbone model is "mature" enough to extract
weather-related representations. On this basis, we proceed to the second stage,
in which we propose to refine the representations under the guidance of
contrastive learning, enabling the conversion of the weather-aware
representations to the normal weather domain. Through such representation and
conversion, the model achieves robust counting performance under both normal
and adverse weather conditions. Extensive experimental results show that,
compared to the baseline, MQCL reduces the counting error under adverse weather
conditions by 22%, while introducing only about 13% increase in computational
burden, which achieves state-of-the-art performance.
comment: 8 pages, 5 figures
♻ ★ Semi-supervised Underwater Image Enhancement Using A Physics-Aware Triple-Stream Network
Underwater images normally suffer from degradation due to the transmission
medium of water bodies. Both traditional prior-based approaches and deep
learning-based methods have been used to address this problem. However, the
inflexible assumption of the former often impairs their effectiveness in
handling diverse underwater scenes, while the generalization of the latter to
unseen images is usually weakened by insufficient data. In this study, we
leverage both the physics-based Image Formation Model (IFM) and deep learning
techniques for Underwater Image Enhancement (UIE). To this end, we propose a
novel Physics-Aware Triple-Stream Underwater Image Enhancement Network, i.e.,
PATS-UIENet, which comprises a Direct Signal Transmission Estimation Steam
(D-Stream), a Backscatter Signal Transmission Estimation Steam (B-Stream) and
an Ambient Light Estimation Stream (A-Stream). This network fulfills the UIE
task by explicitly estimating the degradation parameters of a revised IFM. We
also adopt an IFM-inspired semi-supervised learning framework, which exploits
both the labeled and unlabeled images, to address the issue of insufficient
data. To our knowledge, such a physics-aware deep network and the IFM-inspired
semi-supervised learning framework have not been used for the UIE task before.
Our method performs better than, or at least comparably to, sixteen baselines
across six testing sets in the degradation estimation and UIE tasks. These
promising results should be due to the fact that the proposed method can not
only model the degradation but also learn the characteristics of diverse
underwater scenes.
comment: 13 pages, 10 figures
♻ ★ IntelliCardiac: An Intelligent Platform for Cardiac Image Segmentation and Classification
Ting Yu Tsai, An Yu, Meghana Spurthi Maadugundu, Ishrat Jahan Mohima, Umme Habiba Barsha, Mei-Hwa F. Chen, Balakrishnan Prabhakaran, Ming-Ching Chang
Precise and effective processing of cardiac imaging data is critical for the
identification and management of the cardiovascular diseases. We introduce
IntelliCardiac, a comprehensive, web-based medical image processing platform
for the automatic segmentation of 4D cardiac images and disease classification,
utilizing an AI model trained on the publicly accessible ACDC dataset. The
system, intended for patients, cardiologists, and healthcare professionals,
offers an intuitive interface and uses deep learning models to identify
essential heart structures and categorize cardiac diseases. The system supports
analysis of both the right and left ventricles as well as myocardium, and then
classifies patient's cardiac images into five diagnostic categories: dilated
cardiomyopathy, myocardial infarction, hypertrophic cardiomyopathy, right
ventricular abnormality, and no disease. IntelliCardiac combines a deep
learning-based segmentation model with a two-step classification pipeline. The
segmentation module gains an overall accuracy of 92.6%. The classification
module, trained on characteristics taken from segmented heart structures,
achieves 98% accuracy in five categories. These results exceed the performance
of the existing state-of-the-art methods that integrate both segmentation and
classification models. IntelliCardiac, which supports real-time visualization,
workflow integration, and AI-assisted diagnostics, has great potential as a
scalable, accurate tool for clinical decision assistance in cardiac imaging and
diagnosis.
♻ ★ HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning
Tianyi Chen, Xiaoyi Qu, David Aponte, Colby Banbury, Jongwoo Ko, Tianyu Ding, Yong Ma, Vladimir Lyapunov, Ilya Zharkov, Luming Liang
Structured pruning is one of the most popular approaches to effectively
compress the heavy deep neural networks (DNNs) into compact sub-networks while
retaining performance. The existing methods suffer from multi-stage procedures
along with significant engineering efforts and human expertise. The
Only-Train-Once (OTO) series has been recently proposed to resolve the many
pain points by streamlining the workflow by automatically conducting (i) search
space generation, (ii) structured sparse optimization, and (iii) sub-network
construction. However, the built-in sparse optimizers in the OTO series, i.e.,
the Half-Space Projected Gradient (HSPG) family, have limitations that require
hyper-parameter tuning and the implicit controls of the sparsity exploration,
consequently requires intervening by human expertise. To address such
limitations, we propose a Hybrid Efficient Structured Sparse Optimizer (HESSO).
HESSO could automatically and efficiently train a DNN to produce a
high-performing subnetwork. Meanwhile, it is almost tuning-free and enjoys
user-friendly integration for generic training applications. To address another
common issue of irreversible performance collapse observed in pruning DNNs, we
further propose a Corrective Redundant Identification Cycle (CRIC) for reliably
identifying indispensable structures. We numerically demonstrate the efficacy
of HESSO and its enhanced version HESSO-CRIC on a variety of applications
ranging from computer vision to natural language processing, including large
language model. The numerical results showcase that HESSO can achieve
competitive even superior performance to varying state-of-the-arts and support
most DNN architectures. Meanwhile, CRIC can effectively prevent the
irreversible performance collapse and further enhance the performance of HESSO
on certain applications.
comment: 19 pages, 6 figures
♻ ★ FieldNet: Efficient Real-Time Shadow Removal for Enhanced Vision in Field Robotics
Shadows significantly hinder computer vision tasks in outdoor environments,
particularly in field robotics, where varying lighting conditions complicate
object detection and localisation. We present FieldNet, a novel deep learning
framework for real-time shadow removal, optimised for resource-constrained
hardware. FieldNet introduces a probabilistic enhancement module and a novel
loss function to address challenges of inconsistent shadow boundary supervision
and artefact generation, achieving enhanced accuracy and simplicity without
requiring shadow masks during inference. Trained on a dataset of 10,000 natural
images augmented with synthetic shadows, FieldNet outperforms state-of-the-art
methods on benchmark datasets (ISTD, ISTD+, SRD), with up to $9$x speed
improvements (66 FPS on Nvidia 2080Ti) and superior shadow removal quality
(PSNR: 38.67, SSIM: 0.991). Real-world case studies in precision agriculture
robotics demonstrate the practical impact of FieldNet in enhancing weed
detection accuracy. These advancements establish FieldNet as a robust,
efficient solution for real-time vision tasks in field robotics and beyond.
comment: 22 pages, 9 figures, 8 tables. Published at Expert Systems with
Applications