Computer Vision and Pattern Recognition
★ DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos
Chieh Hubert Lin, Zhaoyang Lv, Songyin Wu, Zhen Xu, Thu Nguyen-Phuoc, Hung-Yu Tseng, Julian Straub, Numair Khan, Lei Xiao, Ming-Hsuan Yang, Yuheng Ren, Richard Newcombe, Zhao Dong, Zhengqin Li
We introduce the Deformable Gaussian Splats Large Reconstruction Model
(DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian
splats from a monocular posed video of any dynamic scene. Feed-forward scene
reconstruction has gained significant attention for its ability to rapidly
create digital replicas of real-world environments. However, most existing
models are limited to static scenes and fail to reconstruct the motion of
moving objects. Developing a feed-forward model for dynamic scene
reconstruction poses significant challenges, including the scarcity of training
data and the need for appropriate 3D representations and training paradigms. To
address these challenges, we introduce several key technical contributions: an
enhanced large-scale synthetic dataset with ground-truth multi-view videos and
dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian
representation that is easy to learn, supports high-quality dynamic view
synthesis, and enables long-range 3D tracking; and a large transformer network
that achieves real-time, generalizable dynamic scene reconstruction. Extensive
qualitative and quantitative experiments demonstrate that DGS-LRM achieves
dynamic scene reconstruction quality comparable to optimization-based methods,
while significantly outperforming the state-of-the-art predictive dynamic
reconstruction method on real-world examples. Its predicted physically grounded
3D deformation is accurate and can readily adapt for long-range 3D tracking
tasks, achieving performance on par with state-of-the-art monocular video 3D
tracking methods.
comment: Project page: https://hubert0527.github.io/dgslrm/
★ PlayerOne: Egocentric World Simulator
We introduce PlayerOne, the first egocentric realistic world simulator,
facilitating immersive and unrestricted exploration within vividly dynamic
environments. Given an egocentric scene image from the user, PlayerOne can
accurately construct the corresponding world and generate egocentric videos
that are strictly aligned with the real scene human motion of the user captured
by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that
first performs pretraining on large-scale egocentric text-video pairs for
coarse-level egocentric understanding, followed by finetuning on synchronous
motion-video data extracted from egocentric-exocentric video datasets with our
automatic construction pipeline. Besides, considering the varying importance of
different components, we design a part-disentangled motion injection scheme,
enabling precise control of part-level movements. In addition, we devise a
joint reconstruction framework that progressively models both the 4D scene and
video frames, ensuring scene consistency in the long-form video generation.
Experimental results demonstrate its great generalization ability in precise
control of varying human movements and worldconsistent modeling of diverse
scenarios. It marks the first endeavor into egocentric real-world simulation
and can pave the way for the community to delve into fresh frontiers of world
modeling and its diverse applications.
comment: Project page: https://playerone-hku.github.io/
★ Text-Aware Image Restoration with Diffusion Models
Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim
Image restoration aims to recover degraded images. However, existing
diffusion-based restoration methods, despite great success in natural image
restoration, often struggle to faithfully reconstruct textual regions in
degraded images. Those methods frequently generate plausible but incorrect
text-like patterns, a phenomenon we refer to as text-image hallucination. In
this paper, we introduce Text-Aware Image Restoration (TAIR), a novel
restoration task that requires the simultaneous recovery of visual contents and
textual fidelity. To tackle this task, we present SA-Text, a large-scale
benchmark of 100K high-quality scene images densely annotated with diverse and
complex text instances. Furthermore, we propose a multi-task diffusion
framework, called TeReDiff, that integrates internal features from diffusion
models into a text-spotting module, enabling both components to benefit from
joint training. This allows for the extraction of rich text representations,
which are utilized as prompts in subsequent denoising steps. Extensive
experiments demonstrate that our approach consistently outperforms
state-of-the-art restoration methods, achieving significant gains in text
recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/
comment: Project page: https://cvlab-kaist.github.io/TAIR/
★ Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
Wenbo Zhang, Tianrun Hu, Yanyuan Qiao, Hanbo Zhang, Yuchu Qin, Yang Li, Jiajun Liu, Tao Kong, Lingqiao Liu, Xiao Ma
We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built
upon Trajectory Autoregressive Modeling. Unlike conventional approaches that
predict next step action(s) forward, CoA generates an entire trajectory by
explicit backward reasoning with task-specific goals through an action-level
Chain-of-Thought (CoT) process. This process is unified within a single
autoregressive structure: (1) the first token corresponds to a stable keyframe
action that encodes the task-specific goals; and (2) subsequent action tokens
are generated autoregressively, conditioned on the initial keyframe and
previously predicted actions. This backward action reasoning enforces a
global-to-local structure, allowing each local action to be tightly constrained
by the final goal. To further realize the action reasoning structure, CoA
incorporates four complementary designs: continuous action token
representation; dynamic stopping for variable-length trajectory generation;
reverse temporal ensemble; and multi-token prediction to balance action chunk
modeling with global structure. As a result, CoA gives strong spatial
generalization capabilities while preserving the flexibility and simplicity of
a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art
performance across 60 RLBench tasks and 8 real-world manipulation tasks.
★ Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes CVPR 2025
We study the problem of making 3D scene reconstructions interactive by asking
the following question: can we predict the sounds of human hands physically
interacting with a scene? First, we record a video of a human manipulating
objects within a 3D scene using their hands. We then use these action-sound
pairs to train a rectified flow model to map 3D hand trajectories to their
corresponding audio. At test time, a user can query the model for other
actions, parameterized as sequences of hand poses, to estimate their
corresponding sounds. In our experiments, we find that our generated sounds
accurately convey material properties and actions, and that they are often
indistinguishable to human observers from real sounds. Project page:
https://www.yimingdou.com/hearing_hands/
comment: CVPR 2025, Project page: https://www.yimingdou.com/hearing_hands/ ,
Code: https://github.com/Dou-Yiming/hearing_hands/
★ EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits
Text-guided image editing, fueled by recent advancements in generative AI, is
becoming increasingly widespread. This trend highlights the need for a
comprehensive framework to verify text-guided edits and assess their quality.
To address this need, we introduce EditInspector, a novel benchmark for
evaluation of text-guided image edits, based on human annotations collected
using an extensive template for edit verification. We leverage EditInspector to
evaluate the performance of state-of-the-art (SoTA) vision and language models
in assessing edits across various dimensions, including accuracy, artifact
detection, visual quality, seamless integration with the image scene, adherence
to common sense, and the ability to describe edit-induced changes. Our findings
indicate that current models struggle to evaluate edits comprehensively and
frequently hallucinate when describing the changes. To address these
challenges, we propose two novel methods that outperform SoTA models in both
artifact detection and difference caption generation.
★ A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, Mahmoud Assran
Existing benchmarks for assessing the spatio-temporal understanding and
reasoning abilities of video language models are susceptible to score inflation
due to the presence of shortcut solutions based on superficial visual or
textual cues. This paper mitigates the challenges in accurately assessing model
performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple
shortcut-aware video QA benchmark for assessing the physical understanding of
video language models. The benchmark is comprised of 55K high-quality
multiple-choice video QA examples focusing on physical world understanding.
Examples are curated from nine video data sources, spanning first-person
egocentric and exocentric videos, robotic interaction data, and cognitive
science intuitive physics benchmarks. To mitigate shortcut solutions that rely
on superficial visual or textual cues and biases, each sample in MVP has a
minimal-change pair -- a visually similar video accompanied by an identical
question but an opposing answer. To answer a question correctly, a model must
provide correct answers for both examples in the minimal-change pair; as such,
models that solely rely on visual or textual biases would achieve below random
performance. Human performance on MVP is 92.9\%, while the best open-source
state-of-the-art video-language model achieves 40.2\% compared to random
performance at 25\%.
★ InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Dahua Lin
End-to-end human animation with rich multi-modal conditions, e.g., text,
image and audio has achieved remarkable advancements in recent years. However,
most existing methods could only animate a single subject and inject conditions
in a global manner, ignoring scenarios that multiple concepts could appears in
the same video with rich human-human interactions and human-object
interactions. Such global assumption prevents precise and per-identity control
of multiple concepts including humans and objects, therefore hinders
applications. In this work, we discard the single-entity assumption and
introduce a novel framework that enforces strong, region-specific binding of
conditions from modalities to each identity's spatiotemporal footprint. Given
reference images of multiple concepts, our method could automatically infer
layout information by leveraging a mask predictor to match appearance cues
between the denoised video and each reference appearance. Furthermore, we
inject local audio condition into its corresponding region to ensure
layout-aligned modality matching in a iterative manner. This design enables the
high-quality generation of controllable multi-concept human-centric videos.
Empirical results and ablation studies validate the effectiveness of our
explicit layout control for multi-modal conditions compared to implicit
counterparts and other existing methods.
comment: TL;DR: The first multi-person dialogue video generation method from
pairs of reference image and audio via explicit layout-aligned condition
injection. See project page https://zhenzhiwang.github.io/interacthuman/ for
more details
★ V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas
A major challenge for modern AI is to learn to understand the world and learn
to act largely by observation. This paper explores a self-supervised approach
that combines internet-scale video data with a small amount of interaction data
(robot trajectories), to develop models capable of understanding, predicting,
and planning in the physical world. We first pre-train an action-free
joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset
comprising over 1 million hours of internet video. V-JEPA 2 achieves strong
performance on motion understanding (77.3 top-1 accuracy on Something-Something
v2) and state-of-the-art performance on human action anticipation (39.7
recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models.
Additionally, after aligning V-JEPA 2 with a large language model, we
demonstrate state-of-the-art performance on multiple video question-answering
tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on
TempCompass). Finally, we show how self-supervised learning can be applied to
robotic planning tasks by post-training a latent action-conditioned world
model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the
Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different
labs and enable picking and placing of objects using planning with image goals.
Notably, this is achieved without collecting any data from the robots in these
environments, and without any task-specific training or reward. This work
demonstrates how self-supervised learning from web-scale data and a small
amount of robot interaction data can yield a world model capable of planning in
the physical world.
comment: 48 pages, 19 figures
★ AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
Recent advances in 4D content generation have attracted increasing attention,
yet creating high-quality animated 3D models remains challenging due to the
complexity of modeling spatio-temporal distributions and the scarcity of 4D
training data. In this paper, we present AnimateAnyMesh, the first feed-forward
framework that enables efficient text-driven animation of arbitrary 3D meshes.
Our approach leverages a novel DyMeshVAE architecture that effectively
compresses and reconstructs dynamic mesh sequences by disentangling spatial and
temporal features while preserving local topological structures. To enable
high-quality text-conditional generation, we employ a Rectified Flow-based
training strategy in the compressed latent space. Additionally, we contribute
the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text
annotations. Experimental results demonstrate that our method generates
semantically accurate and temporally coherent mesh animations in a few seconds,
significantly outperforming existing approaches in both quality and efficiency.
Our work marks a substantial step forward in making 4D content creation more
accessible and practical. All the data, code, and models will be open-released.
comment: Project Page: https://animateanymesh.github.io/AnimateAnyMesh/
★ ReSim: Reliable World Simulation for Autonomous Driving
Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, Li Chen
How can we reliably simulate future driving scenarios under a wide range of
ego driving behaviors? Recent driving world models, developed exclusively on
real-world driving data composed mainly of safe expert trajectories, struggle
to follow hazardous or non-expert behaviors, which are rare in such data. This
limitation restricts their applicability to tasks such as policy evaluation. In
this work, we address this challenge by enriching real-world human
demonstrations with diverse non-expert data collected from a driving simulator
(e.g., CARLA), and building a controllable world model trained on this
heterogeneous corpus. Starting with a video generator featuring a diffusion
transformer architecture, we devise several strategies to effectively integrate
conditioning signals and improve prediction controllability and fidelity. The
resulting model, ReSim, enables Reliable Simulation of diverse open-world
driving scenarios under various actions, including hazardous non-expert ones.
To close the gap between high-fidelity simulation and applications that require
reward signals to judge different actions, we introduce a Video2Reward module
that estimates a reward from ReSim's simulated future. Our ReSim paradigm
achieves up to 44% higher visual fidelity, improves controllability for both
expert and non-expert actions by over 50%, and boosts planning and policy
selection performance on NAVSIM by 2% and 25%, respectively.
comment: Project page: https://opendrivelab.com/ReSim
★ Efficient Part-level 3D Object Generation via Dual Volume Packing
Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, Tsung-Yi Lin
Recent progress in 3D object generation has greatly improved both the quality
and efficiency. However, most existing methods generate a single mesh with all
parts fused together, which limits the ability to edit or manipulate individual
parts. A key challenge is that different objects may have a varying number of
parts. To address this, we propose a new end-to-end framework for part-level 3D
object generation. Given a single input image, our method generates
high-quality 3D objects with an arbitrary number of complete and semantically
meaningful parts. We introduce a dual volume packing strategy that organizes
all parts into two complementary volumes, allowing for the creation of complete
and interleaved parts that assemble into the final object. Experiments show
that our model achieves better quality, diversity, and generalization than
previous image-based part-level generation methods.
comment: Code: https://github.com/NVlabs/PartPacker Project Page:
https://research.nvidia.com/labs/dir/partpacker/
★ Vectorized Region Based Brush Strokes for Artistic Rendering
Creating a stroke-by-stroke evolution process of a visual artwork tries to
bridge the emotional and educational gap between the finished static artwork
and its creation process. Recent stroke-based painting systems focus on
capturing stroke details by predicting and iteratively refining stroke
parameters to maximize the similarity between the input image and the rendered
output. However, these methods often struggle to produce stroke compositions
that align with artistic principles and intent. To address this, we explore an
image-to-painting method that (i) facilitates semantic guidance for brush
strokes in targeted regions, (ii) computes the brush stroke parameters, and
(iii) establishes a sequence among segments and strokes to sequentially render
the final painting. Experimental results on various input image types, such as
face images, paintings, and photographic images, show that our method aligns
with a region-based painting strategy while rendering a painting with high
fidelity and superior stroke quality.
★ Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
As textual reasoning with large language models (LLMs) has advanced
significantly, there has been growing interest in enhancing the multimodal
reasoning capabilities of large vision-language models (LVLMs). However,
existing methods primarily approach multimodal reasoning in a straightforward,
text-centric manner, where both reasoning and answer derivation are conducted
purely through text, with the only difference being the presence of multimodal
input. As a result, these methods often encounter fundamental limitations in
spatial reasoning tasks that demand precise geometric understanding and
continuous spatial tracking-capabilities that humans achieve through mental
visualization and manipulation. To address the limitations, we propose drawing
to reason in space, a novel paradigm that enables LVLMs to reason through
elementary drawing operations in the visual space. By equipping models with
basic drawing operations, including annotating bounding boxes and drawing
auxiliary lines, we empower them to express and analyze spatial relationships
through direct visual manipulation, meanwhile avoiding the performance ceiling
imposed by specialized perception tools in previous tool-integrated reasoning
approaches. To cultivate this capability, we develop a three-stage training
framework: cold-start training with synthetic data to establish basic drawing
abilities, reflective rejection sampling to enhance self-reflection behaviors,
and reinforcement learning to directly optimize for target rewards. Extensive
experiments demonstrate that our model, named VILASR, consistently outperforms
existing methods across diverse spatial reasoning benchmarks, involving maze
navigation, static spatial reasoning, video-based reasoning, and
multi-view-based reasoning tasks, with an average improvement of 18.4%.
★ Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy
Medical Visual Question Answering (MedVQA) is a promising field for
developing clinical decision support systems, yet progress is often limited by
the available datasets, which can lack clinical complexity and visual
diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new,
large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly
expands upon the original Kvasir-VQA by incorporating 159,549 new
question-answer pairs that are designed to test deeper clinical reasoning. We
developed a systematic method using large language models to generate these
questions, which are stratified by complexity to better assess a model's
inference capabilities. To ensure our dataset prepares models for real-world
clinical scenarios, we have also introduced a variety of visual augmentations
that mimic common imaging artifacts. The dataset is structured to support two
main evaluation tracks: one for standard VQA performance and another to test
model robustness against these visual perturbations. By providing a more
challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate
the development of more reliable and effective multimodal AI systems for use in
clinical settings. The dataset is fully accessible and adheres to FAIR data
principles, making it a valuable resource for the wider research community.
Code and data: https://github.com/Simula/Kvasir-VQA-x1 and
https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1
★ Canonical Latent Representations in Conditional Diffusion Models
Conditional diffusion models (CDMs) have shown impressive performance across
a range of generative tasks. Their ability to model the full data distribution
has opened new avenues for analysis-by-synthesis in downstream discriminative
learning. However, this same modeling capacity causes CDMs to entangle the
class-defining features with irrelevant context, posing challenges to
extracting robust and interpretable representations. To this end, we identify
Canonical LAtent Representations (CLAReps), latent codes whose internal CDM
features preserve essential categorical information while discarding
non-discriminative signals. When decoded, CLAReps produce representative
samples for each class, offering an interpretable and compact summary of the
core class semantics with minimal irrelevant details. Exploiting CLAReps, we
develop a novel diffusion-based feature-distillation paradigm, CaDistill. While
the student has full access to the training set, the CDM as teacher transfers
core class knowledge only via CLAReps, which amounts to merely 10 % of the
training data in size. After training, the student achieves strong adversarial
robustness and generalization ability, focusing more on the class signals
instead of spurious background cues. Our findings suggest that CDMs can serve
not just as image generators but also as compact, interpretable teachers that
can drive robust representation learning.
comment: 45 pages,41 figures
★ Vision Generalist Model: A Survey
Ziyi Wang, Yongming Rao, Shuofeng Sun, Xinrun Liu, Yi Wei, Xumin Yu, Zuyan Liu, Yanbo Wang, Hongmin Liu, Jie Zhou, Jiwen Lu
Recently, we have witnessed the great success of the generalist model in
natural language processing. The generalist model is a general framework
trained with massive data and is able to process various downstream tasks
simultaneously. Encouraged by their impressive performance, an increasing
number of researchers are venturing into the realm of applying these models to
computer vision tasks. However, the inputs and outputs of vision tasks are more
diverse, and it is difficult to summarize them as a unified representation. In
this paper, we provide a comprehensive overview of the vision generalist
models, delving into their characteristics and capabilities within the field.
First, we review the background, including the datasets, tasks, and benchmarks.
Then, we dig into the design of frameworks that have been proposed in existing
research, while also introducing the techniques employed to enhance their
performance. To better help the researchers comprehend the area, we take a
brief excursion into related domains, shedding light on their interconnections
and potential synergies. To conclude, we provide some real-world application
scenarios, undertake a thorough examination of the persistent challenges, and
offer insights into possible directions for future research endeavors.
comment: Accepted by International Journal of Computer Vision (IJCV)
★ Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
In outside knowledge visual question answering (OK-VQA), the model must
identify relevant visual information within an image and incorporate external
knowledge to accurately respond to a question. Extending this task to a
visually grounded dialogue setting based on videos, a conversational model must
both recognize pertinent visual details over time and answer questions where
the required information is not necessarily present in the visual information.
Moreover, the context of the overall conversation must be considered for the
subsequent dialogue. To explore this task, we introduce a dataset comprised of
$2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$
interleaved dialogue turns. While the dialogue context is visually grounded in
specific video segments, the questions further require external knowledge that
is not visually present. Thus, the model not only has to identify relevant
video parts but also leverage external knowledge to converse within the
dialogue. We further provide several baselines evaluated on our dataset and
show future challenges associated with this task. The dataset is made publicly
available here: https://github.com/c-patsch/OKCV.
★ UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting CVPR 2025
The scale diversity of point cloud data presents significant challenges in
developing unified representation learning techniques for 3D vision. Currently,
there are few unified 3D models, and no existing pre-training method is equally
effective for both object- and scene-level point clouds. In this paper, we
introduce UniPre3D, the first unified pre-training method that can be
seamlessly applied to point clouds of any scale and 3D models of any
architecture. Our approach predicts Gaussian primitives as the pre-training
task and employs differentiable Gaussian splatting to render images, enabling
precise pixel-level supervision and end-to-end optimization. To further
regulate the complexity of the pre-training task and direct the model's focus
toward geometric structures, we integrate 2D features from pre-trained image
models to incorporate well-established texture knowledge. We validate the
universal effectiveness of our proposed method through extensive experiments
across a variety of object- and scene-level tasks, using diverse point cloud
models as backbones. Code is available at https://github.com/wangzy22/UniPre3D.
comment: Accepted to CVPR 2025
★ Sampling Theory for Super-Resolution with Implicit Neural Representations
Implicit neural representations (INRs) have emerged as a powerful tool for
solving inverse problems in computer vision and computational imaging. INRs
represent images as continuous domain functions realized by a neural network
taking spatial coordinates as inputs. However, unlike traditional pixel
representations, little is known about the sample complexity of estimating
images using INRs in the context of linear inverse problems. Towards this end,
we study the sampling requirements for recovery of a continuous domain image
from its low-pass Fourier samples by fitting a single hidden-layer INR with
ReLU activation and a Fourier features layer using a generalized form of weight
decay regularization. Our key insight is to relate minimizers of this
non-convex parameter space optimization problem to minimizers of a convex
penalty defined over an infinite-dimensional space of measures. We identify a
sufficient number of Fourier samples for which an image realized by an INR is
exactly recoverable by solving the INR training problem. To validate our
theory, we empirically assess the probability of achieving exact recovery of
images realized by low-width single hidden-layer INRs, and illustrate the
performance of INRs on super-resolution recovery of continuous domain phantom
images.
comment: arXiv admin note: text overlap with arXiv:2405.18410
★ CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models NeurIPS2025
We introduce CausalVQA, a benchmark dataset for video question answering
(VQA) composed of question-answer pairs that probe models' understanding of
causality in the physical world. Existing VQA benchmarks either tend to focus
on surface perceptual understanding of real-world videos, or on narrow physical
reasoning questions created using simulation environments. CausalVQA fills an
important gap by presenting challenging questions that are grounded in
real-world scenarios, while focusing on models' ability to predict the likely
outcomes of different actions and events through five question types:
counterfactual, hypothetical, anticipation, planning and descriptive. We
designed quality control mechanisms that prevent models from exploiting trivial
shortcuts, requiring models to base their answers on deep visual understanding
instead of linguistic cues. We find that current frontier multimodal models
fall substantially below human performance on the benchmark, especially on
anticipation and hypothetical questions. This highlights a challenge for
current systems to leverage spatial-temporal reasoning, understanding of
physical principles, and comprehension of possible alternatives to make
accurate predictions in real-world settings.
comment: 35 pages, 3 figures, Submitted to NeurIPS2025 benchmark track
★ LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation
Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang
Developing 3D-VL generalists capable of understanding 3D scenes and following
natural language instructions to perform a wide range of tasks has been a
long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL
models still lag behind their 2D counterparts in capability and robustness,
falling short of the generalist standard. A key obstacle to developing 3D-VL
generalists lies in data scalability, hindered by the lack of an efficient
scene representation. We propose LEO-VL, a 3D-VL model built upon condensed
feature grid (CFG), an efficient scene representation that bridges 2D
perception and 3D spatial structure while significantly reducing token
overhead. This efficiency unlocks large-scale training towards 3D-VL
generalist, for which we curate over 700k high-quality 3D-VL data spanning four
domains of real-world indoor scenes and five tasks such as captioning and
dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA
benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the
efficiency of our representation, the importance of task and scene diversity,
and the validity of our data curation principle. Furthermore, we introduce
SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL
models. We hope our findings contribute to the advancement of scalable and
robust 3D-VL generalists.
comment: Project page: https://leo-vl.github.io
★ Fluoroscopic Shape and Pose Tracking of Catheters with Custom Radiopaque Markers
Safe navigation of steerable and robotic catheters in the cerebral
vasculature requires awareness of the catheters shape and pose. Currently, a
significant perception burden is placed on interventionalists to mentally
reconstruct and predict catheter motions from biplane fluoroscopy images.
Efforts to track these catheters are limited to planar segmentation or bulky
sensing instrumentation, which are incompatible with microcatheters used in
neurointervention. In this work, a catheter is equipped with custom radiopaque
markers arranged to enable simultaneous shape and pose estimation under biplane
fluoroscopy. A design measure is proposed to guide the arrangement of these
markers to minimize sensitivity to marker tracking uncertainty. This approach
was deployed for microcatheters smaller than 2mm OD navigating phantom
vasculature with shape tracking errors less than 1mm and catheter roll errors
below 40 degrees. This work can enable steerable catheters to autonomously
navigate under biplane imaging.
comment: 8 pages, 5 figures, accepted in Robotics and Automation Letters
★ HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations
Diffusion models represent the cutting edge in image generation, but their
high memory and computational demands hinder deployment on resource-constrained
devices. Post-Training Quantization (PTQ) offers a promising solution by
reducing the bitwidth of matrix operations. However, standard PTQ methods
struggle with outliers, and achieving higher compression often requires
transforming model weights and activations before quantization. In this work,
we propose HadaNorm, a novel linear transformation that extends existing
approaches and effectively mitigates outliers by normalizing activations
feature channels before applying Hadamard transformations, enabling more
aggressive activation quantization. We demonstrate that HadaNorm consistently
reduces quantization error across the various components of transformer blocks,
achieving superior efficiency-performance trade-offs when compared to
state-of-the-art methods.
comment: 4 Pages, 5 Figures
★ From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
One promise that Vision-Language-Action (VLA) models hold over traditional
imitation learning for robotics is to leverage the broad generalization
capabilities of large Vision-Language Models (VLMs) to produce versatile,
"generalist" robot policies. However, current evaluations of VLAs remain
insufficient. Traditional imitation learning benchmarks are unsuitable due to
the lack of language instructions. Emerging benchmarks for VLAs that
incorporate language often come with limited evaluation tasks and do not intend
to investigate how much VLM pretraining truly contributes to the generalization
capabilities of the downstream robotic policy. Meanwhile, much research relies
on real-world robot setups designed in isolation by different institutions,
which creates a barrier for reproducibility and accessibility. To address this
gap, we introduce a unified probing suite of 50 simulation-based tasks across
10 subcategories spanning language instruction, vision, and objects. We
systematically evaluate several state-of-the-art VLA architectures on this
suite to understand their generalization capability. Our results show that
while VLM backbones endow VLAs with robust perceptual understanding and high
level planning, which we refer to as good intentions, this does not reliably
translate into precise motor execution: when faced with out-of-distribution
observations, policies often exhibit coherent intentions, but falter in action
execution. Moreover, finetuning on action data can erode the original VLM's
generalist reasoning abilities. We release our task suite and evaluation code
to serve as a standardized benchmark for future VLAs and to drive research on
closing the perception-to-action gap. More information, including the source
code, can be found at https://ai4ce.github.io/INT-ACT/
comment: Under review
★ Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering
Hyperspectral image (HSI) clustering assigns similar pixels to the same class
without any annotations, which is an important yet challenging task. For
large-scale HSIs, most methods rely on superpixel segmentation and perform
superpixel-level clustering based on graph neural networks (GNNs). However,
existing GNNs cannot fully exploit the spectral information of the input HSI,
and the inaccurate superpixel topological graph may lead to the confusion of
different class semantics during information aggregation. To address these
challenges, we first propose a structural-spectral graph convolutional operator
(SSGCO) tailored for graph-structured HSI superpixels to improve their
representation quality through the co-extraction of spatial and spectral
features. Second, we propose an evidence-guided adaptive edge learning (EGAEL)
module that adaptively predicts and refines edge weights in the superpixel
topological graph. We integrate the proposed method into a contrastive learning
framework to achieve clustering, where representation learning and clustering
are simultaneously conducted. Experiments demonstrate that the proposed method
improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best
compared methods on four HSI datasets. Our code is available at
https://github.com/jhqi/SSGCO-EGAEL.
★ MetricHMR: Metric Human Mesh Recovery from Monocular Images
We introduce MetricHMR (Metric Human Mesh Recovery), an approach for metric
human mesh recovery with accurate global translation from monocular images. In
contrast to existing HMR methods that suffer from severe scale and depth
ambiguity, MetricHMR is able to produce geometrically reasonable body shape and
global translation in the reconstruction results. To this end, we first
systematically analyze previous HMR methods on camera models to emphasize the
critical role of the standard perspective projection model in enabling
metric-scale HMR. We then validate the acceptable ambiguity range of metric HMR
under the standard perspective projection model. Finally, we contribute a novel
approach that introduces a ray map based on the standard perspective projection
to jointly encode bounding-box information, camera parameters, and geometric
cues for End2End metric HMR without any additional metric-regularization
modules. Extensive experiments demonstrate that our method achieves
state-of-the-art performance, even compared with sequential HMR methods, in
metric pose, shape, and global translation estimation across both indoor and
in-the-wild scenarios.
★ Only-Style: Stylistic Consistency in Image Generation without Content Leakage
Generating images in a consistent reference visual style remains a
challenging computer vision task. State-of-the-art methods aiming for
style-consistent generation struggle to effectively separate semantic content
from stylistic elements, leading to content leakage from the image provided as
a reference to the targets. To address this challenge, we propose Only-Style: a
method designed to mitigate content leakage in a semantically coherent manner
while preserving stylistic consistency. Only-Style works by localizing content
leakage during inference, allowing the adaptive tuning of a parameter that
controls the style alignment process, specifically within the image patches
containing the subject in the reference image. This adaptive process best
balances stylistic consistency with leakage elimination. Moreover, the
localization of content leakage can function as a standalone component, given a
reference-target image pair, allowing the adaptive tuning of any
method-specific parameter that provides control over the impact of the
stylistic reference. In addition, we propose a novel evaluation framework to
quantify the success of style-consistent generations in avoiding undesired
content leakage. Our approach demonstrates a significant improvement over
state-of-the-art methods through extensive evaluation across diverse instances,
consistently achieving robust stylistic consistency without undesired content
leakage.
★ CEM-FBGTinyDet: Context-Enhanced Foreground Balance with Gradient Tuning for tiny Objects
Tiny object detection (TOD) reveals a fundamental flaw in feature pyramid
networks: high-level features (P5-P6) frequently receive zero positive anchors
under standard label assignment protocols, leaving their semantic
representations untrained due to exclusion from loss computation. This creates
dual deficiencies: (1) Stranded high-level features become semantic dead-ends
without gradient updates, while (2) low-level features lack essential semantic
context for robust classification. We propose E-FPN-BS that systematically
converts wasted high-level semantics into low-level feature enhancements. To
address these issues, we propose E-FPN-BS, a novel architecture integrating
multi-scale feature enhancement and adaptive optimization. First, our Context
Enhancement Module(CEM) employs dual-branch processing to align and compress
high-level features for effective global-local fusion. Second, the
Foreground-Background Separation Module (FBSM) generates spatial gating masks
that dynamically amplify discriminative regions. To address gradient imbalance
across object scales, we further propose a Dynamic Gradient-Balanced Loss
(DCLoss) that automatically modulates loss contributions via scale-aware
gradient equilibrium. Extensive experiments across multiple benchmark datasets
demonstrate the outstanding performance and generalization ability of our
approach.
★ EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks
Learning self-supervised representations that are invariant and equivariant
to transformations is crucial for advancing beyond traditional visual
classification tasks. However, many methods rely on predictor architectures to
encode equivariance, despite evidence that architectural choices, such as
capsule networks, inherently excel at learning interpretable pose-aware
representations. To explore this, we introduce EquiCaps (Equivariant Capsule
Network), a capsule-based approach to pose-aware self-supervision that
eliminates the need for a specialised predictor for enforcing equivariance.
Instead, we leverage the intrinsic pose-awareness capabilities of capsules to
improve performance in pose estimation tasks. To further challenge our
assumptions, we increase task complexity via multi-geometric transformations to
enable a more thorough evaluation of invariance and equivariance by introducing
3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical
results demonstrate that EquiCaps outperforms prior state-of-the-art
equivariant methods on rotation prediction, achieving a supervised-level $R^2$
of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE
and CapsIE by 0.05 and 0.04 $R^2$, respectively. Moreover, in contrast to
non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant
performance under combined geometric transformations, underscoring its
generalisation capabilities and the promise of predictor-free capsule
architectures.
comment: 19 pages, 11 Figures, 13 Tables
★ The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge
We consider the problem of generalizable novel view synthesis (NVS), which
aims to generate photorealistic novel views from sparse or even unposed 2D
images without per-scene optimization. This task remains fundamentally
challenging, as it requires inferring 3D structure from incomplete and
ambiguous 2D observations. Early approaches typically rely on strong 3D
knowledge, including architectural 3D inductive biases (e.g., embedding
explicit 3D representations, such as NeRF or 3DGS, into network design) and
ground-truth camera poses for both input and target views. While recent efforts
have sought to reduce the 3D inductive bias or the dependence on known camera
poses of input views, critical questions regarding the role of 3D knowledge and
the necessity of circumventing its use remain under-explored. In this work, we
conduct a systematic analysis on the 3D knowledge and uncover a critical trend:
the performance of methods that requires less 3D knowledge accelerates more as
data scales, eventually achieving performance on par with their 3D
knowledge-driven counterparts, which highlights the increasing importance of
reducing dependence on 3D knowledge in the era of large-scale data. Motivated
by and following this trend, we propose a novel NVS framework that minimizes 3D
inductive bias and pose dependence for both input and target views. By
eliminating this 3D knowledge, our method fully leverages data scaling and
learns implicit 3D awareness directly from sparse 2D images, without any 3D
inductive bias or pose annotation during training. Extensive experiments
demonstrate that our model generates photorealistic and 3D-consistent novel
views, achieving even comparable performance with methods that rely on posed
inputs, thereby validating the feasibility and effectiveness of our
data-centric paradigm. Project page:
https://pku-vcl-geometry.github.io/Less3Depend/ .
★ 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
Vision-Language Models (VLMs) have shown remarkable performance on diverse
visual and linguistic tasks, yet they remain fundamentally limited in their
understanding of 3D spatial structures. We propose Geometric Distillation, a
lightweight, annotation-free fine-tuning framework that injects human-inspired
geometric cues into pretrained VLMs without modifying their architecture. By
distilling (1) sparse correspondences, (2) relative depth relations, and (3)
dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R,
VGGT), our method shapes representations to be geometry-aware while remaining
compatible with natural image-text inputs. Through extensive evaluations on 3D
vision-language reasoning and 3D perception benchmarks, our method consistently
outperforms prior approaches, achieving improved 3D spatial reasoning with
significantly lower computational cost. Our work demonstrates a scalable and
efficient path to bridge 2D-trained VLMs with 3D understanding, opening up
wider use in spatially grounded multimodal tasks.
★ Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation
Open-Vocabulary semantic segmentation (OVSS) and domain generalization in
semantic segmentation (DGSS) highlight a subtle complementarity that motivates
Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS
aims to generate pixel-level masks for unseen categories while maintaining
robustness across unseen domains, a critical capability for real-world
scenarios such as autonomous driving in adverse conditions. We introduce Vireo,
a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS
and DGSS for the first time. Vireo builds upon the frozen Visual Foundation
Models (VFMs) and incorporates scene geometry via Depth VFMs to extract
domain-invariant structural features. To bridge the gap between visual and
textual modalities under domain shift, we propose three key components: (1)
GeoText Prompts, which align geometric features with language cues and
progressively refine VFM encoder representations; (2) Coarse Mask Prior
Embedding (CMPE) for enhancing gradient flow for faster convergence and
stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding
Head (DOV-VEH), which fuses refined structural and semantic features for robust
prediction. Comprehensive evaluation on these components demonstrates the
effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art
performance and surpasses existing methods by a large margin in both domain
generalization and open-vocabulary recognition, offering a unified and scalable
solution for robust visual understanding in diverse and dynamic environments.
Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo.
★ IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments
We present IntPhys 2, a video benchmark designed to evaluate the intuitive
physics understanding of deep learning models. Building on the original IntPhys
benchmark, IntPhys 2 focuses on four core principles related to macroscopic
objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity.
These conditions are inspired by research into intuitive physical understanding
emerging during early childhood. IntPhys 2 offers a comprehensive suite of
tests, based on the violation of expectation framework, that challenge models
to differentiate between possible and impossible events within controlled and
diverse virtual environments. Alongside the benchmark, we provide performance
evaluations of several state-of-the-art models. Our findings indicate that
while these models demonstrate basic visual understanding, they face
significant challenges in grasping intuitive physics across the four principles
in complex scenes, with most models performing at chance levels (50%), in stark
contrast to human performance, which achieves near-perfect accuracy. This
underscores the gap between current models and human-like intuitive physics
understanding, highlighting the need for advancements in model architectures
and training methodologies.
★ Dataset of News Articles with Provenance Metadata for Media Relevance Assessment
Out-of-context and misattributed imagery is the leading form of media
manipulation in today's misinformation and disinformation landscape. The
existing methods attempting to detect this practice often only consider whether
the semantics of the imagery corresponds to the text narrative, missing
manipulation so long as the depicted objects or scenes somewhat correspond to
the narrative at hand. To tackle this, we introduce News Media Provenance
Dataset, a dataset of news articles with provenance-tagged images. We formulate
two tasks on this dataset, location of origin relevance (LOR) and date and time
of origin relevance (DTOR), and present baseline results on six large language
models (LLMs). We identify that, while the zero-shot performance on LOR is
promising, the performance on DTOR hinders, leaving room for specialized
architectures and future work.
★ Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition
Handwritten text recognition aims to convert visual input into
machine-readable text, and it remains challenging due to the evolving and
context-dependent nature of handwriting. Character sets change over time, and
character frequency distributions shift across historical periods or regions,
often causing models trained on broad, heterogeneous corpora to underperform on
specific subsets. To tackle this, we propose a novel loss function that
incorporates the Wasserstein distance between the character frequency
distribution of the predicted text and a target distribution empirically
derived from training data. By penalizing divergence from expected
distributions, our approach enhances both accuracy and robustness under
temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that
character distribution alignment can also improve existing models at inference
time without requiring retraining by integrating it as a scoring function in a
guided decoding scheme. Experimental results across multiple datasets and
architectures confirm the effectiveness of our method in boosting
generalization and performance. We open source our code at
https://github.com/pkaliosis/fada.
comment: 17 pages, 10 figures, Under Review
★ OctoNav: Towards Generalist Embodied Navigation
Embodied navigation stands as a foundation pillar within the broader pursuit
of embodied AI. However, previous navigation research is divided into different
tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task
objectives and modalities, making datasets and methods are designed
individually. In this work, we take steps toward generalist navigation agents,
which can follow free-form instructions that include arbitrary compounds of
multi-modal and multi-capability. To achieve this, we propose a large-scale
benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1.
Specifically, OctoNav-Bench features continuous environments and is constructed
via a designed annotation pipeline. We thoroughly craft instruction-trajectory
pairs, where instructions are diverse in free-form with arbitrary modality and
capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within
OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1,
we build it upon MLLMs and adapt it to a VLA-type model, which can produce
low-level actions solely based on 2D visual observations. Moreover, we design a
Hybrid Training Paradigm (HTP) that consists of three stages, i.e.,
Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains
specifically designed learning policies and rewards. Importantly, for TBA-SFT
and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which
show impressive reasoning ability via thinking-before-answer. Thus, we aim to
investigate how to achieve thinking-before-action in the embodied navigation
field, to improve model's reasoning ability toward generalists. Specifically,
we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a
cold-start phrase and then leverage Nav-GPRO to improve its thinking ability.
Finally, OctoNav-R1 shows superior performance compared with previous methods.
comment: 31 pages, 25 figures
★ DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction
Reconstructing intricate, ever-changing environments remains a central
ambition in computer vision, yet existing solutions often crumble before the
complexity of real-world dynamics. We present DynaSplat, an approach that
extends Gaussian Splatting to dynamic scenes by integrating dynamic-static
separation and hierarchical motion modeling. First, we classify scene elements
as static or dynamic through a novel fusion of deformation offset statistics
and 2D motion flow consistency, refining our spatial representation to focus
precisely where motion matters. We then introduce a hierarchical motion
modeling strategy that captures both coarse global transformations and
fine-grained local movements, enabling accurate handling of intricate,
non-rigid motions. Finally, we integrate physically-based opacity estimation to
ensure visually coherent reconstructions, even under challenging occlusions and
perspective shifts. Extensive experiments on challenging datasets reveal that
DynaSplat not only surpasses state-of-the-art alternatives in accuracy and
realism but also provides a more intuitive, compact, and efficient route to
dynamic scene reconstruction.
★ MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion
Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an
individual's genuine emotional state. Their analysis has attracted considerable
interest due to its promising applications in fields such as healthcare,
criminal investigation, and human-computer interaction. However, existing ME
research is limited to single visual modality, overlooking the rich emotional
information conveyed by other physiological modalities, resulting in ME
recognition and spotting performance far below practical application needs.
Therefore, exploring the cross-modal association mechanism between ME visual
features and physiological signals (PS), and developing a multimodal fusion
framework, represents a pivotal step toward advancing ME analysis. This study
introduces a novel ME dataset, MMME, which, for the first time, enables
synchronized collection of facial action signals (MEs), central nervous system
signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). By overcoming
the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841
macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS,
establishing a robust foundation for investigating ME neural mechanisms and
conducting multimodal fusion-based analyses. Extensive experiments validate the
dataset's reliability and provide benchmarks for ME analysis, demonstrating
that integrating MEs with PS significantly enhances recognition and spotting
performance. To the best of our knowledge, MMME is the most comprehensive ME
dataset to date in terms of modality diversity. It provides critical data
support for exploring the neural mechanisms of MEs and uncovering the
visual-physiological synergistic effects, driving a paradigm shift in ME
research from single-modality visual analysis to multimodal fusion. The dataset
will be publicly available upon acceptance of this paper.
★ DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision
While text-to-3D generation has attracted growing interest, existing methods
often struggle to produce 3D assets that align well with human preferences.
Current preference alignment techniques for 3D content typically rely on
hardly-collected preference-paired multi-view 2D images to train 2D reward
models, when then guide 3D generation -- leading to geometric artifacts due to
their inherent 2D bias. To address these limitations, we construct 3D-MeshPref,
the first large-scale unpaired 3D preference dataset, featuring diverse 3D
meshes annotated by a large language model and refined by human evaluators. We
then develop RewardCS, the first reward model trained directly on unpaired
3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling
effective learning of human-aligned 3D geometric preferences without requiring
paired comparisons. Building on this, we propose DreamCS, a unified framework
that integrates RewardCS into text-to-3D pipelines -- enhancing both implicit
and explicit 3D generation with human preference feedback. Extensive
experiments show DreamCS outperforms prior methods, producing 3D assets that
are both geometrically faithful and human-preferred. Code and models will be
released publicly.
★ ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
AI-generated content has evolved from monolithic models to modular workflows,
particularly on platforms like ComfyUI, enabling customization in creative
pipelines. However, crafting effective workflows requires great expertise to
orchestrate numerous specialized components, presenting a steep learning curve
for users. To address this challenge, we introduce ComfyUI-R1, the first large
reasoning model for automated workflow generation. Starting with our curated
dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning
data, including node selection, workflow planning, and code-level workflow
representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT
fine-tuning for cold start, adapting models to the ComfyUI domain; (2)
reinforcement learning for incentivizing reasoning capability, guided by a
fine-grained rule-metric hybrid reward, ensuring format validity, structural
integrity, and node-level fidelity. Experiments show that our 7B-parameter
model achieves a 97\% format validity rate, along with high pass rate,
node-level and graph-level F1 scores, significantly surpassing prior
state-of-the-art methods that employ leading closed-source models such as
GPT-4o and Claude series. Further analysis highlights the critical role of the
reasoning process and the advantage of transforming workflows into code.
Qualitative comparison reveals our strength in synthesizing intricate workflows
with diverse nodes, underscoring the potential of long CoT reasoning in AI art
creation.
comment: Work in progress. Try it out in ComfyUI-Copilot
https://github.com/AIDC-AI/ComfyUI-Copilot
★ Accurate and efficient zero-shot 6D pose estimation with frozen foundation models
Estimating the 6D pose of objects from RGBD data is a fundamental problem in
computer vision, with applications in robotics and augmented reality. A key
challenge is achieving generalization to novel objects that were not seen
during training. Most existing approaches address this by scaling up training
on synthetic data tailored to the task, a process that demands substantial
computational resources. But is task-specific training really necessary for
accurate and efficient 6D pose estimation of novel objects? To answer No!, we
introduce FreeZeV2, the second generation of FreeZe: a training-free method
that achieves strong generalization to unseen objects by leveraging geometric
and vision foundation models pre-trained on unrelated data. FreeZeV2 improves
both accuracy and efficiency over FreeZe through three key contributions: (i) a
sparse feature extraction strategy that reduces inference-time computation
without sacrificing accuracy; (ii) a feature-aware scoring mechanism that
improves both pose selection during RANSAC-based 3D registration and the final
ranking of pose candidates; and (iii) a modular design that supports ensembles
of instance segmentation models, increasing robustness to segmentation masks
errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark,
where it establishes a new state-of-the-art in 6D pose estimation of unseen
objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable
8x speedup over FreeZe while also improving accuracy by 5%. When using
ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy
while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall
Method at the BOP Challenge 2024.
comment: Technical report
★ Q-SAM2: Accurate Quantization for Segment Anything Model 2
Nicola Farronato, Florian Scheidegger, Mattia Rigotti, Cristiano Malossi, Michele Magno, Haotong Qin
The Segment Anything Model 2 (SAM2) has gained significant attention as a
foundational approach for promptable image and video segmentation. However, its
expensive computational and memory consumption poses a severe challenge for its
application in resource-constrained scenarios. In this paper, we propose an
accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To
address the performance degradation caused by the singularities in weight and
activation distributions during quantization, Q-SAM2 introduces two novel
technical contributions. We first introduce a linear layer calibration method
for low-bit initialization of SAM2, which minimizes the Frobenius norm over a
small image batch to reposition weight distributions for improved quantization.
We then propose a Quantization-Aware Training (QAT) pipeline that applies
clipping to suppress outliers and allows the network to adapt to quantization
thresholds during training. Our comprehensive experiments demonstrate that
Q-SAM2 allows for highly accurate inference while substantially improving
efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses
existing state-of-the-art general quantization schemes, especially for
ultra-low 2-bit quantization. While designed for quantization-aware training,
our proposed calibration technique also proves effective in post-training
quantization, achieving up to a 66% mIoU accuracy improvement over
non-calibrated models.
comment: 20 pages
★ Inverting Black-Box Face Recognition Systems via Zero-Order Optimization in Eigenface Space
Anton Razzhigaev, Matvey Mikhalchuk, Klim Kireev, Igor Udovichenko, Andrey Kuznetsov, Aleksandr Petiushko
Reconstructing facial images from black-box recognition models poses a
significant privacy threat. While many methods require access to embeddings, we
address the more challenging scenario of model inversion using only similarity
scores. This paper introduces DarkerBB, a novel approach that reconstructs
color faces by performing zero-order optimization within a PCA-derived
eigenface space. Despite this highly limited information, experiments on LFW,
AgeDB-30, and CFP-FP benchmarks demonstrate that DarkerBB achieves
state-of-the-art verification accuracies in the similarity-only setting, with
competitive query efficiency.
★ Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints
Absolute localization, aiming to determine an agent's location with respect
to a global reference, is crucial for unmanned aerial vehicles (UAVs) in
various applications, but it becomes challenging when global navigation
satellite system (GNSS) signals are unavailable. Vision-based absolute
localization methods, which locate the current view of the UAV in a reference
satellite map to estimate its position, have become popular in GNSS-denied
scenarios. However, existing methods mostly rely on traditional and low-level
image matching, suffering from difficulties due to significant differences
introduced by cross-source discrepancies and temporal variations. To overcome
these limitations, in this paper, we introduce a hierarchical cross-source
image matching method designed for UAV absolute localization, which integrates
a semantic-aware and structure-constrained coarse matching module with a
lightweight fine-grained matching module. Specifically, in the coarse matching
module, semantic features derived from a vision foundation model first
establish region-level correspondences under semantic and structural
constraints. Then, the fine-grained matching module is applied to extract fine
features and establish pixel-level correspondences. Building upon this, a UAV
absolute visual localization pipeline is constructed without any reliance on
relative localization techniques, mainly by employing an image retrieval module
before the proposed hierarchical image matching modules. Experimental
evaluations on public benchmark datasets and a newly introduced CS-UAV dataset
demonstrate superior accuracy and robustness of the proposed method under
various challenging conditions, confirming its effectiveness.
comment: 8 pages, 6 figures
★ Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets
Existing multimodal methods typically assume that different modalities share
the same category set. However, in real-world applications, the category
distributions in multimodal data exhibit inconsistencies, which can hinder the
model's ability to effectively utilize cross-modal information for recognizing
all categories. In this work, we propose the practical setting termed
Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are
trained in heterogeneous category sets of multi-modal data and aim to recognize
complete classes set of all modalities during test. To effectively address this
task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF).
Specifically, CSCF aligns modality-specific features to a shared semantic space
to enable knowledge transfer between seen and unseen classes. It then selects
the most discriminative modality for decision fusion through uncertainty
estimation. Finally, it integrates cross-modal information based on class
similarity, where the auxiliary modality refines the prediction of the dominant
one. Experimental results show that our method significantly outperforms
existing state-of-the-art (SOTA) approaches on multiple benchmark datasets,
effectively addressing the MMHCL task.
★ ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models
Diffusion models excel at image generation. Recent studies have shown that
these models not only generate high-quality images but also encode text-image
alignment information through attention maps or loss functions. This
information is valuable for various downstream tasks, including segmentation,
text-guided image editing, and compositional image generation. However, current
methods heavily rely on the assumption of perfect text-image alignment in
diffusion models, which is not the case. In this paper, we propose using
zero-shot referring image segmentation as a proxy task to evaluate the
pixel-level image and class-level text alignment of popular diffusion models.
We conduct an in-depth analysis of pixel-text misalignment in diffusion models
from the perspective of training data bias. We find that misalignment occurs in
images with small sized, occluded, or rare object classes. Therefore, we
propose ELBO-T2IAlign, a simple yet effective method to calibrate pixel-text
alignment in diffusion models based on the evidence lower bound (ELBO) of
likelihood. Our method is training-free and generic, eliminating the need to
identify the specific cause of misalignment and works well across various
diffusion model architectures. Extensive experiments on commonly used benchmark
datasets on image segmentation and generation have verified the effectiveness
of our proposed calibration approach.
★ Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning
Despite the rapid progress of multimodal large language models (MLLMs), they
have largely overlooked the importance of visual processing. In a simple yet
revealing experiment, we interestingly find that language-only models, when
provided with image captions, can achieve comparable or even better performance
than MLLMs that consume raw visual inputs. This suggests that current MLLMs may
generate accurate visual descriptions but fail to effectively integrate them
during reasoning. Motivated by this, we propose a simple visual perturbation
framework that enhances perceptual robustness without requiring algorithmic
modifications or additional training data. Our approach introduces three
targeted perturbations: distractor concatenation, dominance-preserving mixup,
and random rotation, that can be easily integrated into existing post-training
pipelines including SFT, DPO, and GRPO. Through extensive experiments across
multiple datasets, we demonstrate consistent improvements in mathematical
reasoning performance, with gains comparable to those achieved through
algorithmic changes. Additionally, we achieve competitive performance among
open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual
perturbation. Through comprehensive ablation studies, we analyze the
effectiveness of different perturbation strategies, revealing that each
perturbation type contributes uniquely to different aspects of visual
reasoning. Our findings highlight the critical role of visual perturbation in
multimodal mathematical reasoning: better reasoning begins with better seeing.
Our code is available at https://github.com/YutingLi0606/Vision-Matters.
comment: Technical Report
★ MPFNet: A Multi-Prior Fusion Network with a Progressive Training Strategy for Micro-Expression Recognition
Micro-expression recognition (MER), a critical subfield of affective
computing, presents greater challenges than macro-expression recognition due to
its brief duration and low intensity. While incorporating prior knowledge has
been shown to enhance MER performance, existing methods predominantly rely on
simplistic, singular sources of prior knowledge, failing to fully exploit
multi-source information. This paper introduces the Multi-Prior Fusion Network
(MPFNet), leveraging a progressive training strategy to optimize MER tasks. We
propose two complementary encoders: the Generic Feature Encoder (GFE) and the
Advanced Feature Encoder (AFE), both based on Inflated 3D ConvNets (I3D) with
Coordinate Attention (CA) mechanisms, to improve the model's ability to capture
spatiotemporal and channel-specific features. Inspired by developmental
psychology, we present two variants of MPFNet--MPFNet-P and
MPFNet-C--corresponding to two fundamental modes of infant cognitive
development: parallel and hierarchical processing. These variants enable the
evaluation of different strategies for integrating prior knowledge. Extensive
experiments demonstrate that MPFNet significantly improves MER accuracy while
maintaining balanced performance across categories, achieving accuracies of
0.811, 0.924, and 0.857 on the SMIC, CASME II, and SAMM datasets, respectively.
To the best of our knowledge, our approach achieves state-of-the-art
performance on the SMIC and SAMM datasets.
★ AtmosMJ: Revisiting Gating Mechanism for AI Weather Forecasting Beyond the Year Scale
The advent of Large Weather Models (LWMs) has marked a turning point in
data-driven forecasting, with many models now outperforming traditional
numerical systems in the medium range. However, achieving stable, long-range
autoregressive forecasts beyond a few weeks remains a significant challenge.
Prevailing state-of-the-art models that achieve year-long stability, such as
SFNO and DLWP-HPX, have relied on transforming input data onto non-standard
spatial domains like spherical harmonics or HEALPix meshes. This has led to the
prevailing assumption that such representations are necessary to enforce
physical consistency and long-term stability. This paper challenges that
assumption by investigating whether comparable long-range performance can be
achieved on the standard latitude-longitude grid. We introduce AtmosMJ, a deep
convolutional network that operates directly on ERA5 data without any spherical
remapping. The model's stability is enabled by a novel Gated Residual Fusion
(GRF) mechanism, which adaptively moderates feature updates to prevent error
accumulation over long recursive simulations. Our results demonstrate that
AtmosMJ produces stable and physically plausible forecasts for about 500 days.
In quantitative evaluations, it achieves competitive 10-day forecast accuracy
against models like Pangu-Weather and GraphCast, all while requiring a
remarkably low training budget of 5.7 days on a V100 GPU. Our findings suggest
that efficient architectural design, rather than non-standard data
representation, can be the key to unlocking stable and computationally
efficient long-range weather prediction.
★ The Four Color Theorem for Cell Instance Segmentation ICML 2025
Cell instance segmentation is critical to analyzing biomedical images, yet
accurately distinguishing tightly touching cells remains a persistent
challenge. Existing instance segmentation frameworks, including
detection-based, contour-based, and distance mapping-based approaches, have
made significant progress, but balancing model performance with computational
efficiency remains an open problem. In this paper, we propose a novel cell
instance segmentation method inspired by the four-color theorem. By
conceptualizing cells as countries and tissues as oceans, we introduce a
four-color encoding scheme that ensures adjacent instances receive distinct
labels. This reformulation transforms instance segmentation into a constrained
semantic segmentation problem with only four predicted classes, substantially
simplifying the instance differentiation process. To solve the training
instability caused by the non-uniqueness of four-color encoding, we design an
asymptotic training strategy and encoding transformation method. Extensive
experiments on various modes demonstrate our approach achieves state-of-the-art
performance. The code is available at https://github.com/zhangye-zoe/FCIS.
comment: Accepted at ICML 2025
★ Non-Contact Health Monitoring During Daily Personal Care Routines
Xulin Ma, Jiankai Tang, Zhang Jiang, Songqin Cheng, Yuanchun Shi, Dong LI, Xin Liu, Daniel McDuff, Xiaojing Liu, Yuntao Wang
Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring
of physiological signals and offers a practical alternative to traditional
health sensing methods. Although rPPG is promising for daily health monitoring,
its application in long-term personal care scenarios, such as mirror-facing
routines in high-altitude environments, remains challenging due to ambient
lighting variations, frequent occlusions from hand movements, and dynamic
facial postures. To address these challenges, we present LADH (Long-term
Altitude Daily Health), the first long-term rPPG dataset containing 240
synchronized RGB and infrared (IR) facial videos from 21 participants across
five common personal care scenarios, along with ground-truth PPG, respiration,
and blood oxygen signals. Our experiments demonstrate that combining RGB and IR
video inputs improves the accuracy and robustness of non-contact physiological
monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate
estimation. Furthermore, we find that multi-task learning enhances performance
across multiple physiological indicators simultaneously. Dataset and code are
open at https://github.com/McJackTang/FusionVitals.
★ Training-Free Voice Conversion with Factorized Optimal Transport
This paper introduces Factorized MKL-VC, a training-free modification for
kNN-VC pipeline. In contrast with original pipeline, our algorithm performs
high quality any-to-any cross-lingual voice conversion with only 5 second of
reference audio. MKL-VC replaces kNN regression with a factorized optimal
transport map in WavLM embedding subspaces, derived from Monge-Kantorovich
Linear solution. Factorization addresses non-uniform variance across
dimensions, ensuring effective feature transformation. Experiments on
LibriSpeech and FLEURS datasets show MKL-VC significantly improves content
preservation and robustness with short reference audio, outperforming kNN-VC.
MKL-VC achieves performance comparable to FACodec, especially in cross-lingual
voice conversion domain.
comment: Interspeech 2025
★ CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings
Mattia Nardon, Mikel Mujika Agirre, Ander González Tomé, Daniel Sedano Algarabel, Josep Rueda Collell, Ana Paola Caro, Andrea Caraffa, Fabio Poiesi, Paul Ian Chippendale, Davide Boscaini
Accurate 6D pose estimation of complex objects in 3D environments is
essential for effective robotic manipulation. Yet, existing benchmarks fall
short in evaluating 6D pose estimation methods under realistic industrial
conditions, as most datasets focus on household objects in domestic settings,
while the few available industrial datasets are limited to artificial setups
with objects placed on tables. To bridge this gap, we introduce CHIP, the first
dataset designed for 6D pose estimation of chairs manipulated by a robotic arm
in a real-world industrial environment. CHIP includes seven distinct chairs
captured using three different RGBD sensing technologies and presents unique
challenges, such as distractor objects with fine-grained differences and severe
occlusions caused by the robotic arm and human operators. CHIP comprises 77,811
RGBD images annotated with ground-truth 6D poses automatically derived from the
robot's kinematics, averaging 11,115 annotations per chair. We benchmark CHIP
using three zero-shot 6D pose estimation methods, assessing performance across
different sensor types, localization priors, and occlusion levels. Results show
substantial room for improvement, highlighting the unique challenges posed by
the dataset. CHIP will be publicly released.
comment: Technical report
★ Towards Practical Alzheimer's Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model
Changwei Wu, Yifei Chen, Yuxin Du, Jinying Zong, Jie Dong, Mingxuan Liu, Yong Peng, Jin Fan, Feiwei Qin, Changmiao Wang
Early diagnosis of Alzheimer's Disease (AD), especially at the mild cognitive
impairment (MCI) stage, is vital yet hindered by subjective assessments and the
high cost of multimodal imaging modalities. Although deep learning methods
offer automated alternatives, their energy inefficiency and computational
demands limit real-world deployment, particularly in resource-constrained
settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are
inherently well-suited for modeling the sparse, event-driven patterns of neural
degeneration in AD, offering a promising foundation for interpretable and
low-power medical diagnostics. However, existing SNNs often suffer from weak
expressiveness and unstable training, which restrict their effectiveness in
complex medical tasks. To address these limitations, we propose FasterSNN, a
hybrid neural architecture that integrates biologically inspired LIF neurons
with region-adaptive convolution and multi-scale spiking attention. This design
enables sparse, efficient processing of 3D MRI while preserving diagnostic
accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves
competitive performance with substantially improved efficiency and stability,
supporting its potential for practical AD screening. Our source code is
available at https://github.com/wuchangw/FasterSNN.
comment: 11 pages, 5 figures
★ Adding simple structure at inference improves Vision-Language Compositionality
Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for
image-text retrieval tasks. However, those models struggle with
compositionality, showing a bag-of-words-like behavior that limits their
retrieval performance. Many different training approaches have been proposed to
improve the vision-language compositionality capabilities of those models. In
comparison, inference-time techniques have received little attention. In this
paper, we propose to add simple structure at inference, where, given an image
and a caption: i) we divide the image into different smaller crops, ii) we
extract text segments, capturing objects, attributes and relations, iii) using
a VLM, we find the image crops that better align with text segments obtaining
matches, and iv) we compute the final image-text similarity aggregating the
individual similarities of the matches. Based on various popular dual encoder
VLMs, we evaluate our approach in controlled and natural datasets for VL
compositionality. We find that our approach consistently improves the
performance of evaluated VLMs without any training, which shows the potential
of inference-time techniques. The results are especially good for
attribute-object binding as shown in the controlled dataset. As a result of an
extensive analysis: i) we show that processing image crops is actually
essential for the observed gains in performance, and ii) we identify specific
areas to further improve inference-time approaches.
★ Reasoning Models Are More Easily Gaslighted Than You Think
Recent advances in reasoning-centric models promise improved robustness
through mechanisms such as chain-of-thought prompting and test-time scaling.
However, their ability to withstand misleading user input remains
underexplored. In this paper, we conduct a systematic evaluation of three
state-of-the-art reasoning models, i.e., OpenAI's o4-mini, Claude-3.7-Sonnet
and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and
CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average)
following gaslighting negation prompts, indicating that even top-tier reasoning
models struggle to preserve correct answers under manipulative user feedback.
Built upon the insights of the evaluation and to further probe this
vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark
specifically designed to evaluate reasoning models' susceptibility to defend
their belief under gaslighting negation prompt. Constructed by filtering and
curating 1,025 challenging samples from the existing benchmarks,
GaslightingBench-R induces even more dramatic failures, with accuracy drops
exceeding 53% on average. Our findings reveal fundamental limitations in the
robustness of reasoning models, highlighting the gap between step-by-step
reasoning and belief persistence.
★ CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain
Maik Dannecker, Vasiliki Sideri-Lampretsa, Sophie Starck, Angeline Mihailov, Mathieu Milh, Nadine Girard, Guillaume Auzias, Daniel Rueckert
Magnetic resonance imaging of fetal and neonatal brains reveals rapid
neurodevelopment marked by substantial anatomical changes unfolding within
days. Studying this critical stage of the developing human brain, therefore,
requires accurate brain models-referred to as atlases-of high spatial and
temporal resolution. To meet these demands, established traditional atlases and
recently proposed deep learning-based methods rely on large and comprehensive
datasets. This poses a major challenge for studying brains in the presence of
pathologies for which data remains scarce. We address this limitation with
CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for
creating high-resolution, spatio-temporal, multimodal brain atlases, suitable
for low-data settings. Unlike established methods, CINeMA operates in latent
space, avoiding compute-intensive image registration and reducing atlas
construction times from days to minutes. Furthermore, it enables flexible
conditioning on anatomical features including GA, birth age, and pathologies
like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA
supports downstream tasks such as tissue segmentation and age prediction
whereas its generative properties enable synthetic data creation and
anatomically informed data augmentation. Surpassing state-of-the-art methods in
accuracy, efficiency, and versatility, CINeMA represents a powerful tool for
advancing brain research. We release the code and atlases at
https://github.com/m-dannecker/CINeMA.
comment: Work currently under revision for IEEE TMI
★ VideoMat: Extracting PBR Materials from Video Diffusion Models
We leverage finetuned video diffusion models, intrinsic decomposition of
videos, and physically-based differentiable rendering to generate high quality
materials for 3D models given a text prompt or a single image. We condition a
video diffusion model to respect the input geometry and lighting condition.
This model produces multiple views of a given 3D model with coherent material
properties. Secondly, we use a recent model to extract intrinsics (base color,
roughness, metallic) from the generated video. Finally, we use the intrinsics
alongside the generated video in a differentiable path tracer to robustly
extract PBR materials directly compatible with common content creation tools.
★ Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation
Haowen Wang, Xiaoping Yuan, Zhao Jin, Zhen Zhao, Zhengping Che, Yousong Xue, Jin Tian, Yakun Huang, Jian Tang
Articulated objects are ubiquitous in everyday life, and accurate 3D
representations of their geometry and motion are critical for numerous
applications. However, in the absence of human annotation, existing approaches
still struggle to build a unified representation for objects that contain
multiple movable parts. We introduce DeGSS, a unified framework that encodes
articulated objects as deformable 3D Gaussian fields, embedding geometry,
appearance, and motion in one compact representation. Each interaction state is
modeled as a smooth deformation of a shared field, and the resulting
deformation trajectories guide a progressive coarse-to-fine part segmentation
that identifies distinct rigid components, all in an unsupervised manner. The
refined field provides a spatially continuous, fully decoupled description of
every part, supporting part-level reconstruction and precise modeling of their
kinematic relationships. To evaluate generalization and realism, we enlarge the
synthetic PartNet-Mobility benchmark and release RS-Art, a real-to-sim dataset
that pairs RGB captures with accurately reverse-engineered 3D models. Extensive
experiments demonstrate that our method outperforms existing methods in both
accuracy and stability.
★ A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma
Garima Jain, Sanghamitra Pati, Mona Duggal, Amit Sethi, Abhijeet Patil, Gururaj Malekar, Nilesh Kowe, Jitender Kumar, Jatin Kashyap, Divyajeet Rout, Deepali, Hitesh, Nishi Halduniya, Sharat Kumar, Heena Tabassum, Rupinder Singh Dhaliwal, Sucheta Devi Khuraijam, Sushma Khuraijam, Sharmila Laishram, Simmi Kharb, Sunita Singh, K. Swaminadtan, Ranjana Solanki, Deepika Hemranjani, Shashank Nath Singh, Uma Handa, Manveen Kaur, Surinder Singhal, Shivani Kalhan, Rakesh Kumar Gupta, Ravi. S, D. Pavithra, Sunil Kumar Mahto, Arvind Kumar, Deepali Tirkey, Saurav Banerjee, L. Sreelakshmi
Oral squamous cell carcinoma OSCC is a major global health burden,
particularly in several regions across Asia, Africa, and South America, where
it accounts for a significant proportion of cancer cases. Early detection
dramatically improves outcomes, with stage I cancers achieving up to 90 percent
survival. However, traditional diagnosis based on histopathology has limited
accessibility in low-resource settings because it is invasive,
resource-intensive, and reliant on expert pathologists. On the other hand, oral
cytology of brush biopsy offers a minimally invasive and lower cost
alternative, provided that the remaining challenges, inter observer variability
and unavailability of expert pathologists can be addressed using artificial
intelligence. Development and validation of robust AI solutions requires access
to large, labeled, and multi-source datasets to train high capacity models that
generalize across domain shifts. We introduce the first large and multicenter
oral cytology dataset, comprising annotated slides stained with
Papanicolaou(PAP) and May-Grunwald-Giemsa(MGG) protocols, collected from ten
tertiary medical centers in India. The dataset is labeled and annotated by
expert pathologists for cellular anomaly classification and detection, is
designed to advance AI driven diagnostic methods. By filling the gap in
publicly available oral cytology datasets, this resource aims to enhance
automated detection, reduce diagnostic errors, and improve early OSCC diagnosis
in resource-constrained settings, ultimately contributing to reduced mortality
and better patient outcomes worldwide.
comment: 7 pages, 2 figurs
★ HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios
Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen
Action segmentation is a core challenge in high-level video understanding,
aiming to partition untrimmed videos into segments and assign each a label from
a predefined action set. Existing methods primarily address single-person
activities with fixed action sequences, overlooking multi-person scenarios. In
this work, we pioneer textual reference-guided human action segmentation in
multi-person settings, where a textual description specifies the target person
for segmentation. We introduce the first dataset for Referring Human Action
Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137
fine-grained actions with 33h video data, together with textual descriptions
for this new task. Benchmarking existing action recognition methods on RHAS133
using VLM-based feature extractors reveals limited performance and poor
aggregation of visual cues for the target person. To address this, we propose a
holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF,
leveraging a novel cross-input gate attentional xLSTM to enhance
holistic-partial long-range reasoning and a novel Fourier condition to
introduce more fine-grained control to improve the action segmentation
generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse
evaluation settings. The code is available at
https://github.com/KPeng9510/HopaDIFF.git.
comment: The code is available at https://github.com/KPeng9510/HopaDIFF.git
★ DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning
Autoencoders empower state-of-the-art image and video generative models by
compressing pixels into a latent space through visual tokenization. Although
recent advances have alleviated the performance degradation of autoencoders
under high compression ratios, addressing the training instability caused by
GAN remains an open challenge. While improving spatial compression, we also aim
to minimize the latent space dimensionality, enabling more efficient and
compact representations. To tackle these challenges, we focus on improving the
decoder's expressiveness. Concretely, we propose DGAE, which employs a
diffusion model to guide the decoder in recovering informative signals that are
not fully decoded from the latent representation. With this design, DGAE
effectively mitigates the performance degradation under high spatial
compression rates. At the same time, DGAE achieves state-of-the-art performance
with a 2x smaller latent space. When integrated with Diffusion Models, DGAE
demonstrates competitive performance on image generation for ImageNet-1K and
shows that this compact latent representation facilitates faster convergence of
the diffusion model.
★ Using Sign Language Production as Data Augmentation to enhance Sign Language Translation
Machine learning models fundamentally rely on large quantities of
high-quality data. Collecting the necessary data for these models can be
challenging due to cost, scarcity, and privacy restrictions. Signed languages
are visual languages used by the deaf community and are considered low-resource
languages. Sign language datasets are often orders of magnitude smaller than
their spoken language counterparts. Sign Language Production is the task of
generating sign language videos from spoken language sentences, while Sign
Language Translation is the reverse translation task. Here, we propose
leveraging recent advancements in Sign Language Production to augment existing
sign language datasets and enhance the performance of Sign Language Translation
models. For this, we utilize three techniques: a skeleton-based approach to
production, sign stitching, and two photo-realistic generative models, SignGAN
and SignSplat. We evaluate the effectiveness of these techniques in enhancing
the performance of Sign Language Translation models by generating variation in
the signer's appearance and the motion of the skeletal data. Our results
demonstrate that the proposed methods can effectively augment existing datasets
and enhance the performance of Sign Language Translation models by up to 19%,
paving the way for more robust and accurate Sign Language Translation systems,
even in resource-constrained environments.
★ FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in
cross-modal understanding and generation by integrating visual and textual
information. While instruction tuning and parameter-efficient fine-tuning
methods have substantially improved the generalization of VLMs, most existing
approaches rely on centralized training, posing challenges for deployment in
domains with strict privacy requirements like healthcare. Recent efforts have
introduced Federated Learning (FL) into VLM fine-tuning to address these
privacy concerns, yet comprehensive benchmarks for evaluating federated
fine-tuning strategies, model architectures, and task generalization remain
lacking. In this work, we present \textbf{FedVLMBench}, the first systematic
benchmark for federated fine-tuning of VLMs. FedVLMBench integrates two
mainstream VLM architectures (encoder-based and encoder-free), four fine-tuning
strategies, five FL algorithms, six multimodal datasets spanning four
cross-domain single-task scenarios and two cross-domain multitask settings,
covering four distinct downstream task categories. Through extensive
experiments, we uncover key insights into the interplay between VLM
architectures, fine-tuning strategies, data heterogeneity, and multi-task
federated optimization. Notably, we find that a 2-layer multilayer perceptron
(MLP) connector with concurrent connector and LLM tuning emerges as the optimal
configuration for encoder-based VLMs in FL. Furthermore, current FL methods
exhibit significantly higher sensitivity to data heterogeneity in
vision-centric tasks than text-centric ones, across both encoder-free and
encoder-based VLM architectures. Our benchmark provides essential tools,
datasets, and empirical guidance for the research community, offering a
standardized platform to advance privacy-preserving, federated training of
multimodal foundation models.
★ HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding
Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based
decisions by enhancing diagnostic accuracy and workflow efficiency. While
multimodal large language models (MLLMs) exhibit promising performance in
visual-language understanding, existing methods mainly focus on 2D medical
images, which fundamentally limits their ability to capture complex 3D
anatomical structures. This limitation often leads to misinterpretation of
subtle pathologies and causes diagnostic hallucinations. In this paper, we
present Hybrid Spatial Encoding Network (HSENet), a framework that exploits
enriched 3D medical visual cues by effective visual perception and projection
for accurate and robust vision-language understanding. Specifically, HSENet
employs dual-3D vision encoders to perceive both global volumetric contexts and
fine-grained anatomical details, which are pre-trained by dual-stage alignment
with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient
multimodal projector that condenses high-resolution 3D spatial regions into a
compact set of informative visual tokens via centroid-based compression. By
assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly
perceive and transfer hybrid visual representations to LLM's semantic space,
facilitating accurate diagnostic text generation. Experimental results
demonstrate that our method achieves state-of-the-art performance in 3D
language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report
generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering
(73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness.
Our code is available at https://github.com/YanzhaoShi/HSENet.
comment: 27 pages, 9 figures. arXiv admin note: text overlap with
arXiv:2410.14200 by other authors
★ ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting IJCNN 2025
Human trajectory forecasting is crucial in applications such as autonomous
driving, robotics and surveillance. Accurate forecasting requires models to
consider various factors, including social interactions, multi-modal
predictions, pedestrian intention and environmental context. While existing
methods account for these factors, they often overlook the impact of the
environment, which leads to collisions with obstacles. This paper introduces
ECAM (Environmental Collision Avoidance Module), a contrastive learning-based
module to enhance collision avoidance ability with the environment. The
proposed module can be integrated into existing trajectory forecasting models,
improving their ability to generate collision-free predictions. We evaluate our
method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate
its collision avoidance capabilities. Our experiments show that
state-of-the-art methods significantly reduce (-40/50%) the collision rate when
integrated with the proposed module. The code is available at
https://github.com/CVML-CFU/ECAM.
comment: IJCNN 2025
★ Consistent Story Generation with Asymmetry Zigzag Sampling
Text-to-image generation models have made significant progress in producing
high-quality images from textual descriptions, yet they continue to struggle
with maintaining subject consistency across multiple images, a fundamental
requirement for visual storytelling. Existing methods attempt to address this
by either fine-tuning models on large-scale story visualization datasets, which
is resource-intensive, or by using training-free techniques that share
information across generations, which still yield limited success. In this
paper, we introduce a novel training-free sampling strategy called Zigzag
Sampling with Asymmetric Prompts and Visual Sharing to enhance subject
consistency in visual story generation. Our approach proposes a zigzag sampling
mechanism that alternates between asymmetric prompting to retain subject
characteristics, while a visual sharing module transfers visual cues across
generated images to %further enforce consistency. Experimental results, based
on both quantitative metrics and qualitative evaluations, demonstrate that our
method significantly outperforms previous approaches in generating coherent and
consistent visual stories. The code is available at
https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.
comment: 17 pages, 9. figures
★ SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields
Holistic 3D scene understanding, which jointly models geometry, appearance,
and semantics, is crucial for applications like augmented reality and robotic
interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM)
are limited to extracting language-based semantics from scenes, failing to
achieve holistic scene comprehension. Additionally, they suffer from
low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene
optimization methods rely on dense input views, which reduces practicality and
increases complexity during deployment. In this paper, we propose
SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which
unifies 3D Gaussians with latent semantic attributes for joint
geometry-appearance-semantics modeling. To predict the semantic anisotropic
Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a
cost volume representation that stores cross-view feature similarities,
enhancing coherent and accurate scene comprehension. Leveraging a two-stage
distillation framework, SemanticSplat reconstructs a holistic multi-modal
semantic feature field from sparse-view images. Experiments demonstrate the
effectiveness of our method for 3D scene understanding tasks like promptable
and open-vocabulary segmentation. Video results are available at
https://semanticsplat.github.io.
★ AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions
Chain-of-Thought (CoT) reasoning has emerged as a powerful approach to
enhance the structured, multi-step decision-making capabilities of Multi-Modal
Large Models (MLLMs), is particularly crucial for autonomous driving with
adverse weather conditions and complex traffic environments. However, existing
benchmarks have largely overlooked the need for rigorous evaluation of CoT
processes in these specific and challenging scenarios. To address this critical
gap, we introduce AD^2-Bench, the first Chain-of-Thought benchmark specifically
designed for autonomous driving with adverse weather and complex scenes.
AD^2-Bench is meticulously constructed to fulfill three key criteria:
comprehensive data coverage across diverse adverse environments, fine-grained
annotations that support multi-step reasoning, and a dedicated evaluation
framework tailored for assessing CoT performance. The core contribution of
AD^2-Bench is its extensive collection of over 5.4k high-quality, manually
annotated CoT instances. Each intermediate reasoning step in these annotations
is treated as an atomic unit with explicit ground truth, enabling unprecedented
fine-grained analysis of MLLMs' inferential processes under text-level,
point-level, and region-level visual prompts. Our comprehensive evaluation of
state-of-the-art MLLMs on AD^2-Bench reveals accuracy below 60%, highlighting
the benchmark's difficulty and the need to advance robust, interpretable
end-to-end autonomous driving systems. AD^2-Bench thus provides a standardized
evaluation platform, driving research forward by improving MLLMs' reasoning in
autonomous driving, making it an invaluable resource.
★ GLD-Road:A global-local decoding road network extraction model for remote sensing images
Road networks are crucial for mapping, autonomous driving, and disaster
response. While manual annotation is costly, deep learning offers efficient
extraction. Current methods include postprocessing (prone to errors), global
parallel (fast but misses nodes), and local iterative (accurate but slow). We
propose GLD-Road, a two-stage model combining global efficiency and local
precision. First, it detects road nodes and connects them via a Connect Module.
Then, it iteratively refines broken roads using local searches, drastically
reducing computation. Experiments show GLD-Road outperforms state-of-the-art
methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also
reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++
(local). The experimental results are available at
https://github.com/ucas-dlg/GLD-Road.
★ Enhancing Human-Robot Collaboration: A Sim2Real Domain Adaptation Algorithm for Point Cloud Segmentation in Industrial Environments
The robust interpretation of 3D environments is crucial for human-robot
collaboration (HRC) applications, where safety and operational efficiency are
paramount. Semantic segmentation plays a key role in this context by enabling a
precise and detailed understanding of the environment. Considering the intense
data hunger for real-world industrial annotated data essential for effective
semantic segmentation, this paper introduces a pioneering approach in the
Sim2Real domain adaptation for semantic segmentation of 3D point cloud data,
specifically tailored for HRC. Our focus is on developing a network that
robustly transitions from simulated environments to real-world applications,
thereby enhancing its practical utility and impact on a safe HRC.
In this work, we propose a dual-stream network architecture (FUSION)
combining Dynamic Graph Convolutional Neural Networks (DGCNN) and Convolutional
Neural Networks (CNN) augmented with residual layers as a Sim2Real domain
adaptation algorithm for an industrial environment. The proposed model was
evaluated on real-world HRC setups and simulation industrial point clouds, it
showed increased state-of-the-art performance, achieving a segmentation
accuracy of 97.76%, and superior robustness compared to existing methods.
comment: Preprint, Journal of Intelligent & Robotic Systems
★ 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection
This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection
approach that effectively handles single- and multi-view RGB images in indoor
and outdoor environments, showcasing its general-purpose applicability. The key
challenge for image-based 3D object detection tasks is the lack of 3D geometric
cues, which leads to ambiguity in establishing correspondences between images
and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D
geometric representations in both explicit and implicit manners based on
predicted depth information. Specifically, we utilize the predicted depth to
learn voxel occupancy and optimize the voxelized 3D feature volume explicitly
through the proposed voxel occupancy attention. To further enhance 3D
awareness, the feature volume is integrated with an implicit 3D representation,
the truncated signed distance function (TSDF). Without requiring supervision
from 3D signals, we significantly improve the model's comprehension of 3D
geometry by leveraging intermediate 3D representations and achieve end-to-end
training. Our approach surpasses the performance of state-of-the-art
image-based methods on both single- and multi-view benchmark datasets across
diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D
dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19
AP3D@0.7 improvement on the KITTI dataset. The project page is available at:
https://cindy0725.github.io/3DGeoDet/.
comment: Accepted by IEEE Transactions on Multimedia
★ AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant T2I Adversarial Patches
Cutting-edge works have demonstrated that text-to-image (T2I) diffusion
models can generate adversarial patches that mislead state-of-the-art object
detectors in the physical world, revealing detectors' vulnerabilities and
risks. However, these methods neglect the T2I patches' attack effectiveness
when observed from different views in the physical world (i.e., angle
robustness of the T2I adversarial patches). In this paper, we study the angle
robustness of T2I adversarial patches comprehensively, revealing their
angle-robust issues, demonstrating that texts affect the angle robustness of
generated patches significantly, and task-specific linguistic instructions fail
to enhance the angle robustness. Motivated by the studies, we introduce
Angle-Robust Concept Learning (AngleRoCL), a simple and flexible approach that
learns a generalizable concept (i.e., text embeddings in implementation)
representing the capability of generating angle-robust patches. The learned
concept can be incorporated into textual prompts and guides T2I models to
generate patches with their attack effectiveness inherently resistant to
viewpoint variations. Through extensive simulation and physical-world
experiments on five SOTA detectors across multiple views, we demonstrate that
AngleRoCL significantly enhances the angle robustness of T2I adversarial
patches compared to baseline methods. Our patches maintain high attack success
rates even under challenging viewing conditions, with over 50% average relative
improvement in attack effectiveness across multiple angles. This research
advances the understanding of physically angle-robust patches and provides
insights into the relationship between textual concepts and physical properties
in T2I-generated contents.
★ Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance
field rendering, but it typically requires millions of redundant Gaussian
primitives, overwhelming memory and rendering budgets. Existing compaction
approaches address this by pruning Gaussians based on heuristic importance
scores, without global fidelity guarantee. To bridge this gap, we propose a
novel optimal transport perspective that casts 3DGS compaction as global
Gaussian mixture reduction. Specifically, we first minimize the composite
transport divergence over a KD-tree partition to produce a compact geometric
representation, and then decouple appearance from geometry by fine-tuning color
and opacity attributes with far fewer Gaussian primitives. Experiments on
benchmark datasets show that our method (i) yields negligible loss in rendering
quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10% Gaussians;
and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques.
Notably, our method is applicable to any stage of vanilla or accelerated 3DGS
pipelines, providing an efficient and agnostic pathway to lightweight neural
rendering.
comment: 18 pages, 8 figures
★ Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
We present Athena-PRM, a multimodal process reward model (PRM) designed to
evaluate the reward score for each step in solving complex reasoning problems.
Developing high-performance PRMs typically demands significant time and
financial investment, primarily due to the necessity for step-level annotations
of reasoning steps. Conventional automated labeling methods, such as Monte
Carlo estimation, often produce noisy labels and incur substantial
computational costs. To efficiently generate high-quality process-labeled data,
we propose leveraging prediction consistency between weak and strong completers
as a criterion for identifying reliable process labels. Remarkably, Athena-PRM
demonstrates outstanding effectiveness across various scenarios and benchmarks
with just 5,000 samples. Furthermore, we also develop two effective strategies
to improve the performance of PRMs: ORM initialization and up-sampling for
negative data. We validate our approach in three specific scenarios:
verification for test time scaling, direct evaluation of reasoning step
correctness, and reward ranked fine-tuning. Our Athena-PRM consistently
achieves superior performance across multiple benchmarks and scenarios.
Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances
performance by 10.2 points on WeMath and 7.1 points on MathVista for test time
scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in
VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score,
showcasing its robust capability to accurately assess the correctness of the
reasoning step. Additionally, utilizing Athena-PRM as the reward model, we
develop Athena-7B with reward ranked fine-tuning and outperforms baseline with
a significant margin on five benchmarks.
★ Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance
across various multimodal tasks by integrating visual perception with language
understanding. However, conventional decoding strategies of LVLMs often fail to
successfully utilize visual information, leading to visually ungrounded
responses. While various approaches have been proposed to address this
limitation, they typically require additional training, multi-step inference
procedures, or external model dependencies. This paper introduces ReVisiT, a
simple yet effective decoding method that references vision tokens to guide the
text generation process in LVLMs. Our approach leverages the semantic
information embedded within vision tokens by projecting them into the text
token distribution space, and dynamically selecting the most relevant vision
token at each decoding step through constrained divergence minimization. This
selected vision token is then used to refine the output distribution to better
incorporate visual semantics. Experiments on three LVLM hallucination
benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances
visual grounding with minimal computational overhead. Moreover, our method
achieves competitive or superior results relative to state-of-the-art baselines
while reducing computational costs for up to $2\times$.
comment: Code available at https://github.com/bscho333/ReVisiT
★ HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene
Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Chengxuan Qian, Juyuan Kang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang
Reconstructing dynamic 3D scenes from monocular videos remains a fundamental
challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time
rendering in static settings, extending it to dynamic scenes is challenging due
to the difficulty of learning structured and temporally consistent motion
representations. This challenge often manifests as three limitations in
existing methods: redundant Gaussian updates, insufficient motion supervision,
and weak modeling of complex non-rigid deformations. These issues collectively
hinder coherent and efficient dynamic reconstruction. To address these
limitations, we propose HAIF-GS, a unified framework that enables structured
and consistent dynamic modeling through sparse anchor-driven deformation. It
first identifies motion-relevant regions via an Anchor Filter to suppresses
redundant updates in static areas. A self-supervised Induced Flow-Guided
Deformation module induces anchor motion using multi-frame feature aggregation,
eliminating the need for explicit flow labels. To further handle fine-grained
deformations, a Hierarchical Anchor Propagation mechanism increases anchor
resolution based on motion complexity and propagates multi-level
transformations. Extensive experiments on synthetic and real-world benchmarks
validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in
rendering quality, temporal coherence, and reconstruction efficiency.
★ Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals
Gaussian and Laplacian entropy models are proved effective in learned point
cloud attribute compression, as they assist in arithmetic coding of latents.
However, we demonstrate through experiments that there is still unutilized
information in entropy parameters estimated by neural networks in current
methods, which can be used for more accurate probability estimation. Thus we
introduce generalized Gaussian entropy model, which controls the tail shape
through shape parameter to more accurately estimate the probability of latents.
Meanwhile, to the best of our knowledge, existing methods use fixed likelihood
intervals for each integer during arithmetic coding, which limits model
performance. We propose Mean Error Discriminator (MED) to determine whether the
entropy parameter estimation is accurate and then dynamically adjust likelihood
intervals. Experiments show that our method significantly improves
rate-distortion (RD) performance on three VAE-based models for point cloud
attribute compression, and our method can be applied to other compression
tasks, such as image and video compression.
★ DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects
Transparent and reflective objects in everyday environments pose significant
challenges for depth sensors due to their unique visual properties, such as
specular reflections and light transmission. These characteristics often lead
to incomplete or inaccurate depth estimation, which severely impacts downstream
geometry-based vision tasks, including object recognition, scene
reconstruction, and robotic manipulation. To address the issue of missing depth
information in transparent and reflective objects, we propose DCIRNet, a novel
multimodal depth completion network that effectively integrates RGB images and
depth maps to enhance depth estimation quality. Our approach incorporates an
innovative multimodal feature fusion module designed to extract complementary
information between RGB images and incomplete depth maps. Furthermore, we
introduce a multi-stage supervision and depth refinement strategy that
progressively improves depth completion and effectively mitigates the issue of
blurred object boundaries. We integrate our depth completion model into
dexterous grasping frameworks and achieve a $44\%$ improvement in the grasp
success rate for transparent and reflective objects. We conduct extensive
experiments on public datasets, where DCIRNet demonstrates superior
performance. The experimental results validate the effectiveness of our
approach and confirm its strong generalization capability across various
transparent and reflective objects.
★ Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression
We introduce TransDiff, the first image generation model that marries
Autoregressive (AR) Transformer with diffusion models. In this joint modeling
framework, TransDiff encodes labels and images into high-level semantic
features and employs a diffusion model to estimate the distribution of image
samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms
other image generation models based on standalone AR Transformer or diffusion
models. Specifically, TransDiff achieves a Fr\'echet Inception Distance (FID)
of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster
inference latency compared to state-of-the-art methods based on AR Transformer
and x112 faster inference compared to diffusion-only models. Furthermore,
building on the TransDiff model, we introduce a novel image generation paradigm
called Multi-Reference Autoregression (MRAR), which performs autoregressive
generation by predicting the next image. MRAR enables the model to reference
multiple previously generated images, thereby facilitating the learning of more
diverse representations and improving the quality of generated images in
subsequent iterations. By applying MRAR, the performance of TransDiff is
improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open
up a new frontier in the field of image generation.
★ TinySplat: Feedforward Approach for Generating Compact 3D Scene Representation
The recent development of feedforward 3D Gaussian Splatting (3DGS) presents a
new paradigm to reconstruct 3D scenes. Using neural networks trained on
large-scale multi-view datasets, it can directly infer 3DGS representations
from sparse input views. Although the feedforward approach achieves high
reconstruction speed, it still suffers from the substantial storage cost of 3D
Gaussians. Existing 3DGS compression methods relying on scene-wise optimization
are not applicable due to architectural incompatibilities. To overcome this
limitation, we propose TinySplat, a complete feedforward approach for
generating compact 3D scene representations. Built upon standard feedforward
3DGS methods, TinySplat integrates a training-free compression framework that
systematically eliminates key sources of redundancy. Specifically, we introduce
View-Projection Transformation (VPT) to reduce geometric redundancy by
projecting geometric parameters into a more compact space. We further present
Visibility-Aware Basis Reduction (VABR), which mitigates perceptual redundancy
by aligning feature energy along dominant viewing directions via basis
transformation. Lastly, spatial redundancy is addressed through an
off-the-shelf video codec. Comprehensive experimental results on multiple
benchmark datasets demonstrate that TinySplat achieves over 100x compression
for 3D Gaussian data generated by feedforward methods. Compared to the
state-of-the-art compression approach, we achieve comparable quality with only
6% of the storage size. Meanwhile, our compression framework requires only 25%
of the encoding time and 1% of the decoding time.
★ Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries
Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data,
offers rare insights into understanding early urban development and long-term
transformation. However, severe quality degradation (e.g., distortion,
misalignment, and spectral scarcity) and annotation absence have long hindered
semantic segmentation on such historical RS imagery. To bridge this gap and
enhance understanding of urban development, we introduce
$\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on
historical satellite imagery with the earliest observation time among all
existing segmentation datasets, along with a benchmark framework for
unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First,
$\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic
segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering
1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the
earliest segmentation dataset of its kind, it provides a pioneering benchmark
for historical urban understanding. Second,
$\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel
unsupervised semantic segmentation framework for historical RS imagery. It
employs a confidence-aware alignment mechanism and focal-confidence loss based
on a self-supervised learning architecture, which generates robust
pseudo-labels and adaptively prioritizes prediction difficulty and label
reliability to improve unsupervised segmentation on noisy historical data
without manual supervision. Experiments show Urban1960SatUSM significantly
outperforms existing unsupervised segmentation methods on Urban1960SatSeg for
segmenting historical urban scenes, promising in paving the way for
quantitative studies of long-term urban change using modern computer vision.
Our benchmark and supplementary material are available at
https://github.com/Tianxiang-Hao/Urban1960SatSeg.
★ Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning CVPR 2025
In-context learning (ICL), a predominant trend in instruction learning, aims
at enhancing the performance of large language models by providing clear task
guidance and examples, improving their capability in task understanding and
execution. This paper investigates ICL on Large Vision-Language Models (LVLMs)
and explores the policies of multi-modal demonstration selection. Existing
research efforts in ICL face significant challenges: First, they rely on
pre-defined demonstrations or heuristic selecting strategies based on human
intuition, which are usually inadequate for covering diverse task requirements,
leading to sub-optimal solutions; Second, individually selecting each
demonstration fails in modeling the interactions between them, resulting in
information redundancy. Unlike these prevailing efforts, we propose a new
exploration-exploitation reinforcement learning framework, which explores
policies to fuse multi-modal information and adaptively select adequate
demonstrations as an integrated whole. The framework allows LVLMs to optimize
themselves by continually refining their demonstrations through
self-exploration, enabling the ability to autonomously identify and generate
the most effective selection policies for in-context learning. Experimental
results verify the superior performance of our approach on four Visual
Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing
the generalization capability of few-shot LVLMs.
comment: 10 pages, 6 figures, CVPR 2025
★ Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing
Multi-Object Tracking (MOT) plays a crucial role in autonomous driving
systems, as it lays the foundations for advanced perception and precise path
planning modules. Nonetheless, single agent based MOT lacks in sensing
surroundings due to occlusions, sensors failures, etc. Hence, the integration
of multiagent information is essential for comprehensive understanding of the
environment. This paper proposes a novel Cooperative MOT framework for tracking
objects in 3D LiDAR scene by formulating and solving a graph topology-aware
optimization problem so as to fuse information coming from multiple vehicles.
By exploiting a fully connected graph topology defined by the detected bounding
boxes, we employ the Graph Laplacian processing optimization technique to
smooth the position error of bounding boxes and effectively combine them. In
that manner, we reveal and leverage inherent coherences of diverse multi-agent
detections, and associate the refined bounding boxes to tracked objects at two
stages, optimizing localization and tracking accuracies. An extensive
evaluation study has been conducted, using the real-world V2V4Real dataset,
where the proposed method significantly outperforms the baseline frameworks,
including the state-of-the-art deep-learning DMSTrack and V2V4Real, in various
testing sequences.
comment: 2025 IEEE International Conference on Multimedia and Expo Workshops,
3DMM - 3D Multimedia Analytics, Search and Generation
★ Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization
Open-set domain generalization(OSDG) for hyperspectral image classification
presents significant challenges due to the presence of unknown classes in
target domains and the need for models to generalize across multiple unseen
domains without target-specific adaptation. Existing domain adaptation methods
assume access to target domain data during training and fail to address the
fundamental issue of domain shift when unknown classes are present, leading to
negative transfer and reduced classification performance. To address these
limitations, we propose a novel open-set domain generalization framework that
combines four key components: Spectrum-Invariant Frequency Disentanglement
(SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network
(DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning
(EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty
Disentanglement (SSUD) for reliable open-set classification. The SIFD module
extracts domain-invariant spectral features in the frequency domain through
attention-weighted frequency analysis and domain-agnostic regularization, while
DCRN captures complementary spectral and spatial information via parallel
pathways with adaptive fusion. EDL provides principled uncertainty estimation
using Dirichlet distributions, enabling the SSUD module to make reliable
open-set decisions through uncertainty-aware pathway weighting and adaptive
rejection thresholding. Experimental results on three cross-scene hyperspectral
classification tasks show that our approach achieves performance comparable to
state-of-the-art domain adaptation methods while requiring no access to the
target domain during training. The implementation will be made available at
https://github.com/amir-khb/SSUDOSDG upon acceptance.
★ Harmonizing and Merging Source Models for CLIP-based Domain Generalization
CLIP-based domain generalization aims to improve model generalization to
unseen domains by leveraging the powerful zero-shot classification capabilities
of CLIP and multiple source datasets. Existing methods typically train a single
model across multiple source domains to capture domain-shared information.
However, this paradigm inherently suffers from two types of conflicts: 1)
sample conflicts, arising from noisy samples and extreme domain shifts among
sources; and 2) optimization conflicts, stemming from competition and
trade-offs during multi-source training. Both hinder the generalization and
lead to suboptimal solutions. Recent studies have shown that model merging can
effectively mitigate the competition of multi-objective optimization and
improve generalization performance. Inspired by these findings, we propose
Harmonizing and Merging (HAM), a novel source model merging framework for
CLIP-based domain generalization. During the training process of the source
models, HAM enriches the source samples without conflicting samples, and
harmonizes the update directions of all models. Then, a redundancy-aware
historical model merging method is introduced to effectively integrate
knowledge across all source models. HAM comprehensively consolidates source
domain information while enabling mutual enhancement among source models,
ultimately yielding a final model with optimal generalization capabilities.
Extensive experiments on five widely used benchmark datasets demonstrate the
effectiveness of our approach, achieving state-of-the-art performance.
★ TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
We address the problem of video question answering (video QA) with temporal
grounding in a weakly supervised setup, without any temporal annotations. Given
a video and a question, we generate an open-ended answer grounded with the
start and end time. For this task, we propose TOGA: a vision-language model for
Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune
TOGA to jointly generate the answer and the temporal grounding. We operate in a
weakly supervised setup where the temporal grounding annotations are not
available. We generate pseudo labels for temporal grounding and ensure the
validity of these labels by imposing a consistency constraint between the
question of a grounding response and the response generated by a question
referring to the same temporal segment. We notice that jointly generating the
answers with the grounding improves performance on question answering as well
as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For
grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate
weakly supervised grounded question answering. For open-ended QA, we consider
the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art
performance for both tasks on these benchmarks.
★ A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning
Transformer-based models have achieved strong performance in remote sensing
image captioning by capturing long-range dependencies and contextual
information. However, their practical deployment is hindered by high
computational costs, especially in multi-modal frameworks that employ separate
transformer-based encoders and decoders. In addition, existing remote sensing
image captioning models primarily focus on high-level semantic extraction while
often overlooking fine-grained structural features such as edges, contours, and
object boundaries. To address these challenges, a lightweight transformer
architecture is proposed by reducing the dimensionality of the encoder layers
and employing a distilled version of GPT-2 as the decoder. A knowledge
distillation strategy is used to transfer knowledge from a more complex teacher
model to improve the performance of the lightweight network. Furthermore, an
edge-aware enhancement strategy is incorporated to enhance image representation
and object boundary understanding, enabling the model to capture fine-grained
spatial details in remote sensing images. Experimental results demonstrate that
the proposed approach significantly improves caption quality compared to
state-of-the-art methods.
★ A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang
Recent advancements in Large Multimodal Models (LMMs) have significantly
improved multimodal understanding and generation. However, these models still
struggle to generate tightly interleaved image-text outputs, primarily due to
the limited scale, quality and instructional richness of current training
datasets. To address this, we introduce InterSyn, a large-scale multimodal
dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR)
method. InterSyn features multi-turn, instruction-driven dialogues with tightly
interleaved imagetext responses, providing rich object diversity and rigorous
automated quality refinement, making it well-suited for training
next-generation instruction-following LMMs. Furthermore, to address the lack of
reliable evaluation tools capable of assessing interleaved multimodal outputs,
we introduce SynJudge, an automatic evaluation model designed to quantitatively
assess multimodal outputs along four dimensions: text content, image content,
image quality, and image-text synergy.
Experimental studies show that the SEIR method leads to substantially higher
dataset quality compared to an otherwise identical process without refinement.
Moreover, LMMs trained on InterSyn achieve uniform performance gains across
all evaluation metrics, confirming InterSyn's utility for advancing multimodal
systems.
★ ODG: Occupancy Prediction Using Dual Gaussians
3D occupancy provides fine-grained 3D geometry and semantics for scene
understanding which is critical for autonomous driving. Most existing methods,
however, carry high compute costs, requiring dense 3D feature volume and
cross-attention to effectively aggregate information. More recent works have
adopted Bird's Eye View (BEV) or sparse points as scene representation with
much reduced cost, but still suffer from their respective shortcomings. More
concretely, BEV struggles with small objects that often experience significant
information loss after being projected to the ground plane. On the other hand,
points can flexibly model little objects in 3D, but is inefficient at capturing
flat surfaces or large objects. To address these challenges, in this paper, we
present a novel 3D occupancy prediction approach, ODG, which combines BEV and
sparse points based representations. We propose a dual-branch design: a
query-based sparse points branch and a BEV branch. The 3D information learned
in the sparse points branch is shared with the BEV stream via cross-attention,
which enriches the weakened signals of difficult objects on the BEV plane. The
outputs of both branches are finally fused to generate predicted 3D occupancy.
We conduct extensive experiments on the Occ3D-nuScenes and Occ3D-Waymo
benchmarks that demonstrate the superiority of our proposed ODG. Moreover, ODG
also delivers competitive inference speed when compared to the latest efficient
approaches.
★ Noise Conditional Variational Score Distillation
Xinyu Peng, Ziyang Zheng, Yaoming Wang, Han Li, Nuowen Kan, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
We propose Noise Conditional Variational Score Distillation (NCVSD), a novel
method for distilling pretrained diffusion models into generative denoisers. We
achieve this by revealing that the unconditional score function implicitly
characterizes the score function of denoising posterior distributions. By
integrating this insight into the Variational Score Distillation (VSD)
framework, we enable scalable learning of generative denoisers capable of
approximating samples from the denoising posterior distribution across a wide
range of noise levels. The proposed generative denoisers exhibit desirable
properties that allow fast generation while preserve the benefit of iterative
refinement: (1) fast one-step generation through sampling from pure Gaussian
noise at high noise levels; (2) improved sample quality by scaling the
test-time compute with multi-step sampling; and (3) zero-shot probabilistic
inference for flexible and controllable sampling. We evaluate NCVSD through
extensive experiments, including class-conditional image generation and inverse
problem solving. By scaling the test-time compute, our method outperforms
teacher diffusion models and is on par with consistency models of larger sizes.
Additionally, with significantly fewer NFEs than diffusion-based methods, we
achieve record-breaking LPIPS on inverse problems.
★ Synthetic Human Action Video Data Generation with Pose Transfer
In video understanding tasks, particularly those involving human motion,
synthetic data generation often suffers from uncanny features, diminishing its
effectiveness for training. Tasks such as sign language translation, gesture
recognition, and human motion understanding in autonomous driving have thus
been unable to exploit the full potential of synthetic data. This paper
proposes a method for generating synthetic human action video data using pose
transfer (specifically, controllable 3D Gaussian avatar models). We evaluate
this method on the Toyota Smarthome and NTU RGB+D datasets and show that it
improves performance in action recognition tasks. Moreover, we demonstrate that
the method can effectively scale few-shot datasets, making up for groups
underrepresented in the real training data and adding diverse backgrounds. We
open-source the method along with RANDOM People, a dataset with videos and
avatars of novel human identities for pose transfer crowd-sourced from the
internet.
★ SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation
Domain Adaptation (DA) is crucial for robust deployment of medical image
segmentation models when applied to new clinical centers with significant
domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal
with privacy concerns and access constraints on source-domain data during
adaptation to target-domain data. However, SFDA faces challenges such as
insufficient supervision in the target domain with unlabeled images. In this
work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels
method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch
Intensity Enhancement (T3IE) that not only improves quality of raw
pseudo-labels in the target domain, but also leads to SAM-compatible inputs
with three channels to better leverage SAM's zero-shot inference ability for
refining the pseudo-labels; 2) A reliable pseudo-label selection module that
rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs
(CMSO) under input perturbations with T3IE; and 3) A reliability-aware training
procedure in the unlabeled target domain where reliable pseudo-labels are used
for supervision and unreliable parts are regularized by entropy minimization.
Experiments conducted on two multi-domain medical image segmentation datasets
for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA
effectively enhances pseudo-label quality in the unlabeled target domain, and
improves SFDA performance by leveraging the reliability-aware training; 2)
SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is
close to that of supervised training in the target domain. The code of this
work is available online: https://github.com/HiLab-git/SRPL-SFDA.
comment: 18 pages, 4 figures. Accepted for publication in Neurocomputing
★ Improving Out-of-Distribution Detection via Dynamic Covariance Calibration
Out-of-Distribution (OOD) detection is essential for the trustworthiness of
AI systems. Methods using prior information (i.e., subspace-based methods) have
shown effective performance by extracting information geometry to detect OOD
data with a more appropriate distance metric. However, these methods fail to
address the geometry distorted by ill-distributed samples, due to the
limitation of statically extracting information geometry from the training
distribution. In this paper, we argue that the influence of ill-distributed
samples can be corrected by dynamically adjusting the prior geometry in
response to new data. Based on this insight, we propose a novel approach that
dynamically updates the prior covariance matrix using real-time input features,
refining its information. Specifically, we reduce the covariance along the
direction of real-time input features and constrain adjustments to the residual
space, thus preserving essential data characteristics and avoiding effects on
unintended directions in the principal space. We evaluate our method on two
pre-trained models for the CIFAR dataset and five pre-trained models for
ImageNet-1k, including the self-supervised DINO model. Extensive experiments
demonstrate that our approach significantly enhances OOD detection across
various models. The code is released at https://github.com/workerbcd/ooddcc.
♻ ★ Do Multiple Instance Learning Models Transfer? ICML 2025
Multiple Instance Learning (MIL) is a cornerstone approach in computational
pathology (CPath) for generating clinically meaningful slide-level embeddings
from gigapixel tissue images. However, MIL often struggles with small, weakly
supervised clinical datasets. In contrast to fields such as NLP and
conventional computer vision, where transfer learning is widely used to address
data scarcity, the transferability of MIL models remains poorly understood. In
this study, we systematically evaluate the transfer learning capabilities of
pretrained MIL models by assessing 11 models across 21 pretraining tasks for
morphological and molecular subtype prediction. Our results show that
pretrained MIL models, even when trained on different organs than the target
task, consistently outperform models trained from scratch. Moreover,
pretraining on pancancer datasets enables strong generalization across organs
and tasks, outperforming slide foundation models while using substantially less
pretraining data. These findings highlight the robust adaptability of MIL
models and demonstrate the benefits of leveraging transfer learning to boost
performance in CPath. Lastly, we provide a resource which standardizes the
implementation of MIL models and collection of pretrained model weights on
popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab
comment: ICML 2025 (Spotlight). 20 pages, 8 figures
♻ ★ SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping
Recent studies on Visual Autoregressive (VAR) models have highlighted that
high-frequency components, or later steps, in the generation process contribute
disproportionately to inference latency. However, the underlying computational
redundancy involved in these steps has yet to be thoroughly investigated. In
this paper, we conduct an in-depth analysis of the VAR inference process and
identify two primary sources of inefficiency: step redundancy and unconditional
branch redundancy. To address step redundancy, we propose an automatic
step-skipping strategy that selectively omits unnecessary generation steps to
improve efficiency. For unconditional branch redundancy, we observe that the
information gap between the conditional and unconditional branches is minimal.
Leveraging this insight, we introduce unconditional branch replacement, a
technique that bypasses the unconditional branch to reduce computational cost.
Notably, we observe that the effectiveness of acceleration strategies varies
significantly across different samples. Motivated by this, we propose SkipVAR,
a sample-adaptive framework that leverages frequency information to dynamically
select the most suitable acceleration strategy for each instance. To evaluate
the role of high-frequency information, we introduce high-variation benchmark
datasets that test model sensitivity to fine details. Extensive experiments
show SkipVAR achieves over 0.88 average SSIM with up to 1.81x overall
acceleration and 2.62x speedup on the GenEval benchmark, maintaining model
quality. These results confirm the effectiveness of frequency-aware,
training-free adaptive acceleration for scalable autoregressive image
generation. Our code is available at https://github.com/fakerone-li/SkipVAR and
has been publicly released.
♻ ★ MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis
José Morano, Botond Fazekas, Emese Sükei, Ronald Fecso, Taha Emre, Markus Gumpinger, Georg Faustmann, Marzieh Oghbaie, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Artificial intelligence (AI) has become a fundamental tool for assisting
clinicians in analyzing ophthalmic images, such as optical coherence tomography
(OCT). However, developing AI models often requires extensive annotation, and
existing models tend to underperform on independent, unseen data. Foundation
models (FMs), large AI models trained on vast unlabeled datasets, have shown
promise in overcoming these challenges. Nonetheless, available FMs for
ophthalmology lack extensive validation, especially for segmentation tasks, and
focus on a single imaging modality. In this context, we propose MIRAGE, a novel
multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO)
images. Additionally, we propose a new evaluation benchmark with OCT/SLO
classification and segmentation tasks. The comparison with general and
specialized FMs and segmentation methods shows the superiority of MIRAGE in
both types of tasks, highlighting its suitability as a basis for the
development of robust AI systems for retinal OCT image analysis. Both MIRAGE
and the evaluation benchmark are publicly available:
https://github.com/j-morano/MIRAGE.
♻ ★ Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Jingguo Qu, Xinyang Han, Tonghuan Xiao, Jia Ai, Juan Wu, Tong Zhao, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying
Medical ultrasonography is an essential imaging technique for examining
superficial organs and tissues, including lymph nodes, breast, and thyroid. It
employs high-frequency ultrasound waves to generate detailed images of the
internal structures of the human body. However, manually contouring regions of
interest in these images is a labor-intensive task that demands expertise and
often results in inconsistent interpretations among individuals.
Vision-language foundation models, which have excelled in various computer
vision applications, present new opportunities for enhancing ultrasound image
analysis. Yet, their performance is hindered by the significant differences
between natural and medical imaging domains. This research seeks to overcome
these challenges by developing domain adaptation methods for vision-language
foundation models. In this study, we explore the fine-tuning pipeline for
vision-language foundation models by utilizing large language model as text
refiner with special-designed adaptation strategies and task-driven heads. Our
approach has been extensively evaluated on six ultrasound datasets and two
tasks: segmentation and classification. The experimental results show that our
method can effectively improve the performance of vision-language foundation
models for ultrasound image analysis, and outperform the existing
state-of-the-art vision-language and pure foundation models. The source code of
this study is available at https://github.com/jinggqu/NextGen-UIA.
♻ ★ Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, Shanghang Zhang
Video content comprehension is essential for various applications, ranging
from video analysis to interactive systems. Despite advancements in large-scale
vision-language models (VLMs), these models often struggle to capture the
nuanced, spatiotemporal details essential for thorough video analysis. To
address this gap, we introduce Video-CoT, a groundbreaking dataset designed to
enhance spatiotemporal understanding using Chain-of-Thought (CoT)
methodologies. Video-CoT contains 192,000 fine-grained spa-tiotemporal
question-answer pairs and 23,000 high-quality CoT-annotated samples, providing
a solid foundation for evaluating spatiotemporal understanding in video
comprehension. Additionally, we provide a comprehensive benchmark for assessing
these tasks, with each task featuring 750 images and tailored evaluation
metrics. Our extensive experiments reveal that current VLMs face significant
challenges in achieving satisfactory performance, high-lighting the
difficulties of effective spatiotemporal understanding. Overall, the Video-CoT
dataset and benchmark open new avenues for research in multimedia understanding
and support future innovations in intelligent systems requiring advanced video
analysis capabilities. By making these resources publicly available, we aim to
encourage further exploration in this critical area. Project
website:https://video-cot.github.io/ .
♻ ★ Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting
Self-supervised learning (SSL) for point cloud pre-training has become a
cornerstone for many 3D vision tasks, enabling effective learning from
large-scale unannotated data. At the scene level, existing SSL methods often
incorporate volume rendering into the pre-training framework, using RGB-D
images as reconstruction signals to facilitate cross-modal learning. This
strategy promotes alignment between 2D and 3D modalities and enables the model
to benefit from rich visual cues in the RGB-D inputs. However, these approaches
are limited by their reliance on implicit scene representations and high memory
demands. Furthermore, since their reconstruction objectives are applied only in
2D space, they often fail to capture underlying 3D geometric structures. To
address these challenges, we propose Gaussian2Scene, a novel scene-level SSL
framework that leverages the efficiency and explicit nature of 3D Gaussian
Splatting (3DGS) for pre-training. The use of 3DGS not only alleviates the
computational burden associated with volume rendering but also supports direct
3D scene reconstruction, thereby enhancing the geometric understanding of the
backbone network. Our approach follows a progressive two-stage training
strategy. In the first stage, a dual-branch masked autoencoder learns both 2D
and 3D scene representations. In the second stage, we initialize training with
reconstructed point clouds and further supervise learning using the geometric
locations of Gaussian primitives and rendered RGB images. This process
reinforces both geometric and cross-modal learning. We demonstrate the
effectiveness of Gaussian2Scene across several downstream 3D object detection
tasks, showing consistent improvements over existing pre-training methods.
♻ ★ Geometric deep learning for local growth prediction on abdominal aortic aneurysm surfaces
Dieuwertje Alblas, Patryk Rygiel, Julian Suk, Kaj O. Kappe, Marieke Hofman, Christoph Brune, Kak Khee Yeung, Jelmer M. Wolterink
Abdominal aortic aneurysms (AAAs) are progressive focal dilatations of the
abdominal aorta. AAAs may rupture, with a survival rate of only 20\%. Current
clinical guidelines recommend elective surgical repair when the maximum AAA
diameter exceeds 55 mm in men or 50 mm in women. Patients that do not meet
these criteria are periodically monitored, with surveillance intervals based on
the maximum AAA diameter. However, this diameter does not take into account the
complex relation between the 3D AAA shape and its growth, making standardized
intervals potentially unfit. Personalized AAA growth predictions could improve
monitoring strategies. We propose to use an SE(3)-symmetric transformer model
to predict AAA growth directly on the vascular model surface enriched with
local, multi-physical features. In contrast to other works which have
parameterized the AAA shape, this representation preserves the vascular
surface's anatomical structure and geometric fidelity. We train our model using
a longitudinal dataset of 113 computed tomography angiography (CTA) scans of 24
AAA patients at irregularly sampled intervals. After training, our model
predicts AAA growth to the next scan moment with a median diameter error of
1.18 mm. We further demonstrate our model's utility to identify whether a
patient will become eligible for elective repair within two years (acc = 0.93).
Finally, we evaluate our model's generalization on an external validation set
consisting of 25 CTAs from 7 AAA patients from a different hospital. Our
results show that local directional AAA growth prediction from the vascular
surface is feasible and may contribute to personalized surveillance strategies.
♻ ★ Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models
Jingming Liu, Yumeng Li, Boyuan Xiao, Yichang Jian, Ziang Qin, Tianjia Shao, Yao-Xiang Ding, Kun Zhou
Under pure textual modality, Large Language Models (LLMs) have demonstrated
remarkable success in complex reasoning tasks by decomposing them into simpler
sub-problems. However, Multimodal Large Language Models (MLLMs) still struggle
with some seemingly straightforward visual tasks, such as counting and solving
jigsaw puzzles. We argue that these tasks challenge the ability of
visual-to-textual conversion, where MLLMs convert visual information perceived
from the input scene, to textual information for further reasoning and
generating the answer. If the complexity of the visual input is beyond the
perceptual capability of the MLLMs, without decomposing this conversion
process, simply scaling inference-time reasoning cannot solve the task because
it repeatedly encounters the same perceptual bottleneck. We propose an
approach, autonomous imagination, to enable MLLMs to iteratively modify visual
inputs (e.g. isolating objects, rearranging puzzle pieces) into intermediate
visual states, decomposing visual-to-textual conversion into closed-loop visual
modification steps. We show that, without any retraining, MLLMs can now solve
tasks initially beyond their perceptual capability, highlighting that
closed-loop visual modification can be an effective way of decomposing the
visual reasoning task into solvable substeps. Project page:
https://future-item.github.io/autoimagine-site/
♻ ★ ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts
Scientific fact-checking has mostly focused on text and tables, overlooking
scientific charts, which are key for presenting quantitative evidence and
statistical reasoning. We introduce ClimateViz, the first large-scale benchmark
for scientific fact-checking using expert-curated scientific charts. ClimateViz
contains 49,862 claims linked to 2,896 visualizations, each labeled as support,
refute, or not enough information. To improve interpretability, each example
includes structured knowledge graph explanations covering trends, comparisons,
and causal relations. We evaluate state-of-the-art multimodal language models,
including both proprietary and open-source systems, in zero-shot and few-shot
settings. Results show that current models struggle with chart-based reasoning:
even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to
77.8 percent accuracy in label-only settings, far below human performance (89.3
and 92.7 percent). Explanation-augmented outputs improve performance in some
models. We released our dataset and code alongside the paper.
♻ ★ Beyond Calibration: Physically Informed Learning for Raw-to-Raw Mapping
Achieving consistent color reproduction across multiple cameras is essential
for seamless image fusion and Image Processing Pipeline (ISP) compatibility in
modern devices, but it is a challenging task due to variations in sensors and
optics. Existing raw-to-raw conversion methods face limitations such as poor
adaptability to changing illumination, high computational costs, or impractical
requirements such as simultaneous camera operation and overlapping
fields-of-view. We introduce the Neural Physical Model (NPM), a lightweight,
physically-informed approach that simulates raw images under specified
illumination to estimate transformations between devices. The NPM effectively
adapts to varying illumination conditions, can be initialized with physical
measurements, and supports training with or without paired data. Experiments on
public datasets like NUS and BeyondRGB demonstrate that NPM outperforms recent
state-of-the-art methods, providing robust chromatic consistency across
different sensors and optical systems.
♻ ★ RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation
Creating recipe images is a key challenge in food computing, with
applications in culinary education and multimodal recipe assistants. However,
existing datasets lack fine-grained alignment between recipe goals, step-wise
instructions, and visual content. We present RecipeGen, the first large-scale,
real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video
(I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes,
196,724 images, and 4,491 videos, covering diverse ingredients, cooking
procedures, styles, and dish types. We further propose domain-specific
evaluation metrics to assess ingredient fidelity and interaction modeling,
benchmark representative T2I, I2V, and T2V models, and provide insights for
future recipe generation models. Project page is available now.
comment: This is an extended version of arXiv:2503.05228
♻ ★ Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
Multimodal Large Language Models (MLLMs) have demonstrated impressive
capabilities in understanding common visual elements, largely due to their
large-scale datasets and advanced training strategies. However, their
effectiveness in medical applications remains limited due to the inherent
discrepancies between data and tasks in medical scenarios and those in the
general domain. Concretely, existing medical MLLMs face the following critical
limitations: (1) limited coverage of medical knowledge beyond imaging, (2)
heightened susceptibility to hallucinations due to suboptimal data curation
processes, (3) lack of reasoning capabilities tailored for complex medical
scenarios. To address these challenges, we first propose a comprehensive data
curation procedure that (1) efficiently acquires rich medical knowledge data
not only from medical imaging but also from extensive medical texts and
general-domain data; and (2) synthesizes accurate medical captions, visual
question answering (VQA), and reasoning samples. As a result, we build a
multimodal dataset enriched with extensive medical knowledge. Building on the
curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu
undergoes multi-stage training to embed medical expertise and enhance its
task-solving capabilities progressively. Besides, we preliminarily explore the
potential of applying reinforcement learning with verifiable rewards paradigm
to enhance Lingshu's medical reasoning ability. Additionally, we develop
MedEvalKit, a unified evaluation framework that consolidates leading multimodal
and textual medical benchmarks for standardized, fair, and efficient model
assessment. We evaluate the performance of Lingshu on three fundamental medical
tasks, multimodal QA, text-based QA, and medical report generation. The results
show that Lingshu consistently outperforms the existing open-source multimodal
models on most tasks ...
comment: Technical Report, 53 pages, 25 tables, and 16 figures
♻ ★ Human-like object concept representations emerge naturally in multimodal large language models
Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, Jinpeng Li, Shuang Qiu, Le Chang, Huiguang He
Understanding how humans conceptualize and categorize natural objects offers
critical insights into perception and cognition. With the advent of Large
Language Models (LLMs), a key question arises: can these models develop
human-like object representations from linguistic and multimodal data? In this
study, we combined behavioral and neuroimaging analyses to explore the
relationship between object concept representations in LLMs and human
cognition. We collected 4.7 million triplet judgments from LLMs and Multimodal
LLMs (MLLMs) to derive low-dimensional embeddings that capture the similarity
structure of 1,854 natural objects. The resulting 66-dimensional embeddings
were stable, predictive, and exhibited semantic clustering similar to human
mental representations. Remarkably, the dimensions underlying these embeddings
were interpretable, suggesting that LLMs and MLLMs develop human-like
conceptual representations of objects. Further analysis showed strong alignment
between model embeddings and neural activity patterns in brain regions such as
EBA, PPA, RSC, and FFA. This provides compelling evidence that the object
representations in LLMs, while not identical to human ones, share fundamental
similarities that reflect key aspects of human conceptual knowledge. Our
findings advance the understanding of machine intelligence and inform the
development of more human-like artificial cognitive systems.
comment: Published on Nature Machine Intelligence
♻ ★ MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding
Different medical imaging modalities capture diagnostic information at
varying spatial resolutions, from coarse global patterns to fine-grained
localized structures. However, most existing vision-language frameworks in the
medical domain apply a uniform strategy for local feature extraction,
overlooking the modality-specific demands. In this work, we present MedMoE, a
modular and extensible vision-language processing framework that dynamically
adapts visual representation based on the diagnostic context. MedMoE
incorporates a Mixture-of-Experts (MoE) module conditioned on the report type,
which routes multi-scale image features through specialized expert branches
trained to capture modality-specific visual semantics. These experts operate
over feature pyramids derived from a Swin Transformer backbone, enabling
spatially adaptive attention to clinically relevant regions. This framework
produces localized visual representations aligned with textual descriptions,
without requiring modality-specific supervision at inference. Empirical results
on diverse medical benchmarks demonstrate that MedMoE improves alignment and
retrieval performance across imaging modalities, underscoring the value of
modality-specialized visual representations in clinical vision-language
systems.
♻ ★ Spectral Image Tokenizer
Image tokenizers map images to sequences of discrete tokens, and are a
crucial component of autoregressive transformer-based image generation. The
tokens are typically associated with spatial locations in the input image,
arranged in raster scan order, which is not ideal for autoregressive modeling.
In this paper, we propose to tokenize the image spectrum instead, obtained from
a discrete wavelet transform (DWT), such that the sequence of tokens represents
the image in a coarse-to-fine fashion. Our tokenizer brings several advantages:
1) it leverages that natural images are more compressible at high frequencies,
2) it can take and reconstruct images of different resolutions without
retraining, 3) it improves the conditioning for next-token prediction --
instead of conditioning on a partial line-by-line reconstruction of the image,
it takes a coarse reconstruction of the full image, 4) it enables partial
decoding where the first few generated tokens can reconstruct a coarse version
of the image, 5) it enables autoregressive models to be used for image
upsampling. We evaluate the tokenizer reconstruction metrics as well as
multiscale image generation, text-guided image upsampling and editing.
♻ ★ Fine-Grained Spatially Varying Material Selection in Images
Julia Guerrero-Viu, Michael Fischer, Iliyan Georgiev, Elena Garces, Diego Gutierrez, Belen Masia, Valentin Deschaintre
Selection is the first step in many image editing processes, enabling faster
and simpler modifications of all pixels sharing a common modality. In this
work, we present a method for material selection in images, robust to lighting
and reflectance variations, which can be used for downstream editing tasks. We
rely on vision transformer (ViT) models and leverage their features for
selection, proposing a multi-resolution processing strategy that yields finer
and more stable selection results than prior methods. Furthermore, we enable
selection at two levels: texture and subtexture, leveraging a new two-level
material selection (DuMaS) dataset which includes dense annotations for over
800,000 synthetic images, both on the texture and subtexture levels.
♻ ★ Understanding Long Videos with Multimodal Language Models ICLR 2025
Large Language Models (LLMs) have allowed recent LLM-based approaches to
achieve excellent performance on long-video understanding benchmarks. We
investigate how extensive world knowledge and strong reasoning skills of
underlying LLMs influence this strong performance. Surprisingly, we discover
that LLM-based approaches can yield surprisingly good accuracy on long-video
tasks with limited video information, sometimes even with no video specific
information. Building on this, we explore injecting video-specific information
into an LLM-based framework. We utilize off-the-shelf vision tools to extract
three object-centric information modalities from videos, and then leverage
natural language as a medium for fusing this information. Our resulting
Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art
performance across multiple video understanding benchmarks. Strong performance
also on robotics domain tasks establish its strong generality. Code:
https://github.com/kahnchana/mvu
comment: 17 pages (main paper), 7 pages appendix. ICLR 2025 conference paper
♻ ★ HRTR: A Single-stage Transformer for Fine-grained Sub-second Action Segmentation in Stroke Rehabilitation
Stroke rehabilitation often demands precise tracking of patient movements to
monitor progress, with complexities of rehabilitation exercises presenting two
critical challenges: fine-grained and sub-second (under one-second) action
detection. In this work, we propose the High Resolution Temporal Transformer
(HRTR), to time-localize and classify high-resolution (fine-grained),
sub-second actions in a single-stage transformer, eliminating the need for
multi-stage methods and post-processing. Without any refinements, HRTR
outperforms state-of-the-art systems on both stroke related and general
datasets, achieving Edit Score (ES) of 70.1 on StrokeRehab Video, 69.4 on
StrokeRehab IMU, and 88.4 on 50Salads.
♻ ★ TerraMind: Large-Scale Generative Multimodality for Earth Observation
Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé
We present TerraMind, the first any-to-any generative, multimodal foundation
model for Earth observation (EO). Unlike other multimodal models, TerraMind is
pretrained on dual-scale representations combining both token-level and
pixel-level data across modalities. On a token level, TerraMind encodes
high-level contextual information to learn cross-modal relationships, while on
a pixel level, TerraMind leverages fine-grained representations to capture
critical spatial nuances. We pretrained TerraMind on nine geospatial modalities
of a global, large-scale dataset. In this paper, we demonstrate that (i)
TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and
few-shot applications for Earth observation, (ii) TerraMind introduces
"Thinking-in-Modalities" (TiM) -- the capability of generating additional
artificial data during finetuning and inference to improve the model output --
and (iii) TerraMind achieves beyond state-of-the-art performance in
community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the
model weights, and our code are open-sourced under a permissive license.
♻ ★ RS-MTDF: Multi-Teacher Distillation and Fusion for Remote Sensing Semi-Supervised Semantic Segmentation
Semantic segmentation in remote sensing images is crucial for various
applications, yet its performance is heavily reliant on large-scale,
high-quality pixel-wise annotations, which are notoriously expensive and
time-consuming to acquire. Semi-supervised semantic segmentation (SSS) offers a
promising alternative to mitigate this data dependency. However, existing SSS
methods often struggle with the inherent distribution mismatch between limited
labeled data and abundant unlabeled data, leading to suboptimal generalization.
To alleviate this issue, we attempt to introduce the Vision Foundation Models
(VFMs) pre-trained on vast and diverse datasets into the SSS task since VFMs
possess robust generalization capabilities that can effectively bridge this
distribution gap and provide strong semantic priors for SSS. Inspired by this,
we introduce RS-MTDF (Multi-Teacher Distillation and Fusion), a novel framework
that leverages the powerful semantic knowledge embedded in VFMs to guide
semi-supervised learning in remote sensing. Specifically, RS-MTDF employs
multiple frozen VFMs (e.g., DINOv2 and CLIP) as expert teachers, utilizing
feature-level distillation to align student features with their robust
representations. To further enhance discriminative power, the distilled
knowledge is seamlessly fused into the student decoder. Extensive experiments
on three challenging remote sensing datasets demonstrate that RS-MTDF
consistently achieves state-of-the-art performance. Notably, our method
outperforms existing approaches across various label ratios on LoveDA and
secures the highest IoU in the majority of semantic categories. These results
underscore the efficacy of multi-teacher VFM guidance in significantly
enhancing both generalization and semantic understanding for remote sensing
segmentation. Ablation studies further validate the contribution of each
proposed module.
♻ ★ MVTamperBench: Evaluating Robustness of Vision-Language Models
Amit Agarwal, Srikant Panda, Angeline Charles, Bhargava Kumar, Hitesh Patel, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Hansa Meghwani, Karan Gupta, Dong-Kyu Chae
Multimodal Large Language Models (MLLMs), are recent advancement of
Vision-Language Models (VLMs) that have driven major advances in video
understanding. However, their vulnerability to adversarial tampering and
manipulations remains underexplored. To address this gap, we introduce
\textbf{MVTamperBench}, a benchmark that systematically evaluates MLLM
robustness against five prevalent tampering techniques: rotation, masking,
substitution, repetition, and dropping; based on real-world visual tampering
scenarios such as surveillance interference, social media content edits, and
misinformation injection. MVTamperBench comprises ~3.4K original videos,
expanded into over ~17K tampered clips covering 19 distinct video manipulation
tasks. This benchmark challenges models to detect manipulations in spatial and
temporal coherence. We evaluate 45 recent MLLMs from 15+ model families. We
reveal substantial variability in resilience across tampering types and show
that larger parameter counts do not necessarily guarantee robustness.
MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in
safety-critical applications, including detecting clickbait, preventing harmful
content distribution, and enforcing policies on media platforms. We release all
code, data, and benchmark to foster open research in trustworthy video
understanding.
Code: https://amitbcp.github.io/MVTamperBench/ Data:
https://huggingface.co/datasets/Srikant86/MVTamperBench
♻ ★ Traveling Waves Integrate Spatial Information Through Time
Traveling waves of neural activity are widely observed in the brain, but
their precise computational function remains unclear. One prominent hypothesis
is that they enable the transfer and integration of spatial information across
neural populations. However, few computational models have explored how
traveling waves might be harnessed to perform such integrative processing.
Drawing inspiration from the famous "Can one hear the shape of a drum?" problem
-- which highlights how normal modes of wave dynamics encode geometric
information -- we investigate whether similar principles can be leveraged in
artificial neural networks. Specifically, we introduce convolutional recurrent
neural networks that learn to produce traveling waves in their hidden states in
response to visual stimuli, enabling spatial integration. By then treating
these wave-like activation sequences as visual representations themselves, we
obtain a powerful representational space that outperforms local feed-forward
networks on tasks requiring global spatial context. In particular, we observe
that traveling waves effectively expand the receptive field of locally
connected neurons, supporting long-range encoding and communication of
information. We demonstrate that models equipped with this mechanism solve
visual semantic segmentation tasks demanding global integration, significantly
outperforming local feed-forward models and rivaling non-local U-Net models
with fewer parameters. As a first step toward traveling-wave-based
communication and visual representation in artificial networks, our findings
suggest wave-dynamics may provide efficiency and training stability benefits,
while simultaneously offering a new framework for connecting models to
biological recordings of neural activity.
♻ ★ SpikeSMOKE: Spiking Neural Networks for Monocular 3D Object Detection with Cross-Scale Gated Coding
Low energy consumption for 3D object detection is an important research area
because of the increasing energy consumption with their wide application in
fields such as autonomous driving. The spiking neural networks (SNNs) with
low-power consumption characteristics can provide a novel solution for this
research. Therefore, we apply SNNs to monocular 3D object detection and propose
the SpikeSMOKE architecture in this paper, which is a new attempt for low-power
monocular 3D object detection. As we all know, discrete signals of SNNs will
generate information loss and limit their feature expression ability compared
with the artificial neural networks (ANNs).In order to address this issue,
inspired by the filtering mechanism of biological neuronal synapses, we propose
a cross-scale gated coding mechanism(CSGC), which can enhance feature
representation by combining cross-scale fusion of attentional methods and gated
filtering mechanisms.In addition, to reduce the computation and increase the
speed of training, we present a novel light-weight residual block that can
maintain spiking computing paradigm and the highest possible detection
performance. Compared to the baseline SpikeSMOKE under the 3D Object Detection,
the proposed SpikeSMOKE with CSGC can achieve 11.78 (+2.82, Easy), 10.69 (+3.2,
Moderate), and 10.48 (+3.17, Hard) on the KITTI autonomous driving dataset by
AP|R11 at 0.7 IoU threshold, respectively. It is important to note that the
results of SpikeSMOKE can significantly reduce energy consumption compared to
the results on SMOKE. For example,the energy consumption can be reduced by
72.2% on the hard category, while the detection performance is reduced by only
4%. SpikeSMOKE-L (lightweight) can further reduce the amount of parameters by 3
times and computation by 10 times compared to SMOKE.
♻ ★ ContentV: Efficient Training of Video Generation Models with Limited Compute
Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, Qi Wu, Zuotao Liu, Mingyu Guo
Recent advances in video generation demand increasingly efficient training
recipes to mitigate escalating computational costs. In this report, we present
ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art
performance (85.14 on VBench) after training on 256 x 64GB Neural Processing
Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality
videos across multiple resolutions and durations from text prompts, enabled by
three key innovations: (1) A minimalist architecture that maximizes reuse of
pre-trained image generation models for video generation; (2) A systematic
multi-stage training strategy leveraging flow matching for enhanced efficiency;
and (3) A cost-effective reinforcement learning with human feedback framework
that improves generation quality without requiring additional human
annotations. All the code and models are available at:
https://contentv.github.io.
comment: Project Page: https://contentv.github.io
♻ ★ ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Reasoning over sequences of images remains a challenge for multimodal large
language models (MLLMs). While recent models incorporate multi-image data
during pre-training, they still struggle to recognize sequential structures,
often treating images independently. This work introduces ImageChain, a
framework that enhances MLLMs with sequential reasoning capabilities over image
data by modeling visual sequences as a multi-turn conversation. In ImageChain,
images are interleaved with corresponding textual descriptions to form a
controlled dialogue that explicitly captures temporal dependencies and
narrative progression. Our method optimizes for the task of next-scene
description, where the model generates a context-aware description of an
upcoming scene based on preceding visual and textual cues. We demonstrate that
our approach improves performance on the next-scene description task --
achieving an average improvement from 3.7% to 19% in SimRate, a metric that
quantifies semantic similarity to human-annotated ground truths. Moreover,
ImageChain achieves robust zero-shot out-of-domain performance in applications
ranging from comics to robotics. Extensive experiments validate that
instruction-tuning in a multimodal, multi-turn conversation design is key to
bridging the gap between static image understanding and temporally-aware
reasoning.
comment: Code, dataset, and checkpoints are publicly available at
https://github.com/danaesavi/ImageChain; v2: added human annotation study to
validate SimRate
♻ ★ One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Multi-modal retrieval augmented generation (M-RAG) is instrumental for
inhibiting hallucinations in large multi-modal models (LMMs) through the use of
a factual knowledge base (KB). However, M-RAG introduces new attack vectors for
adversaries that aim to disrupt the system by injecting malicious entries into
the KB. In this paper, we present the first poisoning attack against M-RAG
targeting visual document retrieval applications where the KB contains images
of document pages. We propose two attacks, each of which require injecting only
a single adversarial image into the KB. Firstly, we propose a universal attack
that, for any potential user query, influences the response to cause a
denial-of-service (DoS) in the M-RAG system. Secondly, we present a targeted
attack against one or a group of user queries, with the goal of spreading
targeted misinformation. For both attacks, we use a multi-objective
gradient-based adversarial approach to craft the injected image while
optimizing for both retrieval and generation. We evaluate our attacks against
several visual document retrieval datasets, a diverse set of state-of-the-art
retrievers (embedding models) and generators (LMMs), demonstrating the attack
effectiveness in both the universal and targeted settings. We additionally
present results including commonly used defenses, various attack
hyper-parameter settings, ablations, and attack transferability.
comment: 19 pages, 7 figures
♻ ★ Unseen Visual Anomaly Generation
Visual anomaly detection (AD) presents significant challenges due to the
scarcity of anomalous data samples. While numerous works have been proposed to
synthesize anomalous samples, these synthetic anomalies often lack authenticity
or require extensive training data, limiting their applicability in real-world
scenarios. In this work, we propose Anomaly Anything (AnomalyAny), a novel
framework that leverages Stable Diffusion (SD)'s image generation capabilities
to generate diverse and realistic unseen anomalies. By conditioning on a single
normal sample during test time, AnomalyAny is able to generate unseen anomalies
for arbitrary object types with text descriptions. Within AnomalyAny, we
propose attention-guided anomaly optimization to direct SD attention on
generating hard anomaly concepts. Additionally, we introduce prompt-guided
anomaly refinement, incorporating detailed descriptions to further improve the
generation quality. Extensive experiments on MVTec AD and VisA datasets
demonstrate AnomalyAny's ability in generating high-quality unseen anomalies
and its effectiveness in enhancing downstream AD performance.
comment: 8 pages excluding supplementary
♻ ★ Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization
Existing approaches to drone visual geo-localization predominantly adopt the
image-based setting, where a single drone-view snapshot is matched with images
from other platforms. Such task formulation, however, underutilizes the
inherent video output of the drone and is sensitive to occlusions and viewpoint
disparity. To address these limitations, we formulate a new video-based drone
geo-localization task and propose the Video2BEV paradigm. This paradigm
transforms the video into a Bird's Eye View (BEV), simplifying the subsequent
\textbf{inter-platform} matching process. In particular, we employ Gaussian
Splatting to reconstruct a 3D scene and obtain the BEV projection. Different
from the existing transform methods, \eg, polar transform, our BEVs preserve
more fine-grained details without significant distortion. To facilitate the
discriminative \textbf{intra-platform} representation learning, our Video2BEV
paradigm also incorporates a diffusion-based module for generating hard
negative samples. To validate our approach, we introduce UniV, a new
video-based geo-localization dataset that extends the image-based
University-1652 dataset. UniV features flight paths at $30^\circ$ and
$45^\circ$ elevation angles with increased frame rates of up to 10 frames per
second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV
paradigm achieves competitive recall rates and outperforms conventional
video-based methods. Compared to other competitive methods, our proposed
approach exhibits robustness at lower elevations with more occlusions.
♻ ★ Using Shapley interactions to understand how models use structure ACL 2025
Language is an intricately structured system, and a key goal of NLP
interpretability is to provide methodological insights for understanding how
language models represent this structure internally. In this paper, we use
Shapley Taylor interaction indices (STII) in order to examine how language and
speech models internally relate and structure their inputs. Pairwise Shapley
interactions measure how much two inputs work together to influence model
outputs beyond if we linearly added their independent influences, providing a
view into how models encode structural interactions between inputs. We relate
the interaction patterns in models to three underlying linguistic structures:
syntactic structure, non-compositional semantics, and phonetic coarticulation.
We find that autoregressive text models encode interactions that correlate with
the syntactic proximity of inputs, and that both autoregressive and masked
models encode nonlinear interactions in idiomatic phrases with
non-compositional semantics. Our speech results show that inputs are more
entangled for pairs where a neighboring consonant is likely to influence a
vowel or approximant, showing that models encode the phonetic interaction
needed for extracting discrete phonemic representations.
comment: Published in ACL 2025
♻ ★ TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval AAAI 2025
Cross-modal retrieval maps data under different modality via semantic
relevance. Existing approaches implicitly assume that data pairs are
well-aligned and ignore the widely existing annotation noise, i.e., noisy
correspondence (NC). Consequently, it inevitably causes performance
degradation. Despite attempts that employ the co-teaching paradigm with
identical architectures to provide distinct data perspectives, the differences
between these architectures are primarily stemmed from random initialization.
Thus, the model becomes increasingly homogeneous along with the training
process. Consequently, the additional information brought by this paradigm is
severely limited. In order to resolve this problem, we introduce a Tripartite
learning with Semantic Variation Consistency (TSVC) for robust image-text
retrieval. We design a tripartite cooperative learning mechanism comprising a
Coordinator, a Master, and an Assistant model. The Coordinator distributes
data, and the Assistant model supports the Master model's noisy label
prediction with diverse data. Moreover, we introduce a soft label estimation
method based on mutual information variation, which quantifies the noise in new
samples and assigns corresponding soft labels. We also present a new loss
function to enhance robustness and optimize training effectiveness. Extensive
experiments on three widely used datasets demonstrate that, even at increasing
noise ratios, TSVC exhibits significant advantages in retrieval accuracy and
maintains stable training performance.
comment: This paper has been accepted to the Main Track of AAAI 2025. It
contains 9 pages, 7 figures, and is relevant to the areas of cross-modal
retrieval and machine learning. The work presents a novel approach in robust
image-text retrieval using a tripartite learning framework
♻ ★ SMMT: Siamese Motion Mamba with Self-attention for Thermal Infrared Target Tracking
Thermal infrared (TIR) object tracking often suffers from challenges such as
target occlusion, motion blur, and background clutter, which significantly
degrade the performance of trackers. To address these issues, this paper
pro-poses a novel Siamese Motion Mamba Tracker (SMMT), which integrates a
bidirectional state-space model and a self-attention mechanism. Specifically,
we introduce the Motion Mamba module into the Siamese architecture to ex-tract
motion features and recover overlooked edge details using bidirectional
modeling and self-attention. We propose a Siamese parameter-sharing strate-gy
that allows certain convolutional layers to share weights. This approach
reduces computational redundancy while preserving strong feature
represen-tation. In addition, we design a motion edge-aware regression loss to
improve tracking accuracy, especially for motion-blurred targets. Extensive
experi-ments are conducted on four TIR tracking benchmarks, including
LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR 2017. The results show that SMMT
achieves superior performance in TIR target tracking.
♻ ★ Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations
Reasoning Segmentation (RS) is a multimodal vision-text task that requires
segmenting objects based on implicit text queries, demanding both precise
visual perception and vision-text reasoning capabilities. Current RS approaches
rely on fine-tuning vision-language models (VLMs) for both perception and
reasoning, but their tokenization of images fundamentally disrupts continuous
spatial relationships between objects. We introduce DTwinSeger, a novel RS
approach that leverages Digital Twin (DT) representation as an intermediate
layer to decouple perception from reasoning. Innovatively, DTwinSeger
reformulates RS as a two-stage process, where the first transforms the image
into a structured DT representation that preserves spatial relationships and
semantic properties and then employs a Large Language Model (LLM) to perform
explicit reasoning over this representation to identify target objects. We
propose a supervised fine-tuning method specifically for LLM with DT
representation, together with a corresponding fine-tuning dataset Seg-DT, to
enhance the LLM's reasoning capabilities with DT representations. Experiments
show that our method can achieve state-of-the-art performance on two image RS
benchmarks and three image referring segmentation benchmarks. It yields that DT
representation functions as an effective bridge between vision and text,
enabling complex multimodal reasoning tasks to be accomplished solely with an
LLM.
comment: This work was submitted without the consent of all co-authors. We
request withdrawal until all parties agree
♻ ★ XMeCap: Meme Caption Generation with Sub-Image Adaptability
Humor, deeply rooted in societal meanings and cultural details, poses a
unique challenge for machines. While advances have been made in natural
language processing, real-world humor often thrives in a multi-modal context,
encapsulated distinctively by memes. This paper poses a particular emphasis on
the impact of multi-images on meme captioning. After that, we introduce the
\textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning
and reinforcement learning based on an innovative reward model, which factors
in both global and local similarities between visuals and text. Our results,
benchmarked against contemporary models, manifest a marked improvement in
caption generation for both single-image and multi-image memes, as well as
different meme categories. \textsc{XMeCap} achieves an average evaluation score
of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming
the best baseline by 6.75\% and 8.56\%, respectively. This research not only
establishes a new frontier in meme-related studies but also underscores the
potential of machines in understanding and generating humor in a multi-modal
setting.
comment: Accepted to ACM Multimedia 2024
♻ ★ LLM2TEA: Agentic AI Designer Finds Innovative Objects with Generative Evolutionary Multitasking
In this paper, we introduce LLM-driven MultiTask Evolutionary Algorithm
(LLM2TEA), the first agentic AI designer within a generative evolutionary
multitasking (GEM) framework that promotes the crossover and synergy of designs
from multiple domains, leading to innovative solutions that transcend
individual disciplines. Of particular interest is the discovery of objects that
are not only innovative but also conform to the physical specifications of the
real world in science and engineering. LLM2TEA comprises a large language model
to initialize a population of genotypes (defined by text prompts) describing
the objects of interest, a text-to-3D generative model to produce phenotypes
from these prompts, a classifier to interpret the semantic representations of
the objects, and a physics simulation model to assess their physical
properties. We propose several novel LLM-based multitask evolutionary operators
to guide the search toward the discovery of high-performing practical objects.
Experimental results in conceptual design optimization validate the
effectiveness of LLM2TEA, revealing from 97\% to 174\% improvement in the
diversity of innovative objects compared to the present text-to-3D generative
model baseline. In addition, more than 73\% of the generated designs have
better physical performance than the top 1\% percentile of the designs
generated in the baseline. Moreover, LLM2TEA generates designs that are not
only aesthetically creative but also functional in real-world applications.
Several of these designs have been successfully 3D-printed, emphasizing the
proposed approach's capacity to transform AI-generated outputs into tangible
physical objects. The designs produced by LLM2TEA meets practical requirements
while showcasing creative and innovative features, underscoring its potential
applications in complex design optimization and discovery.
comment: This work has been submitted to the IEEE for review
♻ ★ HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
Despite emerging efforts to enhance the safety of Vision-Language Models
(VLMs), current approaches face two main shortcomings. 1) Existing
safety-tuning datasets and benchmarks only partially consider how image-text
interactions can yield harmful content, often overlooking contextually unsafe
outcomes from seemingly benign pairs. This narrow coverage leaves VLMs
vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely
primarily on data-centric tuning, with limited architectural innovations to
intrinsically strengthen safety. We address these gaps by introducing a
holistic safety dataset and benchmark, HoliSafe, that spans all five
safe/unsafe image-text combinations, providing a more robust basis for both
training and evaluation. We further propose SafeLLaVA, a novel VLM augmented
with a learnable safety meta token and a dedicated safety head. The meta token
encodes harmful visual cues during training, intrinsically guiding the language
model toward safer responses, while the safety head offers interpretable
harmfulness classification aligned with refusal rationales. Experiments show
that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety
performance across multiple VLM benchmarks. Additionally, the HoliSafe
benchmark itself reveals critical vulnerabilities in existing models. We hope
that HoliSafe and SafeLLaVA will spur further research into robust and
interpretable VLM safety, expanding future avenues for multimodal alignment.
comment: Project page: https://youngwanlee.github.io/holisafe
♻ ★ AugGen: Synthetic Augmentation Can Improve Discriminative Models
The increasing reliance on large-scale datasets in machine learning poses
significant privacy and ethical challenges, particularly in sensitive domains
such as face recognition (FR). Synthetic data generation offers a promising
alternative; however, most existing methods depend heavily on external datasets
or pre-trained models, increasing complexity and resource demands. In this
paper, we introduce AugGen, a self-contained synthetic augmentation technique.
AugGen strategically samples from a class-conditional generative model trained
exclusively on the target FR dataset, eliminating the need for external
resources. Evaluated across 8 FR benchmarks, including IJB-C and IJB-B, our
method achieves 1-12% performance improvements, outperforming models trained
solely on real data and surpassing state-of-the-art synthetic data generation
approaches, while using less real data. Notably, these gains often exceed those
from architectural modifications, underscoring the value of synthetic
augmentation in data-limited scenarios. Our findings demonstrate that carefully
integrated synthetic data can both mitigate privacy constraints and
substantially enhance discriminative performance in face recognition. Paper
website: https://parsa-ra.github.io/auggen/.
♻ ★ Question-Aware Gaussian Experts for Audio-Visual Question Answering CVPR 2025
Audio-Visual Question Answering (AVQA) requires not only question-based
multimodal reasoning but also precise temporal grounding to capture subtle
dynamics for accurate prediction. However, existing methods mainly use question
information implicitly, limiting focus on question-specific details.
Furthermore, most studies rely on uniform frame sampling, which can miss key
question-relevant frames. Although recent Top-K frame selection methods aim to
address this, their discrete nature still overlooks fine-grained temporal
details. This paper proposes QA-TIGER, a novel framework that explicitly
incorporates question information and models continuous temporal dynamics. Our
key idea is to use Gaussian-based modeling to adaptively focus on both
consecutive and non-consecutive frames based on the question, while explicitly
injecting question information and applying progressive refinement. We leverage
a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models,
activating temporal experts specifically tailored to the question. Extensive
experiments on multiple AVQA benchmarks show that QA-TIGER consistently
achieves state-of-the-art performance. Code is available at
https://aim-skku.github.io/QA-TIGER/
comment: CVPR 2025. Code is available at https://github.com/AIM-SKKU/QA-TIGER
♻ ★ Holistic Uncertainty Estimation For Open-Set Recognition
Accurate uncertainty estimation is a critical challenge in open-set
recognition, where a probe biometric sample may belong to an unknown identity.
It can be addressed through sample quality estimation via probabilistic
embeddings. However, the low variance of probabilistic embedding only partly
implies a low identification error probability: an embedding of a sample could
be close to several classes in a gallery, thus yielding high uncertainty
despite high sample quality. We propose HolUE - a holistic uncertainty
estimation method based on a Bayesian probabilistic model; it is aware of two
sources of ambiguity in the open-set recognition system: (1) the gallery
uncertainty caused by overlapping classes and (2) the uncertainty of
embeddings. Challenging open-set recognition datasets, such as IJB-C for the
image domain and VoxBlink for the audio domain, serve as a testbed for our
method. We also provide a new open-set recognition protocol for the
identification of whales and dolphins. In all cases, HolUE better identifies
recognition errors than alternative uncertainty estimation methods, including
those based solely on sample quality.
♻ ★ Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025 CVPR
In this report, we present a novel three-stage framework developed for the
Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in
foundation models, our method consists of three stages: feature extraction,
action recognition, and long-term action anticipation. First, visual features
are extracted using a high-performance visual encoder. The features are then
fed into a Transformer to predict verbs and nouns, with a verb-noun
co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the
predicted verb-noun pairs are formatted as textual prompts and input into a
fine-tuned large language model (LLM) to anticipate future action sequences.
Our framework achieves first place in this challenge at CVPR 2025, establishing
a new state-of-the-art in long-term action prediction. Our code will be
released at https://github.com/CorrineQiu/Ego4D-LTA-Challenge-2025.
comment: The champion solution for the Ego4D Long-Term Action Anticipation
Challenge at the CVPR EgoVis Workshop 2025
♻ ★ Fourier-Modulated Implicit Neural Representation for Multispectral Satellite Image Compression
Multispectral satellite images play a vital role in agriculture, fisheries,
and environmental monitoring. However, their high dimensionality, large data
volumes, and diverse spatial resolutions across multiple channels pose
significant challenges for data compression and analysis. This paper presents
ImpliSat, a unified framework specifically designed to address these challenges
through efficient compression and reconstruction of multispectral satellite
data. ImpliSat leverages Implicit Neural Representations (INR) to model
satellite images as continuous functions over coordinate space, capturing fine
spatial details across varying spatial resolutions. Furthermore, we introduce a
Fourier modulation algorithm that dynamically adjusts to the spectral and
spatial characteristics of each band, ensuring optimal compression while
preserving critical image details.
comment: Accepted to IGARSS 2025 (Oral)
♻ ★ BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection
3D object detection is an important task that has been widely applied in
autonomous driving. To perform this task, a new trend is to fuse multi-modal
inputs, i.e., LiDAR and camera. Under such a trend, recent methods fuse these
two modalities by unifying them in the same 3D space. However, during direct
fusion in a unified space, the drawbacks of both modalities (LiDAR features
struggle with detailed semantic information and the camera lacks accurate 3D
spatial information) are also preserved, diluting semantic and spatial
awareness of the final unified representation. To address the issue, this
letter proposes a novel bidirectional complementary LiDAR-camera fusion
framework, called BiCo-Fusion that can achieve robust semantic- and
spatial-aware 3D object detection. The key insight is to fuse LiDAR and camera
features in a bidirectional complementary way to enhance the semantic awareness
of the LiDAR and the 3D spatial awareness of the camera. The enhanced features
from both modalities are then adaptively fused to build a semantic- and
spatial-aware unified representation. Specifically, we introduce Pre-Fusion
consisting of a Voxel Enhancement Module (VEM) to enhance the semantic
awareness of voxel features from 2D camera features and Image Enhancement
Module (IEM) to enhance the 3D spatial awareness of camera features from 3D
voxel features. We then introduce Unified Fusion (U-Fusion) to adaptively fuse
the enhanced features from the last stage to build a unified representation.
Extensive experiments demonstrate the superiority of our BiCo-Fusion against
the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.
comment: Accepted by IEEE Robotics and Automation Letters (RA-L)
♻ ★ SmartEraser: Remove Anything from Images using Masked-Region Guidance
Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Lei Shi, Dong Chen, Houqiang Li
Object removal has so far been dominated by the mask-and-inpaint paradigm,
where the masked region is excluded from the input, leaving models relying on
unmasked areas to inpaint the missing region. However, this approach lacks
contextual information for the masked area, often resulting in unstable
performance. In this work, we introduce SmartEraser, built with a new removing
paradigm called Masked-Region Guidance. This paradigm retains the masked region
in the input, using it as guidance for the removal process. It offers several
distinct advantages: (a) it guides the model to accurately identify the object
to be removed, preventing its regeneration in the output; (b) since the user
mask often extends beyond the object itself, it aids in preserving the
surrounding context in the final result. Leveraging this new paradigm, we
present Syn4Removal, a large-scale object removal dataset, where instance
segmentation data is used to copy and paste objects onto images as removal
targets, with the original images serving as ground truths. Experimental
results demonstrate that SmartEraser significantly outperforms existing
methods, achieving superior performance in object removal, especially in
complex scenes with intricate compositions.
comment: Project at: https://longtaojiang.github.io/smarteraser.github.io/
♻ ★ ProbDiffFlow: An Efficient Learning-Free Framework for Probabilistic Single-Image Optical Flow Estimation
Mo Zhou, Jianwei Wang, Xuanmeng Zhang, Dylan Campbell, Kai Wang, Long Yuan, Wenjie Zhang, Xuemin Lin
This paper studies optical flow estimation, a critical task in motion
analysis with applications in autonomous navigation, action recognition, and
film production. Traditional optical flow methods require consecutive frames,
which are often unavailable due to limitations in data acquisition or
real-world scene disruptions. Thus, single-frame optical flow estimation is
emerging in the literature. However, existing single-frame approaches suffer
from two major limitations: (1) they rely on labeled training data, making them
task-specific, and (2) they produce deterministic predictions, failing to
capture motion uncertainty. To overcome these challenges, we propose
ProbDiffFlow, a training-free framework that estimates optical flow
distributions from a single image. Instead of directly predicting motion,
ProbDiffFlow follows an estimation-by-synthesis paradigm: it first generates
diverse plausible future frames using a diffusion-based model, then estimates
motion from these synthesized samples using a pre-trained optical flow model,
and finally aggregates the results into a probabilistic flow distribution. This
design eliminates the need for task-specific training while capturing multiple
plausible motions. Experiments on both synthetic and real-world datasets
demonstrate that ProbDiffFlow achieves superior accuracy, diversity, and
efficiency, outperforming existing single-image and two-frame baselines.
comment: 18 pages, 13 figures, accepted by Frontiers of Computer Science (FCS)
♻ ★ Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency
Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
We present Genesis, a unified framework for joint generation of multi-view
driving videos and LiDAR sequences with spatio-temporal and cross-modal
consistency. Genesis employs a two-stage architecture that integrates a
DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR
generator with NeRF-based rendering and adaptive sampling. Both modalities are
directly coupled through a shared latent space, enabling coherent evolution
across visual and geometric domains. To guide the generation with structured
semantics, we introduce DataCrafter, a captioning module built on
vision-language models that provides scene-level and instance-level
supervision. Extensive experiments on the nuScenes benchmark demonstrate that
Genesis achieves state-of-the-art performance across video and LiDAR metrics
(FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including
segmentation and 3D detection, validating the semantic fidelity and practical
utility of the generated data.
♻ ★ NeRF-CA: Dynamic Reconstruction of X-ray Coronary Angiography with Extremely Sparse-views
Dynamic three-dimensional (4D) reconstruction from two-dimensional X-ray
coronary angiography (CA) remains a significant clinical problem. Existing CA
reconstruction methods often require extensive user interaction or large
training datasets. Recently, Neural Radiance Field (NeRF) has successfully
reconstructed high-fidelity scenes in natural and medical contexts without
these requirements. However, challenges such as sparse-views, intra-scan
motion, and complex vessel morphology hinder its direct application to CA data.
We introduce NeRF-CA, a first step toward a fully automatic 4D CA
reconstruction that achieves reconstructions from sparse coronary angiograms.
To the best of our knowledge, we are the first to address the challenges of
sparse-views and cardiac motion by decoupling the scene into the moving
coronary artery and the static background, effectively translating the problem
of motion into a strength. NeRF-CA serves as a first stepping stone for solving
the 4D CA reconstruction problem, achieving adequate 4D reconstructions from as
few as four angiograms, as required by clinical practice, while significantly
outperforming state-of-the-art sparse-view X-ray NeRF. We validate our approach
quantitatively and qualitatively using representative 4D phantom datasets and
ablation studies. To accelerate research in this domain, we made our codebase
public: https://github.com/kirstenmaas/NeRF-CA.
♻ ★ MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling
Character video synthesis aims to produce realistic videos of animatable
characters within lifelike scenes. As a fundamental problem in the computer
vision and graphics community, 3D works typically require multi-view captures
for per-case training, which severely limits their applicability of modeling
arbitrary characters in a short time. Recent 2D methods break this limitation
via pre-trained diffusion models, but they struggle for pose generality and
scene interaction. To this end, we propose MIMO, a novel framework which can
not only synthesize character videos with controllable attributes (i.e.,
character, motion and scene) provided by simple user inputs, but also
simultaneously achieve advanced scalability to arbitrary characters, generality
to novel 3D motions, and applicability to interactive real-world scenes in a
unified framework. The core idea is to encode the 2D video to compact spatial
codes, considering the inherent 3D nature of video occurrence. Concretely, we
lift the 2D frame pixels into 3D using monocular depth estimators, and
decompose the video clip to three spatial components (i.e., main human,
underlying scene, and floating occlusion) in hierarchical layers based on the
3D depth. These components are further encoded to canonical identity code,
structured motion code and full scene code, which are utilized as control
signals of synthesis process. The design of spatial decomposed modeling enables
flexible user control, complex motion expression, as well as 3D-aware synthesis
for scene interactions. Experimental results demonstrate effectiveness and
robustness of the proposed method.
comment: Project Page: https://menyifang.github.io/projects/MIMO/index.html
♻ ★ Temporal-Guided Spiking Neural Networks for Event-Based Human Action Recognition
This paper explores the promising interplay between spiking neural networks
(SNNs) and event-based cameras for privacy-preserving human action recognition
(HAR). The unique feature of event cameras in capturing only the outlines of
motion, combined with SNNs' proficiency in processing spatiotemporal data
through spikes, establishes a highly synergistic compatibility for event-based
HAR. Previous studies, however, have been limited by SNNs' ability to process
long-term temporal information, essential for precise HAR. In this paper, we
introduce two novel frameworks to address this: temporal segment-based SNN
(\textit{TS-SNN}) and 3D convolutional SNN (\textit{3D-SNN}). The
\textit{TS-SNN} extracts long-term temporal information by dividing actions
into shorter segments, while the \textit{3D-SNN} replaces 2D spatial elements
with 3D components to facilitate the transmission of temporal information. To
promote further research in event-based HAR, we create a dataset,
\textit{FallingDetection-CeleX}, collected using the high-resolution CeleX-V
event camera $(1280 \times 800)$, comprising 7 distinct actions. Extensive
experimental results show that our proposed frameworks surpass state-of-the-art
SNN methods on our newly collected dataset and three other neuromorphic
datasets, showcasing their effectiveness in handling long-range temporal
information for event-based HAR.
♻ ★ LEMUR Neural Network Dataset: Towards Seamless AutoML
Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte
Neural networks are fundamental in artificial intelligence, driving progress
in computer vision and natural language processing. High-quality datasets are
crucial for their development, and there is growing interest in datasets
composed of neural networks themselves to support benchmarking, automated
machine learning (AutoML), and model analysis. We introduce LEMUR, an open
source dataset of neural network models with well-structured code for diverse
architectures across tasks such as object detection, image classification,
segmentation, and natural language processing. LEMUR is primarily designed to
provide a rich source of structured model representations and associated
performance data, enabling the fine-tuning of large language models for AutoML
applications. Leveraging Python and PyTorch, LEMUR enables seamless extension
to new datasets and models while maintaining consistency. It integrates an
Optuna-powered framework for evaluation, hyperparameter optimization,
statistical analysis, and graphical insights. LEMUR VR extension enables the
seamless deployment of models in virtual reality, optimizing their performance
on resource-constrained devices. Providing tools for model evaluation,
preprocessing, and database management, LEMUR supports researchers and
practitioners in developing, testing, and analyzing neural networks. It offers
an API that delivers comprehensive information about neural network models and
their complete performance statistics with a single request, which can be used
in experiments with code-generating large language models. The LEMUR and its
plugins are accessible as open source projects under the MIT license at
https://github.com/ABrain-One/nn-dataset,
https://github.com/ABrain-One/nn-plots and https://github.com/ABrain-One/nn-vr.
♻ ★ Dynamic Negative Guidance of Diffusion Models ICLR 2025
Negative Prompting (NP) is widely utilized in diffusion models, particularly
in text-to-image applications, to prevent the generation of undesired features.
In this paper, we show that conventional NP is limited by the assumption of a
constant guidance scale, which may lead to highly suboptimal results, or even
complete failure, due to the non-stationarity and state-dependence of the
reverse process. Based on this analysis, we derive a principled technique
called Dynamic Negative Guidance, which relies on a near-optimal time and state
dependent modulation of the guidance without requiring additional training.
Unlike NP, negative guidance requires estimating the posterior class
probability during the denoising process, which is achieved with limited
additional computational overhead by tracking the discrete Markov Chain during
the generative process. We evaluate the performance of DNG class-removal on
MNIST and CIFAR10, where we show that DNG leads to higher safety, preservation
of class balance and image quality when compared with baseline methods.
Furthermore, we show that it is possible to use DNG with Stable Diffusion to
obtain more accurate and less invasive guidance than NP.
comment: Paper accepted at ICLR 2025 (poster). Our implementation is available
at https://github.com/FelixKoulischer/Dynamic-Negative-Guidance.git
♻ ★ DeepMultiConnectome: Deep Multi-Task Prediction of Structural Connectomes Directly from Diffusion MRI Tractography
Marcus J. Vroemen, Yuqian Chen, Yui Lo, Tengfei Xue, Weidong Cai, Fan Zhang, Josien P. W. Pluim, Lauren J. O'Donnell
Diffusion MRI (dMRI) tractography enables in vivo mapping of brain structural
connections, but traditional connectome generation is time-consuming and
requires gray matter parcellation, posing challenges for large-scale studies.
We introduce DeepMultiConnectome, a deep-learning model that predicts
structural connectomes directly from tractography, bypassing the need for gray
matter parcellation while supporting multiple parcellation schemes. Using a
point-cloud-based neural network with multi-task learning, the model classifies
streamlines according to their connected regions across two parcellation
schemes, sharing a learned representation. We train and validate
DeepMultiConnectome on tractography from the Human Connectome Project Young
Adult dataset ($n = 1000$), labeled with an 84 and 164 region gray matter
parcellation scheme. DeepMultiConnectome predicts multiple structural
connectomes from a whole-brain tractogram containing 3 million streamlines in
approximately 40 seconds. DeepMultiConnectome is evaluated by comparing
predicted connectomes with traditional connectomes generated using the
conventional method of labeling streamlines using a gray matter parcellation.
The predicted connectomes are highly correlated with traditionally generated
connectomes ($r = 0.992$ for an 84-region scheme; $r = 0.986$ for a 164-region
scheme) and largely preserve network properties. A test-retest analysis of
DeepMultiConnectome demonstrates reproducibility comparable to traditionally
generated connectomes. The predicted connectomes perform similarly to
traditionally generated connectomes in predicting age and cognitive function.
Overall, DeepMultiConnectome provides a scalable, fast model for generating
subject-specific connectomes across multiple parcellation schemes.
comment: 15 pages, 5 figures
♻ ★ MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks
As automated attack techniques rapidly advance, CAPTCHAs remain a critical
defense mechanism against malicious bots. However, existing CAPTCHA schemes
encompass a diverse range of modalities -- from static distorted text and
obfuscated images to interactive clicks, sliding puzzles, and logic-based
questions -- yet the community still lacks a unified, large-scale, multimodal
benchmark to rigorously evaluate their security robustness. To address this
gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking
suite that integrates heterogeneous CAPTCHA types into a single evaluation
protocol. Leveraging a shared vision-language model backbone, we fine-tune
specialized cracking agents for each CAPTCHA category, enabling consistent,
cross-modal assessments. Extensive experiments reveal that MCA-Bench
effectively maps the vulnerability spectrum of modern CAPTCHA designs under
varied attack settings, and crucially offers the first quantitative analysis of
how challenge complexity, interaction depth, and model solvability interrelate.
Based on these findings, we propose three actionable design principles and
identify key open challenges, laying the groundwork for systematic CAPTCHA
hardening, fair benchmarking, and broader community collaboration. Datasets and
code are available online.
comment: 31 pages, 8 figures
♻ ★ Plug-and-Play image restoration with Stochastic deNOising REgularization
Plug-and-Play (PnP) algorithms are a class of iterative algorithms that
address image inverse problems by combining a physical model and a deep neural
network for regularization. Even if they produce impressive image restoration
results, these algorithms rely on a non-standard use of a denoiser on images
that are less and less noisy along the iterations, which contrasts with recent
algorithms based on Diffusion Models (DM), where the denoiser is applied only
on re-noised images. We propose a new PnP framework, called Stochastic
deNOising REgularization (SNORE), which applies the denoiser only on images
with noise of the adequate level. It is based on an explicit stochastic
regularization, which leads to a stochastic gradient descent algorithm to solve
ill-posed inverse problems. A convergence analysis of this algorithm and its
annealing extension is provided. Experimentally, we prove that SNORE is
competitive with respect to state-of-the-art methods on deblurring and
inpainting tasks, both quantitatively and qualitatively.
♻ ★ Exploring Test-Time Adaptation for Object Detection in Continually Changing Environments
Real-world application models are commonly deployed in dynamic environments,
where the target domain distribution undergoes temporal changes. Continual
Test-Time Adaptation (CTTA) has recently emerged as a promising technique to
gradually adapt a source-trained model to continually changing target domains.
Despite recent advancements in addressing CTTA, two critical issues remain: 1)
Fixed thresholds for pseudo-labeling in existing methodologies lead to
low-quality pseudo-labels, as model confidence varies across categories and
domains; 2) Stochastic parameter restoration methods for mitigating
catastrophic forgetting fail to preserve critical information effectively, due
to their intrinsic randomness. To tackle these challenges for detection models
in CTTA scenarios, we present AMROD, featuring three core components. Firstly,
the object-level contrastive learning module extracts object-level features for
contrastive learning to refine the feature representation in the target domain.
Secondly, the adaptive monitoring module dynamically skips unnecessary
adaptation and updates the category-specific threshold based on predicted
confidence scores to enable efficiency and improve the quality of
pseudo-labels. Lastly, the adaptive randomized restoration mechanism
selectively reset inactive parameters with higher possibilities, ensuring the
retention of essential knowledge. We demonstrate the effectiveness of AMROD on
four CTTA object detection tasks, where AMROD outperforms existing methods,
especially achieving a 3.2 mAP improvement and a 20\% increase in efficiency on
the Cityscapes-to-Cityscapes-C CTTA task. The code of this work is available at
https://github.com/ShileiCao/AMROD.
♻ ★ Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain
The diffusion-based adversarial purification methods attempt to drown
adversarial perturbations into a part of isotropic noise through the forward
process, and then recover the clean images through the reverse process. Due to
the lack of distribution information about adversarial perturbations in the
pixel domain, it is often unavoidable to damage normal semantics. We turn to
the frequency domain perspective, decomposing the image into amplitude spectrum
and phase spectrum. We find that for both spectra, the damage caused by
adversarial perturbations tends to increase monotonically with frequency. This
means that we can extract the content and structural information of the
original clean sample from the frequency components that are less damaged.
Meanwhile, theoretical analysis indicates that existing purification methods
indiscriminately damage all frequency components, leading to excessive damage
to the image. Therefore, we propose a purification method that can eliminate
adversarial perturbations while maximizing the preservation of the content and
structure of the original image. Specifically, at each time step during the
reverse process, for the amplitude spectrum, we replace the low-frequency
components of the estimated image's amplitude spectrum with the corresponding
parts of the adversarial image. For the phase spectrum, we project the phase of
the estimated image into a designated range of the adversarial image's phase
spectrum, focusing on the low frequencies. Empirical evidence from extensive
experiments demonstrates that our method significantly outperforms most current
defense methods.
♻ ★ SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis
Despite recent advances in text-conditioned 3D indoor scene generation, there
remain gaps in the evaluation of these methods. Existing metrics primarily
assess the realism of generated scenes by comparing them to a set of
ground-truth scenes, often overlooking alignment with the input text - a
critical factor in determining how effectively a method meets user
requirements. We present SceneEval, an evaluation framework designed to address
this limitation. SceneEval includes metrics for both explicit user
requirements, such as the presence of specific objects and their attributes
described in the input text, and implicit expectations, like the absence of
object collisions, providing a comprehensive assessment of scene quality. To
facilitate evaluation, we introduce SceneEval-500, a dataset of scene
descriptions with annotated ground-truth scene properties. We evaluate recent
scene generation methods using SceneEval and demonstrate its ability to provide
detailed assessments of the generated scenes, highlighting strengths and areas
for improvement across multiple dimensions. Our results show that current
methods struggle at generating scenes that meet user requirements, underscoring
the need for further research in this direction.
comment: Expanded dataset to 500 annotated scene descriptions with new scene
types; added validation via extended manual evaluation and a new user study;
clarified distinctions from prior metrics; included results using an
open-source VLM; stated intent to release code and data; corrected
terminology and typos. 24 pages with 8 figures and 6 tables
♻ ★ ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions
Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang
Editing images with instructions to reflect non-rigid motions, camera
viewpoint shifts, object deformations, human articulations, and complex
interactions, poses a challenging yet underexplored problem in computer vision.
Existing approaches and datasets predominantly focus on static scenes or rigid
transformations, limiting their capacity to handle expressive edits involving
dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive
framework for instruction-based image editing with an emphasis on non-rigid
motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong
baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher.
ByteMorph-6M includes over 6 million high-resolution image editing pairs for
training, along with a carefully curated evaluation benchmark ByteMorph-Bench.
Both capture a wide variety of non-rigid motion types across diverse
environments, human figures, and object categories. The dataset is constructed
using motion-guided data generation, layered compositing techniques, and
automated captioning to ensure diversity, realism, and semantic coherence. We
further conduct a comprehensive evaluation of recent instruction-based image
editing methods from both academic and commercial domains.
comment: Website: https://boese0601.github.io/bytemorph Dataset:
https://huggingface.co/datasets/ByteDance-Seed/BM-6M Benchmark:
https://huggingface.co/datasets/ByteDance-Seed/BM-Bench Code:
https://github.com/ByteDance-Seed/BM-code Demo:
https://huggingface.co/spaces/Boese0601/ByteMorph-Demo
♻ ★ Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations CVPR 2025
Modeling spatial-temporal interactions among neighboring agents is at the
heart of multi-agent problems such as motion forecasting and crowd navigation.
Despite notable progress, it remains unclear to which extent modern
representations can capture the causal relationships behind agent interactions.
In this work, we take an in-depth look at the causal awareness of these
representations, from computational formalism to real-world practice. First, we
cast doubt on the notion of non-causal robustness studied in the recent
CausalAgents benchmark. We show that recent representations are already
partially resilient to perturbations of non-causal agents, and yet modeling
indirect causal effects involving mediator agents remains challenging. To
address this challenge, we introduce a metric learning approach that
regularizes latent representations with causal annotations. Our controlled
experiments show that this approach not only leads to higher degrees of causal
awareness but also yields stronger out-of-distribution robustness. To further
operationalize it in practice, we propose a sim-to-real causal transfer method
via cross-domain multi-task learning. Experiments on pedestrian datasets show
that our method can substantially boost generalization, even in the absence of
real-world causal annotations. We hope our work provides a new perspective on
the challenges and pathways towards causally-aware representations of
multi-agent interactions. Our code is available at
https://github.com/vita-epfl/CausalSim2Real.
comment: CVPR 2025
♻ ★ Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration IJCAI 2025
Long Peng, Xin Di, Zhanfeng Feng, Wenbo Li, Renjing Pei, Yang Wang, Xueyang Fu, Yang Cao, Zheng-Jun Zha
Image restoration aims to recover details and enhance contrast in degraded
images. With the growing demand for high-quality imaging (\textit{e.g.}, 4K and
8K), achieving a balance between restoration quality and computational
efficiency has become increasingly critical. Existing methods, primarily based
on CNNs, Transformers, or their hybrid approaches, apply uniform deep
representation extraction across the image. However, these methods often
struggle to effectively model long-range dependencies and largely overlook the
spatial characteristics of image degradation (regions with richer textures tend
to suffer more severe damage), making it hard to achieve the best trade-off
between restoration quality and efficiency. To address these issues, we propose
a novel texture-aware image restoration method, TAMambaIR, which simultaneously
perceives image textures and achieves a trade-off between performance and
efficiency. Specifically, we introduce a novel Texture-Aware State Space Model,
which enhances texture awareness and improves efficiency by modulating the
transition matrix of the state-space equation and focusing on regions with
complex textures. Additionally, we design a {Multi-Directional Perception
Block} to improve multi-directional receptive fields while maintaining low
computational overhead. Extensive experiments on benchmarks for image
super-resolution, deraining, and low-light image enhancement demonstrate that
TAMambaIR achieves state-of-the-art performance with significantly improved
efficiency, establishing it as a robust and efficient framework for image
restoration.
comment: Accepted by the 34th International Joint Conference on Artificial
Intelligence (IJCAI 2025)
♻ ★ MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models
Philip R. Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ramani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, Shu Hu
The integration of deep learning-based glaucoma detection with large language
models (LLMs) presents an automated strategy to mitigate ophthalmologist
shortages and improve clinical reporting efficiency. However, applying general
LLMs to medical imaging remains challenging due to hallucinations, limited
interpretability, and insufficient domain-specific medical knowledge, which can
potentially reduce clinical accuracy. Although recent approaches combining
imaging models with LLM reasoning have improved reporting, they typically rely
on a single generalist agent, restricting their capacity to emulate the diverse
and complex reasoning found in multidisciplinary medical teams. To address
these limitations, we propose MedChat, a multi-agent diagnostic framework and
platform that combines specialized vision models with multiple role-specific
LLM agents, all coordinated by a director agent. This design enhances
reliability, reduces hallucination risk, and enables interactive diagnostic
reporting through an interface tailored for clinical review and educational
use. Code available at https://github.com/Purdue-M2/MedChat.
comment: 7 pages, 6 figures. Accepted to the 2025 IEEE 8th International
Conference on Multimedia Information Processing and Retrieval (MIPR)