Computer Vision and Pattern Recognition
★ ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang
Structured image understanding, such as interpreting tables and charts,
requires strategically refocusing across various structures and texts within an
image, forming a reasoning sequence to arrive at the final answer. However,
current multimodal large language models (LLMs) lack this multihop selective
attention capability. In this work, we introduce ReFocus, a simple yet
effective framework that equips multimodal LLMs with the ability to generate
"visual thoughts" by performing visual editing on the input image through code,
shifting and refining their visual focuses. Specifically, ReFocus enables
multimodal LLMs to generate Python codes to call tools and modify the input
image, sequentially drawing boxes, highlighting sections, and masking out
areas, thereby enhancing the visual reasoning process. We experiment upon a
wide range of structured image understanding tasks involving tables and charts.
ReFocus largely improves performance on all tasks over GPT-4o without visual
editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart
tasks. We present an in-depth analysis of the effects of different visual
edits, and reasons why ReFocus can improve the performance without introducing
additional information. Further, we collect a 14k training set using ReFocus,
and prove that such visual chain-of-thought with intermediate information
offers a better supervision than standard VQA data, reaching a 8.0% average
gain over the same model trained with QA pairs and 2.6% over CoT.
comment: Project link: https://zeyofu.github.io/ReFocus/
★ An Empirical Study of Autoregressive Pre-training from Videos
Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik
We empirically study autoregressive pre-training from videos. To perform our
study, we construct a series of autoregressive video models, called Toto. We
treat videos as sequences of visual tokens and train transformer models to
autoregressively predict future tokens. Our models are pre-trained on a diverse
dataset of videos and images comprising over 1 trillion visual tokens. We
explore different architectural, training, and inference design choices. We
evaluate the learned visual representations on a range of downstream tasks
including image recognition, video classification, object tracking, and
robotics. Our results demonstrate that, despite minimal inductive biases,
autoregressive pre-training leads to competitive performance across all
benchmarks. Finally, we find that scaling our video models results in similar
scaling curves to those seen in language models, albeit with a different rate.
More details at https://brjathu.github.io/toto/
★ Decentralized Diffusion Models
Large-scale AI model training divides work across thousands of GPUs, then
synchronizes gradients across them at each step. This incurs a significant
network burden that only centralized, monolithic clusters can support, driving
up infrastructure costs and straining power systems. We propose Decentralized
Diffusion Models, a scalable framework for distributing diffusion model
training across independent clusters or datacenters by eliminating the
dependence on a centralized, high-bandwidth networking fabric. Our method
trains a set of expert diffusion models over partitions of the dataset, each in
full isolation from one another. At inference time, the experts ensemble
through a lightweight router. We show that the ensemble collectively optimizes
the same objective as a single model trained over the whole dataset. This means
we can divide the training burden among a number of "compute islands," lowering
infrastructure costs and improving resilience to localized GPU failures.
Decentralized diffusion models empower researchers to take advantage of
smaller, more cost-effective and more readily available compute like on-demand
GPU nodes rather than central integrated systems. We conduct extensive
experiments on ImageNet and LAION Aesthetics, showing that decentralized
diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally
scale our approach to 24 billion parameters, demonstrating that high-quality
diffusion models can now be trained with just eight individual GPU nodes in
less than a week.
comment: Project webpage: https://decentralizeddiffusion.github.io/
★ Explainable AI-Enhanced Deep Learning for Pumpkin Leaf Disease Detection: A Comparative Analysis of CNN Architectures
Pumpkin leaf diseases are significant threats to agricultural productivity,
requiring a timely and precise diagnosis for effective management. Traditional
identification methods are laborious and susceptible to human error,
emphasizing the necessity for automated solutions. This study employs on the
"Pumpkin Leaf Disease Dataset", that comprises of 2000 high-resolution images
separated into five categories. Downy mildew, powdery mildew, mosaic disease,
bacterial leaf spot, and healthy leaves. The dataset was rigorously assembled
from several agricultural fields to ensure a strong representation for model
training. We explored many proficient deep learning architectures, including
DenseNet201, DenseNet121, DenseNet169, Xception, ResNet50, ResNet101 and
InceptionResNetV2, and observed that ResNet50 performed most effectively, with
an accuracy of 90.5% and comparable precision, recall, and F1-Score. We used
Explainable AI (XAI) approaches like Grad-CAM, Grad-CAM++, Score-CAM, and
Layer-CAM to provide meaningful representations of model decision-making
processes, which improved understanding and trust in automated disease
diagnostics. These findings demonstrate ResNet50's potential to revolutionize
pumpkin leaf disease detection, allowing for earlier and more accurate
treatments.
comment: Accepted in 2024 27th International Conference on Computer and
Information Technology (ICCIT)
★ Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
Monocular depth estimation (MDE) models have undergone significant
advancements over recent years. Many MDE models aim to predict affine-invariant
relative depth from monocular images, while recent developments in large-scale
training and vision foundation models enable reasonable estimation of metric
(absolute) depth. However, effectively leveraging these predictions for
geometric vision tasks, in particular relative pose estimation, remains
relatively under explored. While depths provide rich constraints for cross-view
image alignment, the intrinsic noise and ambiguity from the monocular depth
priors present practical challenges to improving upon classic keypoint-based
solutions. In this paper, we develop three solvers for relative pose estimation
that explicitly account for independent affine (scale and shift) ambiguities,
covering both calibrated and uncalibrated conditions. We further propose a
hybrid estimation pipeline that combines our proposed solvers with classic
point-based solvers and epipolar constraints. We find that the affine
correction modeling is beneficial to not only the relative depth priors but
also, surprisingly, the ``metric" ones. Results across multiple datasets
demonstrate large improvements of our approach over classic keypoint-based
baselines and PnP-based solutions, under both calibrated and uncalibrated
setups. We also show that our method improves consistently with different
feature matchers and MDE models, and can further benefit from very recent
advances on both modules. Code is available at
https://github.com/MarkYu98/madpose.
★ Consistent Flow Distillation for Text-to-3D Generation
Score Distillation Sampling (SDS) has made significant strides in distilling
image-generative models for 3D generation. However, its
maximum-likelihood-seeking behavior often leads to degraded visual quality and
diversity, limiting its effectiveness in 3D applications. In this work, we
propose Consistent Flow Distillation (CFD), which addresses these limitations.
We begin by leveraging the gradient of the diffusion ODE or SDE sampling
process to guide the 3D generation. From the gradient-based sampling
perspective, we find that the consistency of 2D image flows across different
viewpoints is important for high-quality 3D generation. To achieve this, we
introduce multi-view consistent Gaussian noise on the 3D object, which can be
rendered from various viewpoints to compute the flow gradient. Our experiments
demonstrate that CFD, through consistent flows, significantly outperforms
previous methods in text-to-3D generation.
comment: Project page: https://runjie-yan.github.io/cfd/
★ Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
The ability to organically reason over and with both text and images is a
pillar of human intelligence, yet the ability of Multimodal Large Language
Models (MLLMs) to perform such multimodal reasoning remains under-explored.
Existing benchmarks often emphasize text-dominant reasoning or rely on shallow
visual cues, failing to adequately assess integrated visual and textual
reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark
targeting organic multimodal reasoning across mathematics, physics, chemistry,
and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be
addressed by reasoning independently in each modality, offering an enhanced
test suite for MLLMs' reasoning capabilities. Our evaluation of
state-of-the-art MLLMs on EMMA reveals significant limitations in handling
complex multimodal and multi-step reasoning tasks, even with advanced
techniques like Chain-of-Thought prompting and test-time compute scaling
underperforming. These findings underscore the need for improved multimodal
architectures and training paradigms to close the gap between human and model
reasoning in multimodality.
★ Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces
Video tokenizers are essential for latent video diffusion models, converting
raw video data into spatiotemporally compressed latent spaces for efficient
training. However, extending state-of-the-art video tokenizers to achieve a
temporal compression ratio beyond 4x without increasing channel capacity poses
significant challenges. In this work, we propose an alternative approach to
enhance temporal compression. We find that the reconstruction quality of
temporally subsampled videos from a low-compression encoder surpasses that of
high-compression encoders applied to original videos. This indicates that
high-compression models can leverage representations from lower-compression
models. Building on this insight, we develop a bootstrapped
high-temporal-compression model that progressively trains high-compression
blocks atop well-trained lower-compression models. Our method includes a
cross-level feature-mixing module to retain information from the pretrained
low-compression model and guide higher-compression blocks to capture the
remaining details from the full video sequence. Evaluation of video benchmarks
shows that our method significantly improves reconstruction quality while
increasing temporal compression compared to direct extensions of existing video
tokenizers. Furthermore, the resulting compact latent space effectively trains
a video diffusion model for high-quality video generation with a reduced token
budget.
comment: Project website:
https://progressive-video-tokenizer.github.io/Pro-MAG/
★ The GAN is dead; long live the GAN! A Modern GAN Baseline NeurIPS 2024
There is a widely-spread claim that GANs are difficult to train, and GAN
architectures in the literature are littered with empirical tricks. We provide
evidence against this claim and build a modern GAN baseline in a more
principled manner. First, we derive a well-behaved regularized relativistic GAN
loss that addresses issues of mode dropping and non-convergence that were
previously tackled via a bag of ad-hoc tricks. We analyze our loss
mathematically and prove that it admits local convergence guarantees, unlike
most existing relativistic losses. Second, our new loss allows us to discard
all ad-hoc tricks and replace outdated backbones used in common GANs with
modern architectures. Using StyleGAN2 as an example, we present a roadmap of
simplification and modernization that results in a new minimalist baseline --
R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ,
ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against
state-of-the-art GANs and diffusion models.
comment: Accepted to NeurIPS 2024. Code available at
https://github.com/brownvc/R3GAN/
★ $DPF^*$: improved Depth Potential Function for scale-invariant sulcal depth estimation
The shape of human brain is complex and highly variable, with interactions
between brain size, cortical folding, and age well-documented in the
literature. However, few studies have explored how global brain size influences
geometric features of the cortical surface derived from anatomical MRI. In this
work, we focus on sulcal depth, an imaging phenotype that has gained
significant attention in both basic research and clinical applications. We make
key contributions to the field by: 1) providing the first quantitative analysis
of how brain size affects sulcal depth measurements; 2) introducing a novel,
scale-invariant method for sulcal depth estimation based on an original
formalization of the problem; 3) presenting a validation framework and sharing
our code and benchmark data with the community; and 4) demonstrating the
biological relevance of our new sulcal depth measure using a large sample of
1,987 subjects spanning the developmental period from 26 weeks post-conception
to adulthood.
comment: GA and JL contributed equally to this work
★ Flatland Vision
When is it possible to project two sets of labeled points lying in a pair of
projective planes to the same image on a projective line? We give a complete
answer to this question and describe the loci of the projection centers that
enable a common image. In particular, we find that there exists a solution to
this problem if and only if these two sets are themselves images of a common
pointset in projective space.
★ Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation
Recent advances in 2D image generation have achieved remarkable
quality,largely driven by the capacity of diffusion models and the availability
of large-scale datasets. However, direct 3D generation is still constrained by
the scarcity and lower fidelity of 3D datasets. In this paper, we introduce
Zero-1-to-G, a novel approach that addresses this problem by enabling direct
single-view generation on Gaussian splats using pretrained 2D diffusion models.
Our key insight is that Gaussian splats, a 3D representation, can be decomposed
into multi-view images encoding different attributes. This reframes the
challenging task of direct 3D generation within a 2D diffusion framework,
allowing us to leverage the rich priors of pretrained 2D diffusion models. To
incorporate 3D awareness, we introduce cross-view and cross-attribute attention
layers, which capture complex correlations and enforce 3D consistency across
generated splats. This makes Zero-1-to-G the first direct image-to-3D
generative model to effectively utilize pretrained 2D diffusion priors,
enabling efficient training and improved generalization to unseen objects.
Extensive experiments on both synthetic and in-the-wild datasets demonstrate
superior performance in 3D object generation, offering a new approach to
high-quality 3D generation.
★ From Images to Insights: Transforming Brain Cancer Diagnosis with Explainable AI
Brain cancer represents a major challenge in medical diagnostics, requisite
precise and timely detection for effective treatment. Diagnosis initially
relies on the proficiency of radiologists, which can cause difficulties and
threats when the expertise is sparse. Despite the use of imaging resources,
brain cancer remains often difficult, time-consuming, and vulnerable to
intraclass variability. This study conveys the Bangladesh Brain Cancer MRI
Dataset, containing 6,056 MRI images organized into three categories: Brain
Tumor, Brain Glioma, and Brain Menin. The dataset was collected from several
hospitals in Bangladesh, providing a diverse and realistic sample for research.
We implemented advanced deep learning models, and DenseNet169 achieved
exceptional results, with accuracy, precision, recall, and F1-Score all
reaching 0.9983. In addition, Explainable AI (XAI) methods including GradCAM,
GradCAM++, ScoreCAM, and LayerCAM were employed to provide visual
representations of the decision-making processes of the models. In the context
of brain cancer, these techniques highlight DenseNet169's potential to enhance
diagnostic accuracy while simultaneously offering transparency, facilitating
early diagnosis and better patient outcomes.
comment: Accepted in 2024 27th International Conference on Computer and
Information Technology (ICCIT)
★ Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation
Training audio-to-image generative models requires an abundance of diverse
audio-visual pairs that are semantically aligned. Such data is almost always
curated from in-the-wild videos, given the cross-modal semantic correspondence
that is inherent to them. In this work, we hypothesize that insisting on the
absolute need for ground truth audio-visual correspondence, is not only
unnecessary, but also leads to severe restrictions in scale, quality, and
diversity of the data, ultimately impairing its use in the modern generative
models. That is, we propose a scalable image sonification framework where
instances from a variety of high-quality yet disjoint uni-modal origins can be
artificially paired through a retrieval process that is empowered by reasoning
capabilities of modern vision-language models. To demonstrate the efficacy of
this approach, we use our sonified images to train an audio-to-image generative
model that performs competitively against state-of-the-art. Finally, through a
series of ablation studies, we exhibit several intriguing auditory capabilities
like semantic mixing and interpolation, loudness calibration and acoustic space
modeling through reverberation that our model has implicitly developed to guide
the image generation process.
★ A Novel Pathology Foundation Model by Mayo Clinic, Charité, and Aignostics
Maximilian Alber, Stephan Tietz, Jonas Dippel, Timo Milbich, Timothée Lesort, Panos Korfiatis, Moritz Krügener, Beatriz Perez Cancer, Neelay Shah, Alexander Möllers, Philipp Seegerer, Alexandra Carpen-Amarie, Kai Standvoss, Gabriel Dernbach, Edwin de Jong, Simon Schallenberg, Andreas Kunft, Helmut Hoffer von Ankershoffen, Gavin Schaeferle, Patrick Duffy, Matt Redlon, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, Andrew Norgan
Recent advances in digital pathology have demonstrated the effectiveness of
foundation models across diverse applications. In this report, we present a
novel vision foundation model based on the RudolfV approach. Our model was
trained on a dataset comprising 1.2 million histopathology whole slide images,
collected from two medical institutions: Mayo Clinic and Charit\'e -
Universt\"atsmedizin Berlin. Comprehensive evaluations show that our model
achieves state-of-the-art performance across twenty-one public benchmark
datasets, even though it is neither the largest model by parameter count nor by
training dataset size.
★ Performance of YOLOv7 in Kitchen Safety While Handling Knife
Safe knife practices in the kitchen significantly reduce the risk of cuts,
injuries, and serious accidents during food preparation. Using YOLOv7, an
advanced object detection model, this study focuses on identifying safety risks
during knife handling, particularly improper finger placement and blade contact
with hand. The model's performance was evaluated using metrics such as
precision, recall, mAP50, and mAP50-95. The results demonstrate that YOLOv7
achieved its best performance at epoch 31, with a mAP50-95 score of 0.7879,
precision of 0.9063, and recall of 0.7503. These findings highlight YOLOv7's
potential to accurately detect knife-related hazards, promoting the development
of improved kitchen safety.
★ Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance
Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, Stefanos Zafeiriou
Inspired by the effectiveness of 3D Gaussian Splatting (3DGS) in
reconstructing detailed 3D scenes within multi-view setups and the emergence of
large 2D human foundation models, we introduce Arc2Avatar, the first SDS-based
method utilizing a human face foundation model as guidance with just a single
image as input. To achieve that, we extend such a model for diverse-view human
head generation by fine-tuning on synthetic data and modifying its
conditioning. Our avatars maintain a dense correspondence with a human face
mesh template, allowing blendshape-based expression generation. This is
achieved through a modified 3DGS approach, connectivity regularizers, and a
strategic initialization tailored for our task. Additionally, we propose an
optional efficient SDS-based correction step to refine the blendshape
expressions, enhancing realism and diversity. Experiments demonstrate that
Arc2Avatar achieves state-of-the-art realism and identity preservation,
effectively addressing color issues by allowing the use of very low guidance,
enabled by our strong identity prior and initialization strategy, without
compromising detail.
★ 1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On
Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the
realistic simulation of garments on individuals while preserving their original
appearance and pose. Early VTON methods relied on single generative networks,
but challenges remain in preserving fine-grained garment details due to
limitations in feature extraction and fusion. To address these issues, recent
approaches have adopted a dual-network paradigm, incorporating a complementary
"ReferenceNet" to enhance garment feature extraction and fusion. While
effective, this dual-network approach introduces significant computational
overhead, limiting its scalability for high-resolution and long-duration
image/video VTON applications. In this paper, we challenge the dual-network
paradigm by proposing a novel single-network VTON method that overcomes the
limitations of existing techniques. Our method, namely MNVTON, introduces a
Modality-specific Normalization strategy that separately processes text, image
and video inputs, enabling them to share the same attention layers in a VTON
network. Extensive experimental results demonstrate the effectiveness of our
approach, showing that it consistently achieves higher-quality, more detailed
results for both image and video VTON tasks. Our results suggest that the
single-network paradigm can rival the performance of dualnetwork approaches,
offering a more efficient alternative for high-quality, scalable VTON
applications.
comment: Project page: https://ningshuliang.github.io/2023/Arxiv/index.html
★ CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models
With advances in diffusion models, image generation has shown significant
performance improvements. This raises concerns about the potential abuse of
image generation, such as the creation of explicit or violent images, commonly
referred to as Not Safe For Work (NSFW) content. To address this, the Stable
Diffusion model includes several safety checkers to censor initial text prompts
and final output images generated from the model. However, recent research has
shown that these safety checkers have vulnerabilities against adversarial
attacks, allowing them to generate NSFW images. In this paper, we find that
these adversarial attacks are not robust to small changes in text prompts or
input latents. Based on this, we propose CROPS (Circular or RandOm Prompts for
Safety), a model-agnostic framework that easily defends against adversarial
attacks generating NSFW images without requiring additional training. Moreover,
we develop an approach that utilizes one-step diffusion models for efficient
NSFW detection (CROPS-1), further reducing computational resources. We
demonstrate the superiority of our method in terms of performance and
applicability.
★ JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration AAAI 2025
Mingzi Wang, Yuan Meng, Chen Tang, Weixiang Zhang, Yijian Qin, Yang Yao, Yingxin Li, Tongtong Feng, Xin Wang, Xun Guan, Zhi Wang, Wenwu Zhu
The co-design of neural network architectures, quantization precisions, and
hardware accelerators offers a promising approach to achieving an optimal
balance between performance and efficiency, particularly for model deployment
on resource-constrained edge devices. In this work, we propose the JAQ
Framework, which jointly optimizes the three critical dimensions. However,
effectively automating the design process across the vast search space of those
three dimensions poses significant challenges, especially when pursuing
extremely low-bit quantization. Specifical, the primary challenges include: (1)
Memory overhead in software-side: Low-precision quantization-aware training can
lead to significant memory usage due to storing large intermediate features and
latent weights for back-propagation, potentially causing memory exhaustion. (2)
Search time-consuming in hardware-side: The discrete nature of hardware
parameters and the complex interplay between compiler optimizations and
individual operators make the accelerator search time-consuming. To address
these issues, JAQ mitigates the memory overhead through a channel-wise sparse
quantization (CSQ) scheme, selectively applying quantization to the most
sensitive components of the model during optimization. Additionally, JAQ
designs BatchTile, which employs a hardware generation network to encode all
possible tiling modes, thereby speeding up the search for the optimal compiler
mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ,
achieving approximately 7% higher Top-1 accuracy on ImageNet compared to
previous methods and reducing the hardware search time per iteration to 0.15
seconds.
comment: Accepted by AAAI 2025
★ Comparison Study: Glacier Calving Front Delineation in Synthetic Aperture Radar Images With Deep Learning
Nora Gourmelon, Konrad Heidler, Erik Loebel, Daniel Cheng, Julian Klink, Anda Dong, Fei Wu, Noah Maul, Moritz Koch, Marcel Dreier, Dakota Pyles, Thorsten Seehaus, Matthias Braun, Andreas Maier, Vincent Christlein
Calving front position variation of marine-terminating glaciers is an
indicator of ice mass loss and a crucial parameter in numerical glacier models.
Deep Learning (DL) systems can automatically extract this position from
Synthetic Aperture Radar (SAR) imagery, enabling continuous, weather- and
illumination-independent, large-scale monitoring. This study presents the first
comparison of DL systems on a common calving front benchmark dataset. A
multi-annotator study with ten annotators is performed to contrast the
best-performing DL system against human performance. The best DL model's
outputs deviate 221 m on average, while the average deviation of the human
annotators is 38 m. This significant difference shows that current DL systems
do not yet match human performance and that further research is needed to
enable fully automated monitoring of glacier calving fronts. The study of
Vision Transformers, foundation models, and the inclusion and processing
strategy of more information are identified as avenues for future research.
★ Solving the Catastrophic Forgetting Problem in Generalized Category Discovery CVPR 2024
Generalized Category Discovery (GCD) aims to identify a mix of known and
novel categories within unlabeled data sets, providing a more realistic setting
for image recognition. Essentially, GCD needs to remember existing patterns
thoroughly to recognize novel categories. Recent state-of-the-art method SimGCD
transfers the knowledge from known-class data to the learning of novel classes
through debiased learning. However, some patterns are catastrophically forgot
during adaptation and thus lead to poor performance in novel categories
classification. To address this issue, we propose a novel learning approach,
LegoGCD, which is seamlessly integrated into previous methods to enhance the
discrimination of novel classes while maintaining performance on previously
encountered known classes. Specifically, we design two types of techniques
termed as Local Entropy Regularization (LER) and Dual-views Kullback Leibler
divergence constraint (DKL). The LER optimizes the distribution of potential
known class samples in unlabeled data, thus ensuring the preservation of
knowledge related to known categories while learning novel classes. Meanwhile,
DKL introduces Kullback Leibler divergence to encourage the model to produce a
similar prediction distribution of two view samples from the same image. In
this way, it successfully avoids mismatched prediction and generates more
reliable potential known class samples simultaneously. Extensive experiments
validate that the proposed LegoGCD effectively addresses the known category
forgetting issue across all datasets, eg, delivering a 7.74% and 2.51% accuracy
boost on known and novel classes in CUB, respectively. Our code is available
at: https://github.com/Cliffia123/LegoGCD.
comment: Accepted by CVPR 2024
★ CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models
Digital Pathology is a cornerstone in the diagnosis and treatment of
diseases. A key task in this field is the identification and segmentation of
cells in hematoxylin and eosin-stained images. Existing methods for cell
segmentation often require extensive annotated datasets for training and are
limited to a predefined cell classification scheme. To overcome these
limitations, we propose $\text{CellViT}^{{\scriptscriptstyle ++}}$, a framework
for generalized cell segmentation in digital pathology.
$\text{CellViT}^{{\scriptscriptstyle ++}}$ utilizes Vision Transformers with
foundation models as encoders to compute deep cell features and segmentation
masks simultaneously. To adapt to unseen cell types, we rely on a
computationally efficient approach. It requires minimal data for training and
leads to a drastically reduced carbon footprint. We demonstrate excellent
performance on seven different datasets, covering a broad spectrum of cell
types, organs, and clinical settings. The framework achieves remarkable
zero-shot segmentation and data-efficient cell-type classification.
Furthermore, we show that $\text{CellViT}^{{\scriptscriptstyle ++}}$ can
leverage immunofluorescence stainings to generate training datasets without the
need for pathologist annotations. The automated dataset generation approach
surpasses the performance of networks trained on manually labeled data,
demonstrating its effectiveness in creating high-quality training datasets
without expert annotations. To advance digital pathology,
$\text{CellViT}^{{\scriptscriptstyle ++}}$ is available as an open-source
framework featuring a user-friendly, web-based interface for visualization and
annotation. The code is available under
https://github.com/TIO-IKIM/CellViT-plus-plus.
★ Patch-GAN Transfer Learning with Reconstructive Models for Cloud Removal
Cloud removal plays a crucial role in enhancing remote sensing image
analysis, yet accurately reconstructing cloud-obscured regions remains a
significant challenge. Recent advancements in generative models have made the
generation of realistic images increasingly accessible, offering new
opportunities for this task. Given the conceptual alignment between image
generation and cloud removal tasks, generative models present a promising
approach for addressing cloud removal in remote sensing. In this work, we
propose a deep transfer learning approach built on a generative adversarial
network (GAN) framework to explore the potential of the novel masked
autoencoder (MAE) image reconstruction model in cloud removal. Due to the
complexity of remote sensing imagery, we further propose using a patch-wise
discriminator to determine whether each patch of the image is real or not. The
proposed reconstructive transfer learning approach demonstrates significant
improvements in cloud removal performance compared to other GAN-based methods.
Additionally, whilst direct comparisons with some of the state-of-the-art cloud
removal techniques are limited due to unclear details regarding their
train/test data splits, the proposed model achieves competitive results based
on available benchmarks.
★ Towards Balanced Continual Multi-Modal Learning in Human Pose Estimation
3D human pose estimation (3D HPE) has emerged as a prominent research topic,
particularly in the realm of RGB-based methods. However, RGB images are
susceptible to limitations such as sensitivity to lighting conditions and
potential user discomfort. Consequently, multi-modal sensing, which leverages
non-intrusive sensors, is gaining increasing attention. Nevertheless,
multi-modal 3D HPE still faces challenges, including modality imbalance and the
imperative for continual learning. In this work, we introduce a novel balanced
continual multi-modal learning method for 3D HPE, which harnesses the power of
RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based
contribution algorithm to quantify the contribution of each modality and
identify modality imbalance. To address this imbalance, we employ a re-learning
strategy. Furthermore, recognizing that raw data is prone to noise
contamination, we develop a novel denoising continual learning approach. This
approach incorporates a noise identification and separation module to mitigate
the adverse effects of noise and collaborates with the balanced learning
strategy to enhance optimization. Additionally, an adaptive EWC mechanism is
employed to alleviate catastrophic forgetting. We conduct extensive experiments
on the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the
superiority of our approach in boosting 3D pose estimation and mitigating
catastrophic forgetting in complex scenarios. We will release our codes.
★ Domain-Incremental Semantic Segmentation for Autonomous Driving under Adverse Driving Conditions ICPR
Semantic segmentation for autonomous driving is an even more challenging task
when faced with adverse driving conditions. Standard models trained on data
recorded under ideal conditions show a deteriorated performance in unfavorable
weather or illumination conditions. Fine-tuning on the new task or condition
would lead to overwriting the previously learned information resulting in
catastrophic forgetting. Adapting to the new conditions through traditional
domain adaption methods improves the performance on the target domain at the
expense of the source domain. Addressing these issues, we propose an
architecture-based domain-incremental learning approach called Progressive
Semantic Segmentation (PSS). PSS is a task-agnostic, dynamically growing
collection of domain-specific segmentation models. The task of inferring the
domain and subsequently selecting the appropriate module for segmentation is
carried out using a collection of convolutional autoencoders. We extensively
evaluate our proposed approach using several datasets at varying levels of
granularity in the categorization of adverse driving conditions. Furthermore,
we demonstrate the generalization of the proposed approach to similar and
unseen domains.
comment: Accepted at ICPRAM 2025
★ Optimized Sampling for Non-Line-of-Sight Imaging Using Modified Fast Fourier Transforms
Non-line-of-Sight (NLOS) imaging systems collect light at a diffuse relay
surface and input this measurement into computational algorithms that output a
3D volumetric reconstruction. These algorithms utilize the Fast Fourier
Transform (FFT) to accelerate the reconstruction process but require both input
and output to be sampled spatially with uniform grids. However, the geometry of
NLOS imaging inherently results in non-uniform sampling on the relay surface
when using multi-pixel detector arrays, even though such arrays significantly
reduce acquisition times. Furthermore, using these arrays increases the data
rate required for sensor readout, posing challenges for real-world deployment.
In this work, we utilize the phasor field framework to demonstrate that
existing NLOS imaging setups typically oversample the relay surface spatially,
explaining why the measurement can be compressed without significantly
sacrificing reconstruction quality. This enables us to utilize the Non-Uniform
Fast Fourier Transform (NUFFT) to reconstruct from sparse measurements acquired
from irregularly sampled relay surfaces of arbitrary shapes. Furthermore, we
utilize the NUFFT to reconstruct at arbitrary locations in the hidden volume,
ensuring flexible sampling schemes for both the input and output. Finally, we
utilize the Scaled Fast Fourier Transform (SFFT) to reconstruct larger volumes
without increasing the number of samples stored in memory. All algorithms
introduced in this paper preserve the computational complexity of FFT-based
methods, ensuring scalability for practical NLOS imaging applications.
★ Scaffold-SLAM: Structured 3D Gaussians for Simultaneous Localization and Photorealistic Mapping
3D Gaussian Splatting (3DGS) has recently revolutionized novel view synthesis
in the Simultaneous Localization and Mapping (SLAM). However, existing SLAM
methods utilizing 3DGS have failed to provide high-quality novel view rendering
for monocular, stereo, and RGB-D cameras simultaneously. Notably, some methods
perform well for RGB-D cameras but suffer significant degradation in rendering
quality for monocular cameras. In this paper, we present Scaffold-SLAM, which
delivers simultaneous localization and high-quality photorealistic mapping
across monocular, stereo, and RGB-D cameras. We introduce two key innovations
to achieve this state-of-the-art visual quality. First, we propose
Appearance-from-Motion embedding, enabling 3D Gaussians to better model image
appearance variations across different camera poses. Second, we introduce a
frequency regularization pyramid to guide the distribution of Gaussians,
allowing the model to effectively capture finer details in the scene. Extensive
experiments on monocular, stereo, and RGB-D datasets demonstrate that
Scaffold-SLAM significantly outperforms state-of-the-art methods in
photorealistic mapping quality, e.g., PSNR is 16.76% higher in the TUM RGB-D
datasets for monocular cameras.
comment: 12 pages, 6 figures
★ Contrast-Free Myocardial Scar Segmentation in Cine MRI using Motion and Texture Fusion
Guang Yang, Jingkun Chen, Xicheng Sheng, Shan Yang, Xiahai Zhuang, Betty Raman, Lei Li, Vicente Grau
Late gadolinium enhancement MRI (LGE MRI) is the gold standard for the
detection of myocardial scars for post myocardial infarction (MI). LGE MRI
requires the injection of a contrast agent, which carries potential side
effects and increases scanning time and patient discomfort. To address these
issues, we propose a novel framework that combines cardiac motion observed in
cine MRI with image texture information to segment the myocardium and scar
tissue in the left ventricle. Cardiac motion tracking can be formulated as a
full cardiac image cycle registration problem, which can be solved via deep
neural networks. Experimental results prove that the proposed method can
achieve scar segmentation based on non-contrasted cine images with comparable
accuracy to LGE MRI. This demonstrates its potential as an alternative to
contrast-enhanced techniques for scar detection.
comment: 5 pages, 2figs, 2tables
★ Is Your Autonomous Vehicle Safe? Understanding the Threat of Electromagnetic Signal Injection Attacks on Traffic Scene Perception AAAI 2025
Autonomous vehicles rely on camera-based perception systems to comprehend
their driving environment and make crucial decisions, thereby ensuring vehicles
to steer safely. However, a significant threat known as Electromagnetic Signal
Injection Attacks (ESIA) can distort the images captured by these cameras,
leading to incorrect AI decisions and potentially compromising the safety of
autonomous vehicles. Despite the serious implications of ESIA, there is limited
understanding of its impacts on the robustness of AI models across various and
complex driving scenarios. To address this gap, our research analyzes the
performance of different models under ESIA, revealing their vulnerabilities to
the attacks. Moreover, due to the challenges in obtaining real-world attack
data, we develop a novel ESIA simulation method and generate a simulated attack
dataset for different driving scenarios. Our research provides a comprehensive
simulation and evaluation framework, aiming to enhance the development of more
robust AI models and secure intelligent systems, ultimately contributing to the
advancement of safer and more reliable technology across various fields.
comment: To appear in AAAI 2025
★ FOCUS: Towards Universal Foreground Segmentation
Foreground segmentation is a fundamental task in computer vision,
encompassing various subdivision tasks. Previous research has typically
designed task-specific architectures for each task, leading to a lack of
unification. Moreover, they primarily focus on recognizing foreground objects
without effectively distinguishing them from the background. In this paper, we
emphasize the importance of the background and its relationship with the
foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation
framework that can handle multiple foreground tasks. We develop a multi-scale
semantic network using the edge information of objects to enhance image
features. To achieve boundary-aware segmentation, we propose a novel
distillation method, integrating the contrastive learning strategy to refine
the prediction mask in multi-modal feature space. We conduct extensive
experiments on a total of 13 datasets across 5 tasks, and the results
demonstrate that FOCUS consistently outperforms the state-of-the-art
task-specific models on most metrics.
★ Automated external cervical resorption segmentation in cone-beam CT using local texture features
External cervical resorption (ECR) is a resorptive process affecting teeth.
While in some patients, active resorption ceases and gets replaced by osseous
tissue, in other cases, the resorption progresses and ultimately results in
tooth loss. For proper ECR assessment, cone-beam computed tomography (CBCT) is
the recommended imaging modality, enabling a 3-D characterization of these
lesions. While it is possible to manually identify and measure ECR resorption
in CBCT scans, this process can be time intensive and highly subject to human
error. Therefore, there is an urgent need to develop an automated method to
identify and quantify the severity of ECR resorption using CBCT. Here, we
present a method for ECR lesion segmentation that is based on automatic, binary
classification of locally extracted voxel-wise texture features. We evaluate
our method on 6 longitudinal CBCT datasets and show that certain
texture-features can be used to accurately detect subtle CBCT signal changes
due to ECR. We also present preliminary analyses clustering texture features
within a lesion to stratify the defects and identify patterns indicative of
calcification. These methods are important steps in developing prognostic
biomarkers to predict whether ECR will continue to progress or cease,
ultimately informing treatment decisions.
comment: 4 pages, 3 figures, 1 table
★ Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection
Out-of-distribution (OOD) detection has seen significant advancements with
zero-shot approaches by leveraging the powerful Vision-Language Models (VLMs)
such as CLIP. However, prior research works have predominantly focused on
enhancing Far-OOD performance, while potentially compromising Near-OOD
efficacy, as observed from our pilot study. To address this issue, we propose a
novel strategy to enhance zero-shot OOD detection performances for both Far-OOD
and Near-OOD scenarios by innovatively harnessing Large Language Models (LLMs)
and VLMs. Our approach first exploit an LLM to generate superclasses of the ID
labels and their corresponding background descriptions followed by feature
extraction using CLIP. We then isolate the core semantic features for ID data
by subtracting background features from the superclass features. The refined
representation facilitates the selection of more appropriate negative labels
for OOD data from a comprehensive candidate label set of WordNet, thereby
enhancing the performance of zero-shot OOD detection in both scenarios.
Furthermore, we introduce novel few-shot prompt tuning and visual prompt tuning
to adapt the proposed framework to better align with the target distribution.
Experimental results demonstrate that the proposed approach consistently
outperforms current state-of-the-art methods across multiple benchmarks, with
an improvement of up to 2.9% in AUROC and a reduction of up to 12.6% in FPR95.
Additionally, our method exhibits superior robustness against covariate shift
across different domains, further highlighting its effectiveness in real-world
scenarios.
comment: 9 pages, 4 figures
★ Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes
We introduce a single-view reconstruction technique of volumetric fields in
which multiple light scattering effects are omnipresent, such as in clouds. We
model the unknown distribution of volumetric fields using an unconditional
diffusion model trained on a novel benchmark dataset comprising 1,000
synthetically simulated volumetric density fields. The neural diffusion model
is trained on the latent codes of a novel, diffusion-friendly, monoplanar
representation. The generative model is used to incorporate a tailored
parametric diffusion posterior sampling technique into different reconstruction
tasks. A physically-based differentiable volume renderer is employed to provide
gradients with respect to light transport in the latent space. This stands in
contrast to classic NeRF approaches and makes the reconstructions better
aligned with observed data. Through various experiments, we demonstrate
single-view reconstruction of volumetric clouds at a previously unattainable
quality.
★ MHAFF: Multi-Head Attention Feature Fusion of CNN and Transformer for Cattle Identification
Convolutional Neural Networks (CNNs) have drawn researchers' attention to
identifying cattle using muzzle images. However, CNNs often fail to capture
long-range dependencies within the complex patterns of the muzzle. The
transformers handle these challenges. This inspired us to fuse the strengths of
CNNs and transformers in muzzle-based cattle identification. Addition and
concatenation have been the most commonly used techniques for feature fusion.
However, addition fails to preserve discriminative information, while
concatenation results in an increase in dimensionality. Both methods are simple
operations and cannot discover the relationships or interactions between fusing
features. This research aims to overcome the issues faced by addition and
concatenation. This research introduces a novel approach called Multi-Head
Attention Feature Fusion (MHAFF) for the first time in cattle identification.
MHAFF captures relations between the different types of fusing features while
preserving their originality. The experiments show that MHAFF outperformed
addition and concatenation techniques and the existing cattle identification
methods in accuracy on two publicly available cattle datasets. MHAFF
demonstrates excellent performance and quickly converges to achieve optimum
accuracy of 99.88% and 99.52% in two cattle datasets simultaneously.
comment: 30 pages
★ Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
Infants develop complex visual understanding rapidly, even preceding of the
acquisition of linguistic inputs. As computer vision seeks to replicate the
human vision system, understanding infant visual development may offer valuable
insights. In this paper, we present an interdisciplinary study exploring this
question: can a computational model that imitates the infant learning process
develop broader visual concepts that extend beyond the vocabulary it has heard,
similar to how infants naturally learn? To investigate this, we analyze a
recently published model in Science by Vong et al.,which is trained on
longitudinal, egocentric images of a single child paired with transcribed
parental speech. We introduce a training-free framework that can discover
visual concept neurons hidden in the model's internal representations. Our
findings show that these neurons can classify objects outside its original
vocabulary. Furthermore, we compare the visual representations in infant-like
models with those in moder computer vision models, such as CLIP or ImageNet
pre-trained model, highlighting key similarities and differences. Ultimately,
our work bridges cognitive science and computer vision by analyzing the
internal representations of a computational model trained on an infant's visual
and linguistic inputs.
comment: 12 pages, 11 figures
★ HipyrNet: Hypernet-Guided Feature Pyramid network for mixed-exposure correction
Recent advancements in image translation for enhancing mixed-exposure images
have demonstrated the transformative potential of deep learning algorithms.
However, addressing extreme exposure variations in images remains a significant
challenge due to the inherent complexity and contrast inconsistencies across
regions. Current methods often struggle to adapt effectively to these
variations, resulting in suboptimal performance. In this work, we propose
HipyrNet, a novel approach that integrates a HyperNetwork within a Laplacian
Pyramid-based framework to tackle the challenges of mixed-exposure image
enhancement. The inclusion of a HyperNetwork allows the model to adapt to these
exposure variations. HyperNetworks dynamically generates weights for another
network, allowing dynamic changes during deployment. In our model, the
HyperNetwork employed is used to predict optimal kernels for Feature Pyramid
decomposition, which enables a tailored and adaptive decomposition process for
each input image. Our enhanced translational network incorporates multiscale
decomposition and reconstruction, leveraging dynamic kernel prediction to
capture and manipulate features across varying scales. Extensive experiments
demonstrate that HipyrNet outperforms existing methods, particularly in
scenarios with extreme exposure variations, achieving superior results in both
qualitative and quantitative evaluations. Our approach sets a new benchmark for
mixed-exposure image enhancement, paving the way for future research in
adaptive image translation.
★ Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration
Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen
Multimodal large language models (MLLMs) have attracted considerable
attention due to their exceptional performance in visual content understanding
and reasoning. However, their inference efficiency has been a notable concern,
as the increasing length of multimodal contexts leads to quadratic complexity.
Token compression techniques, which reduce the number of visual tokens, have
demonstrated their effectiveness in reducing computational costs. Yet, these
approaches have struggled to keep pace with the rapid advancements in MLLMs,
especially the AnyRes strategy in the context of high-resolution image
understanding. In this paper, we propose a novel token compression method,
GlobalCom$^2$, tailored for high-resolution MLLMs that receive both the
thumbnail and multiple crops. GlobalCom$^2$ treats the tokens derived from the
thumbnail as the ``commander'' of the entire token compression process,
directing the allocation of retention ratios and the specific compression for
each crop. In this way, redundant tokens are eliminated while important local
details are adaptively preserved to the highest extent feasible. Empirical
results across 10 benchmarks reveal that GlobalCom$^2$ achieves an optimal
balance between performance and efficiency, and consistently outperforms
state-of-the-art token compression methods with LLaVA-NeXT-7B/13B models. Our
code is released at \url{https://github.com/xuyang-liu16/GlobalCom2}.
comment: Our code is released at
\url{https://github.com/xuyang-liu16/GlobalCom2}
★ FaceMe: Robust Blind Face Restoration with Personal Identification AAAI 2025
Blind face restoration is a highly ill-posed problem due to the lack of
necessary context. Although existing methods produce high-quality outputs, they
often fail to faithfully preserve the individual's identity. In this paper, we
propose a personalized face restoration method, FaceMe, based on a diffusion
model. Given a single or a few reference images, we use an identity encoder to
extract identity-related features, which serve as prompts to guide the
diffusion model in restoring high-quality and identity-consistent facial
images. By simply combining identity-related features, we effectively minimize
the impact of identity-irrelevant features during training and support any
number of reference image inputs during inference. Additionally, thanks to the
robustness of the identity encoder, synthesized images can be used as reference
images during training, and identity changing during inference does not require
fine-tuning the model. We also propose a pipeline for constructing a reference
image training pool that simulates the poses and expressions that may appear in
real-world scenarios. Experimental results demonstrate that our FaceMe can
restore high-quality facial images while maintaining identity consistency,
achieving excellent performance and robustness.
comment: To appear at AAAI 2025
★ A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision
Depth estimation (DE) provides spatial information about a scene and enables
tasks such as 3D reconstruction, object detection, and scene understanding.
Recently, there has been an increasing interest in using deep learning
(DL)-based methods for DE. Traditional techniques rely on handcrafted features
that often struggle to generalise to diverse scenes and require extensive
manual tuning. However, DL models for DE can automatically extract relevant
features from input data, adapt to various scene conditions, and generalise
well to unseen environments. Numerous DL-based methods have been developed,
making it necessary to survey and synthesize the state-of-the-art (SOTA).
Previous reviews on DE have mainly focused on either monocular or stereo-based
techniques, rather than comprehensively reviewing DE. Furthermore, to the best
of our knowledge, there is no systematic literature review (SLR) that
comprehensively focuses on DE. Therefore, this SLR study is being conducted.
Initially, electronic databases were searched for relevant publications,
resulting in 1284 publications. Using defined exclusion and quality criteria,
128 publications were shortlisted and further filtered to select 59
high-quality primary studies. These studies were analysed to extract data and
answer defined research questions. Based on the results, DL methods were
developed for mainly three different types of DE: monocular, stereo, and
multi-view. 20 publicly available datasets were used to train, test, and
evaluate DL models for DE, with KITTI, NYU Depth V2, and Make 3D being the most
used datasets. 29 evaluation metrics were used to assess the performance of DE.
35 base models were reported in the primary studies, and the top five most-used
base models were ResNet-50, ResNet-18, ResNet-101, U-Net, and VGG-16. Finally,
the lack of ground truth data was among the most significant challenges
reported by primary studies.
★ CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection
Real-time object detection takes an essential part in the decision-making
process of numerous real-world applications, including collision avoidance and
path planning in autonomous driving systems. This paper presents a novel
real-time streaming perception method named CorrDiff, designed to tackle the
challenge of delays in real-time detection systems. The main contribution of
CorrDiff lies in its adaptive delay-aware detector, which is able to utilize
runtime-estimated temporal cues to predict objects' locations for multiple
future frames, and selectively produce predictions that matches real-world
time, effectively compensating for any communication and computational delays.
The proposed model outperforms current state-of-the-art methods by leveraging
motion estimation and feature enhancement, both for 1) single-frame detection
for the current frame or the next frame, in terms of the metric mAP, and 2) the
prediction for (multiple) future frame(s), in terms of the metric sAP (The sAP
metric is to evaluate object detection algorithms in streaming scenarios,
factoring in both latency and accuracy). It demonstrates robust performance
across a range of devices, from powerful Tesla V100 to modest RTX 2080Ti,
achieving the highest level of perceptual accuracy on all platforms. Unlike
most state-of-the-art methods that struggle to complete computation within a
single frame on less powerful devices, CorrDiff meets the stringent real-time
processing requirements on all kinds of devices. The experimental results
emphasize the system's adaptability and its potential to significantly improve
the safety and reliability for many real-world systems, such as autonomous
driving. Our code is completely open-sourced and is available at
https://anonymous.4open.science/r/CorrDiff.
comment: Submitted to IEEE JSAC Special Issue: Intelligent Communications for
Real-Time Computer Vision (Comm4CV)
★ 3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering
The growing demand for controllable outputs in text-to-image generation has
driven significant advancements in multi-instance generation (MIG), enabling
users to define both instance layouts and attributes. Currently, the
state-of-the-art methods in MIG are primarily adapter-based. However, these
methods necessitate retraining a new adapter each time a more advanced model is
released, resulting in significant resource consumption. A methodology named
Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which
decouples MIG into two distinct phases: 1) depth-based scene construction and
2) detail rendering with widely pre-trained depth control models. The 3DIS
method requires adapter training solely during the scene construction phase,
while enabling various models to perform training-free detail rendering.
Initially, 3DIS focused on rendering techniques utilizing U-Net architectures
such as SD1.5, SD2, and SDXL, without exploring the potential of recent
DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension
of the 3DIS framework that integrates the FLUX model for enhanced rendering
capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map
controlled image generation and introduce a detail renderer that manipulates
the Attention Mask in FLUX's Joint Attention mechanism based on layout
information. This approach allows for the precise rendering of fine-grained
attributes of each instance. Our experimental results indicate that 3DIS-FLUX,
leveraging the FLUX model, outperforms the original 3DIS method, which utilized
SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in
terms of both performance and image quality. Project Page:
https://limuloo.github.io/3DIS/.
comment: tech report
★ Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, Goran Glavaš
Most Large Vision-Language Models (LVLMs) to date are trained predominantly
on English data, which makes them struggle to understand non-English input and
fail to generate output in the desired target language. Existing efforts
mitigate these issues by adding multilingual training data, but do so in a
largely ad-hoc manner, lacking insight into how different training mixes tip
the scale for different groups of languages. In this work, we present a
comprehensive investigation into the training strategies for massively
multilingual LVLMs. First, we conduct a series of multi-stage experiments
spanning 13 downstream vision-language tasks and 43 languages, systematically
examining: (1) the number of training languages that can be included without
degrading English performance and (2) optimal language distributions of
pre-training as well as (3) instruction-tuning data. Further, we (4)
investigate how to improve multilingual text-in-image understanding, and
introduce a new benchmark for the task. Surprisingly, our analysis reveals that
one can (i) include as many as 100 training languages simultaneously (ii) with
as little as 25-50\% of non-English data, to greatly improve multilingual
performance while retaining strong English performance. We further find that
(iii) including non-English OCR data in pre-training and instruction-tuning is
paramount for improving multilingual text-in-image understanding. Finally, we
put all our findings together and train Centurio, a 100-language LVLM, offering
state-of-the-art performance in an evaluation covering 14 tasks and 56
languages.
★ Improving the U-Net Configuration for Automated Delineation of Head and Neck Cancer on MRI
Tumor volume segmentation on MRI is a challenging and time-consuming process
that is performed manually in typical clinical settings. This work presents an
approach to automated delineation of head and neck tumors on MRI scans,
developed in the context of the MICCAI Head and Neck Tumor Segmentation for
MR-Guided Applications (HNTS-MRG) 2024 Challenge. Rather than designing a new,
task-specific convolutional neural network, the focus of this research was to
propose improvements to the configuration commonly used in medical segmentation
tasks, relying solely on the traditional U-Net architecture. The empirical
results presented in this article suggest the superiority of patch-wise
normalization used for both training and sliding window inference. They also
indicate that the performance of segmentation models can be enhanced by
applying a scheduled data augmentation policy during training. Finally, it is
shown that a small improvement in quality can be achieved by using Gaussian
weighting to combine predictions for individual patches during sliding window
inference. The model with the best configuration obtained an aggregated Dice
Similarity Coefficient (DSCagg) of 0.749 in Task 1 and 0.710 in Task 2 on five
cross-validation folds. The ensemble of five models (one best model per
validation fold) showed consistent results on a private test set of 50 patients
with an DSCagg of 0.752 in Task 1 and 0.718 in Task 2 (team name:
andrei.iantsen). The source code and model weights are freely available at
www.github.com/iantsen/hntsmrg.
★ Optimizing Multitask Industrial Processes with Predictive Action Guidance
Monitoring complex assembly processes is critical for maintaining
productivity and ensuring compliance with assembly standards. However,
variability in human actions and subjective task preferences complicate
accurate task anticipation and guidance. To address these challenges, we
introduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU)
Network for egocentric activity anticipation, utilizing multimodal fusion to
improve prediction accuracy. Integrated with the Operator Action Monitoring
Unit (OAMU), the system provides proactive operator guidance, preventing
deviations in the assembly process. OAMU employs two strategies: (1) Top-5
MMTF-RU predictions, combined with a reference graph and an action dictionary,
for next-step recommendations; and (2) Top-1 MMTF-RU predictions, integrated
with a reference graph, for detecting sequence deviations and predicting
anomaly scores via an entropy-informed confidence mechanism. We also introduce
Time-Weighted Sequence Accuracy (TWSA) to evaluate operator efficiency and
ensure timely task completion. Our approach is validated on the industrial
Meccano dataset and the largescale EPIC-Kitchens-55 dataset, demonstrating its
effectiveness in dynamic environments.
★ Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset NeurIPS 2023
Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang
In this paper, we introduce Motion-X++, a large-scale multimodal 3D
expressive whole-body human motion dataset. Existing motion datasets
predominantly capture body-only poses, lacking facial expressions, hand
gestures, and fine-grained pose descriptions, and are typically limited to lab
settings with manually labeled text descriptions, thereby restricting their
scalability. To address this issue, we develop a scalable annotation pipeline
that can automatically capture 3D whole-body human motion and comprehensive
textural labels from RGB videos and build the Motion-X dataset comprising 81.1K
text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving
the annotation pipeline, introducing more data modalities, and scaling up the
data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations
covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K
audios, 19.5M frame-level whole-body pose descriptions, and 120.5K
sequence-level semantic labels. Comprehensive experiments validate the accuracy
of our annotation pipeline and highlight Motion-X++'s significant benefits for
generating expressive, precise, and natural motion with paired multimodal
labels supporting several downstream tasks, including text-driven whole-body
motion generation,audio-driven motion generation, 3D whole-body human mesh
recovery, and 2D whole-body keypoints estimation, etc.
comment: 17 pages, 14 figures, This work extends and enhances the research
published in the NeurIPS 2023 paper, "Motion-X: A Large-scale 3D Expressive
Whole-body Human Motion Dataset". arXiv admin note: substantial text overlap
with arXiv:2307.00818
★ A 1Mb mixed-precision quantized encoder for image classification and patch-based compression
Even if Application-Specific Integrated Circuits (ASIC) have proven to be a
relevant choice for integrating inference at the edge, they are often limited
in terms of applicability. In this paper, we demonstrate that an ASIC neural
network accelerator dedicated to image processing can be applied to multiple
tasks of different levels: image classification and compression, while
requiring a very limited hardware. The key component is a reconfigurable,
mixed-precision (3b/2b/1b) encoder that takes advantage of proper weight and
activation quantizations combined with convolutional layer structural pruning
to lower hardware-related constraints (memory and computing). We introduce an
automatic adaptation of linear symmetric quantizer scaling factors to perform
quantized levels equalization, aiming at stabilizing quinary and ternary
weights training. In addition, a proposed layer-shared Bit-Shift Normalization
significantly simplifies the implementation of the hardware-expensive Batch
Normalization. For a specific configuration in which the encoder design only
requires 1Mb, the classification accuracy reaches 87.5% on CIFAR-10. Besides,
we also show that this quantized encoder can be used to compress image
patch-by-patch while the reconstruction can performed remotely, by a dedicated
full-frame decoder. This solution typically enables an end-to-end compression
almost without any block artifacts, outperforming patch-based state-of-the-art
techniques employing a patch-constant bitrate.
comment: Published at IEEE Transactions on Circuits and Systems for Video
Technology (TCSVT)
★ Advancing ALS Applications with Large-Scale Pre-training: Dataset Development and Downstream Assessment
The pre-training and fine-tuning paradigm has revolutionized satellite remote
sensing applications. However, this approach remains largely underexplored for
airborne laser scanning (ALS), an important technology for applications such as
forest management and urban planning. In this study, we address this gap by
constructing a large-scale ALS point cloud dataset and evaluating its impact on
downstream applications. Our dataset comprises ALS point clouds collected
across the contiguous United States, provided by the United States Geological
Survey's 3D Elevation Program. To ensure efficient data collection while
capturing diverse land cover and terrain types, we introduce a geospatial
sampling method that selects point cloud tiles based on land cover maps and
digital elevation models. As a baseline self-supervised learning model, we
adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point
clouds, and pre-train it on the constructed dataset. The pre-trained models are
subsequently fine-tuned for downstream tasks, including tree species
classification, terrain scene recognition, and point cloud semantic
segmentation. Our results show that the pre-trained models significantly
outperform their scratch counterparts across all downstream tasks,
demonstrating the transferability of the representations learned from the
proposed dataset. Furthermore, we observe that scaling the dataset using our
geospatial sampling method consistently enhances performance, whereas
pre-training on datasets constructed with random sampling fails to achieve
similar improvements. These findings highlight the utility of the constructed
dataset and the effectiveness of our sampling strategy in the pre-training and
fine-tuning paradigm. The source code and pre-trained models will be made
publicly available at \url{https://github.com/martianxiu/ALS_pretraining}.
★ ResPanDiff: Diffusion Model with Disentangled Modulations for Image Fusion
The implementation of diffusion-based pansharpening task is predominantly
constrained by its slow inference speed, which results from numerous sampling
steps. Despite the existing techniques aiming to accelerate sampling, they
often compromise performance when fusing multi-source images. To ease this
limitation, we introduce a novel and efficient diffusion model named Diffusion
Model for Pansharpening by Inferring Residual Inference (ResPanDiff), which
significantly reduces the number of diffusion steps without sacrificing the
performance to tackle pansharpening task. In ResPanDiff, we innovatively
propose a Markov chain that transits from noisy residuals to the residuals
between the LRMS and HRMS images, thereby reducing the number of sampling steps
and enhancing performance. Additionally, we design the latent space to help
model extract more features at the encoding stage, Shallow
Cond-Injection~(SC-I) to help model fetch cond-injected hidden features with
higher dimensions, and loss functions to give a better guidance for the
residual generation task. enabling the model to achieve superior performance in
residual generation. Furthermore, experimental evaluations on pansharpening
datasets demonstrate that the proposed method achieves superior outcomes
compared to recent state-of-the-art~(SOTA) techniques, requiring only 15
sampling steps, which reduces over $90\%$ step compared with the benchmark
diffusion models. Our experiments also include thorough discussions and
ablation studies to underscore the effectiveness of our approach.
★ End-to-End Deep Learning for Interior Tomography with Low-Dose X-ray CT
Objective: There exist several X-ray computed tomography (CT) scanning
strategies to reduce a radiation dose, such as (1) sparse-view CT, (2) low-dose
CT, and (3) region-of-interest (ROI) CT (called interior tomography). To
further reduce the dose, the sparse-view and/or low-dose CT settings can be
applied together with interior tomography. Interior tomography has various
advantages in terms of reducing the number of detectors and decreasing the
X-ray radiation dose. However, a large patient or small field-of-view (FOV)
detector can cause truncated projections, and then the reconstructed images
suffer from severe cupping artifacts. In addition, although the low-dose CT can
reduce the radiation exposure dose, analytic reconstruction algorithms produce
image noise. Recently, many researchers have utilized image-domain deep
learning (DL) approaches to remove each artifact and demonstrated impressive
performances, and the theory of deep convolutional framelets supports the
reason for the performance improvement. Approach: In this paper, we found that
the image-domain convolutional neural network (CNN) is difficult to solve
coupled artifacts, based on deep convolutional framelets. Significance: To
address the coupled problem, we decouple it into two sub-problems: (i) image
domain noise reduction inside truncated projection to solve low-dose CT problem
and (ii) extrapolation of projection outside truncated projection to solve the
ROI CT problem. The decoupled sub-problems are solved directly with a novel
proposed end-to-end learning using dual-domain CNNs. Main results: We
demonstrate that the proposed method outperforms the conventional image-domain
deep learning methods, and a projection-domain CNN shows better performance
than the image-domain CNNs which are commonly used by many researchers.
comment: Published by Physics in Medicine & Biology (2022.5)
★ TipSegNet: Fingertip Segmentation in Contactless Fingerprint Imaging
Contactless fingerprint recognition systems offer a hygienic, user-friendly,
and efficient alternative to traditional contact-based methods. However, their
accuracy heavily relies on precise fingertip detection and segmentation,
particularly under challenging background conditions. This paper introduces
TipSegNet, a novel deep learning model that achieves state-of-the-art
performance in segmenting fingertips directly from grayscale hand images.
TipSegNet leverages a ResNeXt-101 backbone for robust feature extraction,
combined with a Feature Pyramid Network (FPN) for multi-scale representation,
enabling accurate segmentation across varying finger poses and image qualities.
Furthermore, we employ an extensive data augmentation strategy to enhance the
model's generalizability and robustness. TipSegNet outperforms existing
methods, achieving a mean Intersection over Union (mIoU) of 0.987 and an
accuracy of 0.999, representing a significant advancement in contactless
fingerprint segmentation. This enhanced accuracy has the potential to
substantially improve the reliability and effectiveness of contactless
biometric systems in real-world applications.
★ A Flexible and Scalable Framework for Video Moment Search
Video moment search, the process of finding relevant moments in a video
corpus to match a user's query, is crucial for various applications. Existing
solutions, however, often assume a single perfect matching moment, struggle
with inefficient inference, and have limitations with hour-long videos. This
paper introduces a flexible and scalable framework for retrieving a ranked list
of moments from collection of videos in any length to match a text query, a
task termed Ranked Video Moment Retrieval (RVMR). Our framework, called
Segment-Proposal-Ranking (SPR), simplifies the search process into three
independent stages: segment retrieval, proposal generation, and moment
refinement with re-ranking. Specifically, videos are divided into equal-length
segments with precomputed embeddings indexed offline, allowing efficient
retrieval regardless of video length. For scalable online retrieval, both
segments and queries are projected into a shared feature space to enable
approximate nearest neighbor (ANN) search. Retrieved segments are then merged
into coarse-grained moment proposals. Then a refinement and re-ranking module
is designed to reorder and adjust timestamps of the coarse-grained proposals.
Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves
state-of-the-art performance with significant reductions in computational cost
and processing time. The flexible design also allows for independent
improvements to each stage, making SPR highly adaptable for large-scale
applications.
★ Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
This paper proposes the first video-grounded entailment tree reasoning method
for commonsense video question answering (VQA). Despite the remarkable progress
of large visual-language models (VLMs), there are growing concerns that they
learn spurious correlations between videos and likely answers, reinforced by
their black-box nature and remaining benchmarking biases. Our method explicitly
grounds VQA tasks to video fragments in four steps: entailment tree
construction, video-language entailment verification, tree reasoning, and
dynamic tree expansion. A vital benefit of the method is its generalizability
to current video and image-based VLMs across reasoning types. To support fair
evaluation, we devise a de-biasing procedure based on large-language models
that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic
experiments on existing and de-biased benchmarks highlight the impact of our
method components across benchmarks, VLMs, and reasoning types.
★ LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
In this paper, we introduce LLaVA-Octopus, a novel video multimodal large
language model. LLaVA-Octopus adaptively weights features from different visual
projectors based on user instructions, enabling us to leverage the
complementary strengths of each projector. We observe that different visual
projectors exhibit distinct characteristics when handling specific tasks. For
instance, some projectors excel at capturing static details, while others are
more effective at processing temporal information, and some are better suited
for tasks requiring temporal coherence. By dynamically adjusting feature
weights according to user instructions, LLaVA-Octopus dynamically selects and
combines the most suitable features, significantly enhancing the model's
performance in multimodal tasks. Experimental results demonstrate that
LLaVA-Octopus achieves excellent performance across multiple benchmarks,
especially in tasks such as multimodal understanding, visual question
answering, and video understanding, highlighting its broad application
potential.
★ Improving Skeleton-based Action Recognition with Interactive Object Information
Human skeleton information is important in skeleton-based action recognition,
which provides a simple and efficient way to describe human pose. However,
existing skeleton-based methods focus more on the skeleton, ignoring the
objects interacting with humans, resulting in poor performance in recognizing
actions that involve object interactions. We propose a new action recognition
framework introducing object nodes to supplement absent interactive object
information. We also propose Spatial Temporal Variable Graph Convolutional
Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing
object nodes. Specifically, in order to validate the role of interactive object
information, by leveraging a simple self-training approach, we establish a new
dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more
than 2 million additional object nodes. At the same time, we designe the
Variable Graph construction method to accommodate a variable number of nodes
for graph structure. Additionally, we are the first to explore the overfitting
issue introduced by incorporating additional object information, and we propose
a VG-based data augmentation method to address this issue, called Random Node
Attack. Finally, regarding the network structure, we introduce two fusion
modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the
comprehensive performance by effectively fusing and balancing skeleton and
object node information. Our method surpasses the previous state-of-the-art on
multiple skeleton-based action recognition benchmarks. The accuracy of our
method on NTU RGB+D 60 cross-subject split is 96.7\%, and on cross-view split,
it is 99.2\%.
★ LongViTU: Instruction Tuning for Long-Form Video Understanding
This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos),
automatically generated dataset for long-form video understanding. We developed
a systematic approach that organizes videos into a hierarchical tree structure
and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each
QA pair in LongViTU features: 1) long-term context (average certificate length
of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense,
causality, planning, etc.); and 3) explicit timestamp labels for relevant
events. LongViTU also serves as a benchmark for instruction following in
long-form and streaming video understanding. We evaluate the open-source
state-of-the-art long video understanding model, LongVU, and the commercial
model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and
52.3, respectively, underscoring the substantial challenge posed by our
benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance
improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID)
benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD)
benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes
demonstrate LongViTU's high data quality and robust OOD generalizability.
★ Towards Fingerprint Mosaicking Artifact Detection: A Self-Supervised Deep Learning Approach
Fingerprint mosaicking, which is the process of combining multiple
fingerprint images into a single master fingerprint, is an essential process in
modern biometric systems. However, it is prone to errors that can significantly
degrade fingerprint image quality. This paper proposes a novel deep
learning-based approach to detect and score mosaicking artifacts in fingerprint
images. Our method leverages a self-supervised learning framework to train a
model on large-scale unlabeled fingerprint data, eliminating the need for
manual artifact annotation. The proposed model effectively identifies
mosaicking errors, achieving high accuracy on various fingerprint modalities,
including contactless, rolled, and pressed fingerprints and furthermore proves
to be robust to different data sources. Additionally, we introduce a novel
mosaicking artifact score to quantify the severity of errors, enabling
automated evaluation of fingerprint images. By addressing the challenges of
mosaicking artifact detection, our work contributes to improving the accuracy
and reliability of fingerprint-based biometric systems.
★ ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing
The enhancement of generalization in robots by large vision-language models
(LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of
LVLMs based on egocentric videos are of great interest. However, current
datasets for embodied video question answering lack comprehensive and
systematic evaluation frameworks. Critical embodied cognitive issues, such as
robotic self-cognition, dynamic scene perception, and hallucination, are rarely
addressed. To tackle these challenges, we propose ECBench, a high-quality
benchmark designed to systematically evaluate the embodied cognitive abilities
of LVLMs. ECBench features a diverse range of scene video sources, open and
varied question formats, and 30 dimensions of embodied cognition. To ensure
quality, balance, and high visual dependence, ECBench uses class-independent
meticulous human annotation and multi-round question screening strategies.
Additionally, we introduce ECEval, a comprehensive evaluation system that
ensures the fairness and rationality of the indicators. Utilizing ECBench, we
conduct extensive evaluations of proprietary, open-source, and task-specific
LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of
LVLMs, laying a solid foundation for developing reliable core models for
embodied agents. All data and code are available at
https://github.com/Rh-Dang/ECBench.
★ Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
Motion-controllable image animation is a fundamental task with a wide range
of potential applications. Recent works have made progress in controlling
camera or object motion via various motion representations, while they still
struggle to support collaborative camera and object motion control with
adaptive control granularity. To this end, we introduce 3D-aware motion
representation and propose an image animation framework, called
Perception-as-Control, to achieve fine-grained collaborative motion control.
Specifically, we construct 3D-aware motion representation from a reference
image, manipulate it based on interpreted user intentions, and perceive it from
different viewpoints. In this way, camera and object motions are transformed
into intuitive, consistent visual changes. Then, the proposed framework
leverages the perception results as motion control signals, enabling it to
support various motion-related video synthesis tasks in a unified and flexible
way. Experiments demonstrate the superiority of the proposed framework. For
more details and qualitative results, please refer to our project webpage:
https://chen-yingjie.github.io/projects/Perception-as-Control.
★ Continuous Knowledge-Preserving Decomposition for Few-Shot Continual Learning SC
Few-shot class-incremental learning (FSCIL) involves learning new classes
from limited data while retaining prior knowledge, and often results in
catastrophic forgetting. Existing methods either freeze backbone networks to
preserve knowledge, which limits adaptability, or rely on additional modules or
prompts, introducing inference overhead. To this end, we propose Continuous
Knowledge-Preserving Decomposition for FSCIL (CKPD-FSCIL), a framework that
decomposes a model's weights into two parts: one that compacts existing
knowledge (knowledge-sensitive components) and another that carries redundant
capacity to accommodate new abilities (redundant-capacity components). The
decomposition is guided by a covariance matrix from replay samples, ensuring
principal components align with classification abilities. During adaptation, we
freeze the knowledge-sensitive components and only adapt the redundant-capacity
components, fostering plasticity while minimizing interference without changing
the architecture or increasing overhead. Additionally, CKPD introduces an
adaptive layer selection strategy to identify layers with redundant capacity,
dynamically allocating adapters. Experiments on multiple benchmarks show that
CKPD-FSCIL outperforms state-of-the-art methods.
comment: Code: https://github.com/xiaojieli0903/CKPD-FSCIL
★ A Scalable System for Visual Analysis of Ocean Data
Toshit Jain, Upkar Singh, Varun Singh, Vijay Kumar Boda, Ingrid Hotz, Sathish S. Vadhiyar, P. N. Vinayachandran, Vijay Natarajan
Oceanographers rely on visual analysis to interpret model simulations,
identify events and phenomena, and track dynamic ocean processes. The ever
increasing resolution and complexity of ocean data due to its dynamic nature
and multivariate relationships demands a scalable and adaptable visualization
tool for interactive exploration. We introduce pyParaOcean, a scalable and
interactive visualization system designed specifically for ocean data analysis.
pyParaOcean offers specialized modules for common oceanographic analysis tasks,
including eddy identification and salinity movement tracking. These modules
seamlessly integrate with ParaView as filters, ensuring a user-friendly and
easy-to-use system while leveraging the parallelization capabilities of
ParaView and a plethora of inbuilt general-purpose visualization
functionalities. The creation of an auxiliary dataset stored as a Cinema
database helps address I/O and network bandwidth bottlenecks while supporting
the generation of quick overview visualizations. We present a case study on the
Bay of Bengal (BoB) to demonstrate the utility of the system and scaling
studies to evaluate the efficiency of the system.
★ A CT Image Classification Network Framework for Lung Tumors Based on Pre-trained MobileNetV2 Model and Transfer learning, And Its Application and Market Analysis in the Medical field
In the medical field, accurate diagnosis of lung cancer is crucial for
treatment. Traditional manual analysis methods have significant limitations in
terms of accuracy and efficiency. To address this issue, this paper proposes a
deep learning network framework based on the pre-trained MobileNetV2 model,
initialized with weights from the ImageNet-1K dataset (version 2). The last
layer of the model (the fully connected layer) is replaced with a new fully
connected layer, and a softmax activation function is added to efficiently
classify three types of lung cancer CT scan images. Experimental results show
that the model achieves an accuracy of 99.6% on the test set, with significant
improvements in feature extraction compared to traditional models.With the
rapid development of artificial intelligence technologies, deep learning
applications in medical image processing are bringing revolutionary changes to
the healthcare industry. AI-based lung cancer detection systems can
significantly improve diagnostic efficiency, reduce the workload of doctors,
and occupy an important position in the global healthcare market. The potential
of AI to improve diagnostic accuracy, reduce medical costs, and promote
precision medicine will have a profound impact on the future development of the
healthcare industry.
★ IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation AAAI 2025
3D Referring Expression Segmentation (3D-RES) aims to segment point cloud
scenes based on a given expression. However, existing 3D-RES approaches face
two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity
arises from information loss or distortion during point cloud acquisition due
to limitations such as lighting and viewpoint. Intent ambiguity refers to the
model's equal treatment of all queries during the decoding process, lacking
top-down task-specific guidance. In this paper, we introduce an Image enhanced
Prompt Decoding Network (IPDN), which leverages multi-view images and
task-driven information to enhance the model's reasoning capabilities. To
address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE)
module, which injects multi-view 2D image information into the 3D scene and
compensates for potential spatial information loss. To tackle intent ambiguity,
we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by
deriving task-driven signals from the interaction between the expression and
visual features. Comprehensive experiments demonstrate that IPDN outperforms
the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and
3D-GRES tasks, respectively.
comment: AAAI 2025
★ V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer AAAI2025
Concept Bottleneck Models (CBMs) offer inherent interpretability by initially
translating images into human-comprehensible concepts, followed by a linear
combination of these concepts for classification. However, the annotation of
concepts for visual recognition tasks requires extensive expert knowledge and
labor, constraining the broad adoption of CBMs. Recent approaches have
leveraged the knowledge of large language models to construct concept
bottlenecks, with multimodal models like CLIP subsequently mapping image
features into the concept feature space for classification. Despite this, the
concepts produced by language models can be verbose and may introduce
non-visual attributes, which hurts accuracy and interpretability. In this
study, we investigate to avoid these issues by constructing CBMs directly from
multimodal models. To this end, we adopt common words as base concept
vocabulary and leverage auxiliary unlabeled images to construct a
Vision-to-Concept (V2C) tokenizer that can explicitly quantize images into
their most relevant visual concepts, thus creating a vision-oriented concept
bottleneck tightly coupled with the multimodal model. This leads to our V2C-CBM
which is training efficient and interpretable with high accuracy. Our V2C-CBM
has matched or outperformed LLM-supervised CBMs on various visual
classification benchmarks, validating the efficacy of our approach.
comment: Accepted by AAAI2025
★ AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data
As opposed to human drivers, current autonomous driving systems still require
vast amounts of labeled data to train. Recently, world models have been
proposed to simultaneously enhance autonomous driving capabilities by improving
the way these systems understand complex real-world environments and reduce
their data demands via self-supervised pre-training. In this paper, we present
AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding
Predictive Architecture), a novel self-supervised pre-training framework for
autonomous driving with LiDAR data that, as opposed to existing methods, is
neither generative nor contrastive. Our method learns spatial world models with
a joint embedding predictive architecture. Instead of explicitly generating
masked unknown regions, our self-supervised world models predict Bird's Eye
View (BEV) embeddings to represent the diverse nature of autonomous driving
scenes. Our approach furthermore eliminates the need to manually create
positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA
leads to simpler implementation and enhanced learned representations. We
qualitatively and quantitatively demonstrate high-quality of embeddings learned
with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of
AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and
associated transfer learning. Our experimental evaluation demonstrates that
AD-L-JEPA is a plausible approach for self-supervised pre-training in
autonomous driving applications and is the best available approach
outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO
[2]. The source code of AD-L-JEPA is available at
https://github.com/HaoranZhuExplorer/AD-L-JEPA-Release.
★ Emergence of Painting Ability via Recognition-Driven Evolution
From Paleolithic cave paintings to Impressionism, human painting has evolved
to depict increasingly complex and detailed scenes, conveying more nuanced
messages. This paper attempts to emerge this artistic capability by simulating
the evolutionary pressures that enhance visual communication efficiency.
Specifically, we present a model with a stroke branch and a palette branch that
together simulate human-like painting. The palette branch learns a limited
colour palette, while the stroke branch parameterises each stroke using
B\'ezier curves to render an image, subsequently evaluated by a high-level
recognition module. We quantify the efficiency of visual communication by
measuring the recognition accuracy achieved with machine vision. The model then
optimises the control points and colour choices for each stroke to maximise
recognition accuracy with minimal strokes and colours. Experimental results
show that our model achieves superior performance in high-level recognition
tasks, delivering artistic expression and aesthetic appeal, especially in
abstract sketches. Additionally, our approach shows promise as an efficient
bit-level image compression technique, outperforming traditional methods.
★ Addressing Domain Shift via Imbalance-Aware Domain Adaptation in Embryo Development Assessment
Deep learning models in medical imaging face dual challenges: domain shift,
where models perform poorly when deployed in settings different from their
training environment, and class imbalance, where certain disease conditions are
naturally underrepresented. We present Imbalance-Aware Domain Adaptation
(IADA), a novel framework that simultaneously tackles both challenges through
three key components: (1) adaptive feature learning with class-specific
attention mechanisms, (2) balanced domain alignment with dynamic weighting, and
(3) adaptive threshold optimization. Our theoretical analysis establishes
convergence guarantees and complexity bounds. Through extensive experiments on
embryo development assessment across four imaging modalities, IADA demonstrates
significant improvements over existing methods, achieving up to 25.19\% higher
accuracy while maintaining balanced performance across classes. In challenging
scenarios with low-quality imaging systems, IADA shows robust generalization
with AUC improvements of up to 12.56\%. These results demonstrate IADA's
potential for developing reliable and equitable medical imaging systems for
diverse clinical settings. The code is made public available at
\url{https://github.com/yinghemedical/imbalance-aware_domain_adaptation}
comment: 15 pages
★ MORDA: A Synthetic Dataset to Facilitate Adaptation of Object Detectors to Unseen Real-target Domain While Preserving Performance on Real-source Domain ICRA2025
Deep neural network (DNN) based perception models are indispensable in the
development of autonomous vehicles (AVs). However, their reliance on
large-scale, high-quality data is broadly recognized as a burdensome necessity
due to the substantial cost of data acquisition and labeling. Further, the
issue is not a one-time concern, as AVs might need a new dataset if they are to
be deployed to another region (real-target domain) that the in-hand dataset
within the real-source domain cannot incorporate. To mitigate this burden, we
propose leveraging synthetic environments as an auxiliary domain where the
characteristics of real domains are reproduced. This approach could enable
indirect experience about the real-target domain in a time- and cost-effective
manner. As a practical demonstration of our methodology, nuScenes and South
Korea are employed to represent real-source and real-target domains,
respectively. That means we construct digital twins for several regions of
South Korea, and the data-acquisition framework of nuScenes is reproduced.
Blending the aforementioned components within a simulator allows us to obtain a
synthetic-fusion domain in which we forge our novel driving dataset, MORDA:
Mixture Of Real-domain characteristics for synthetic-data-assisted Domain
Adaptation. To verify the value of synthetic features that MORDA provides in
learning about driving environments of South Korea, 2D/3D detectors are trained
solely on a combination of nuScenes and MORDA. Afterward, their performance is
evaluated on the unforeseen real-world dataset (AI-Hub) collected in South
Korea. Our experiments present that MORDA can significantly improve mean
Average Precision (mAP) on AI-Hub dataset while that on nuScenes is retained or
slightly enhanced.
comment: 7 pages, 6 figures, 4 tables, This work has been submitted to the
IEEE for possible publication (the paper is submitted to the conference
ICRA2025 and is under review)
★ Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments
In assistive robotics serving people with disabilities (PWD), accurate place
recognition in built environments is crucial to ensure that robots navigate and
interact safely within diverse indoor spaces. Language interfaces, particularly
those powered by Large Language Models (LLM) and Vision Language Models (VLM),
hold significant promise in this context, as they can interpret visual scenes
and correlate them with semantic information. However, such interfaces are also
known for their hallucinated predictions. In addition, language instructions
provided by humans can also be ambiguous and lack precise details about
specific locations, objects, or actions, exacerbating the hallucination issue.
In this work, we introduce Seeing with Partial Certainty (SwPC) - a framework
designed to measure and align uncertainty in VLM-based place recognition,
enabling the model to recognize when it lacks confidence and seek assistance
when necessary. This framework is built on the theory of conformal prediction
to provide statistical guarantees on place recognition while minimizing
requests for human help in complex indoor environment settings. Through
experiments on the widely used richly-annotated scene dataset Matterport3D, we
show that SwPC significantly increases the success rate and decreases the
amount of human intervention required relative to the prior art. SwPC can be
utilized with any VLMs directly without requiring model fine-tuning, offering a
promising, lightweight approach to uncertainty modeling that complements and
scales alongside the expanding capabilities of foundational models.
comment: 10 pages, 4 Figures
★ MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification
Transformer has been extensively explored for hyperspectral image (HSI)
classification. However, transformer poses challenges in terms of speed and
memory usage because of its quadratic computational complexity. Recently, the
Mamba model has emerged as a promising approach, which has strong long-distance
modeling capabilities while maintaining a linear computational complexity.
However, representing the HSI is challenging for the Mamba due to the
requirement for an integrated spatial and spectral understanding. To remedy
these drawbacks, we propose a novel HSI classification model based on a Mamba
model, named MambaHSI, which can simultaneously model long-range interaction of
the whole image and integrate spatial and spectral information in an adaptive
manner. Specifically, we design a spatial Mamba block (SpaMB) to model the
long-range interaction of the whole image at the pixel-level. Then, we propose
a spectral Mamba block (SpeMB) to split the spectral vector into multiple
groups, mine the relations across different spectral groups, and extract
spectral features. Finally, we propose a spatial-spectral fusion module (SSFM)
to adaptively integrate spatial and spectral features of a HSI. To our best
knowledge, this is the first image-level HSI classification model based on the
Mamba. We conduct extensive experiments on four diverse HSI datasets. The
results demonstrate the effectiveness and superiority of the proposed model for
HSI classification. This reveals the great potential of Mamba to be the
next-generation backbone for HSI models. Codes are available at
https://github.com/li-yapeng/MambaHSI .
comment: accepted by IEEE TGRS
★ Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation
Referring video object segmentation aims to segment objects within a video
corresponding to a given text description. Existing transformer-based temporal
modeling approaches face challenges related to query inconsistency and the
limited consideration of context. Query inconsistency produces unstable masks
of different objects in the middle of the video. The limited consideration of
context leads to the segmentation of incorrect objects by failing to adequately
account for the relationship between the given text and instances. To address
these issues, we propose the Multi-context Temporal Consistency Module (MTCM),
which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner
removes noise from queries and aligns them to achieve query consistency. The
MCE predicts text-relevant queries by considering multi-context. We applied
MTCM to four different models, increasing performance across all of them,
particularly achieving 47.6 J&F on the MeViS. Code is available at
https://github.com/Choi58/MTCM.
★ Plug-and-Play DISep: Separating Dense Instances for Scene-to-Pixel Weakly-Supervised Change Detection in High-Resolution Remote Sensing Images SP
Existing Weakly-Supervised Change Detection (WSCD) methods often encounter
the problem of "instance lumping" under scene-level supervision, particularly
in scenarios with a dense distribution of changed instances (i.e., changed
objects). In these scenarios, unchanged pixels between changed instances are
also mistakenly identified as changed, causing multiple changes to be
mistakenly viewed as one. In practical applications, this issue prevents the
accurate quantification of the number of changes. To address this issue, we
propose a Dense Instance Separation (DISep) method as a plug-and-play solution,
refining pixel features from a unified instance perspective under scene-level
supervision. Specifically, our DISep comprises a three-step iterative training
process: 1) Instance Localization: We locate instance candidate regions for
changed pixels using high-pass class activation maps. 2) Instance Retrieval: We
identify and group these changed pixels into different instance IDs through
connectivity searching. Then, based on the assigned instance IDs, we extract
corresponding pixel-level features on a per-instance basis. 3) Instance
Separation: We introduce a separation loss to enforce intra-instance pixel
consistency in the embedding space, thereby ensuring separable instance feature
representations. The proposed DISep adds only minimal training cost and no
inference cost. It can be seamlessly integrated to enhance existing WSCD
methods. We achieve state-of-the-art performance by enhancing {three
Transformer-based and four ConvNet-based methods} on the LEVIR-CD, WHU-CD,
DSIFN-CD, SYSU-CD, and CDD datasets. Additionally, our DISep can be used to
improve fully-supervised change detection methods. Code is available at
https://github.com/zhenghuizhao/Plug-and-Play-DISep-for-Change-Detection.
comment: Accepted by ISPRS Journal of Photogrammetry and Remote Sensing
★ Image2CADSeq: Computer-Aided Design Sequence and Knowledge Inference from Product Images
Computer-aided design (CAD) tools empower designers to design and modify 3D
models through a series of CAD operations, commonly referred to as a CAD
sequence. In scenarios where digital CAD files are not accessible, reverse
engineering (RE) has been used to reconstruct 3D CAD models. Recent advances
have seen the rise of data-driven approaches for RE, with a primary focus on
converting 3D data, such as point clouds, into 3D models in boundary
representation (B-rep) format. However, obtaining 3D data poses significant
challenges, and B-rep models do not reveal knowledge about the 3D modeling
process of designs. To this end, our research introduces a novel data-driven
approach with an Image2CADSeq neural network model. This model aims to reverse
engineer CAD models by processing images as input and generating CAD sequences.
These sequences can then be translated into B-rep models using a solid modeling
kernel. Unlike B-rep models, CAD sequences offer enhanced flexibility to modify
individual steps of model creation, providing a deeper understanding of the
construction process of CAD models. To quantitatively and rigorously evaluate
the predictive performance of the Image2CADSeq model, we have developed a
multi-level evaluation framework for model assessment. The model was trained on
a specially synthesized dataset, and various network architectures were
explored to optimize the performance. The experimental and validation results
show great potential for the model in generating CAD sequences from 2D image
data.
comment: 20 pages, 10 figures, and 6 tables
★ From Mesh Completion to AI Designed Crown
Designing a dental crown is a time-consuming and labor intensive process. Our
goal is to simplify crown design and minimize the tediousness of making manual
adjustments while still ensuring the highest level of accuracy and consistency.
To this end, we present a new end- to-end deep learning approach, coined Dental
Mesh Completion (DMC), to generate a crown mesh conditioned on a point cloud
context. The dental context includes the tooth prepared to receive a crown and
its surroundings, namely the two adjacent teeth and the three closest teeth in
the opposing jaw. We formulate crown generation in terms of completing this
point cloud context. A feature extractor first converts the input point cloud
into a set of feature vectors that represent local regions in the point cloud.
The set of feature vectors is then fed into a transformer to predict a new set
of feature vectors for the missing region (crown). Subsequently, a point
reconstruction head, followed by a multi-layer perceptron, is used to predict a
dense set of points with normals. Finally, a differentiable point-to-mesh layer
serves to reconstruct the crown surface mesh. We compare our DMC method to a
graph-based convolutional neural network which learns to deform a crown mesh
from a generic crown shape to the target geometry. Extensive experiments on our
dataset demonstrate the effectiveness of our method, which attains an average
of 0.062 Chamfer Distance.The code is available
at:https://github.com/Golriz-code/DMC.gi
★ A Machine Learning Model for Crowd Density Classification in Hajj Video Frames
Managing the massive annual gatherings of Hajj and Umrah presents significant
challenges, particularly as the Saudi government aims to increase the number of
pilgrims. Currently, around two million pilgrims attend Hajj and 26 million
attend Umrah making crowd control especially in critical areas like the Grand
Mosque during Tawaf, a major concern. Additional risks arise in managing dense
crowds at key sites such as Arafat where the potential for stampedes, fires and
pandemics poses serious threats to public safety. This research proposes a
machine learning model to classify crowd density into three levels: moderate
crowd, overcrowded and very dense crowd in video frames recorded during Hajj,
with a flashing red light to alert organizers in real-time when a very dense
crowd is detected. While current research efforts in processing Hajj
surveillance videos focus solely on using CNN to detect abnormal behaviors,
this research focuses more on high-risk crowds that can lead to disasters.
Hazardous crowd conditions require a robust method, as incorrect classification
could trigger unnecessary alerts and government intervention, while failure to
classify could result in disaster. The proposed model integrates Local Binary
Pattern (LBP) texture analysis, which enhances feature extraction for
differentiating crowd density levels, along with edge density and area-based
features. The model was tested on the KAU-Smart Crowd 'HAJJv2' dataset which
contains 18 videos from various key locations during Hajj including 'Massaa',
'Jamarat', 'Arafat' and 'Tawaf'. The model achieved an accuracy rate of 87%
with a 2.14% error percentage (misclassification rate), demonstrating its
ability to detect and classify various crowd conditions effectively. That
contributes to enhanced crowd management and safety during large-scale events
like Hajj.
♻ ★ Gradient-based facial encoding for key generation to encrypt and decrypt multimedia data
Security systems relying on passwords are vulnerable to being forgotten,
guessed, or breached. Likewise, biometric systems that operate independently
are at risk of template spoofing and replay incidents. This paper introduces a
biocryptosystem utilizing face recognition techniques to address these issues,
allowing for the encryption and decryption of various file types through the
Advanced Encryption Standard (AES). The proposed system creates a distinct
32-bit encryption key derived from facial features identified by Histogram of
Oriented Gradients (HOG) and categorized using Support Vector Machines (SVM).
HOG efficiently identifies edge-aligned facial features, even in dim lighting,
ensuring that reliable biometric keys can be generated. This key is then used
with AES to encrypt and decrypt a variety of data formats, such as text, audio,
and video files. This encryption key, derived from an individual's distinctive
facial traits, is exceedingly challenging for adversaries to reproduce or
guess. The security and performance of the system have been validated through
experiments using several metrics, including correlation analysis, Shannon
entropy, normalized Hamming distance, and the avalanche effect on 25 different
file types. Potential uses for the proposed system include secure file sharing,
online transactions, and data archiving, making it a strong and trustworthy
approach to safeguarding sensitive information by integrating the uniqueness of
facial biometrics with the established security of AES encryption.
comment: 12 pages, 2 figures, This work has been submitted to the IEEE for
possible publication
♻ ★ AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning WACV
Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, Rao Muhammad Anwer
Significant progress has been made in advancing large multimodal
conversational models (LMMs), capitalizing on vast repositories of image-text
data available online. Despite this progress, these models often encounter
substantial domain gaps, hindering their ability to engage in complex
conversations across new domains. Recent efforts have aimed to mitigate this
issue, albeit relying on domain-specific image-text data to curate
instruction-tuning data. However, many domains, such as agriculture, lack such
vision-language data. In this work, we propose an approach to construct
instruction-tuning data that harnesses vision-only data for the agriculture
domain. We utilize diverse agricultural datasets spanning multiple domains,
curate class-specific information, and employ large language models (LLMs) to
construct an expert-tuning set, resulting in a 70k expert-tuning dataset called
AgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient
LMM that can hold complex agriculture-related conversations and provide useful
insights. We also develop AgroEvals for evaluation and compare {AgroGPT's}
performance with large open and closed-source models. {AgroGPT} excels at
identifying fine-grained agricultural concepts, can act as an agriculture
expert, and provides helpful information for multimodal agriculture questions.
The code, datasets, and models are available at
https://github.com/awaisrauf/agroGPT.
comment: Accepted at WACV, 2025
♻ ★ Snapshot: Towards Application-centered Models for Pedestrian Trajectory Prediction in Urban Traffic Environments
This paper explores pedestrian trajectory prediction in urban traffic while
focusing on both model accuracy and real-world applicability. While promising
approaches exist, they often revolve around pedestrian datasets excluding
traffic-related information, or resemble architectures that are either not
real-time capable or robust. To address these limitations, we first introduce a
dedicated benchmark based on Argoverse 2, specifically targeting pedestrians in
traffic environments. Following this, we present Snapshot, a modular,
feed-forward neural network that outperforms the current state of the art,
reducing the Average Displacement Error (ADE) by 8.8% while utilizing
significantly less information. Despite its agent-centric encoding scheme,
Snapshot demonstrates scalability, real-time performance, and robustness to
varying motion histories. Moreover, by integrating Snapshot into a modular
autonomous driving software stack, we showcase its real-world applicability.
comment: 8 Pages, 9 Figures
♻ ★ GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
In recent years, 2D Vision-Language Models (VLMs) have made significant
strides in image-text understanding tasks. However, their performance in 3D
spatial comprehension, which is critical for embodied intelligence, remains
limited. Recent advances have leveraged 3D point clouds and multi-view images
as inputs, yielding promising results. However, we propose exploring a purely
vision-based solution inspired by human perception, which merely relies on
visual cues for 3D spatial understanding. This paper empirically investigates
the limitations of VLMs in 3D spatial knowledge, revealing that their primary
shortcoming lies in the lack of global-local correspondence between the scene
and individual frames. To address this, we introduce GPT4Scene, a novel visual
prompting paradigm in VLM training and inference that helps build the
global-local relationship, significantly improving the 3D spatial understanding
of indoor scenes. Specifically, GPT4Scene constructs a 3D Bird's Eye View (BEV)
image from the video and marks consistent object IDs across both frames and the
BEV image. The model then inputs the concatenated BEV image and video frames
with markers. In zero-shot evaluations, GPT4Scene improves performance over
closed-source VLMs like GPT-4o. Additionally, we prepare a processed video
dataset consisting of 165K text annotation to fine-tune open-source VLMs,
achieving state-of-the-art performance on all 3D understanding tasks.
Surprisingly, after training with the GPT4Scene paradigm, VLMs consistently
improve during inference, even without visual prompting and BEV image as
explicit correspondence. It demonstrates that the proposed paradigm helps VLMs
develop an intrinsic ability to understand 3D scenes, which paves the way for a
noninvasive approach to extending pre-trained VLMs for 3D scene understanding.
comment: Project page: https://gpt4scene.github.io/
♻ ★ OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang
Recent advancements in omnimodal learning have been achieved in understanding
and generation across images, text, and speech, though mainly within
proprietary models. Limited omnimodal datasets and the inherent challenges
associated with real-time emotional speech generation have hindered open-source
progress. To address these issues, we propose openomni, a two-stage training
method combining omnimodal alignment and speech generation to develop a
state-of-the-art omnimodal large language model. In the alignment phase, a
pre-trained speech model is further trained on text-image tasks to generalize
from vision to speech in a (near) zero-shot manner, outperforming models
trained on tri-modal datasets. In the speech generation phase, a lightweight
decoder facilitates real-time emotional speech through training on speech tasks
and preference learning. Experiments demonstrate that openomni consistently
improves across omnimodal, vision-language, and speech-language evaluations,
enabling natural, emotion-rich dialogues and real-time emotional speech
generation.
♻ ★ Voxel-Aggregated Feature Synthesis: Efficient Dense Mapping for Simulated 3D Reasoning CVPR 2025
We address the issue of the exploding computational requirements of recent
State-of-the-art (SOTA) open set multimodel 3D mapping (dense 3D mapping)
algorithms and present Voxel-Aggregated Feature Synthesis (VAFS), a novel
approach to dense 3D mapping in simulation. Dense 3D mapping involves
segmenting and embedding sequential RGBD frames which are then fused into 3D.
This leads to redundant computation as the differences between frames are small
but all are individually segmented and embedded. This makes dense 3D mapping
impractical for research involving embodied agents in which the environment,
and thus the mapping, must be modified with regularity. VAFS drastically
reduces this computation by using the segmented point cloud computed by a
simulator's physics engine and synthesizing views of each region. This reduces
the number of features to embed from the number of captured RGBD frames to the
number of objects in the scene, effectively allowing a "ground truth" semantic
map to be computed an order of magnitude faster than traditional methods. We
test the resulting representation by assessing the IoU scores of semantic
queries for different objects in the simulated scene, and find that VAFS
exceeds the accuracy and speed of prior dense 3D mapping techniques.
comment: 6 pages, 2 figures, CVPR 2025
♻ ★ Less is More: The Influence of Pruning on the Explainability of CNNs
Modern, state-of-the-art Convolutional Neural Networks (CNNs) in computer
vision have millions of parameters. Thus, explaining the complex decisions of
such networks to humans is challenging. A technical approach to reduce CNN
complexity is network pruning, where less important parameters are deleted. The
work presented in this paper investigates whether this technical complexity
reduction also helps with perceived explainability. To do so, we conducted a
pre-study and two human-grounded experiments, assessing the effects of
different pruning ratios on CNN explainability. Overall, we evaluated four
different compression rates (i.e., CPR 2, 4, 8, and 32) with 37 500 tasks on
Mechanical Turk. Results indicate that lower compression rates have a positive
influence on explainability, while higher compression rates show negative
effects. Furthermore, we were able to identify sweet spots that increase both
the perceived explainability and the model's performance.
♻ ★ Geometry Restoration and Dewarping of Camera-Captured Document Images
This research focuses on developing a method for restoring the topology of
digital images of paper documents captured by a camera, using algorithms for
detection, segmentation, geometry restoration, and dewarping. Our methodology
employs deep learning (DL) for document outline detection, followed by computer
vision (CV) to create a topological 2D grid using cubic polynomial
interpolation and correct nonlinear distortions by remapping the image. Using
classical CV methods makes the document topology restoration process more
efficient and faster, as it requires significantly fewer computational
resources and memory. We developed a new pipeline for automatic document
dewarping and reconstruction, along with a framework and annotated dataset to
demonstrate its efficiency. Our experiments confirm the promise of our
methodology and its superiority over existing benchmarks (including mobile apps
and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both
visually and in terms of document readability via Optical Character Recognition
(OCR) and geometry restoration metrics. This paves the way for creating
high-quality digital copies of paper documents and enhancing the efficiency of
OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
comment: 28 pages, 16 figures
♻ ★ Identity-Preserving Video Dubbing Using Motion Warping
Video dubbing aims to synthesize realistic, lip-synced videos from a
reference video and a driving audio signal. Although existing methods can
accurately generate mouth shapes driven by audio, they often fail to preserve
identity-specific features, largely because they do not effectively capture the
nuanced interplay between audio cues and the visual attributes of reference
identity . As a result, the generated outputs frequently lack fidelity in
reproducing the unique textural and structural details of the reference
identity. To address these limitations, we propose IPTalker, a novel and robust
framework for video dubbing that achieves seamless alignment between driving
audio and reference identity while ensuring both lip-sync accuracy and
high-fidelity identity preservation. At the core of IPTalker is a
transformer-based alignment mechanism designed to dynamically capture and model
the correspondence between audio features and reference images, thereby
enabling precise, identity-aware audio-visual integration. Building on this
alignment, a motion warping strategy further refines the results by spatially
deforming reference images to match the target audio-driven configuration. A
dedicated refinement process then mitigates occlusion artifacts and enhances
the preservation of fine-grained textures, such as mouth details and skin
features. Extensive qualitative and quantitative evaluations demonstrate that
IPTalker consistently outperforms existing approaches in terms of realism, lip
synchronization, and identity retention, establishing a new state of the art
for high-quality, identity-consistent video dubbing.
comment: v2, Under Review
♻ ★ BTMTrack: Robust RGB-T Tracking via Dual-template Bridging and Temporal-Modal Candidate Elimination
RGB-T tracking leverages the complementary strengths of RGB and thermal
infrared (TIR) modalities to address challenging scenarios such as low
illumination and adverse weather. However, existing methods often fail to
effectively integrate temporal information and perform efficient cross-modal
interactions, which constrain their adaptability to dynamic targets. In this
paper, we propose BTMTrack, a novel framework for RGB-T tracking. The core of
our approach lies in the dual-template backbone network and the Temporal-Modal
Candidate Elimination (TMCE) strategy. The dual-template backbone effectively
integrates temporal information, while the TMCE strategy focuses the model on
target-relevant tokens by evaluating temporal and modal correlations, reducing
computational overhead and avoiding irrelevant background noise. Building upon
this foundation, we propose the Temporal Dual Template Bridging (TDTB) module,
which facilitates precise cross-modal fusion through dynamically filtered
tokens. This approach further strengthens the interaction between templates and
the search region. Extensive experiments conducted on three benchmark datasets
demonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art
performance, with a 72.3% precision rate on the LasHeR test set and competitive
results on RGBT210 and RGBT234 datasets.
♻ ★ Visual Semantic Navigation with Real Robots
Carlos Gutiérrez-Álvarez, Pablo Ríos-Navarro, Rafael Flor-Rodríguez, Francisco Javier Acevedo-Rodríguez, Roberto J. López-Sastre
Visual Semantic Navigation (VSN) is the ability of a robot to learn visual
semantic information for navigating in unseen environments. These VSN models
are typically tested in those virtual environments where they are trained,
mainly using reinforcement learning based approaches. Therefore, we do not yet
have an in-depth analysis of how these models would behave in the real world.
In this work, we propose a new solution to integrate VSN models into real
robots, so that we have true embodied agents. We also release a novel ROS-based
framework for VSN, ROS4VSN, so that any VSN-model can be easily deployed in any
ROS-compatible robot and tested in a real setting. Our experiments with two
different robots, where we have embedded two state-of-the-art VSN agents,
confirm that there is a noticeable performance difference of these VSN
solutions when tested in real-world and simulation environments. We hope that
this research will endeavor to provide a foundation for addressing this
consequential issue, with the ultimate aim of advancing the performance and
efficiency of embodied agents within authentic real-world scenarios. Code to
reproduce all our experiments can be found at
https://github.com/gramuah/ros4vsn.
♻ ★ Rendering-Oriented 3D Point Cloud Attribute Compression using Sparse Tensor-based Transformer
The evolution of 3D visualization techniques has fundamentally transformed
how we interact with digital content. At the forefront of this change is point
cloud technology, offering an immersive experience that surpasses traditional
2D representations. However, the massive data size of point clouds presents
significant challenges in data compression. Current methods for lossy point
cloud attribute compression (PCAC) generally focus on reconstructing the
original point clouds with minimal error. However, for point cloud
visualization scenarios, the reconstructed point clouds with distortion still
need to undergo a complex rendering process, which affects the final
user-perceived quality. In this paper, we propose an end-to-end deep learning
framework that seamlessly integrates PCAC with differentiable rendering,
denoted as rendering-oriented PCAC (RO-PCAC), directly targeting the quality of
rendered multiview images for viewing. In a differentiable manner, the impact
of the rendering process on the reconstructed point clouds is taken into
account. Moreover, we characterize point clouds as sparse tensors and propose a
sparse tensor-based transformer, called SP-Trans. By aligning with the local
density of the point cloud and utilizing an enhanced local attention mechanism,
SP-Trans captures the intricate relationships within the point cloud, further
improving feature analysis and synthesis within the framework. Extensive
experiments demonstrate that the proposed RO-PCAC achieves state-of-the-art
compression performance, compared to existing reconstruction-oriented methods,
including traditional, learning-based, and hybrid methods.
♻ ★ Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance AAAI2025
Accurate prediction of 3D semantic occupancy from 2D visual images is vital
in enabling autonomous agents to comprehend their surroundings for planning and
navigation. State-of-the-art methods typically employ fully supervised
approaches, necessitating a huge labeled dataset acquired through expensive
LiDAR sensors and meticulous voxel-wise labeling by human annotators. The
resource-intensive nature of this annotating process significantly hampers the
application and scalability of these methods. We introduce a novel
semi-supervised framework to alleviate the dependency on densely annotated
data. Our approach leverages 2D foundation models to generate essential 3D
scene geometric and semantic cues, facilitating a more efficient training
process. Our framework exhibits notable properties: (1) Generalizability,
applicable to various 3D semantic scene completion approaches, including 2D-3D
lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated
through experiments on SemanticKITTI and NYUv2, wherein our method achieves up
to 85% of the fully-supervised performance using only 10% labeled data. This
approach not only reduces the cost and labor associated with data annotation
but also demonstrates the potential for broader adoption in camera-based
systems for 3D semantic occupancy prediction.
comment: Accepted at AAAI2025. Project Page:
https://vinairesearch.github.io/SemiSSC
♻ ★ CoE: Deep Coupled Embedding for Non-Rigid Point Cloud Correspondences
The interest in matching non-rigidly deformed shapes represented as raw point
clouds is rising due to the proliferation of low-cost 3D sensors. Yet, the task
is challenging since point clouds are irregular and there is a lack of
intrinsic shape information. We propose to tackle these challenges by learning
a new shape representation -- a per-point high dimensional embedding, in an
embedding space where semantically similar points share similar embeddings. The
learned embedding has multiple beneficial properties: it is aware of the
underlying shape geometry and is robust to shape deformations and various shape
artefacts, such as noise and partiality. Consequently, this embedding can be
directly employed to retrieve high-quality dense correspondences through a
simple nearest neighbor search in the embedding space. Extensive experiments
demonstrate new state-of-the-art results and robustness in numerous challenging
non-rigid shape matching benchmarks and show its great potential in other shape
analysis tasks, such as segmentation.
comment: 16 pages, 17 figures
♻ ★ DGNN-YOLO: Interpretable Dynamic Graph Neural Networks with YOLO11 for Detecting and Tracking Small Occluded Objects in Urban Traffic
The detection and tracking of small, occluded objects such as pedestrians,
cyclists, and motorbikes pose significant challenges for traffic surveillance
systems because of their erratic movement, frequent occlusion, and poor
visibility in dynamic urban environments. Traditional methods like YOLO11,
while proficient in spatial feature extraction for precise detection, often
struggle with these small and dynamically moving objects, particularly in
handling real-time data updates and resource efficiency. This paper introduces
DGNN-YOLO, a novel framework that integrates dynamic graph neural networks
(DGNNs) with YOLO11 to address these limitations. Unlike standard GNNs, DGNNs
are chosen for their superior ability to dynamically update graph structures in
real-time, which enables adaptive and robust tracking of objects in highly
variable urban traffic scenarios. This framework constructs and regularly
updates its graph representations, capturing objects as nodes and their
interactions as edges, thus effectively responding to rapidly changing
conditions. Additionally, DGNN-YOLO incorporates Grad-CAM, Grad-CAM++, and
Eigen-CAM visualization techniques to enhance interpretability and foster
trust, offering insights into the model's decision-making process. Extensive
experiments validate the framework's performance, achieving a precision of
0.8382, recall of 0.6875, and mAP@0.5:0.95 of 0.6476, significantly
outperforming existing methods. This study offers a scalable and interpretable
solution for real-time traffic surveillance and significantly advances
intelligent transportation systems' capabilities by addressing the critical
challenge of detecting and tracking small, occluded objects.
♻ ★ CMTNet: Convolutional Meets Transformer Network for Hyperspectral Images Classification
Hyperspectral remote sensing (HIS) enables the detailed capture of spectral
information from the Earth's surface, facilitating precise classification and
identification of surface crops due to its superior spectral diagnostic
capabilities. However, current convolutional neural networks (CNNs) focus on
local features in hyperspectral data, leading to suboptimal performance when
classifying intricate crop types and addressing imbalanced sample
distributions. In contrast, the Transformer framework excels at extracting
global features from hyperspectral imagery. To leverage the strengths of both
approaches, this research introduces the Convolutional Meet Transformer Network
(CMTNet). This innovative model includes a spectral-spatial feature extraction
module for shallow feature capture, a dual-branch structure combining CNN and
Transformer branches for local and global feature extraction, and a
multi-output constraint module that enhances classification accuracy through
multi-output loss calculations and cross constraints across local,
international, and joint features. Extensive experiments conducted on three
datasets (WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu) demonstrate that
CTDBNet significantly outperforms other state-of-the-art networks in
classification performance, validating its effectiveness in hyperspectral crop
classification.
comment: After submission, our research team underwent a significant shift in
the project's focus and direction. As a result, the current manuscript no
longer accurately reflects the revised scope or findings of our research.To
prevent potential misinterpretations or misleading citations, we believe it
is in the best interest of the academic community to withdraw this article
♻ ★ Exosense: A Vision-Based Scene Understanding System For Exoskeletons
Jianeng Wang, Matias Mattamala, Christina Kassab, Guillaume Burger, Fabio Elnecave, Lintong Zhang, Marine Petriaux, Maurice Fallon
Self-balancing exoskeletons are a key enabling technology for individuals
with mobility impairments. While the current challenges focus on
human-compliant hardware and control, unlocking their use for daily activities
requires a scene perception system. In this work, we present Exosense, a
vision-centric scene understanding system for self-balancing exoskeletons. We
introduce a multi-sensor visual-inertial mapping device as well as a navigation
stack for state estimation, terrain mapping and long-term operation. We tested
Exosense attached to both a human leg and Wandercraft's Personal Exoskeleton in
real-world indoor scenarios. This enabled us to test the system during typical
periodic walking gaits, as well as future uses in multi-story environments. We
demonstrate that Exosense can achieve an odometry drift of about 4 cm per meter
traveled, and construct terrain maps under 1 cm average reconstruction error.
It can also work in a visual localization mode in a previously mapped
environment, providing a step towards long-term operation of exoskeletons.
comment: 8 pages, 9 figures
♻ ★ Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos
Procedural activities are sequences of key-steps aimed at achieving specific
goals. They are crucial to build intelligent agents able to assist users
effectively. In this context, task graphs have emerged as a
human-understandable representation of procedural activities, encoding a
partial ordering over the key-steps. While previous works generally relied on
hand-crafted procedures to extract task graphs from videos, in this paper, we
propose an approach based on direct maximum likelihood optimization of edges'
weights, which allows gradient-based learning of task graphs and can be
naturally plugged into neural network architectures. Experiments on the
CaptainCook4D dataset demonstrate the ability of our approach to predict
accurate task graphs from the observation of action sequences, with an
improvement of +16.7% over previous approaches. Owing to the differentiability
of the proposed framework, we also introduce a feature-based approach, aiming
to predict task graphs from key-step textual or video embeddings, for which we
observe emerging video understanding abilities. Task graphs learned with our
approach are also shown to significantly enhance online mistake detection in
procedural egocentric videos, achieving notable gains of +19.8% and +7.5% on
the Assembly101-O and EPIC-Tent-O datasets. Code for replicating experiments is
available at https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.
♻ ★ OneLLM: One Framework to Align All Modalities with Language CVPR 2024
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue
Multimodal large language models (MLLMs) have gained significant attention
due to their strong multimodal understanding capability. However, existing
works rely heavily on modality-specific encoders, which usually differ in
architecture and are limited to common modalities. In this paper, we present
OneLLM, an MLLM that aligns eight modalities to language using a unified
framework. We achieve this through a unified multimodal encoder and a
progressive multimodal alignment pipeline. In detail, we first train an image
projection module to connect a vision encoder with LLM. Then, we build a
universal projection module (UPM) by mixing multiple image projection modules
and dynamic routing. Finally, we progressively align more modalities to LLM
with the UPM. To fully leverage the potential of OneLLM in following
instructions, we also curated a comprehensive multimodal instruction dataset,
including 2M items from image, audio, video, point cloud, depth/normal map, IMU
and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,
encompassing tasks such as multimodal captioning, question answering and
reasoning, where it delivers excellent performance. Code, data, model and
online demo are available at https://github.com/csuhan/OneLLM
comment: Accepted by CVPR 2024. Code: https://github.com/csuhan/OneLLM
♻ ★ tCURLoRA: Tensor CUR Decomposition Based Low-Rank Parameter Adaptation and Its Application in Medical Image Segmentation
Transfer learning, by leveraging knowledge from pre-trained models, has
significantly enhanced the performance of target tasks. However, as deep neural
networks scale up, full fine-tuning introduces substantial computational and
storage challenges in resource-constrained environments, limiting its
widespread adoption. To address this, parameter-efficient fine-tuning (PEFT)
methods have been developed to reduce computational complexity and storage
requirements by minimizing the number of updated parameters. While matrix
decomposition-based PEFT methods, such as LoRA, show promise, they struggle to
fully capture the high-dimensional structural characteristics of model weights.
In contrast, high-dimensional tensors offer a more natural representation of
neural network weights, allowing for a more comprehensive capture of
higher-order features and multi-dimensional interactions. In this paper, we
propose tCURLoRA, a novel fine-tuning method based on tensor CUR decomposition.
By concatenating pre-trained weight matrices into a three-dimensional tensor
and applying tensor CUR decomposition, we update only the lower-order tensor
components during fine-tuning, effectively reducing computational and storage
overhead. Experimental results demonstrate that tCURLoRA outperforms existing
PEFT methods in medical image segmentation tasks.
♻ ★ DATransNet: Dynamic Attention Transformer Network for Infrared Small Target Detection
Infrared small target detection (ISTD) is widely used in civilian and
military applications. However, ISTD encounters several challenges, including
the tendency for small and dim targets to be obscured by complex backgrounds.To
address this issue, we propose the Dynamic Attention Transformer Network
(DATransNet), which aims to extract and preserve edge information of small
targets.DATransNet employs the Dynamic Attention Transformer (DATrans),
simulating central difference convolutions (CDC) to extract and integrate
gradient features with deeper features.Furthermore, we propose a global feature
extraction module (GFEM) that offers a comprehensive perspective to prevent the
network from focusing solely on details while neglecting the background
information. We compare the network with state-of-the-art (SOTA) approaches,
and the results demonstrate that our method performs effectively. Our source
code is available at https://github.com/greekinRoma/DATransNet.
♻ ★ TextToucher: Fine-Grained Text-to-Touch Generation AAAI 2025
Tactile sensation plays a crucial role in the development of multi-modal
large models and embodied intelligence. To collect tactile data with minimal
cost as possible, a series of studies have attempted to generate tactile images
by vision-to-touch image translation. However, compared to text modality,
visual modality-driven tactile generation cannot accurately depict human
tactile sensation. In this work, we analyze the characteristics of tactile
images in detail from two granularities: object-level (tactile texture, tactile
shape), and sensor-level (gel status). We model these granularities of
information through text descriptions and propose a fine-grained Text-to-Touch
generation method (TextToucher) to generate high-quality tactile samples.
Specifically, we introduce a multimodal large language model to build the text
sentences about object-level tactile information and employ a set of learnable
text prompts to represent the sensor-level tactile information. To better guide
the tactile generation process with the built text information, we fuse the
dual grains of text information and explore various dual-grain text
conditioning methods within the diffusion transformer architecture.
Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to
precisely evaluate the quality of text-driven generated tactile data. Extensive
experiments demonstrate the superiority of our TextToucher method. The source
codes will be available at \url{https://github.com/TtuHamg/TextToucher}.
comment: This paper has been accepted by AAAI 2025
♻ ★ DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Generative Learning on 3D Meshes
Xuyang Wang, Ziang Cheng, Zhenyu Li, Jiayu Yang, Haorui Ji, Pan Ji, Mehrtash Harandi, Richard Hartley, Hongdong Li
This paper proposes DoubleDiffusion, a novel framework that combines heat
dissipation diffusion and denoising diffusion for direct generative learning on
3D mesh surfaces. Our approach addresses the challenges of generating
continuous signal distributions residing on a curve manifold surface. Unlike
previous methods that rely on unrolling 3D meshes into 2D or adopting field
representations, DoubleDiffusion leverages the Laplacian-Beltrami operator to
process features respecting the mesh structure. This combination enables
effective geometry-aware signal diffusion across the underlying geometry. As
shown in Fig.1, we demonstrate that DoubleDiffusion has the ability to generate
RGB signal distributions on complex 3D mesh surfaces and achieves per-category
shape-conditioned texture generation across different shape geometry. Our work
contributes a new direction in diffusion-based generative modeling on 3D
surfaces, with potential applications in the field of 3D asset generation.
♻ ★ UltraCortex: Submillimeter Ultra-High Field 9.4 T Brain MR Image Collection and Manual Cortical Segmentations
Lucas Mahler, Julius Steiglechner, Benjamin Bender, Tobias Lindig, Dana Ramadan, Jonas Bause, Florian Birk, Rahel Heule, Edyta Charyasz, Michael Erb, Vinod Jangir Kumar, Gisela E Hagberg, Pascal Martin, Gabriele Lohmann, Klaus Scheffler
The UltraCortex repository (https://www.ultracortex.org) houses magnetic
resonance imaging data of the human brain obtained at an ultra-high field
strength of 9.4 T. It contains 86 structural MR images with spatial resolutions
ranging from 0.6 to 0.8 mm. Additionally, the repository includes segmentations
of 12 brains into gray and white matter compartments. These segmentations have
been independently validated by two expert neuroradiologists, thus establishing
them as a reliable gold standard. This resource provides researchers with
access to high-quality brain imaging data and validated segmentations,
facilitating neuroimaging studies and advancing our understanding of brain
structure and function. Existing repositories do not accommodate field
strengths beyond 7 T, nor do they offer validated segmentations, underscoring
the significance of this new resource.
♻ ★ LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Large language models have demonstrated substantial advancements in reasoning
capabilities, particularly through inference-time scaling, as illustrated by
models such as OpenAI's o1. However, current Vision-Language Models (VLMs)
often struggle to perform systematic and structured reasoning, especially when
handling complex visual question-answering tasks. In this work, we introduce
LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning.
Unlike chain-of-thought prompting, LLaVA-CoT independently engages in
sequential stages of summarization, visual interpretation, logical reasoning,
and conclusion generation. This structured approach enables LLaVA-CoT to
achieve marked improvements in precision on reasoning-intensive tasks. To
accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples
from various visual question answering sources and providing structured
reasoning annotations. Besides, we propose an inference-time stage-level beam
search method, which enables effective inference-time scaling. Remarkably, with
only 100k training samples and a simple yet effective inference time scaling
method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range
of multimodal reasoning benchmarks, but also surpasses the performance of
larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and
Llama-3.2-90B-Vision-Instruct.
♻ ★ INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models
Di Jin, Xing Liu, Yu Liu, Jia Qing Yap, Andrea Wong, Adriana Crespo, Qi Lin, Zhiyuan Yin, Qiang Yan, Ryan Ye
The rapid development of large language models (LLMs) and large vision models
(LVMs) have propelled the evolution of multi-modal AI systems, which have
demonstrated the remarkable potential for industrial applications by emulating
human-like cognition. However, they also pose significant ethical challenges,
including amplifying harmful content and reinforcing societal biases. For
instance, biases in some industrial image generation models highlighted the
urgent need for robust fairness assessments. Most existing evaluation
frameworks focus on the comprehensiveness of various aspects of the models, but
they exhibit critical limitations, including insufficient attention to content
generation alignment and social bias-sensitive domains. More importantly, their
reliance on pixel-detection techniques is prone to inaccuracies.
To address these issues, this paper presents INFELM, an in-depth fairness
evaluation on widely-used text-to-image models. Our key contributions are: (1)
an advanced skintone classifier incorporating facial topology and refined skin
pixel representation to enhance classification precision by at least 16.04%,
(2) a bias-sensitive content alignment measurement for understanding societal
impacts, (3) a generalizable representation bias evaluation for diverse
demographic groups, and (4) extensive experiments analyzing large-scale
text-to-image model outputs across six social-bias-sensitive domains. We find
that existing models in the study generally do not meet the empirical fairness
criteria, and representation bias is generally more pronounced than alignment
errors. INFELM establishes a robust benchmark for fairness assessment,
supporting the development of multi-modal AI systems that align with ethical
and human-centric principles.
comment: Di Jin and Xing Liu contributed equally to this work
♻ ★ McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
Iso-surface extraction from an implicit field is a fundamental process in
various applications of computer vision and graphics. When dealing with
geometric shapes with complicated geometric details, many existing algorithms
suffer from high computational costs and memory usage. This paper proposes
McGrids, a novel approach to improve the efficiency of iso-surface extraction.
The key idea is to construct adaptive grids for iso-surface extraction rather
than using a simple uniform grid as prior art does. Specifically, we formulate
the problem of constructing adaptive grids as a probability sampling problem,
which is then solved by Monte Carlo process. We demonstrate McGrids' capability
with extensive experiments from both analytical SDFs computed from surface
meshes and learned implicit fields from real multiview images. The experiment
results show that our McGrids can significantly reduce the number of implicit
field queries, resulting in significant memory reduction, while producing
high-quality meshes with rich geometric details.
♻ ★ MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
We address the problem of facial expression editing by controling the
relative variation of facial action-unit (AU) from the same person. This
enables us to edit this specific person's expression in a fine-grained,
continuous and interpretable manner, while preserving their identity, pose,
background and detailed facial attributes. Key to our model, which we dub
MagicFace, is a diffusion model conditioned on AU variations and an ID encoder
to preserve facial details of high consistency. Specifically, to preserve the
facial details with the input identity, we leverage the power of pretrained
Stable-Diffusion models and design an ID encoder to merge appearance features
through self-attention. To keep background and pose consistency, we introduce
an efficient Attribute Controller by explicitly informing the model of current
background and pose of the target. By injecting AU variations into a denoising
UNet, our model can animate arbitrary identities with various AU combinations,
yielding superior results in high-fidelity expression editing compared to other
facial expression editing works. Code is publicly available at
https://github.com/weimengting/MagicFace.
♻ ★ UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation
Semi-supervised semantic segmentation (SSS) aims at learning rich visual
knowledge from cheap unlabeled images to enhance semantic segmentation
capability. Among recent works, UniMatch improves its precedents tremendously
by amplifying the practice of weak-to-strong consistency regularization.
Subsequent works typically follow similar pipelines and propose various
delicate designs. Despite the achieved progress, strangely, even in this
flourishing era of numerous powerful vision models, almost all SSS works are
still sticking to 1) using outdated ResNet encoders with small-scale
ImageNet-1K pre-training, and 2) evaluation on simple Pascal and Cityscapes
datasets. In this work, we argue that, it is necessary to switch the baseline
of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g.,
DINOv2) that are pre-trained on massive data. A simple update on the encoder
(even using 2x fewer parameters) can bring more significant improvement than
careful method designs. Built on this competitive baseline, we present our
upgraded and simplified UniMatch V2, inheriting the core spirit of
weak-to-strong consistency from V1, but requiring less training cost and
providing consistently better results. Additionally, witnessing the gradually
saturated performance on Pascal and Cityscapes, we appeal that we should focus
on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO
datasets. Code, models, and logs of all reported values, are available at
https://github.com/LiheYoung/UniMatch-V2.
comment: Accepted by TPAMI
♻ ★ InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion
Zhaoyi Yan, Zhijie Sang, Yiming Zhang, Yuhao Fu, Baoyi He, Qi Zhou, Yining Di, Chunlin Ji, Shengyu Zhang, Fei Wu, Hongxia Yang
Large Language Models (LLMs) have demonstrated strong performance across
various reasoning tasks, yet building a single model that consistently excels
across all domains remains challenging. This paper addresses this problem by
exploring strategies to integrate multiple domain-specialized models into an
efficient pivot model.We propose two fusion strategies to combine the strengths
of multiple LLMs: (1) a pairwise, multi-step fusion approach that sequentially
distills each source model into the pivot model, followed by a weight merging
step to integrate the distilled models into the final model. This method
achieves strong performance but requires substantial training effort; and (2) a
unified fusion approach that aggregates all source models' outputs
simultaneously.To improve the fusion process, we introduce a novel
Rate-Skewness Adaptive Fusion (RSAF) technique, which dynamically adjusts top-K
ratios during parameter merging for enhanced flexibility and
stability.Furthermore, we propose an uncertainty-based weighting method for the
unified approach, which dynamically balances the contributions of source models
and outperforms other logits/distribution ensemble methods.We achieved accuracy
improvements of 9.27%, 8.80%, and 8.89% on the GSM8K, MATH, and HumanEval
tasks, respectively.
comment: Under review
♻ ★ Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu
Diffusion models have demonstrated impressive performance in generating
high-quality videos from text prompts or images. However, precise control over
the video generation process, such as camera manipulation or content editing,
remains a significant challenge. Existing methods for controlled video
generation are typically limited to a single control type, lacking the
flexibility to handle diverse control demands. In this paper, we introduce
Diffusion as Shader (DaS), a novel approach that supports multiple video
control tasks within a unified architecture. Our key insight is that achieving
versatile video control necessitates leveraging 3D control signals, as videos
are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods
limited to 2D control signals, DaS leverages 3D tracking videos as control
inputs, making the video diffusion process inherently 3D-aware. This innovation
allows DaS to achieve a wide range of video controls by simply manipulating the
3D tracking videos. A further advantage of using 3D tracking videos is their
ability to effectively link frames, significantly enhancing the temporal
consistency of the generated videos. With just 3 days of fine-tuning on 8 H800
GPUs using less than 10k videos, DaS demonstrates strong control capabilities
across diverse tasks, including mesh-to-video generation, camera control,
motion transfer, and object manipulation.
comment: Project page: https://igl-hkust.github.io/das/ Codes:
https://github.com/IGL-HKUST/DiffusionAsShader
♻ ★ Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud Registration Under Large Geometric and Temporal Change SP
Building 3D geometric maps of man-made spaces is a well-established and
active field that is fundamental to computer vision and robotics. However,
considering the evolving nature of built environments, it is essential to
question the capabilities of current mapping efforts in handling temporal
changes. In addition, spatiotemporal mapping holds significant potential for
achieving sustainability and circularity goals. Existing mapping approaches
focus on small changes, such as object relocation or self-driving car
operation; in all cases where the main structure of the scene remains fixed.
Consequently, these approaches fail to address more radical changes in the
structure of the built environment, such as geometry and topology. To this end,
we introduce the Nothing Stands Still (NSS) benchmark, which focuses on the
spatiotemporal registration of 3D scenes undergoing large spatial and temporal
change, ultimately creating one coherent spatiotemporal map. Specifically, the
benchmark involves registering two or more partial 3D point clouds (fragments)
from the same scene but captured from different spatiotemporal views. In
addition to the standard pairwise registration, we assess the multi-way
registration of multiple fragments that belong to any temporal stage. As part
of NSS, we introduce a dataset of 3D point clouds recurrently captured in
large-scale building indoor environments that are under construction or
renovation. The NSS benchmark presents three scenarios of increasing
difficulty, to quantify the generalization ability of point cloud registration
methods over space (within one building and across buildings) and time. We
conduct extensive evaluations of state-of-the-art methods on NSS. The results
demonstrate the necessity for novel methods specifically designed to handle
large spatiotemporal changes. The homepage of our benchmark is at
http://nothing-stands-still.com.
comment: To appear in the ISPRS Journal of Photogrammetry and Remote Sensing.
29 pages, 26 figures. For the project page, see
http://nothing-stands-still.com
♻ ★ STITCH: Surface reconstrucTion using Implicit neural representations with Topology Constraints and persistent Homology
Anushrut Jignasu, Ethan Herron, Zhanhong Jiang, Soumik Sarkar, Chinmay Hegde, Baskar Ganapathysubramanian, Aditya Balu, Adarsh Krishnamurthy
We present STITCH, a novel approach for neural implicit surface
reconstruction of a sparse and irregularly spaced point cloud while enforcing
topological constraints (such as having a single connected component). We
develop a new differentiable framework based on persistent homology to
formulate topological loss terms that enforce the prior of a single 2-manifold
object. Our method demonstrates excellent performance in preserving the
topology of complex 3D geometries, evident through both visual and empirical
comparisons. We supplement this with a theoretical analysis, and provably show
that optimizing the loss with stochastic (sub)gradient descent leads to
convergence and enables reconstructing shapes with a single connected
component. Our approach showcases the integration of differentiable topological
data analysis tools for implicit surface reconstruction.
comment: 19 pages, 12 figures, 29 tables
♻ ★ Multi-Task Model Merging via Adaptive Weight Disentanglement
Model merging has recently gained attention as an economical and scalable
approach to incorporate task-specific weights from various tasks into a unified
multi-task model. For example, in Task Arithmetic (TA), adding the fine-tuned
weights of different tasks can enhance the model's performance on those tasks,
while subtracting them leads to task forgetting. Although TA is highly
effective, interference among task still hampers the performance of the merged
model. Existing methods for handling conflicts between task generally rely on
empirical selection, resulting in suboptimal performance. In this paper, we
introduce an Adaptive Weight Disentanglement method. We begin by theoretically
proving that task vectors employed in model merging should be orthogonal to
minimize interference among tasks. Guided by this insight, we initialize
redundant vectors such that, when subtracted from the original task vectors,
the resulting vectors exhibit increased orthogonality. Additionally, we impose
an norm constraint on the redundant vectors to preserve the performance of the
task-specific models. Experimental results demonstrate the effectiveness of our
proposed technique: it successfully extracts redundant vectors, and after their
subtraction, the task vectors not only retain robust performance but also
achieve superior fusion outcomes. Our code is available at
\href{https://github.com/FarisXiong/AWD.git}{https://github.com/FarisXiong/AWD.git}.
♻ ★ Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
This paper investigates the problem of understanding dynamic 3D scenes from
egocentric observations, a key challenge in robotics and embodied AI. Unlike
prior studies that explored this as long-form video understanding and utilized
egocentric video only, we instead propose an LLM-based agent, Embodied
VideoAgent, which constructs scene memory from both egocentric video and
embodied sensory inputs (e.g. depth and pose sensing). We further introduce a
VLM-based approach to automatically update the memory when actions or
activities over objects are perceived. Embodied VideoAgent attains significant
advantages over counterparts in challenging reasoning and planning tasks in 3D
scenes, achieving gains of 4.9% on Ego4D-VQ3D, 5.8% on OpenEQA, and 11.7% on
EnvQA. We have also demonstrated its potential in various embodied AI tasks
including generating embodied interactions and perception for robot
manipulation. The code and demo will be made public.
comment: project page: https://embodied-videoagent.github.io/
♻ ★ MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
The generation of talking avatars has achieved significant advancements in
precise audio synchronization. However, crafting lifelike talking head videos
requires capturing a broad spectrum of emotions and subtle facial expressions.
Current methods face fundamental challenges: a) the absence of frameworks for
modeling single basic emotional expressions, which restricts the generation of
complex emotions such as compound emotions; b) the lack of comprehensive
datasets rich in human emotional expressions, which limits the potential of
models. To address these challenges, we propose the following innovations: 1)
the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental
emotions to enable the precise synthesis of both singular and compound
emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to
include six prevalent human emotional expressions as well as four types of
compound emotions, thereby expanding the training potential of emotion-driven
models. Furthermore, to enhance the flexibility of emotion control, we propose
an emotion-to-latents module that leverages multimodal inputs, aligning diverse
control signals-such as audio, text, and labels-to ensure more varied control
inputs as well as the ability to control emotions using audio alone. Through
extensive quantitative and qualitative evaluations, we demonstrate that the
MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in
generating complex emotional expressions and nuanced facial details, setting a
new benchmark in the field. These datasets will be publicly released.
♻ ★ Magic-Boost: Boost 3D Generation with Multi-View Conditioned Diffusion
Fan Yang, Jianfeng Zhang, Yichun Shi, Bowen Chen, Chenxu Zhang, Huichao Zhang, Xiaofeng Yang, Xiu Li, Jiashi Feng, Guosheng Lin
Benefiting from the rapid development of 2D diffusion models, 3D content
generation has witnessed significant progress. One promising solution is to
finetune the pre-trained 2D diffusion models to produce multi-view images and
then reconstruct them into 3D assets via feed-forward sparse-view
reconstruction models. However, limited by the 3D inconsistency in the
generated multi-view images and the low reconstruction resolution of the
feed-forward reconstruction models, the generated 3d assets are still limited
to incorrect geometries and blurry textures. To address this problem, we
present a multi-view based refine method, named Magic-Boost, to further refine
the generation results. In detail, we first propose a novel multi-view
conditioned diffusion model which extracts 3d prior from the synthesized
multi-view images to synthesize high-fidelity novel view images and then
introduce a novel iterative-update strategy to adopt it to provide precise
guidance to refine the coarse generated results through a fast optimization
process. Conditioned on the strong 3d priors extracted from the synthesized
multi-view images, Magic-Boost is capable of providing precise optimization
guidance that well aligns with the coarse generated 3D assets, enriching the
local detail in both geometry and texture within a short time ($\sim15$min).
Extensive experiments show Magic-Boost greatly enhances the coarse generated
inputs, generates high-quality 3D assets with rich geometric and textural
details. (Project Page: https://magic-research.github.io/magic-boost/)
♻ ★ YOLO11 to Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once (YOLO) Series
Ranjan Sapkota, Rizwan Qureshi, Marco Flores Calero, Chetan Badjugar, Upesh Nepal, Alwin Poulose, Peter Zeno, Uday Bhanu Prakash Vaddevolu, Sheheryar Khan, Maged Shoman, Hong Yan, Manoj Karkee
Given the rapid emergence and applications of Large Language This review
systematically examines the progression of the You Only Look Once (YOLO) object
detection algorithms from YOLOv1 to the recently unveiled YOLO11 (or YOLOv11).
Employing a reverse chronological analysis, this study examines the
advancements introduced by YOLO algorithms, beginning with YOLOv11 and
progressing through YOLOv10, YOLOv9, YOLOv8, and subsequent versions to explore
each version's contributions to enhancing speed, detection accuracy, and
computational efficiency in real-time object detection. By detailing the
incremental technological advancements in subsequent YOLO versions, this review
chronicles the evolution of YOLO, and discusses the challenges and limitations
in each earlier versions. The evolution signifies a path towards integrating
YOLO with multimodal, context-aware, and Artificial General Intelligence (AGI)
systems for the next YOLO decade, promising significant implications for future
developments in AI-driven applications. YOLOV11 to YOLOv1
comment: 11 Figures, 7 Tables
♻ ★ Multi-Domain Features Guided Supervised Contrastive Learning for Radar Target Detection
Detecting small targets in sea clutter is challenging due to dynamic maritime
conditions. Existing solutions either model sea clutter for detection or
extract target features based on clutter-target echo differences, including
statistical and deep features. While more common, the latter often excels in
controlled scenarios but struggles with robust detection and generalization in
diverse environments, limiting practical use. In this letter, we propose a
multi-domain features guided supervised contrastive learning (MDFG_SCL) method,
which integrates statistical features derived from multi-domain differences
with deep features obtained through supervised contrastive learning, thereby
capturing both low-level domain-specific variations and high-level semantic
information. This comprehensive feature integration enables the model to
effectively distinguish between small targets and sea clutter, even under
challenging conditions. Experiments conducted on real-world datasets
demonstrate that the proposed shallow-to-deep detector not only achieves
effective identification of small maritime targets but also maintains superior
detection performance across varying sea conditions, outperforming the
mainstream unsupervised contrastive learning and supervised contrastive
learning methods.
♻ ★ ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning
Compressed sensing MRI seeks to accelerate MRI acquisition processes by
sampling fewer k-space measurements and then reconstructing the missing data
algorithmically. The success of these approaches often relies on strong priors
or learned statistical models. While recent diffusion model-based priors have
shown great potential, previous methods typically ignore clinically available
metadata (e.g. patient demographics, imaging parameters, slice-specific
information). In practice, metadata contains meaningful cues about the anatomy
and acquisition protocol, suggesting it could further constrain the
reconstruction problem. In this work, we propose ContextMRI, a text-conditioned
diffusion model for MRI that integrates granular metadata into the
reconstruction process. We train a pixel-space diffusion model directly on
minimally processed, complex-valued MRI images. During inference, metadata is
converted into a structured text prompt and fed to the model via CLIP text
embeddings. By conditioning the prior on metadata, we unlock more accurate
reconstructions and show consistent gains across multiple datasets,
acceleration factors, and undersampling patterns. Our experiments demonstrate
that increasing the fidelity of metadata, ranging from slice location and
contrast to patient age, sex, and pathology, systematically boosts
reconstruction performance. This work highlights the untapped potential of
leveraging clinical context for inverse problems and opens a new direction for
metadata-driven MRI reconstruction.
comment: 29 pages, 9 figures. Code is available at
https://github.com/DoHunLee1/ContextMRI
♻ ★ Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph
Text-to-3D generation represents an exciting field that has seen rapid
advancements, facilitating the transformation of textual descriptions into
detailed 3D models. However, current progress often neglects the intricate
high-order correlation of geometry and texture within 3D objects, leading to
challenges such as over-smoothness, over-saturation and the Janus problem. In
this work, we propose a method named ``3D Gaussian Generation via Hypergraph
(Hyper-3DG)'', designed to capture the sophisticated high-order correlations
present within 3D objects. Our framework is anchored by a well-established
mainflow and an essential module, named ``Geometry and Texture Hypergraph
Refiner (HGRefiner)''. This module not only refines the representation of 3D
Gaussians but also accelerates the update process of these 3D Gaussians by
conducting the Patch-3DGS Hypergraph Learning on both explicit attributes and
latent visual features. Our framework allows for the production of finely
generated 3D objects within a cohesive optimization, effectively circumventing
degradation. Extensive experimentation has shown that our proposed method
significantly enhances the quality of 3D generation while incurring no
additional computational overhead for the underlying framework. (Project code:
https://github.com/yjhboy/Hyper3DG)
comment: Accepted by IJCV
♻ ★ EndoPerfect: A Hybrid NeRF-Stereo Vision Approach Pioneering Monocular Depth Estimation and 3D Reconstruction in Endoscopy
Pengcheng Chen, Wenhao Li, Nicole Gunderson, Jeremy Ruthberg, Randall Bly, Zhenglong Sun, Waleed M. Abuzeid, Eric J. Seibel
3D reconstruction in endoscopic sinus surgery (ESS) demands exceptional
accuracy, with the mean error and standard deviation necessitating within the
range of a single CT slice (0.625 mm), as the critical structures in the nasal
cavity are situated within submillimeter distances from surgical instruments.
This poses a formidable challenge when using conventional monocular endoscopes.
Depth estimation is crucial for 3D reconstruction, yet existing depth
estimation methodologies either suffer from inherent accuracy limitations or,
in the case of learning-based approaches, perform poorly when applied to ESS
despite succeeding on their original datasets. In this study, we present a
novel, highly generalizable method that combines Neural Radiance Fields (NeRF)
and stereo depth estimation for 3D reconstruction that can derive metric
monocular depth. Our approach begins with an initial NeRF reconstruction
yielding a coarse 3D scene, the subsequent creation of binocular pairs within
coarse 3D scene, and generation of depth maps through stereo vision, These
depth maps are used to supervise subsequent NeRF iteration, progressively
refining NeRF and binocular depth, the refinement process continues until the
depth maps converged. This recursive process generates high-accuracy depth maps
from monocular endoscopic video. Evaluation in synthetic endoscopy shows a
depth accuracy of 0.125 $\pm$ 0.443 mm, well within the 0.625 mm threshold.
Further clinical experiments with real endoscopic data demonstrate a mean
distance to CT mesh of 0.269 mm, representing the highest accuracy among
monocular 3D reconstruction methods in ESS.