Computer Vision and Pattern Recognition
★ Toon3D: Seeing Cartoons from a New Perspective
In this work, we recover the underlying 3D structure of non-geometrically
consistent scenes. We focus our analysis on hand-drawn images from cartoons and
anime. Many cartoons are created by artists without a 3D rendering engine,
which means that any new image of a scene is hand-drawn. The hand-drawn images
are usually faithful representations of the world, but only in a qualitative
sense, since it is difficult for humans to draw multiple perspectives of an
object or scene 3D consistently. Nevertheless, people can easily perceive 3D
scenes from inconsistent inputs! In this work, we correct for 2D drawing
inconsistencies to recover a plausible 3D structure such that the newly warped
drawings are consistent with each other. Our pipeline consists of a
user-friendly annotation tool, camera pose estimation, and image deformation to
recover a dense structure. Our method warps images to obey a perspective camera
model, enabling our aligned results to be plugged into novel-view synthesis
reconstruction methods to experience cartoons from viewpoints never drawn
before. Our project page is https://toon3d.studio/.
comment: Please see our project page: https://toon3d.studio/
★ Text-to-Vector Generation with Neural Path Representation SIGGRAPH 2024
Vector graphics are widely used in digital art and highly favored by
designers due to their scalability and layer-wise properties. However, the
process of creating and editing vector graphics requires creativity and design
expertise, making it a time-consuming task. Recent advancements in
text-to-vector (T2V) generation have aimed to make this process more
accessible. However, existing T2V methods directly optimize control points of
vector graphics paths, often resulting in intersecting or jagged paths due to
the lack of geometry constraints. To overcome these limitations, we propose a
novel neural path representation by designing a dual-branch Variational
Autoencoder (VAE) that learns the path latent space from both sequence and
image modalities. By optimizing the combination of neural paths, we can
incorporate geometric constraints while preserving expressivity in generated
SVGs. Furthermore, we introduce a two-stage path optimization method to improve
the visual and topological quality of generated SVGs. In the first stage, a
pre-trained text-to-image diffusion model guides the initial generation of
complex vector graphics through the Variational Score Distillation (VSD)
process. In the second stage, we refine the graphics using a layer-wise image
vectorization strategy to achieve clearer elements and structure. We
demonstrate the effectiveness of our method through extensive experiments and
showcase various applications. The project page is
https://intchous.github.io/T2V-NPR.
comment: Accepted by SIGGRAPH 2024. Project page:
https://intchous.github.io/T2V-NPR
★ Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model
Visual In-Context Learning (ICL) has emerged as a promising research area due
to its capability to accomplish various tasks with limited example pairs
through analogical reasoning. However, training-based visual ICL has
limitations in its ability to generalize to unseen tasks and requires the
collection of a diverse task dataset. On the other hand, existing methods in
the inference-based visual ICL category solely rely on textual prompts, which
fail to capture fine-grained contextual information from given examples and can
be time-consuming when converting from images to text prompts. To address these
challenges, we propose Analogist, a novel inference-based visual ICL approach
that exploits both visual and textual prompting techniques using a
text-to-image diffusion model pretrained for image inpainting. For visual
prompting, we propose a self-attention cloning (SAC) method to guide the
fine-grained structural-level analogy between image examples. For textual
prompting, we leverage GPT-4V's visual reasoning capability to efficiently
generate text prompts and introduce a cross-attention masking (CAM) operation
to enhance the accuracy of semantic-level analogy guided by text prompts. Our
method is out-of-the-box and does not require fine-tuning or optimization. It
is also generic and flexible, enabling a wide range of visual tasks to be
performed in an in-context manner. Extensive experiments demonstrate the
superiority of our method over existing approaches, both qualitatively and
quantitatively.
comment: Project page: https://analogist2d.github.io
★ CAT3D: Create Anything in 3D with Multi-View Diffusion Models
Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, Ben Poole
Advances in 3D reconstruction have enabled high-quality 3D capture, but
require a user to collect hundreds to thousands of images to create a 3D scene.
We present CAT3D, a method for creating anything in 3D by simulating this
real-world capture process with a multi-view diffusion model. Given any number
of input images and a set of target novel viewpoints, our model generates
highly consistent novel views of a scene. These generated views can be used as
input to robust 3D reconstruction techniques to produce 3D representations that
can be rendered from any viewpoint in real-time. CAT3D can create entire 3D
scenes in as little as one minute, and outperforms existing methods for single
image and few-view 3D scene creation. See our project page for results and
interactive demos at https://cat3d.github.io .
comment: Project page: https://cat3d.github.io
★ 4D Panoptic Scene Graph Generation NeurIPS 2023
Jingkang Yang, Jun Cen, Wenxuan Peng, Shuai Liu, Fangzhou Hong, Xiangtai Li, Kaiyang Zhou, Qifeng Chen, Ziwei Liu
We are living in a three-dimensional space while moving forward through a
fourth dimension: time. To allow artificial intelligence to develop a
comprehensive understanding of such a 4D environment, we introduce 4D Panoptic
Scene Graph (PSG-4D), a new representation that bridges the raw visual data
perceived in a dynamic 4D world and high-level visual understanding.
Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent
entities with precise location and status information, and edges, which capture
the temporal relations. To facilitate research in this new area, we build a
richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of
1M frames, each of which is labeled with 4D panoptic segmentation masks as well
as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer,
a Transformer-based model that can predict panoptic segmentation masks, track
masks along the time axis, and generate the corresponding scene graphs via a
relation component. Extensive experiments on the new dataset show that our
method can serve as a strong baseline for future research on PSG-4D. In the
end, we provide a real-world application example to demonstrate how we can
achieve dynamic scene understanding by integrating a large language model into
our PSG-4D system.
comment: Accepted as NeurIPS 2023. Code: https://github.com/Jingkang50/PSG4D
Previous Series: PSG https://github.com/Jingkang50/OpenPSG and PVSG
https://github.com/Jingkang50/OpenPVSG
★ Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object
detection models developed by IDEA Research, which aims to advance the "Edge"
of open-set object detection. The suite encompasses two models: Grounding DINO
1.5 Pro, a high-performance model designed for stronger generalization
capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an
efficient model optimized for faster speed demanded in many applications
requiring edge deployment. The Grounding DINO 1.5 Pro model advances its
predecessor by scaling up the model architecture, integrating an enhanced
vision backbone, and expanding the training dataset to over 20 million images
with grounding annotations, thereby achieving a richer semantic understanding.
The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced
feature scales, maintains robust detection capabilities by being trained on the
same comprehensive dataset. Empirical results demonstrate the effectiveness of
Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP
on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot
transfer benchmark, setting new records for open-set object detection.
Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT,
achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP
on the LVIS-minival benchmark, making it more suitable for edge computing
scenarios. Model examples and demos with API will be released at
https://github.com/IDEA-Research/Grounding-DINO-1.5-API
comment: Technical report
★ Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
Large vision-language models (VLMs) fine-tuned on specialized visual
instruction-following data have exhibited impressive language reasoning
capabilities across various scenarios. However, this fine-tuning paradigm may
not be able to efficiently learn optimal decision-making agents in multi-step
goal-directed tasks from interactive environments. To address this challenge,
we propose an algorithmic framework that fine-tunes VLMs with reinforcement
learning (RL). Specifically, our framework provides a task description and then
prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM
to efficiently explore intermediate reasoning steps that lead to the final
text-based action. Next, the open-ended text output is parsed into an
executable action to interact with the environment to obtain goal-directed task
rewards. Finally, our framework uses these task rewards to fine-tune the entire
VLM with RL. Empirically, we demonstrate that our proposed framework enhances
the decision-making capabilities of VLM agents across various tasks, enabling
7b models to outperform commercial models such as GPT4-V or Gemini.
Furthermore, we find that CoT reasoning is a crucial component for performance
improvement, as removing the CoT reasoning results in a significant decrease in
the overall performance of our method.
★ FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models CVPR 2024
Despite noise and caption quality having been acknowledged as important
factors impacting vision-language contrastive pre-training, in this paper, we
show that the full potential of improving the training process by addressing
such issues is yet to be realized. Specifically, we firstly study and analyze
two issues affecting training: incorrect assignment of negative pairs, and low
caption quality and diversity. Then, we devise effective solutions for
addressing both problems, which essentially require training with multiple true
positive pairs. Finally, we propose training with sigmoid loss to address such
a requirement. We show very large gains over the current state-of-the-art for
both image recognition ($\sim +6\%$ on average over 11 datasets) and image
retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).
comment: Accepted at CVPR 2024
★ Faces that Speak: Jointly Synthesising Talking Face and Speech from Text CVPR 2024
Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung
The goal of this work is to simultaneously generate natural talking faces and
speech outputs from text. We achieve this by integrating Talking Face
Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We
address the main challenges of each task: (1) generating a range of head poses
representative of real-world scenarios, and (2) ensuring voice consistency
despite variations in facial motion for the same identity. To tackle these
issues, we introduce a motion sampler based on conditional flow matching, which
is capable of high-quality motion code generation in an efficient way.
Moreover, we introduce a novel conditioning method for the TTS system, which
utilises motion-removed features from the TFG model to yield uniform speech
outputs. Our extensive experiments demonstrate that our method effectively
creates natural-looking talking faces and speech that accurately match the
input text. To our knowledge, this is the first effort to build a multimodal
synthesis system that can generalise to unseen identities.
comment: CVPR 2024
★ A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
Charles Raude, K R Prajwal, Liliane Momeni, Hannah Bull, Samuel Albanie, Andrew Zisserman, Gül Varol
In this work, our goals are two fold: large-vocabulary continuous sign
language recognition (CSLR), and sign language retrieval. To this end, we
introduce a multi-task Transformer model, CSLR2, that is able to ingest a
signing sequence and output in a joint embedding space between signed language
and spoken language text. To enable CSLR evaluation in the large-vocabulary
setting, we introduce new dataset annotations that have been manually
collected. These provide continuous sign-level annotations for six hours of
test videos, and will be made publicly available. We demonstrate that by a
careful choice of loss functions, training the model for both the CSLR and
retrieval tasks is mutually beneficial in terms of performance -- retrieval
improves CSLR performance by providing context, while CSLR improves retrieval
with more fine-grained supervision. We further show the benefits of leveraging
weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely
sign-level pseudo-labels, and English subtitles. Our model significantly
outperforms the previous state of the art on both tasks.
★ Two-Phase Dynamics of Interactions Explains the Starting Point of a DNN Learning Over-Fitted Features
This paper investigates the dynamics of a deep neural network (DNN) learning
interactions. Previous studies have discovered and mathematically proven that
given each input sample, a well-trained DNN usually only encodes a small number
of interactions (non-linear relationships) between input variables in the
sample. A series of theorems have been derived to prove that we can consider
the DNN's inference equivalent to using these interactions as primitive
patterns for inference. In this paper, we discover the DNN learns interactions
in two phases. The first phase mainly penalizes interactions of medium and high
orders, and the second phase mainly learns interactions of gradually increasing
orders. We can consider the two-phase phenomenon as the starting point of a DNN
learning over-fitted features. Such a phenomenon has been widely shared by DNNs
with various architectures trained for different tasks. Therefore, the
discovery of the two-phase dynamics provides a detailed mechanism for how a DNN
gradually learns different inference patterns (interactions). In particular, we
have also verified the claim that high-order interactions have weaker
generalization power than low-order interactions. Thus, the discovered
two-phase dynamics also explains how the generalization power of a DNN changes
during the training process.
★ Biasing & Debiasing based Approach Towards Fair Knowledge Transfer for Equitable Skin Analysis
Deep learning models, particularly Convolutional Neural Networks (CNNs), have
demonstrated exceptional performance in diagnosing skin diseases, often
outperforming dermatologists. However, they have also unveiled biases linked to
specific demographic traits, notably concerning diverse skin tones or gender,
prompting concerns regarding fairness and limiting their widespread deployment.
Researchers are actively working to ensure fairness in AI-based solutions, but
existing methods incur an accuracy loss when striving for fairness. To solve
this issue, we propose a `two-biased teachers' (i.e., biased on different
sensitive attributes) based approach to transfer fair knowledge into the
student network. Our approach mitigates biases present in the student network
without harming its predictive accuracy. In fact, in most cases, our approach
improves the accuracy of the baseline model. To achieve this goal, we developed
a weighted loss function comprising biasing and debiasing loss terms. We
surpassed available state-of-the-art approaches to attain fairness and also
improved the accuracy at the same time. The proposed approach has been
evaluated and validated on two dermatology datasets using standard accuracy and
fairness evaluation measures. We will make source code publicly available to
foster reproducibility and future research.
★ When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu
As large language models (LLMs) evolve, their integration with 3D spatial
data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for
understanding and interacting with physical spaces. This survey provides a
comprehensive overview of the methodologies enabling LLMs to process,
understand, and generate 3D data. Highlighting the unique advantages of LLMs,
such as in-context learning, step-by-step reasoning, open-vocabulary
capabilities, and extensive world knowledge, we underscore their potential to
significantly advance spatial comprehension and interaction within embodied
Artificial Intelligence (AI) systems. Our investigation spans various 3D data
representations, from point clouds to Neural Radiance Fields (NeRFs). It
examines their integration with LLMs for tasks such as 3D scene understanding,
captioning, question-answering, and dialogue, as well as LLM-based agents for
spatial reasoning, planning, and navigation. The paper also includes a brief
review of other methods that integrate 3D and language. The meta-analysis
presented in this paper reveals significant progress yet underscores the
necessity for novel approaches to harness the full potential of 3D-LLMs. Hence,
with this paper, we aim to chart a course for future research that explores and
expands the capabilities of 3D-LLMs in understanding and interacting with the
complex 3D world. To support this survey, we have established a project page
where papers related to our topic are organized and listed:
https://github.com/ActiveVisionLab/Awesome-LLM-3D.
★ PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology
George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D. Kunz, Juan A. Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, Matthew Hanna, Michal Zelechowski, Julian Viret, Neil Tenenholtz, James Hall, Nicolo Fusi, Razik Yousfi, Peter Hamilton, William A. Moye, Eugene Vorontsov, Siqi Liu, Thomas J. Fuchs
Foundation models in computational pathology promise to unlock the
development of new clinical decision support systems and models for precision
medicine. However, there is a mismatch between most clinical analysis, which is
defined at the level of one or more whole slide images, and foundation models
to date, which process the thousands of image tiles contained in a whole slide
image separately. The requirement to train a network to aggregate information
across a large number of tiles in multiple whole slide images limits these
models' impact. In this work, we present a slide-level foundation model for
H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and
leverages clinical report text for pre-training. Using the tile embeddings,
PRISM produces slide-level embeddings with the ability to generate clinical
reports, resulting in several modes of use. Using text prompts, PRISM achieves
zero-shot cancer detection and sub-typing performance approaching and
surpassing that of a supervised aggregator model. Using the slide embeddings
with linear classifiers, PRISM surpasses supervised aggregator models.
Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields
label-efficient training for biomarker prediction, a task that typically
suffers from low availability of training data; an aggregator initialized with
PRISM and trained on as little as 10% of the training data can outperform a
supervised baseline that uses all of the data.
★ A Foundation Model for Brain Lesion Segmentation with Mixture of Modality Experts MICCAI 2024
Xinru Zhang, Ni Ou, Berke Doga Basaran, Marco Visentin, Mengyun Qiao, Renyang Gu, Cheng Ouyang, Yaou Liu, Paul M. Matthew, Chuyang Ye, Wenjia Bai
Brain lesion segmentation plays an essential role in neurological research
and diagnosis. As brain lesions can be caused by various pathological
alterations, different types of brain lesions tend to manifest with different
characteristics on different imaging modalities. Due to this complexity, brain
lesion segmentation methods are often developed in a task-specific manner. A
specific segmentation model is developed for a particular lesion type and
imaging modality. However, the use of task-specific models requires
predetermination of the lesion type and imaging modality, which complicates
their deployment in real-world scenarios. In this work, we propose a universal
foundation model for 3D brain lesion segmentation, which can automatically
segment different types of brain lesions for input data of various imaging
modalities. We formulate a novel Mixture of Modality Experts (MoME) framework
with multiple expert networks attending to different imaging modalities. A
hierarchical gating network combines the expert predictions and fosters
expertise collaboration. Furthermore, we introduce a curriculum learning
strategy during training to avoid the degeneration of each expert network and
preserve their specialization. We evaluated the proposed method on nine brain
lesion datasets, encompassing five imaging modalities and eight lesion types.
The results show that our model outperforms state-of-the-art universal models
and provides promising generalization to unseen datasets.
comment: The work has been early accepted by MICCAI 2024
★ Towards Task-Compatible Compressible Representations ICME
We identify an issue in multi-task learnable compression, in which a
representation learned for one task does not positively contribute to the
rate-distortion performance of a different task as much as expected, given the
estimated amount of information available in it. We interpret this issue using
the predictive $\mathcal{V}$-information framework. In learnable scalable
coding, previous work increased the utilization of side-information for input
reconstruction by also rewarding input reconstruction when learning this shared
representation. We evaluate the impact of this idea in the context of input
reconstruction more rigorously and extended it to other computer vision tasks.
We perform experiments using representations trained for object detection on
COCO 2017 and depth estimation on the Cityscapes dataset, and use them to
assist in image reconstruction and semantic segmentation tasks. The results
show considerable improvements in the rate-distortion performance of the
assisted tasks. Moreover, using the proposed representations, the performance
of the base tasks are also improved. Results suggest that the proposed method
induces simpler representations that are more compatible with downstream
processes.
comment: To be published in ICME Workshops 2024
★ DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data CVPR 2024
Instance segmentation is data-hungry, and as model capacity increases, data
scale becomes crucial for improving the accuracy. Most instance segmentation
datasets today require costly manual annotation, limiting their data scale.
Models trained on such data are prone to overfitting on the training set,
especially for those rare categories. While recent works have delved into
exploiting generative models to create synthetic datasets for data
augmentation, these approaches do not efficiently harness the full potential of
generative models.
To address these issues, we introduce a more efficient strategy to construct
generative datasets for data augmentation, termed DiverGen. Firstly, we provide
an explanation of the role of generative data from the perspective of
distribution discrepancy. We investigate the impact of different data on the
distribution learned by the model. We argue that generative data can expand the
data distribution that the model can learn, thus mitigating overfitting.
Additionally, we find that the diversity of generative data is crucial for
improving model performance and enhance it through various strategies,
including category diversity, prompt diversity, and generative model diversity.
With these strategies, we can scale the data to millions while maintaining the
trend of model performance improvement. On the LVIS dataset, DiverGen
significantly outperforms the strong model X-Paste, achieving +1.1 box AP and
+1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare
categories.
comment: Accepted to CVPR 2024, codes are available at \href{this https
URL}{https://github.com/aim-uofa/DiverGen}
★ Filling Missing Values Matters for Range Image-Based Point Cloud Segmentation
Point cloud segmentation (PCS) plays an essential role in robot perception
and navigation tasks. To efficiently understand large-scale outdoor point
clouds, their range image representation is commonly adopted. This image-like
representation is compact and structured, making range image-based PCS models
practical. However, undesirable missing values in the range images damage the
shapes and patterns of objects. This problem creates difficulty for the models
in learning coherent and complete geometric information from the objects.
Consequently, the PCS models only achieve inferior performance. Delving deeply
into this issue, we find that the use of unreasonable projection approaches and
deskewing scans mainly leads to unwanted missing values in the range images.
Besides, almost all previous works fail to consider filling in the unexpected
missing values in the PCS task. To alleviate this problem, we first propose a
new projection method, namely scan unfolding++ (SU++), to avoid massive missing
values in the generated range images. Then, we introduce a simple yet effective
approach, namely range-dependent $K$-nearest neighbor interpolation ($K$NNI),
to further fill in missing values. Finally, we introduce the Filling Missing
Values Network (FMVNet) and Fast FMVNet. Extensive experimental results on
SemanticKITTI, SemanticPOSS, and nuScenes datasets demonstrate that by
employing the proposed SU++ and $K$NNI, existing range image-based PCS models
consistently achieve better performance than the baseline models. Besides, both
FMVNet and Fast FMVNet achieve state-of-the-art performance in terms of the
speed-accuracy trade-off. The proposed methods can be applied to other range
image-based tasks and practical applications.
comment: This paper has been submitted to a journal
★ PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning
Remote sensing image-text retrieval constitutes a foundational aspect of
remote sensing interpretation tasks, facilitating the alignment of vision and
language representations. This paper introduces a prior instruction
representation (PIR) learning paradigm that draws on prior knowledge to
instruct adaptive learning of vision and text representations. Based on PIR, a
domain-adapted remote sensing image-text retrieval framework PIR-ITR is
designed to address semantic noise issues in vision-language understanding
tasks. However, with massive additional data for pre-training the
vision-language foundation model, remote sensing image-text retrieval is
further developed into an open-domain retrieval task. Continuing with the
above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote
sensing image-text retrieval, to address semantic noise in remote sensing
vision-language representations and further improve open-domain retrieval
performance. In vision representation, Vision Instruction Representation (VIR)
based on Spatial-PAE utilizes the prior-guided knowledge of the remote sensing
scene recognition by building a belief matrix to select key features for
reducing the impact of semantic noise. In text representation, Language Cycle
Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically
activate the current time step to enhance text representation capability. A
cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes
and to reduce the semantic confusion zones in the common subspace.
Comprehensive experiments demonstrate that PIR could enhance vision and text
representations and outperform the state-of-the-art methods of closed-domain
and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.
comment: 15 pages, 9 figures
★ SpecDETR: A Transformer-based Hyperspectral Point Object Detection Network
Hyperspectral target detection (HTD) aims to identify specific materials
based on spectral information in hyperspectral imagery and can detect point
targets, some of which occupy a smaller than one-pixel area. However, existing
HTD methods are developed based on per-pixel binary classification, which
limits the feature representation capability for point targets. In this paper,
we rethink the hyperspectral point target detection from the object detection
perspective, and focus more on the object-level prediction capability rather
than the pixel classification capability. Inspired by the token-based
processing flow of Detection Transformer (DETR), we propose the first
specialized network for hyperspectral multi-class point object detection,
SpecDETR. Without the backbone part of the current object detection framework,
SpecDETR treats the spectral features of each pixel in hyperspectral images as
a token and utilizes a multi-layer Transformer encoder with local and global
coordination attention modules to extract deep spatial-spectral joint features.
SpecDETR regards point object detection as a one-to-many set prediction
problem, thereby achieving a concise and efficient DETR decoder that surpasses
the current state-of-the-art DETR decoder in terms of parameters and accuracy
in point object detection. We develop a simulated hyperSpectral Point Object
Detection benchmark termed SPOD, and for the first time, evaluate and compare
the performance of current object detection networks and HTD methods on
hyperspectral multi-class point object detection. SpecDETR demonstrates
superior performance as compared to current object detection networks and HTD
methods on the SPOD dataset. Additionally, we validate on a public HTD dataset
that by using data simulation instead of manual annotation, SpecDETR can detect
real-world single-spectral point objects directly.
★ Libra: Building Decoupled Vision System on Large Language Models ICML2024
In this work, we introduce Libra, a prototype model with a decoupled vision
system on a large language model (LLM). The decoupled vision system decouples
inner-modal modeling and cross-modal interaction, yielding unique visual
information modeling and effective cross-modal comprehension. Libra is trained
through discrete auto-regressive modeling on both vision and language inputs.
Specifically, we incorporate a routed visual expert with a cross-modal bridge
module into a pretrained LLM to route the vision and language flows during
attention computing to enable different attention patterns in inner-modal
modeling and cross-modal interaction scenarios. Experimental results
demonstrate that the dedicated design of Libra achieves a strong MLLM baseline
that rivals existing works in the image-to-text scenario with merely 50 million
training data, providing a new perspective for future multimodal foundation
models. Code is available at https://github.com/YifanXu74/Libra.
comment: ICML2024
★ Cooperative Visual-LiDAR Extrinsic Calibration Technology for Intersection Vehicle-Infrastructure: A review
In the typical urban intersection scenario, both vehicles and infrastructures
are equipped with visual and LiDAR sensors. By successfully integrating the
data from vehicle-side and road monitoring devices, a more comprehensive and
accurate environmental perception and information acquisition can be achieved.
The Calibration of sensors, as an essential component of autonomous driving
technology, has consistently drawn significant attention. Particularly in
scenarios involving multiple sensors collaboratively perceiving and addressing
localization challenges, the requirement for inter-sensor calibration becomes
crucial. Recent years have witnessed the emergence of the concept of multi-end
cooperation, where infrastructure captures and transmits surrounding
environment information to vehicles, bolstering their perception capabilities
while mitigating costs. However, this also poses technical complexities,
underscoring the pressing need for diverse end calibration. Camera and LiDAR,
the bedrock sensors in autonomous driving, exhibit expansive applicability.
This paper comprehensively examines and analyzes the calibration of multi-end
camera-LiDAR setups from vehicle, roadside, and vehicle-road cooperation
perspectives, outlining their relevant applications and profound significance.
Concluding with a summary, we present our future-oriented ideas and hypotheses.
★ Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
João Bordalo, Vasco Ramos, Rodrigo Valério, Diogo Glória-Silva, Yonatan Bitton, Michal Yarom, Idan Szpektor, Joao Magalhaes
Multistep instructions, such as recipes and how-to guides, greatly benefit
from visual aids, such as a series of images that accompany the instruction
steps. While Large Language Models (LLMs) have become adept at generating
coherent textual steps, Large Vision/Language Models (LVLMs) are less capable
of generating accompanying image sequences. The most challenging aspect is that
each generated image needs to adhere to the relevant textual step instruction,
as well as be visually consistent with earlier images in the sequence. To
address this problem, we propose an approach for generating consistent image
sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to
transform the sequence into a caption to maintain the semantic coherence of the
sequence. In addition, to maintain the visual coherence of the image sequence,
we introduce a copy mechanism to initialise reverse diffusion processes with a
latent vector iteration from a previously generated image from a relevant step.
Both strategies will condition the reverse diffusion process on the sequence of
instruction steps and tie the contents of the current image to previous
instruction steps and corresponding images. Experiments show that the proposed
approach is preferred by humans in 46.6% of the cases against 26.6% for the
second best method. In addition, automatic metrics showed that the proposed
method maintains semantic coherence and visual consistency across steps in both
domains.
★ An Integrated Framework for Multi-Granular Explanation of Video Summarization
In this paper, we propose an integrated framework for multi-granular
explanation of video summarization. This framework integrates methods for
producing explanations both at the fragment level (indicating which video
fragments influenced the most the decisions of the summarizer) and the more
fine-grained visual object level (highlighting which visual objects were the
most influential for the summarizer). To build this framework, we extend our
previous work on this field, by investigating the use of a model-agnostic,
perturbation-based approach for fragment-level explanation of the video
summarization results, and introducing a new method that combines the results
of video panoptic segmentation with an adaptation of a perturbation-based
explanation approach to produce object-level explanations. The performance of
the developed framework is evaluated using a state-of-the-art summarization
method and two datasets for benchmarking video summarization. The findings of
the conducted quantitative and qualitative evaluations demonstrate the ability
of our framework to spot the most and least influential fragments and visual
objects of the video for the summarizer, and to provide a comprehensive set of
visual-based explanations about the output of the summarization process.
comment: Under review
★ HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition MICCAI2024
Natural language could play an important role in developing generalist
surgical models by providing a broad source of supervision from raw texts. This
flexible form of supervision can enable the model's transferability across
datasets and tasks as natural language can be used to reference learned visual
concepts or describe new ones. In this work, we present HecVL, a novel
hierarchical video-language pretraining approach for building a generalist
surgical model. Specifically, we construct a hierarchical video-text paired
dataset by pairing the surgical lecture video with three hierarchical levels of
texts: at clip-level, atomic actions using transcribed audio texts; at
phase-level, conceptual text summaries; and at video-level, overall abstract
text of the surgical procedure. Then, we propose a novel fine-to-coarse
contrastive learning framework that learns separate embedding spaces for the
three video-text hierarchies using a single model. By disentangling embedding
spaces of different hierarchical levels, the learned multi-modal
representations encode short-term and long-term surgical concepts in the same
model. Thanks to the injected textual semantics, we demonstrate that the HecVL
approach can enable zero-shot surgical phase recognition without any human
annotation. Furthermore, we show that the same HecVL model for surgical phase
recognition can be transferred across different surgical procedures and medical
centers.
comment: Accepted by MICCAI2024
★ MrRegNet: Multi-resolution Mask Guided Convolutional Neural Network for Medical Image Registration with Large Deformations
Deformable image registration (alignment) is highly sought after in numerous
clinical applications, such as computer aided diagnosis and disease progression
analysis. Deep Convolutional Neural Network (DCNN)-based image registration
methods have demonstrated advantages in terms of registration accuracy and
computational speed. However, while most methods excel at global alignment,
they often perform worse in aligning local regions. To address this challenge,
this paper proposes a mask-guided encoder-decoder DCNN-based image registration
method, named as MrRegNet. This approach employs a multi-resolution encoder for
feature extraction and subsequently estimates multi-resolution displacement
fields in the decoder to handle the substantial deformation of images.
Furthermore, segmentation masks are employed to direct the model's attention
toward aligning local regions. The results show that the proposed method
outperforms traditional methods like Demons and a well-known deep learning
method, VoxelMorph, on a public 3D brain MRI dataset (OASIS) and a local 2D
brain MRI dataset with large deformations. Importantly, the image alignment
accuracies are significantly improved at local regions guided by segmentation
masks. Github link:https://github.com/ruizhe-l/MrRegNet.
comment: Accepted for publication at IEEE International Symposium on
Biomedical Imaging (ISBI) 2024
★ SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection CVPR 2024
Open-vocabulary object detection (OvOD) has transformed detection into a
language-guided task, empowering users to freely define their class
vocabularies of interest during inference. However, our initial investigation
indicates that existing OvOD detectors exhibit significant variability when
dealing with vocabularies across various semantic granularities, posing a
concern for real-world deployment. To this end, we introduce Semantic Hierarchy
Nexus (SHiNe), a novel classifier that uses semantic knowledge from class
hierarchies. It runs offline in three steps: i) it retrieves relevant
super-/sub-categories from a hierarchy for each target class; ii) it integrates
these categories into hierarchy-aware sentences; iii) it fuses these sentence
embeddings to generate the nexus classifier vector. Our evaluation on various
detection benchmarks demonstrates that SHiNe enhances robustness across diverse
vocabulary granularities, achieving up to +31.9% mAP50 with ground truth
hierarchies, while retaining improvements using hierarchies generated by large
language models. Moreover, when applied to open-vocabulary classification on
ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy.
SHiNe is training-free and can be seamlessly integrated with any off-the-shelf
OvOD detector, without incurring additional computational overhead during
inference. The code is open source.
comment: Accepted as a conference paper (highlight) at CVPR 2024
★ A Preprocessing and Postprocessing Voxel-based Method for LiDAR Semantic Segmentation Improvement in Long Distance
In recent years considerable research in LiDAR semantic segmentation was
conducted, introducing several new state of the art models. However, most
research focuses on single-scan point clouds, limiting performance especially
in long distance outdoor scenarios, by omitting time-sequential information.
Moreover, varying-density and occlusions constitute significant challenges in
single-scan approaches. In this paper we propose a LiDAR point cloud
preprocessing and postprocessing method. This multi-stage approach, in
conjunction with state of the art models in a multi-scan setting, aims to solve
those challenges. We demonstrate the benefits of our method through
quantitative evaluation with the given models in single-scan settings. In
particular, we achieve significant improvements in mIoU performance of over 5
percentage point in medium range and over 10 percentage point in far range.
This is essential for 3D semantic scene understanding in long distance as well
as for applications where offline processing is permissible.
★ Revealing Hierarchical Structure of Leaf Venations in Plant Science via Label-Efficient Segmentation: Dataset and Method IJCAI2024
Weizhen Liu, Ao Li, Ze Wu, Yue Li, Baobin Ge, Guangyu Lan, Shilin Chen, Minghe Li, Yunfei Liu, Xiaohui Yuan, Nanqing Dong
Hierarchical leaf vein segmentation is a crucial but under-explored task in
agricultural sciences, where analysis of the hierarchical structure of plant
leaf venation can contribute to plant breeding. While current segmentation
techniques rely on data-driven models, there is no publicly available dataset
specifically designed for hierarchical leaf vein segmentation. To address this
gap, we introduce the HierArchical Leaf Vein Segmentation (HALVS) dataset, the
first public hierarchical leaf vein segmentation dataset. HALVS comprises 5,057
real-scanned high-resolution leaf images collected from three plant species:
soybean, sweet cherry, and London planetree. It also includes human-annotated
ground truth for three orders of leaf veins, with a total labeling effort of
83.8 person-days. Based on HALVS, we further develop a label-efficient learning
paradigm that leverages partial label information, i.e. missing annotations for
tertiary veins. Empirical studies are performed on HALVS, revealing new
observations, challenges, and research directions on leaf vein segmentation.
comment: Accepted by IJCAI2024, Code:
https://github.com/WeizhenLiuBioinform/HALVS-Hierarchical-Vein-Segment.git
★ Bilateral Event Mining and Complementary for Event Stream Super-Resolution CVPR2024
Event Stream Super-Resolution (ESR) aims to address the challenge of
insufficient spatial resolution in event streams, which holds great
significance for the application of event cameras in complex scenarios.
Previous works for ESR often process positive and negative events in a mixed
paradigm. This paradigm limits their ability to effectively model the unique
characteristics of each event and mutually refine each other by considering
their correlations. In this paper, we propose a bilateral event mining and
complementary network (BMCNet) to fully leverage the potential of each event
and capture the shared information to complement each other simultaneously.
Specifically, we resort to a two-stream network to accomplish comprehensive
mining of each type of events individually. To facilitate the exchange of
information between two streams, we propose a bilateral information exchange
(BIE) module. This module is layer-wisely embedded between two streams,
enabling the effective propagation of hierarchical global information while
alleviating the impact of invalid information brought by inherent
characteristics of events. The experimental results demonstrate that our
approach outperforms the previous state-of-the-art methods in ESR, achieving
performance improvements of over 11\% on both real and synthetic datasets.
Moreover, our method significantly enhances the performance of event-based
downstream tasks such as object recognition and video reconstruction. Our code
is available at https://github.com/Lqm26/BMCNet-ESR.
comment: Accepted to CVPR2024
★ RSDehamba: Lightweight Vision Mamba for Remote Sensing Satellite Image Dehazing
Remote sensing image dehazing (RSID) aims to remove nonuniform and physically
irregular haze factors for high-quality image restoration. The emergence of
CNNs and Transformers has taken extraordinary strides in the RSID arena.
However, these methods often struggle to demonstrate the balance of adequate
long-range dependency modeling and maintaining computational efficiency. To
this end, we propose the first lightweight network on the mamba-based model
called RSDhamba in the field of RSID. Greatly inspired by the recent rise of
Selective State Space Model (SSM) for its superior performance in modeling
linear complexity and remote dependencies, our designed RSDehamba integrates
the SSM framework into the U-Net architecture. Specifically, we propose the
Vision Dehamba Block (VDB) as the core component of the overall network, which
utilizes the linear complexity of SSM to achieve the capability of global
context encoding. Simultaneously, the Direction-aware Scan Module (DSM) is
designed to dynamically aggregate feature exchanges over different directional
domains to effectively enhance the flexibility of sensing the spatially varying
distribution of haze. In this way, our RSDhamba fully demonstrates the
superiority of spatial distance capture dependencies and channel information
exchange for better extraction of haze features. Extensive experimental results
on widely used benchmarks validate the surpassing performance of our RSDehamba
against existing state-of-the-art methods.
★ Natural Language Can Help Bridge the Sim2Real Gap
The main challenge in learning image-conditioned robotic policies is
acquiring a visual representation conducive to low-level control. Due to the
high dimensionality of the image space, learning a good visual representation
requires a considerable amount of visual data. However, when learning in the
real world, data is expensive. Sim2Real is a promising paradigm for overcoming
data scarcity in the real-world target domain by using a simulator to collect
large amounts of cheap data closely related to the target task. However, it is
difficult to transfer an image-conditioned policy from sim to real when the
domains are very visually dissimilar. To bridge the sim2real visual gap, we
propose using natural language descriptions of images as a unifying signal
across domains that captures the underlying task-relevant semantics. Our key
insight is that if two image observations from different domains are labeled
with similar language, the policy should predict similar action distributions
for both images. We demonstrate that training the image encoder to predict the
language description or the distance between descriptions of a sim or real
image serves as a useful, data-efficient pretraining step that helps learn a
domain-invariant image representation. We can then use this image encoder as
the backbone of an IL policy trained simultaneously on a large amount of
simulated and a handful of real demonstrations. Our approach outperforms widely
used prior sim2real methods and strong vision-language pretraining baselines
like CLIP and R3M by 25 to 40%.
comment: To appear in RSS 2024
★ Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution
The performance of single image super-resolution depends heavily on how to
generate and complement high-frequency details to low-resolution images.
Recently, diffusion-based models exhibit great potential in generating
high-quality images for super-resolution tasks. However, existing models
encounter difficulties in directly predicting high-frequency information of
wide bandwidth by solely utilizing the high-resolution ground truth as the
target for all sampling timesteps. To tackle this problem and achieve
higher-quality super-resolution, we propose a novel Frequency Domain-guided
multiscale Diffusion model (FDDiff), which decomposes the high-frequency
information complementing process into finer-grained steps. In particular, a
wavelet packet-based frequency complement chain is developed to provide
multiscale intermediate targets with increasing bandwidth for reverse diffusion
process. Then FDDiff guides reverse diffusion process to progressively
complement the missing high-frequency details over timesteps. Moreover, we
design a multiscale frequency refinement network to predict the required
high-frequency components at multiple scales within one unified network.
Comprehensive evaluations on popular benchmarks are conducted, and demonstrate
that FDDiff outperforms prior generative methods with higher-fidelity
super-resolution results.
★ Solving the enigma: Deriving optimal explanations of deep networks
Michail Mamalakis, Antonios Mamalakis, Ingrid Agartz, Lynn Egeland Mørch-Johnsen, Graham Murray, John Suckling, Pietro Lio
The accelerated progress of artificial intelligence (AI) has popularized deep
learning models across domains, yet their inherent opacity poses challenges,
notably in critical fields like healthcare, medicine and the geosciences.
Explainable AI (XAI) has emerged to shed light on these "black box" models,
helping decipher their decision making process. Nevertheless, different XAI
methods yield highly different explanations. This inter-method variability
increases uncertainty and lowers trust in deep networks' predictions. In this
study, for the first time, we propose a novel framework designed to enhance the
explainability of deep networks, by maximizing both the accuracy and the
comprehensibility of the explanations. Our framework integrates various
explanations from established XAI methods and employs a non-linear "explanation
optimizer" to construct a unique and optimal explanation. Through experiments
on multi-class and binary classification tasks in 2D object and 3D neuroscience
imaging, we validate the efficacy of our approach. Our explanation optimizer
achieved superior faithfulness scores, averaging 155% and 63% higher than the
best performing XAI method in the 3D and 2D applications, respectively.
Additionally, our approach yielded lower complexity, increasing
comprehensibility. Our results suggest that optimal explanations based on
specific criteria are derivable and address the issue of inter-method
variability in the current XAI literature.
comment: keywords: XAI, neuroscience, brain, 3D, 2D, computer vision,
classification
★ ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset
Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S. Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, Henning Müller, Peter A. Horn, Felix Nensa, Christoph M. Friedrich
Automated medical image analysis systems often require large amounts of
training data with high quality labels, which are difficult and time consuming
to generate. This paper introduces Radiology Object in COntext version 2
(ROCOv2), a multimodal dataset consisting of radiological images and associated
medical concepts and captions extracted from the PMC Open Access subset. It is
an updated version of the ROCO dataset published in 2018, and adds 35,705 new
images added to PMC since 2018. It further provides manually curated concepts
for imaging modalities with additional anatomical and directional concepts for
X-rays. The dataset consists of 79,789 images and has been used, with minor
modifications, in the concept detection and caption prediction tasks of
ImageCLEFmedical Caption 2023. The dataset is suitable for training image
annotation models based on image-caption pairs, or for multi-label image
classification using Unified Medical Language System (UMLS) concepts provided
with each image. In addition, it can serve for pre-training of medical domain
models, and evaluation of deep learning models for multi-task learning.
comment: Major revision Scientific Data
★ Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance CVPR 2024
Real driving-video dehazing poses a significant challenge due to the inherent
difficulty in acquiring precisely aligned hazy/clear video pairs for effective
model training, especially in dynamic driving scenarios with unpredictable
weather conditions. In this paper, we propose a pioneering approach that
addresses this challenge through a nonaligned regularization strategy. Our core
concept involves identifying clear frames that closely match hazy frames,
serving as references to supervise a video dehazing network. Our approach
comprises two key components: reference matching and video dehazing. Firstly,
we introduce a non-aligned reference frame matching module, leveraging an
adaptive sliding window to match high-quality reference frames from clear
videos. Video dehazing incorporates flow-guided cosine attention sampler and
deformable cosine attention fusion modules to enhance spatial multiframe
alignment and fuse their improved information. To validate our approach, we
collect a GoProHazy dataset captured effortlessly with GoPro cameras in diverse
rural and urban road environments. Extensive experiments demonstrate the
superiority of the proposed method over current state-of-the-art methods in the
challenging task of real driving-video dehazing. Project page.
comment: Accepted by CVPR 2024
★ Histopathology Foundation Models Enable Accurate Ovarian Cancer Subtype Classification
Large pretrained transformers are increasingly being developed as generalised
foundation models which can underpin powerful task-specific artificial
intelligence models. Histopathology foundation models show promise across many
tasks, but analyses have been limited by arbitrary hyperparameters that were
not tuned to the specific task/dataset. We report the most rigorous single-task
validation conducted to date of a histopathology foundation model, and the
first performed in ovarian cancer subtyping. Attention-based multiple instance
learning classifiers were compared using vision transformer and ResNet features
generated through varied preprocessing and pretraining procedures. The training
set consisted of 1864 whole slide images from 434 ovarian carcinoma cases at
Leeds Hospitals. Five-class classification performance was evaluated through
five-fold cross-validation, and these cross-validation models were ensembled
for evaluation on a hold-out test set and an external set from the
Transcanadian study. Reporting followed the TRIPOD+AI checklist. The vision
transformer-based histopathology foundation model, UNI, performed best in every
evaluation, with five-class balanced accuracies of 88% and 93% in hold-out
internal and external testing, compared to the best ResNet model scores of 68%
and 81%, respectively. Normalisations and augmentations aided the
generalisability of ResNet-based models, but these still did not match the
performance of UNI, which gave the best external performance in any ovarian
cancer subtyping study to date. Histopathology foundation models offer a clear
benefit to subtyping, improving classification performance to a degree where
clinical utility is tangible, albeit with an increased computational burden.
Such models could provide a second opinion in challenging cases and may improve
the accuracy, objectivity, and efficiency of pathological diagnoses overall.
★ VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing
Due to the significant advances in large-scale text-to-image generation by
diffusion model (DM), controllable human image generation has been attracting
much attention recently. Existing works, such as Controlnet [36], T2I-adapter
[20] and HumanSD [10] have demonstrated good abilities in generating human
images based on pose conditions, they still fail to meet the requirements of
real e-commerce scenarios. These include (1) the interaction between the shown
product and human should be considered, (2) human parts like face/hand/arm/foot
and the interaction between human model and product should be hyper-realistic,
and (3) the identity of the product shown in advertising should be exactly
consistent with the product itself. To this end, in this paper, we first define
a new human image generation task for e-commerce marketing, i.e.,
Object-ID-retentive Human-object Interaction image Generation (OHG), and then
propose a VirtualModel framework to generate human images for product shown,
which supports displays of any categories of products and any types of
human-object interaction. As shown in Figure 1, VirtualModel not only
outperforms other methods in terms of accurate pose control and image quality
but also allows for the display of user-specified product objects by
maintaining the product-ID consistency and enhancing the plausibility of
human-object interaction. Codes and data will be released.
comment: project page: https://aigcdesigngroup.github.io/replace-anything;
★ Adversarial Robustness for Visual Grounding of Multimodal Large Language Models ICLR 2024
Multi-modal Large Language Models (MLLMs) have recently achieved enhanced
performance across various vision-language tasks including visual grounding
capabilities. However, the adversarial robustness of visual grounding remains
unexplored in MLLMs. To fill this gap, we use referring expression
comprehension (REC) as an example task in visual grounding and propose three
adversarial attack paradigms as follows. Firstly, untargeted adversarial
attacks induce MLLMs to generate incorrect bounding boxes for each object.
Besides, exclusive targeted adversarial attacks cause all generated outputs to
the same target bounding box. In addition, permuted targeted adversarial
attacks aim to permute all bounding boxes among different objects within a
single image. Extensive experiments demonstrate that the proposed methods can
successfully attack visual grounding capabilities of MLLMs. Our methods not
only provide a new perspective for designing novel attacks but also serve as a
strong baseline for improving the adversarial robustness for visual grounding
of MLLMs.
comment: ICLR 2024 Workshop on Reliable and Responsible Foundation Models
★ Language-Oriented Semantic Latent Representation for Image Transmission SP
Giordano Cicchetti, Eleonora Grassucci, Jihong Park, Jinho Choi, Sergio Barbarossa, Danilo Comminiello
In the new paradigm of semantic communication (SC), the focus is on
delivering meanings behind bits by extracting semantic information from raw
data. Recent advances in data-to-text models facilitate language-oriented SC,
particularly for text-transformed image communication via image-to-text (I2T)
encoding and text-to-image (T2I) decoding. However, although semantically
aligned, the text is too coarse to precisely capture sophisticated visual
features such as spatial locations, color, and texture, incurring a significant
perceptual difference between intended and reconstructed images. To address
this limitation, in this paper, we propose a novel language-oriented SC
framework that communicates both text and a compressed image embedding and
combines them using a latent diffusion model to reconstruct the intended image.
Experimental results validate the potential of our approach, which transmits
only 2.09\% of the original image size while achieving higher perceptual
similarities in noisy communication channels compared to a baseline SC method
that communicates only through text.The code is available at
https://github.com/ispamm/Img2Img-SC/ .
comment: Under review at IEEE International Workshop on Machine Learning for
Signal Processing (MLSP) 2024
★ KPNDepth: Depth Estimation of Lane Images under Complex Rainy Environment
With the development of deep neural network generative models in recent
years, significant progress has been made in the research of depth estimation
in lane scenes. However, current research achievements are mainly focused on
clear daytime scenarios. In complex rainy environments, the influence of rain
streaks and local fog effects often leads to erroneous increases in the overall
depth estimation values in images. Moreover, these natural factors can
introduce disturbances to the accurate prediction of depth boundaries in
images. In this paper, we investigate lane depth estimation in complex rainy
environments. Based on the concept of convolutional kernel prediction, we
propose a dual-layer pixel-wise convolutional kernel prediction network trained
on offline data. By predicting two sets of independent convolutional kernels
for the target image, we restore the depth information loss caused by complex
environmental factors and address the issue of rain streak artifacts generated
by a single convolutional kernel set. Furthermore, considering the lack of real
rainy lane data currently available, we introduce an image synthesis algorithm,
RCFLane, which comprehensively considers the darkening of the environment due
to rainfall and local fog effects. We create a synthetic dataset containing 820
experimental images, which we refer to as RainKITTI, on the commonly used depth
estimation dataset KITTI. Extensive experiments demonstrate that our proposed
depth estimation framework achieves favorable results in highly complex lane
rainy environments.
★ Patient-Specific Real-Time Segmentation in Trackerless Brain Ultrasound MICCAI 2024
Reuben Dorent, Erickson Torio, Nazim Haouchine, Colin Galvin, Sarah Frisken, Alexandra Golby, Tina Kapur, William Wells
Intraoperative ultrasound (iUS) imaging has the potential to improve surgical
outcomes in brain surgery. However, its interpretation is challenging, even for
expert neurosurgeons. In this work, we designed the first patient-specific
framework that performs brain tumor segmentation in trackerless iUS. To
disambiguate ultrasound imaging and adapt to the neurosurgeon's surgical
objective, a patient-specific real-time network is trained using synthetic
ultrasound data generated by simulating virtual iUS sweep acquisitions in
pre-operative MR data. Extensive experiments performed in real ultrasound data
demonstrate the effectiveness of the proposed approach, allowing for adapting
to the surgeon's definition of surgical targets and outperforming
non-patient-specific models, neurosurgeon experts, and high-end tracking
systems. Our code is available at: \url{https://github.com/ReubenDo/MHVAE-Seg}.
comment: Early accept at MICCAI 2024 - code available at:
https://github.com/ReubenDo/MHVAE-Seg
★ Dual-band feature selection for maturity classification of specialty crops by hyperspectral imaging
The maturity classification of specialty crops such as strawberries and
tomatoes is an essential agricultural downstream activity for selective
harvesting and quality control (QC) at production and packaging sites. Recent
advancements in Deep Learning (DL) have produced encouraging results in color
images for maturity classification applications. However, hyperspectral imaging
(HSI) outperforms methods based on color vision. Multivariate analysis methods
and Convolutional Neural Networks (CNN) deliver promising results; however, a
large amount of input data and the associated preprocessing requirements cause
hindrances in practical application. Conventionally, the reflectance intensity
in a given electromagnetic spectrum is employed in estimating fruit maturity.
We present a feature extraction method to empirically demonstrate that the peak
reflectance in subbands such as 500-670 nm (pigment band) and the wavelength of
the peak position, and contrarily, the trough reflectance and its corresponding
wavelength within 671-790 nm (chlorophyll band) are convenient to compute yet
distinctive features for the maturity classification. The proposed feature
selection method is beneficial because preprocessing, such as dimensionality
reduction, is avoided before every prediction. The feature set is designed to
capture these traits. The best SOTA methods, among 3D-CNN, 1D-CNN, and SVM,
achieve at most 90.0 % accuracy for strawberries and 92.0 % for tomatoes on our
dataset. Results show that the proposed method outperforms the SOTA as it
yields an accuracy above 98.0 % in strawberry and 96.0 % in tomato
classification. A comparative analysis of the time efficiency of these methods
is also conducted, which shows the proposed method performs prediction at 13
Frames Per Second (FPS) compared to the maximum 1.16 FPS attained by the
full-spectrum SVM classifier.
★ FPDIoU Loss: A Loss Function for Efficient Bounding Box Regression of Rotated Object Detection
Bounding box regression is one of the important steps of object detection.
However, rotation detectors often involve a more complicated loss based on
SkewIoU which is unfriendly to gradient-based training. Most of the existing
loss functions for rotated object detection calculate the difference between
two bounding boxes only focus on the deviation of area or each points distance
(e.g., $\mathcal{L}_{Smooth-\ell 1}$, $\mathcal{L}_{RotatedIoU}$ and
$\mathcal{L}_{PIoU}$). The calculation process of some loss functions is
extremely complex (e.g. $\mathcal{L}_{KFIoU}$). In order to improve the
efficiency and accuracy of bounding box regression for rotated object
detection, we proposed a novel metric for arbitrary shapes comparison based on
minimum points distance, which takes most of the factors from existing loss
functions for rotated object detection into account, i.e., the overlap or
nonoverlapping area, the central points distance and the rotation angle. We
also proposed a loss function called $\mathcal{L}_{FPDIoU}$ based on four
points distance for accurate bounding box regression focusing on faster and
high quality anchor boxes. In the experiments, $FPDIoU$ loss has been applied
to state-of-the-art rotated object detection (e.g., RTMDET, H2RBox) models
training with three popular benchmarks of rotated object detection including
DOTA, DIOR, HRSC2016 and two benchmarks of arbitrary orientation scene text
detection including ICDAR 2017 RRC-MLT and ICDAR 2019 RRC-MLT, which achieves
better performance than existing loss functions.
comment: arXiv admin note: text overlap with arXiv:2307.07662, text overlap
with arXiv:1902.09630 by other authors
★ Detecting Domain Shift in Multiple Instance Learning for Digital Pathology Using Fréchet Domain Distance
Multiple-instance learning (MIL) is an attractive approach for digital
pathology applications as it reduces the costs related to data collection and
labelling. However, it is not clear how sensitive MIL is to clinically
realistic domain shifts, i.e., differences in data distribution that could
negatively affect performance, and if already existing metrics for detecting
domain shifts work well with these algorithms. We trained an attention-based
MIL algorithm to classify whether a whole-slide image of a lymph node contains
breast tumour metastases. The algorithm was evaluated on data from a hospital
in a different country and various subsets of this data that correspond to
different levels of domain shift. Our contributions include showing that MIL
for digital pathology is affected by clinically realistic differences in data,
evaluating which features from a MIL model are most suitable for detecting
changes in performance, and proposing an unsupervised metric named Fr\'echet
Domain Distance (FDD) for quantification of domain shifts. Shift measure
performance was evaluated through the mean Pearson correlation to change in
classification performance, where FDD achieved 0.70 on 10-fold cross-validation
models. The baselines included Deep ensemble, Difference of Confidence, and
Representation shift which resulted in 0.45, -0.29, and 0.56 mean Pearson
correlation, respectively. FDD could be a valuable tool for care providers and
vendors who need to verify if a MIL system is likely to perform reliably when
implemented at a new site, without requiring any additional annotations from
pathologists.
★ MiniMaxAD: A Lightweight Autoencoder for Feature-Rich Anomaly Detection
Previous unsupervised anomaly detection (UAD) methods often struggle with
significant intra-class diversity; i.e., a class in a dataset contains multiple
subclasses, which we categorize as Feature-Rich Anomaly Detection Datasets
(FRADs). This is evident in applications such as unified setting and unmanned
supermarket scenarios. To address this challenge, we developed MiniMaxAD: a
lightweight autoencoder designed to efficiently compress and memorize extensive
information from normal images. Our model utilizes a large kernel convolutional
network equipped with a Global Response Normalization (GRN) unit and employs a
multi-scale feature reconstruction strategy. The GRN unit significantly
increases the upper limit of the network's capacity, while the large kernel
convolution facilitates the extraction of highly abstract patterns, leading to
compact normal feature modeling. Additionally, we introduce an Adaptive
Contraction Loss (ADCLoss), tailored to FRADs to overcome the limitations of
global cosine distance loss. MiniMaxAD was comprehensively tested across six
challenging UAD benchmarks, achieving state-of-the-art results in four and
highly competitive outcomes in the remaining two. Notably, our model achieved a
detection AUROC of up to 97.0\% in ViSA under the unified setting. Moreover, it
not only achieved state-of-the-art performance in unmanned supermarket tasks
but also exhibited an inference speed 37 times faster than the previous best
method, demonstrating its effectiveness in complex UAD tasks.
★ Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition CVPR2024
Most existing attention prediction research focuses on salient instances like
humans and objects. However, the more complex interaction-oriented attention,
arising from the comprehension of interactions between instances by human
observers, remains largely unexplored. This is equally crucial for advancing
human-machine interaction and human-centered artificial intelligence. To bridge
this gap, we first collect a novel gaze fixation dataset named IG, comprising
530,000 fixation points across 740 diverse interaction categories, capturing
visual attention during human observers cognitive processes of interactions.
Subsequently, we introduce the zero-shot interaction-oriented attention
prediction task ZeroIA, which challenges models to predict visual cues for
interactions not encountered during training. Thirdly, we present the
Interactive Attention model IA, designed to emulate human observers cognitive
processes to tackle the ZeroIA problem. Extensive experiments demonstrate that
the proposed IA outperforms other state-of-the-art approaches in both ZeroIA
and fully supervised settings. Lastly, we endeavor to apply
interaction-oriented attention to the interaction recognition task itself.
Further experimental results demonstrate the promising potential to enhance the
performance and interpretability of existing state-of-the-art HOI models by
incorporating real human attention data from IG and attention labels generated
by IA.
comment: Accepted by CVPR2024. Project HomePage:
https://yuchen2199.github.io/Interactive-Gaze/
★ Infrared Adversarial Car Stickers CVPR 2024
Infrared physical adversarial examples are of great significance for studying
the security of infrared AI systems that are widely used in our lives such as
autonomous driving. Previous infrared physical attacks mainly focused on 2D
infrared pedestrian detection which may not fully manifest its destructiveness
to AI systems. In this work, we propose a physical attack method against
infrared detectors based on 3D modeling, which is applied to a real car. The
goal is to design a set of infrared adversarial stickers to make cars invisible
to infrared detectors at various viewing angles, distances, and scenes. We
build a 3D infrared car model with real infrared characteristics and propose an
infrared adversarial pattern generation method based on 3D mesh shadow. We
propose a 3D control points-based mesh smoothing algorithm and use a set of
smoothness loss functions to enhance the smoothness of adversarial meshes and
facilitate the sticker implementation. Besides, We designed the aluminum
stickers and conducted physical experiments on two real Mercedes-Benz A200L
cars. Our adversarial stickers hid the cars from Faster RCNN, an object
detector, at various viewing angles, distances, and scenes. The attack success
rate (ASR) was 91.49% for real cars. In comparison, the ASRs of random stickers
and no sticker were only 6.21% and 0.66%, respectively. In addition, the ASRs
of the designed stickers against six unseen object detectors such as YOLOv3 and
Deformable DETR were between 73.35%-95.80%, showing good transferability of the
attack performance across detectors.
comment: Accepted by CVPR 2024
★ NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge
Jie Liang, Radu Timofte, Qiaosi Yi, Shuaizheng Liu, Lingchen Sun, Rongyuan Wu, Xindong Zhang, Hui Zeng, Lei Zhang
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model
(RAIM) in the Wild. The RAIM challenge constructed a benchmark for image
restoration in the wild, including real-world images with/without reference
ground truth in various scenarios from real applications. The participants were
required to restore the real-captured images from complex and unknown
degradation, where generative perceptual quality and fidelity are desired in
the restoration result. The challenge consisted of two tasks. Task one employed
real referenced data pairs, where quantitative evaluation is available. Task
two used unpaired images, and a comprehensive user study was conducted. The
challenge attracted more than 200 registrations, where 39 of them submitted
results with more than 400 submissions. Top-ranked methods improved the
state-of-the-art restoration performance and obtained unanimous recognition
from all 18 judges. The proposed datasets are available at
https://drive.google.com/file/d/1DqbxUoiUqkAIkExu3jZAqoElr_nu1IXb/view?usp=sharing
and the homepage of this challenge is at
https://codalab.lisn.upsaclay.fr/competitions/17632.
★ Cross-sensor self-supervised training and alignment for remote sensing
Large-scale "foundation models" have gained traction as a way to leverage the
vast amounts of unlabeled remote sensing data collected every day. However, due
to the multiplicity of Earth Observation satellites, these models should learn
"sensor agnostic" representations, that generalize across sensor
characteristics with minimal fine-tuning. This is complicated by data
availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data,
are available in large amounts, while very high-resolution aerial or satellite
data is less common. To tackle these challenges, we introduce cross-sensor
self-supervised training and alignment for remote sensing (X-STARS). We design
a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD),
to align representations across sensors, even with vastly different
resolutions. Our X-STARS can be applied to train models from scratch, or to
adapt large models pretrained on e.g low-resolution EO data to new
high-resolution sensors, in a continual pretraining framework. We collect and
release MSC-France, a new multi-sensor dataset, on which we train our X-STARS
models, then evaluated on seven downstream classification and segmentation
tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a
significant margin with less data across various conditions of data
availability and resolutions.
★ Unveiling the Potential: Harnessing Deep Metric Learning to Circumvent Video Streaming Encryption
Encryption on the internet with the shift to HTTPS has been an important step
to improve the privacy of internet users. However, there is an increasing body
of work about extracting information from encrypted internet traffic without
having to decrypt it. Such attacks bypass security guarantees assumed to be
given by HTTPS and thus need to be understood. Prior works showed that the
variable bitrates of video streams are sufficient to identify which video
someone is watching. These works generally have to make trade-offs in aspects
such as accuracy, scalability, robustness, etc. These trade-offs complicate the
practical use of these attacks. To that end, we propose a deep metric learning
framework based on the triplet loss method. Through this framework, we achieve
robust, generalisable, scalable and transferable encrypted video stream
detection. First, the triplet loss is better able to deal with video streams
not seen during training. Second, our approach can accurately classify videos
not seen during training. Third, we show that our method scales well to a
dataset of over 1000 videos. Finally, we show that a model trained on video
streams over Chrome can also classify streams over Firefox. Our results suggest
that this side-channel attack is more broadly applicable than originally
thought. We provide our code alongside a diverse and up-to-date dataset for
future research.
comment: Published in the WI-IAT 2023 proceedings
★ RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception
Xiaosu Zhu, Hualian Sheng, Sijia Cai, Bing Deng, Shaopeng Yang, Qiao Liang, Ken Chen, Lianli Gao, Jingkuan Song, Jieping Ye
We introduce RoScenes, the largest multi-view roadside perception dataset,
which aims to shed light on the development of vision-centric Bird's Eye View
(BEV) approaches for more challenging traffic scenes. The highlights of
RoScenes include significantly large perception area, full scene coverage and
crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D
annotations within 64,000 $m^2$. To relieve the expensive costs of roadside 3D
labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently
collect such a large volume of data. After that, we organize a comprehensive
study for current BEV methods on RoScenes in terms of effectiveness and
efficiency. Tested methods suffer from the vast perception area and variation
of sensor layout across scenes, resulting in performance levels falling below
expectations. To this end, we propose RoBEV that incorporates feature-guided
position embedding for effective 2D-3D feature assignment. With its help, our
method outperforms state-of-the-art by a large margin without extra
computational overhead on validation set. Our dataset and devkit will be made
available at \url{https://github.com/xiaosu-zhu/RoScenes}.
comment: Technical report. 32 pages, 21 figures, 13 tables.
https://github.com/xiaosu-zhu/RoScenes
★ DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
With the rapid development of face recognition (FR) systems, the privacy of
face images on social media is facing severe challenges due to the abuse of
unauthorized FR systems. Some studies utilize adversarial attack techniques to
defend against malicious FR systems by generating adversarial examples.
However, the generated adversarial examples, i.e., the protected face images,
tend to suffer from subpar visual quality and low transferability. In this
paper, we propose a novel face protection approach, dubbed DiffAM, which
leverages the powerful generative ability of diffusion models to generate
high-quality protected face images with adversarial makeup transferred from
reference images. To be specific, we first introduce a makeup removal module to
generate non-makeup images utilizing a fine-tuned diffusion model with guidance
of textual prompts in CLIP space. As the inverse process of makeup transfer,
makeup removal can make it easier to establish the deterministic relationship
between makeup domain and non-makeup domain regardless of elaborate text
prompts. Then, with this relationship, a CLIP-based makeup loss along with an
ensemble attack strategy is introduced to jointly guide the direction of
adversarial makeup domain, achieving the generation of protected face images
with natural-looking makeup and high black-box transferability. Extensive
experiments demonstrate that DiffAM achieves higher visual quality and attack
success rates with a gain of 12.98% under black-box setting compared with the
state of the arts. The code will be available at
https://github.com/HansSunY/DiffAM.
comment: 16 pages, 11 figures
★ Deep Learning-Based Quasi-Conformal Surface Registration for Partial 3D Faces Applied to Facial Recognition
3D face registration is an important process in which a 3D face model is
aligned and mapped to a template face. However, the task of 3D face
registration becomes particularly challenging when dealing with partial face
data, where only limited facial information is available. To address this
challenge, this paper presents a novel deep learning-based approach that
combines quasi-conformal geometry with deep neural networks for partial face
registration. The proposed framework begins with a Landmark Detection Network
that utilizes curvature information to detect the presence of facial features
and estimate their corresponding coordinates. These facial landmark features
serve as essential guidance for the registration process. To establish a dense
correspondence between the partial face and the template surface, a
registration network based on quasiconformal theories is employed. The
registration network establishes a bijective quasiconformal surface mapping
aligning corresponding partial faces based on detected landmarks and curvature
values. It consists of the Coefficients Prediction Network, which outputs the
optimal Beltrami coefficient representing the surface mapping. The Beltrami
coefficient quantifies the local geometric distortion of the mapping. By
controlling the magnitude of the Beltrami coefficient through a suitable
activation function, the bijectivity and geometric distortion of the mapping
can be controlled. The Beltrami coefficient is then fed into the Beltrami
solver network to reconstruct the corresponding mapping. The surface
registration enables the acquisition of corresponding regions and the
establishment of point-wise correspondence between different partial faces,
facilitating precise shape comparison through the evaluation of point-wise
geometric differences at these corresponding regions. Experimental results
demonstrate the effectiveness of the proposed method.
★ Generative Unlearning for Any Identity CVPR 2024
Recent advances in generative models trained on large-scale datasets have
made it possible to synthesize high-quality samples across various domains.
Moreover, the emergence of strong inversion networks enables not only a
reconstruction of real-world images but also the modification of attributes
through various editing methods. However, in certain domains related to privacy
issues, e.g., human faces, advanced generative models along with strong
inversion methods can lead to potential misuses. In this paper, we propose an
essential yet under-explored task called generative identity unlearning, which
steers the model not to generate an image of a specific identity. In the
generative identity unlearning, we target the following objectives: (i)
preventing the generation of images with a certain identity, and (ii)
preserving the overall quality of the generative model. To satisfy these goals,
we propose a novel framework, Generative Unlearning for Any Identity (GUIDE),
which prevents the reconstruction of a specific identity by unlearning the
generator with only a single image. GUIDE consists of two parts: (i) finding a
target point for optimization that un-identifies the source latent code and
(ii) novel loss functions that facilitate the unlearning procedure while less
affecting the learned distribution. Our extensive experiments demonstrate that
our proposed method achieves state-of-the-art performance in the generative
machine unlearning task. The code is available at
https://github.com/KHU-AGI/GUIDE.
comment: 15 pages, 17 figures, 10 tables, CVPR 2024 Poster
★ Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion
Xinyang Li, Zhangyu Lai, Linning Xu, Jianfei Guo, Liujuan Cao, Shengchuan Zhang, Bo Dai, Rongrong Ji
We present Dual3D, a novel text-to-3D generation framework that generates
high-quality 3D assets from texts in only $1$ minute.The key component is a
dual-mode multi-view latent diffusion model. Given the noisy multi-view
latents, the 2D mode can efficiently denoise them with a single latent
denoising network, while the 3D mode can generate a tri-plane neural surface
for consistent rendering-based denoising. Most modules for both modes are tuned
from a pre-trained text-to-image latent diffusion model to circumvent the
expensive cost of training from scratch. To overcome the high rendering cost
during inference, we propose the dual-mode toggling inference strategy to use
only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in
just $10$ seconds without sacrificing quality. The texture of the 3D asset can
be further enhanced by our efficient texture refinement process in a short
time. Extensive experiments demonstrate that our method delivers
state-of-the-art performance while significantly reducing generation time. Our
project page is available at https://dual3d.github.io
comment: Project Page: https://dual3d.github.io
★ IRSRMamba: Infrared Image Super-Resolution via Mamba-based Wavelet Transform Feature Modulation Model
Infrared (IR) image super-resolution faces challenges from homogeneous
background pixel distributions and sparse target regions, requiring models that
effectively handle long-range dependencies and capture detailed local-global
information. Recent advancements in Mamba-based (Selective Structured State
Space Model) models, employing state space models, have shown significant
potential in visual tasks, suggesting their applicability for IR enhancement.
In this work, we introduce IRSRMamba: Infrared Image Super-Resolution via
Mamba-based Wavelet Transform Feature Modulation Model, a novel Mamba-based
model designed specifically for IR image super-resolution. This model enhances
the restoration of context-sparse target details through its advanced
dependency modeling capabilities. Additionally, a new wavelet transform feature
modulation block improves multi-scale receptive field representation, capturing
both global and local information efficiently. Comprehensive evaluations
confirm that IRSRMamba outperforms existing models on multiple benchmarks. This
research advances IR super-resolution and demonstrates the potential of
Mamba-based models in IR image processing. Code are available at
\url{https://github.com/yongsongH/IRSRMamba}.
comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
★ Solar multi-object multi-frame blind deconvolution with a spatially variant convolution neural emulator
The study of astronomical phenomena through ground-based observations is
always challenged by the distorting effects of Earth's atmosphere. Traditional
methods of post-facto image correction, essential for correcting these
distortions, often rely on simplifying assumptions that limit their
effectiveness, particularly in the presence of spatially variant atmospheric
turbulence. Such cases are often solved by partitioning the field-of-view into
small patches, deconvolving each patch independently, and merging all patches
together. This approach is often inefficient and can produce artifacts. Recent
advancements in computational techniques and the advent of deep learning offer
new pathways to address these limitations. This paper introduces a novel
framework leveraging a deep neural network to emulate spatially variant
convolutions, offering a breakthrough in the efficiency and accuracy of
astronomical image deconvolution. By training on a dataset of images convolved
with spatially invariant point spread functions and validating its
generalizability to spatially variant conditions, this approach presents a
significant advancement over traditional methods. The convolution emulator is
used as a forward model in a multi-object multi-frame blind deconvolution
algorithm for solar images. The emulator enables the deconvolution of solar
observations across large fields of view without resorting to patch-wise
mosaicking, thus avoiding artifacts associated with such techniques. This
method represents a significant computational advantage, reducing processing
times by orders of magnitude.
comment: 15 pages, 14 figures, accepted for publication in A&A
★ Box-Free Model Watermarks Are Prone to Black-Box Removal Attacks
Box-free model watermarking is an emerging technique to safeguard the
intellectual property of deep learning models, particularly those for low-level
image processing tasks. Existing works have verified and improved its
effectiveness in several aspects. However, in this paper, we reveal that
box-free model watermarking is prone to removal attacks, even under the
real-world threat model such that the protected model and the watermark
extractor are in black boxes. Under this setting, we carry out three studies.
1) We develop an extractor-gradient-guided (EGG) remover and show its
effectiveness when the extractor uses ReLU activation only. 2) More generally,
for an unknown extractor, we leverage adversarial attacks and design the EGG
remover based on the estimated gradients. 3) Under the most stringent condition
that the extractor is inaccessible, we design a transferable remover based on a
set of private proxy models. In all cases, the proposed removers can
successfully remove embedded watermarks while preserving the quality of the
processed images, and we also demonstrate that the EGG remover can even replace
the watermarks. Extensive experimental results verify the effectiveness and
generalizability of the proposed attacks, revealing the vulnerabilities of the
existing box-free methods and calling for further research.
★ Towards Realistic Incremental Scenario in Class Incremental Semantic Segmentation
This paper addresses the unrealistic aspect of the commonly adopted
Continuous Incremental Semantic Segmentation (CISS) scenario, termed
overlapped. We point out that overlapped allows the same image to reappear in
future tasks with different pixel labels, which is far from practical
incremental learning scenarios. Moreover, we identified that this flawed
scenario may lead to biased results for two commonly used techniques in CISS,
pseudo-labeling and exemplar memory, resulting in unintended advantages or
disadvantages for certain techniques. To mitigate this, a practical scenario
called partitioned is proposed, in which the dataset is first divided into
distinct subsets representing each class, and then the subsets are assigned to
each corresponding task. This efficiently addresses the issue above while
meeting the requirement of CISS scenario, such as capturing the background
shifts. Furthermore, we identify and address the code implementation issues
related to retrieving data from the exemplar memory, which was ignored in
previous works. Lastly, we introduce a simple yet competitive memory-based
baseline, MiB-AugM, that handles background shifts of current tasks in the
exemplar memory. This baseline achieves state-of-the-art results across
multiple tasks involving learning numerous new classes.
★ Region of Interest Detection in Melanocytic Skin Tumor Whole Slide Images -- Nevus & Melanoma NeurIPS 2022
Automated region of interest detection in histopathological image analysis is
a challenging and important topic with tremendous potential impact on clinical
practice. The deep-learning methods used in computational pathology may help us
to reduce costs and increase the speed and accuracy of cancer diagnosis. We
started with the UNC Melanocytic Tumor Dataset cohort that contains 160
hematoxylin and eosin whole-slide images of primary melanomas (86) and nevi
(74). We randomly assigned 80% (134) as a training set and built an in-house
deep-learning method to allow for classification, at the slide level, of nevi
and melanomas. The proposed method performed well on the other 20% (26) test
dataset; the accuracy of the slide classification task was 92.3% and our model
also performed well in terms of predicting the region of interest annotated by
the pathologists, showing excellent performance of our model on melanocytic
skin tumors. Even though we tested the experiments on the skin tumor dataset,
our work could also be extended to other medical image detection problems to
benefit the clinical evaluation and diagnosis of different tumors.
comment: 5 figures, NeurIPS 2022 Workshop
★ PillarNeXt: Improving the 3D detector by introducing Voxel2Pillar feature encoding and extracting multi-scale features
Multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based
3D detectors are essential for autonomous driving. Extracting rich multi-scale
features is crucial for point cloud-based 3D detectors in autonomous driving
due to significant differences in the size of different types of objects.
However, due to the real-time requirements, large-size convolution kernels are
rarely used to extract large-scale features in the backbone. Current 3D
detectors commonly use feature pyramid networks to obtain large-scale features;
however, some objects containing fewer point clouds are further lost during
downsampling, resulting in degraded performance. Since pillar-based schemes
require much less computation than voxel-based schemes, they are more suitable
for constructing real-time 3D detectors. Hence, we propose PillarNeXt, a
pillar-based scheme. We redesigned the feature encoding, the backbone, and the
neck of the 3D detector. We propose Voxel2Pillar feature encoding, which uses a
sparse convolution constructor to construct pillars with richer point cloud
features, especially height features. Moreover, additional learnable parameters
are added, which enables the initial pillar to achieve higher performance
capabilities. We extract multi-scale and large-scale features in the proposed
fully sparse backbone, which does not utilize large-size convolutional kernels;
the backbone consists of the proposed multi-scale feature extraction module.
The neck consists of the proposed sparse ConvNeXt, whose simple structure
significantly improves the performance. The effectiveness of the proposed
PillarNeXt is validated on the Waymo Open Dataset, and object detection
accuracy for vehicles, pedestrians, and cyclists is improved; we also verify
the effectiveness of each proposed module in detail.
★ Parallel Backpropagation for Shared-Feature Visualization
Alexander Lappe, Anna Bognár, Ghazaleh Ghamkhari Nejad, Albert Mukovskiy, Lucas Martini, Martin A. Giese, Rufin Vogels
High-level visual brain regions contain subareas in which neurons appear to
respond more strongly to examples of a particular semantic category, like faces
or bodies, rather than objects. However, recent work has shown that while this
finding holds on average, some out-of-category stimuli also activate neurons in
these regions. This may be due to visual features common among the preferred
class also being present in other images. Here, we propose a
deep-learning-based approach for visualizing these features. For each neuron,
we identify relevant visual features driving its selectivity by modelling
responses to images based on latent activations of a deep neural network. Given
an out-of-category image which strongly activates the neuron, our method first
identifies a reference image from the preferred category yielding a similar
feature activation pattern. We then backpropagate latent activations of both
images to the pixel level, while enhancing the identified shared dimensions and
attenuating non-shared features. The procedure highlights image regions
containing shared features driving responses of the model neuron. We apply the
algorithm to novel recordings from body-selective regions in macaque IT cortex
in order to understand why some images of objects excite these neurons.
Visualizations reveal object parts which resemble parts of a macaque body,
shedding light on neural preference of these objects.
★ Densely Distilling Cumulative Knowledge for Continual Learning
Continual learning, involving sequential training on diverse tasks, often
faces catastrophic forgetting. While knowledge distillation-based approaches
exhibit notable success in preventing forgetting, we pinpoint a limitation in
their ability to distill the cumulative knowledge of all the previous tasks. To
remedy this, we propose Dense Knowledge Distillation (DKD). DKD uses a task
pool to track the model's capabilities. It partitions the output logits of the
model into dense groups, each corresponding to a task in the task pool. It then
distills all tasks' knowledge using all groups. However, using all the groups
can be computationally expensive, we also suggest random group selection in
each optimization step. Moreover, we propose an adaptive weighting scheme,
which balances the learning of new classes and the retention of old classes,
based on the count and similarity of the classes. Our DKD outperforms recent
state-of-the-art baselines across diverse benchmarks and scenarios. Empirical
analysis underscores DKD's ability to enhance model stability, promote flatter
minima for improved generalization, and remains robust across various memory
budgets and task orders. Moreover, it seamlessly integrates with other CL
methods to boost performance and proves versatile in offline scenarios like
model compression.
comment: 12 pages; Continual Leanrning; Class-incremental Learning; Knowledge
Distillation; Forgetting
★ Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis
In this work, we present Semantic Gesticulator, a novel framework designed to
synthesize realistic gestures accompanying speech with strong semantic
correspondence. Semantically meaningful gestures are crucial for effective
non-verbal communication, but such gestures often fall within the long tail of
the distribution of natural human motion. The sparsity of these movements makes
it challenging for deep learning-based systems, trained on moderately sized
datasets, to capture the relationship between the movements and the
corresponding speech semantics. To address this challenge, we develop a
generative retrieval framework based on a large language model. This framework
efficiently retrieves suitable semantic gesture candidates from a motion
library in response to the input speech. To construct this motion library, we
summarize a comprehensive list of commonly used semantic gestures based on
findings in linguistics, and we collect a high-quality motion dataset
encompassing both body and hand movements. We also design a novel GPT-based
model with strong generalization capabilities to audio, capable of generating
high-quality gestures that match the rhythm of speech. Furthermore, we propose
a semantic alignment mechanism to efficiently align the retrieved semantic
gestures with the GPT's output, ensuring the naturalness of the final
animation. Our system demonstrates robustness in generating gestures that are
rhythmically coherent and semantically explicit, as evidenced by a
comprehensive collection of examples. User studies confirm the quality and
human-likeness of our results, and show that our system outperforms
state-of-the-art systems in terms of semantic appropriateness by a clear
margin.
comment: 17 pages
★ MediSyn: Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis
Diffusion models have recently gained significant traction due to their
ability to generate high-fidelity and diverse images and videos conditioned on
text prompts. In medicine, this application promises to address the critical
challenge of data scarcity, a consequence of barriers in data sharing,
stringent patient privacy regulations, and disparities in patient population
and demographics. By generating realistic and varying medical 2D and 3D images,
these models offer a rich, privacy-respecting resource for algorithmic training
and research. To this end, we introduce MediSyn, a pair of instruction-tuned
text-guided latent diffusion models with the ability to generate high-fidelity
and diverse medical 2D and 3D images across specialties and modalities. Through
established metrics, we show significant improvement in broad medical image and
video synthesis guided by text prompts.
★ Many-Shot In-Context Learning in Multimodal Foundation Models
Large language models are well-known to be effective at few-shot in-context
learning (ICL). Recent advancements in multimodal foundation models have
enabled unprecedentedly long context windows, presenting an opportunity to
explore their capability to perform ICL with many more demonstrating examples.
In this work, we evaluate the performance of multimodal foundation models
scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro
across 10 datasets spanning multiple domains (natural imagery, medical imagery,
remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and
fine-grained classification). We observe that many-shot ICL, including up to
almost 2,000 multimodal demonstrating examples, leads to substantial
improvements compared to few-shot (<100 examples) ICL across all of the
datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly
up to the maximum number of tested examples on many datasets. Given the high
inference costs associated with the long prompts required for many-shot ICL, we
also explore the impact of batching multiple queries in a single API call. We
show that batching up to 50 queries can lead to performance improvements under
zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on
multiple datasets, while drastically reducing per-query cost and latency.
Finally, we measure ICL data efficiency of the models, or the rate at which the
models learn from more demonstrating examples. We find that while GPT-4o and
Gemini 1.5 Pro achieve similar zero-shot performance across the datasets,
Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most
datasets. Our results suggest that many-shot ICL could enable users to
efficiently adapt multimodal foundation models to new applications and domains.
Our codebase is publicly available at
https://github.com/stanfordmlgroup/ManyICL .
★ LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation IJCAI'2024
Due to spatial redundancy in remote sensing images, sparse tokens containing
rich information are usually involved in self-attention (SA) to reduce the
overall token numbers within the calculation, avoiding the high computational
cost issue in Vision Transformers. However, such methods usually obtain sparse
tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to
reach a better balance between efficiency and performance. Different from them,
this paper proposes to use learnable meta tokens to formulate sparse tokens,
which effectively learn key information meanwhile improving the inference
speed. Technically, the meta tokens are first initialized from image tokens via
cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote
information exchange between image tokens and meta tokens, where they serve as
query and key (value) tokens alternatively in a dual-branch structure,
significantly reducing the computational complexity compared to self-attention.
By employing DCA in the early stages with dense visual tokens, we obtain the
hierarchical architecture LeMeViT with various sizes. Experimental results in
classification and dense prediction tasks show that LeMeViT has a significant
$1.7 \times$ speedup, fewer parameters, and competitive performance compared to
the baseline models, and achieves a better trade-off between efficiency and
performance.
comment: Accepted by IJCAI'2024. The code is available at
https://github.com/ViTAE-Transformer/LeMeViT
★ Analysis of the BraTS 2023 Intracranial Meningioma Segmentation Challenge MICCAI
Dominic LaBella, Ujjwal Baid, Omaditya Khanna, Shan McBurney-Lin, Ryan McLean, Pierre Nedelec, Arif Rashid, Nourel Hoda Tahon, Talissa Altes, Radhika Bhalerao, Yaseen Dhemesh, Devon Godfrey, Fathi Hilal, Scott Floyd, Anastasia Janas, Anahita Fathi Kazerooni, John Kirkpatrick, Collin Kent, Florian Kofler, Kevin Leu, Nazanin Maleki, Bjoern Menze, Maxence Pajot, Zachary J. Reitman, Jeffrey D. Rudie, Rachit Saluja, Yury Velichko, Chunhao Wang, Pranav Warman, Maruf Adewole, Jake Albrecht, Udunna Anazodo, Syed Muhammad Anwar, Timothy Bergquist, Sully Francis Chen, Verena Chung, Gian-Marco Conte, Farouk Dako, James Eddy, Ivan Ezhov, Nastaran Khalili, Juan Eugenio Iglesias, Zhifan Jiang, Elaine Johanson, Koen Van Leemput, Hongwei Bran Li, Marius George Linguraru, Xinyang Liu, Aria Mahtabfar, Zeke Meier, Ahmed W. Moawad, John Mongan, Marie Piraud, Russell Takeshi Shinohara, Walter F. Wiggins, Aly H. Abayazeed, Rachel Akinola, András Jakab, Michel Bilello, Maria Correia de Verdier, Priscila Crivellaro, Christos Davatzikos, Keyvan Farahani, John Freymann, Christopher Hess, Raymond Huang, Philipp Lohmann, Mana Moassefi, Matthew W. Pease, Phillipp Vollmuth, Nico Sollmann, David Diffley, Khanak K. Nandolia, Daniel I. Warren, Ali Hussain, Pascal Fehringer, Yulia Bronstein, Lisa Deptula, Evan G. Stein, Mahsa Taherzadeh, Eduardo Portela de Oliveira, Aoife Haughey, Marinos Kontzialis, Luca Saba, Benjamin Turner, Melanie M. T. Brüßeler, Shehbaz Ansari, Athanasios Gkampenis, David Maximilian Weiss, Aya Mansour, Islam H. Shawali, Nikolay Yordanov, Joel M. Stein, Roula Hourani, Mohammed Yahya Moshebah, Ahmed Magdy Abouelatta, Tanvir Rizvi, Klara Willms, Dann C. Martin, Abdullah Okar, Gennaro D'Anna, Ahmed Taha, Yasaman Sharifi, Shahriar Faghani, Dominic Kite, Marco Pinho, Muhammad Ammar Haider, Alejandro Aristizabal, Alexandros Karargyris, Hasan Kassem, Sarthak Pati, Micah Sheller, Michelle Alonso-Basanta, Javier Villanueva-Meyer, Andreas M. Rauschecker, Ayman Nada, Mariam Aboian, Adam E. Flanders, Benedikt Wiestler, Spyridon Bakas, Evan Calabrese
We describe the design and results from the BraTS 2023 Intracranial
Meningioma Segmentation Challenge. The BraTS Meningioma Challenge differed from
prior BraTS Glioma challenges in that it focused on meningiomas, which are
typically benign extra-axial tumors with diverse radiologic and anatomical
presentation and a propensity for multiplicity. Nine participating teams each
developed deep-learning automated segmentation models using image data from the
largest multi-institutional systematically expert annotated multilabel
multi-sequence meningioma MRI dataset to date, which included 1000 training set
cases, 141 validation set cases, and 283 hidden test set cases. Each case
included T2, T2/FLAIR, T1, and T1Gd brain MRI sequences with associated tumor
compartment labels delineating enhancing tumor, non-enhancing tumor, and
surrounding non-enhancing T2/FLAIR hyperintensity. Participant automated
segmentation models were evaluated and ranked based on a scoring system
evaluating lesion-wise metrics including dice similarity coefficient (DSC) and
95% Hausdorff Distance. The top ranked team had a lesion-wise median dice
similarity coefficient (DSC) of 0.976, 0.976, and 0.964 for enhancing tumor,
tumor core, and whole tumor, respectively and a corresponding average DSC of
0.899, 0.904, and 0.871, respectively. These results serve as state-of-the-art
benchmarks for future pre-operative meningioma automated segmentation
algorithms. Additionally, we found that 1286 of 1424 cases (90.3%) had at least
1 compartment voxel abutting the edge of the skull-stripped image edge, which
requires further investigation into optimal pre-processing face anonymization
steps.
comment: 16 pages, 11 tables, 10 figures, MICCAI
★ Size-invariance Matters: Rethinking Metrics and Losses for Imbalanced Multi-object Salient Object Detection ICML2024
This paper explores the size-invariance of evaluation metrics in Salient
Object Detection (SOD), especially when multiple targets of diverse sizes
co-exist in the same image. We observe that current metrics are size-sensitive,
where larger objects are focused, and smaller ones tend to be ignored. We argue
that the evaluation should be size-invariant because bias based on size is
unjustified without additional semantic information. In pursuit of this, we
propose a generic approach that evaluates each salient object separately and
then combines the results, effectively alleviating the imbalance. We further
develop an optimization framework tailored to this goal, achieving considerable
improvements in detecting objects of different sizes. Theoretically, we provide
evidence supporting the validity of our new metrics and present the
generalization analysis of SOD. Extensive experiments demonstrate the
effectiveness of our method. The code is available at
https://github.com/Ferry-Li/SI-SOD.
comment: This paper has been accepted by ICML2024
★ Rethinking Barely-Supervised Segmentation from an Unsupervised Domain Adaptation Perspective
This paper investigates an extremely challenging problem, barely-supervised
medical image segmentation (BSS), where the training dataset comprises limited
labeled data with only single-slice annotations and numerous unlabeled images.
Currently, state-of-the-art (SOTA) BSS methods utilize a registration-based
paradigm, depending on image registration to propagate single-slice annotations
into volumetric pseudo labels for constructing a complete labeled set. However,
this paradigm has a critical limitation: the pseudo labels generated by image
registration are unreliable and noisy. Motivated by this, we propose a new
perspective: training a model using only single-annotated slices as the labeled
set without relying on image registration. To this end, we formulate BSS as an
unsupervised domain adaptation (UDA) problem. Specifically, we first design a
novel noise-free labeled data construction algorithm (NFC) for slice-to-volume
labeled data synthesis, which may result in a side effect: domain shifts
between the synthesized images and the original images. Then, a frequency and
spatial mix-up strategy (FSX) is further introduced to mitigate the domain
shifts for UDA. Extensive experiments demonstrate that our method provides a
promising alternative for BSS. Remarkably, the proposed method with only one
labeled slice achieves an 80.77% dice score on left atrial segmentation,
outperforming the SOTA by 61.28%. The code will be released upon the
publication of this paper.
★ Collision Avoidance Metric for 3D Camera Evaluation
3D cameras have emerged as a critical source of information for applications
in robotics and autonomous driving. These cameras provide robots with the
ability to capture and utilize point clouds, enabling them to navigate their
surroundings and avoid collisions with other objects. However, current standard
camera evaluation metrics often fail to consider the specific application
context. These metrics typically focus on measures like Chamfer distance (CD)
or Earth Mover's Distance (EMD), which may not directly translate to
performance in real-world scenarios. To address this limitation, we propose a
novel metric for point cloud evaluation, specifically designed to assess the
suitability of 3D cameras for the critical task of collision avoidance. This
metric incorporates application-specific considerations and provides a more
accurate measure of a camera's effectiveness in ensuring safe robot navigation.
♻ ★ Neural Collapse Meets Differential Privacy: Curious Behaviors of NoisyGD with Near-perfect Representation Learning ICML 2024
A recent study by De et al. (2022) has reported that large-scale
representation learning through pre-training on a public dataset significantly
enhances differentially private (DP) learning in downstream tasks, despite the
high dimensionality of the feature space. To theoretically explain this
phenomenon, we consider the setting of a layer-peeled model in representation
learning, which results in interesting phenomena related to learned features in
deep learning and transfer learning, known as Neural Collapse (NC).
Within the framework of NC, we establish an error bound indicating that the
misclassification error is independent of dimension when the distance between
actual features and the ideal ones is smaller than a threshold. Additionally,
the quality of the features in the last layer is empirically evaluated under
different pre-trained models within the framework of NC, showing that a more
powerful transformer leads to a better feature representation. Furthermore, we
reveal that DP fine-tuning is less robust compared to fine-tuning without DP,
particularly in the presence of perturbations. These observations are supported
by both theoretical analyses and experimental evaluation. Moreover, to enhance
the robustness of DP fine-tuning, we suggest several strategies, such as
feature normalization or employing dimension reduction methods like Principal
Component Analysis (PCA). Empirically, we demonstrate a significant improvement
in testing accuracy by conducting PCA on the last-layer features.
comment: To appear in ICML 2024
♻ ★ Exploring Graph-based Knowledge: Multi-Level Feature Distillation via Channels Relational Graph
In visual tasks, large teacher models capture essential features and deep
information, enhancing performance. However, distilling this information into
smaller student models often leads to performance loss due to structural
differences and capacity limitations. To tackle this, we propose a distillation
framework based on graph knowledge, including a multi-level feature alignment
strategy and an attention-guided mechanism to provide a targeted learning
trajectory for the student model. We emphasize spectral embedding (SE) as a key
technique in our distillation process, which merges the student's feature space
with the relational knowledge and structural complexities similar to the
teacher network. This method captures the teacher's understanding in a
graph-based representation, enabling the student model to more accurately mimic
the complex structural dependencies present in the teacher model. Compared to
methods that focus only on specific distillation areas, our strategy not only
considers key features within the teacher model but also endeavors to capture
the relationships and interactions among feature sets, encoding these complex
pieces of information into a graph structure to understand and utilize the
dynamic relationships among these pieces of information from a global
perspective. Experiments show that our method outperforms previous feature
distillation methods on the CIFAR-100, MS-COCO, and Pascal VOC datasets,
proving its efficiency and applicability.
♻ ★ Global-Local Image Perceptual Score (GLIPS): Evaluating Photorealistic Quality of AI-Generated Images
This paper introduces the Global-Local Image Perceptual Score (GLIPS), an
image metric designed to assess the photorealistic image quality of
AI-generated images with a high degree of alignment to human visual perception.
Traditional metrics such as FID and KID scores do not align closely with human
evaluations. The proposed metric incorporates advanced transformer-based
attention mechanisms to assess local similarity and Maximum Mean Discrepancy
(MMD) to evaluate global distributional similarity. To evaluate the performance
of GLIPS, we conducted a human study on photorealistic image quality.
Comprehensive tests across various generative models demonstrate that GLIPS
consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms
of correlation with human scores. Additionally, we introduce the Interpolative
Binning Scale (IBS), a refined scaling method that enhances the
interpretability of metric scores by aligning them more closely with human
evaluative standards. The proposed metric and scaling approach not only
provides more reliable assessments of AI-generated images but also suggest
pathways for future enhancements in image generation technologies.
comment: 10 pages, 3 figures. Submitted to IEEE Transactions on Human-Machine
Systems
♻ ★ MMFusion: Multi-modality Diffusion Model for Lymph Node Metastasis Diagnosis in Esophageal Cancer MICCAI 2024
Esophageal cancer is one of the most common types of cancer worldwide and
ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis
of cancer progression can help physicians effectively customize personalized
treatment plans. Currently, CT-based cancer diagnosis methods have received
much attention for their comprehensive ability to examine patients' conditions.
However, multi-modal based methods may likely introduce information redundancy,
leading to underperformance. In addition, efficient and effective interactions
between multi-modal representations need to be further explored, lacking
insightful exploration of prognostic correlation in multi-modality features. In
this work, we introduce a multi-modal heterogeneous graph-based conditional
feature-guided diffusion model for lymph node metastasis diagnosis based on CT
images as well as clinical measurements and radiomics data. To explore the
intricate relationships between multi-modal features, we construct a
heterogeneous graph. Following this, a conditional feature-guided diffusion
approach is applied to eliminate information redundancy. Moreover, we propose a
masked relational representation learning strategy, aiming to uncover the
latent prognostic correlations and priorities of primary tumor and lymph node
image representations. Various experimental results validate the effectiveness
of our proposed method. The code is available at
https://github.com/wuchengyu123/MMFusion.
comment: Early accepted to MICCAI 2024 (6/6/5)
♻ ★ Common Corruptions for Enhancing and Evaluating Robustness in Air-to-Air Visual Object Detection
Anastasios Arsenos, Vasileios Karampinis, Evangelos Petrongonas, Christos Skliros, Dimitrios Kollias, Stefanos Kollias, Athanasios Voulodimos
The main barrier to achieving fully autonomous flights lies in autonomous
aircraft navigation. Managing non-cooperative traffic presents the most
important challenge in this problem. The most efficient strategy for handling
non-cooperative traffic is based on monocular video processing through deep
learning models. This study contributes to the vision-based deep learning
aircraft detection and tracking literature by investigating the impact of data
corruption arising from environmental and hardware conditions on the
effectiveness of these methods. More specifically, we designed $7$ types of
common corruptions for camera inputs taking into account real-world flight
conditions. By applying these corruptions to the Airborne Object Tracking (AOT)
dataset we constructed the first robustness benchmark dataset named AOT-C for
air-to-air aerial object detection. The corruptions included in this dataset
cover a wide range of challenging conditions such as adverse weather and sensor
noise. The second main contribution of this letter is to present an extensive
experimental evaluation involving $8$ diverse object detectors to explore the
degradation in the performance under escalating levels of corruptions (domain
shifts). Based on the evaluation results, the key observations that emerge are
the following: 1) One-stage detectors of the YOLO family demonstrate better
robustness, 2) Transformer-based and multi-stage detectors like Faster R-CNN
are extremely vulnerable to corruptions, 3) Robustness against corruptions is
related to the generalization ability of models. The third main contribution is
to present that finetuning on our augmented synthetic data results in
improvements in the generalisation ability of the object detector in real-world
flight experiments.
♻ ★ Ensuring UAV Safety: A Vision-only and Real-time Framework for Collision Avoidance Through Object Detection, Tracking, and Distance Estimation
Vasileios Karampinis, Anastasios Arsenos, Orfeas Filippopoulos, Evangelos Petrongonas, Christos Skliros, Dimitrios Kollias, Stefanos Kollias, Athanasios Voulodimos
In the last twenty years, unmanned aerial vehicles (UAVs) have garnered
growing interest due to their expanding applications in both military and
civilian domains. Detecting non-cooperative aerial vehicles with efficiency and
estimating collisions accurately are pivotal for achieving fully autonomous
aircraft and facilitating Advanced Air Mobility (AAM). This paper presents a
deep-learning framework that utilizes optical sensors for the detection,
tracking, and distance estimation of non-cooperative aerial vehicles. In
implementing this comprehensive sensing framework, the availability of depth
information is essential for enabling autonomous aerial vehicles to perceive
and navigate around obstacles. In this work, we propose a method for estimating
the distance information of a detected aerial object in real time using only
the input of a monocular camera. In order to train our deep learning components
for the object detection, tracking and depth estimation tasks we utilize the
Amazon Airborne Object Tracking (AOT) Dataset. In contrast to previous
approaches that integrate the depth estimation module into the object detector,
our method formulates the problem as image-to-image translation. We employ a
separate lightweight encoder-decoder network for efficient and robust depth
estimation. In a nutshell, the object detection module identifies and localizes
obstacles, conveying this information to both the tracking module for
monitoring obstacle movement and the depth estimation module for calculating
distances. Our approach is evaluated on the Airborne Object Tracking (AOT)
dataset which is the largest (to the best of our knowledge) air-to-air airborne
object dataset.
comment: accepted at ICUAS 2024
♻ ★ Bridging the Gap: Protocol Towards Fair and Consistent Affect Analysis
The increasing integration of machine learning algorithms in daily life
underscores the critical need for fairness and equity in their deployment. As
these technologies play a pivotal role in decision-making, addressing biases
across diverse subpopulation groups, including age, gender, and race, becomes
paramount. Automatic affect analysis, at the intersection of physiology,
psychology, and machine learning, has seen significant development. However,
existing databases and methodologies lack uniformity, leading to biased
evaluations. This work addresses these issues by analyzing six affective
databases, annotating demographic attributes, and proposing a common protocol
for database partitioning. Emphasis is placed on fairness in evaluations.
Extensive experiments with baseline and state-of-the-art methods demonstrate
the impact of these changes, revealing the inadequacy of prior assessments. The
findings underscore the importance of considering demographic attributes in
affect analysis research and provide a foundation for more equitable
methodologies. Our annotations, code and pre-trained models are available at:
https://github.com/dkollias/Fair-Consistent-Affect-Analysis
comment: accepted at IEEE FG 2024
♻ ★ MaterialSeg3D: Segmenting Dense Materials from 2D Priors for 3D Assets
Zeyu Li, Ruitong Gan, Chuanchen Luo, Yuxi Wang, Jiaheng Liu, Ziwei Zhu Man Zhang, Qing Li, Xucheng Yin, Zhaoxiang Zhang, Junran Peng
Driven by powerful image diffusion models, recent research has achieved the
automatic creation of 3D objects from textual or visual guidance. By performing
score distillation sampling (SDS) iteratively across different views, these
methods succeed in lifting 2D generative prior to the 3D space. However, such a
2D generative image prior bakes the effect of illumination and shadow into the
texture. As a result, material maps optimized by SDS inevitably involve
spurious correlated components. The absence of precise material definition
makes it infeasible to relight the generated assets reasonably in novel scenes,
which limits their application in downstream scenarios. In contrast, humans can
effortlessly circumvent this ambiguity by deducing the material of the object
from its appearance and semantics. Motivated by this insight, we propose
MaterialSeg3D, a 3D asset material generation framework to infer underlying
material from the 2D semantic prior. Based on such a prior model, we devise a
mechanism to parse material in 3D space. We maintain a UV stack, each map of
which is unprojected from a specific viewpoint. After traversing all
viewpoints, we fuse the stack through a weighted voting scheme and then employ
region unification to ensure the coherence of the object parts. To fuel the
learning of semantics prior, we collect a material dataset, named Materialized
Individual Objects (MIO), which features abundant images, diverse categories,
and accurate annotations. Extensive quantitative and qualitative experiments
demonstrate the effectiveness of our method.
♻ ★ MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition ICPR
This paper presents a novel approach to processing multimodal data for
dynamic emotion recognition, named as the Multimodal Masked Autoencoder for
Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the
closely correlated representation information within spatiotemporal sequences
across visual and audio modalities. By utilizing a pre-trained masked
autoencoder model, the MultiMAEDER is accomplished through simple,
straightforward finetuning. The performance of the MultiMAE-DER is enhanced by
optimizing six fusion strategies for multimodal input sequences. These
strategies address dynamic feature correlations within cross-domain data across
spatial, temporal, and spatiotemporal sequences. In comparison to
state-of-the-art multimodal supervised learning models for dynamic emotion
recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41%
on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared
with the state-of-the-art model of multimodal self-supervised learning,
MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.
comment: Camera-ready Version, Accepted by ICPRS 2024
♻ ★ Mesh Neural Cellular Automata SIGGRAPH 2024
Texture modeling and synthesis are essential for enhancing the realism of
virtual environments. Methods that directly synthesize textures in 3D offer
distinct advantages to the UV-mapping-based methods as they can create seamless
textures and align more closely with the ways textures form in nature. We
propose Mesh Neural Cellular Automata (MeshNCA), a method that directly
synthesizes dynamic textures on 3D meshes without requiring any UV maps.
MeshNCA is a generalized type of cellular automata that can operate on a set of
cells arranged on non-grid structures such as the vertices of a 3D mesh.
MeshNCA accommodates multi-modal supervision and can be trained using different
targets such as images, text prompts, and motion vector fields. Only trained on
an Icosphere mesh, MeshNCA shows remarkable test-time generalization and can
synthesize textures on unseen meshes in real time. We conduct qualitative and
quantitative comparisons to demonstrate that MeshNCA outperforms other 3D
texture synthesis methods in terms of generalization and producing high-quality
textures. Moreover, we introduce a way of grafting trained MeshNCA instances,
enabling interpolation between textures. MeshNCA allows several user
interactions including texture density/orientation controls,
grafting/regenerate brushes, and motion speed/direction controls. Finally, we
implement the forward pass of our MeshNCA model using the WebGL shading
language and showcase our trained models in an online interactive demo, which
is accessible on personal computers and smartphones and is available at
https://meshnca.github.io.
comment: ACM Transactions on Graphics (TOG) - SIGGRAPH 2024
♻ ★ GraCo: Granularity-Controllable Interactive Segmentation CVPR2024
Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen
Interactive Segmentation (IS) segments specific objects or parts in the image
according to user input. Current IS pipelines fall into two categories:
single-granularity output and multi-granularity output. The latter aims to
alleviate the spatial ambiguity present in the former. However, the
multi-granularity output pipeline suffers from limited interaction flexibility
and produces redundant results. In this work, we introduce
Granularity-Controllable Interactive Segmentation (GraCo), a novel approach
that allows precise control of prediction granularity by introducing additional
parameters to input. This enhances the customization of the interactive system
and eliminates redundancy while resolving ambiguity. Nevertheless, the
exorbitant cost of annotating multi-granularity masks and the lack of available
datasets with granularity annotations make it difficult for models to acquire
the necessary guidance to control output granularity. To address this problem,
we design an any-granularity mask generator that exploits the semantic property
of the pre-trained IS model to automatically generate abundant mask-granularity
pairs without requiring additional manual annotation. Based on these pairs, we
propose a granularity-controllable learning strategy that efficiently imparts
the granularity controllability to the IS model. Extensive experiments on
intricate scenarios at object and part levels demonstrate that our GraCo has
significant advantages over previous methods. This highlights the potential of
GraCo to be a flexible annotation tool, capable of adapting to diverse
segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo.
comment: CVPR2024 Highlight, Project: https://zhao-yian.github.io/GraCo
♻ ★ Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models ICLR 2024
Machine learning has demonstrated remarkable performance over finite
datasets, yet whether the scores over the fixed benchmarks can sufficiently
indicate the model's performance in the real world is still in discussion. In
reality, an ideal robust model will probably behave similarly to the oracle
(e.g., the human users), thus a good evaluation protocol is probably to
evaluate the models' behaviors in comparison to the oracle. In this paper, we
introduce a new robustness measurement that directly measures the image
classification model's performance compared with a surrogate oracle (i.e., a
foundation model). Besides, we design a simple method that can accomplish the
evaluation beyond the scope of the benchmarks. Our method extends the image
datasets with new samples that are sufficiently perturbed to be distinct from
the ones in the original sets, but are still bounded within the same
image-label structure the original test image represents, constrained by a
foundation model pretrained with a large amount of samples. As a result, our
new method will offer us a new way to evaluate the models' robustness
performance, free of limitations of fixed benchmarks or constrained
perturbations, although scoped by the power of the oracle. In addition to the
evaluation results, we also leverage our generated data to understand the
behaviors of the model and our new evaluation strategies.
comment: Accepted by ICLR 2024 Poster
♻ ★ Geo-Localization Based on Dynamically Weighted Factor-Graph
Miguel Ángel Muñoz-Bañón, Alejandro Olivas, Edison Velasco-Sánchez, Francisco A. Candelas, Fernando Torres
Feature-based geo-localization relies on associating features extracted from
aerial imagery with those detected by the vehicle's sensors. This requires that
the type of landmarks must be observable from both sources. This lack of
variety of feature types generates poor representations that lead to outliers
and deviations produced by ambiguities and lack of detections, respectively. To
mitigate these drawbacks, in this paper, we present a dynamically weighted
factor graph model for the vehicle's trajectory estimation. The weight
adjustment in this implementation depends on information quantification in the
detections performed using a LiDAR sensor. Also, a prior (GNSS-based) error
estimation is included in the model. Then, when the representation becomes
ambiguous or sparse, the weights are dynamically adjusted to rely on the
corrected prior trajectory, mitigating outliers and deviations in this way. We
compare our method against state-of-the-art geo-localization ones in a
challenging and ambiguous environment, where we also cause detection losses. We
demonstrate mitigation of the mentioned drawbacks where the other methods fail.
comment: This paper is published in the journal "IEEE Robotics and Automation
Letters"
♻ ★ Deepfake Generation and Detection: A Benchmark and Survey
Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yunsheng Wu, Guangtao Zhai, Jian Yang, Chunhua Shen, Dacheng Tao
Deepfake is a technology dedicated to creating highly realistic facial images
and videos under specific conditions, which has significant application
potential in fields such as entertainment, movie production, digital human
creation, to name a few. With the advancements in deep learning, techniques
primarily represented by Variational Autoencoders and Generative Adversarial
Networks have achieved impressive generation results. More recently, the
emergence of diffusion models with powerful generation capabilities has sparked
a renewed wave of research. In addition to deepfake generation, corresponding
detection technologies continuously evolve to regulate the potential misuse of
deepfakes, such as for privacy invasion and phishing attacks. This survey
comprehensively reviews the latest developments in deepfake generation and
detection, summarizing and analyzing current state-of-the-arts in this rapidly
evolving field. We first unify task definitions, comprehensively introduce
datasets and metrics, and discuss developing technologies. Then, we discuss the
development of several related sub-fields and focus on researching four
representative deepfake fields: face swapping, face reenactment, talking face
generation, and facial attribute editing, as well as forgery detection.
Subsequently, we comprehensively benchmark representative methods on popular
datasets for each field, fully evaluating the latest and influential published
works. Finally, we analyze challenges and future research directions of the
discussed fields.
comment: We closely follow the latest developments in
https://github.com/flyingby/Awesome-Deepfake-Generation-and-Detection
♻ ★ SpecNeRF: Gaussian Directional Encoding for Specular Reflections CVPR2024
Li Ma, Vasu Agrawal, Haithem Turki, Changil Kim, Chen Gao, Pedro Sander, Michael Zollhöfer, Christian Richardt
Neural radiance fields have achieved remarkable performance in modeling the
appearance of 3D scenes. However, existing approaches still struggle with the
view-dependent appearance of glossy surfaces, especially under complex lighting
of indoor environments. Unlike existing methods, which typically assume distant
lighting like an environment map, we propose a learnable Gaussian directional
encoding to better model the view-dependent effects under near-field lighting
conditions. Importantly, our new directional encoding captures the
spatially-varying nature of near-field lighting and emulates the behavior of
prefiltered environment maps. As a result, it enables the efficient evaluation
of preconvolved specular color at any 3D location with varying roughness
coefficients. We further introduce a data-driven geometry prior that helps
alleviate the shape radiance ambiguity in reflection modeling. We show that our
Gaussian directional encoding and geometry prior significantly improve the
modeling of challenging specular reflections in neural radiance fields, which
helps decompose appearance into more physically meaningful components.
comment: Accepted to CVPR2024 as Highlight, Project page:
https://limacv.github.io/SpecNeRF_web/
♻ ★ Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation ICML 2024
The ability to accurately comprehend natural language instructions and
navigate to the target location is essential for an embodied agent. Such agents
are typically required to execute user instructions in an online manner,
leading us to explore the use of unlabeled test samples for effective online
model adaptation. However, for online Vision-and-Language Navigation (VLN), due
to the intrinsic nature of inter-sample online instruction execution and
intra-sample multi-step action decision, frequent updates can result in drastic
changes in model parameters, while occasional updates can make the model
ill-equipped to handle dynamically changing environments. Therefore, we propose
a Fast-Slow Test-Time Adaptation (FSTTA) approach for online VLN by performing
joint decomposition-accumulation analysis for both gradients and parameters in
a unified framework. Extensive experiments show that our method obtains
impressive performance gains on four popular benchmarks. Code is available at
https://github.com/Feliciaxyao/ICML2024-FSTTA.
comment: Accepted by International Conference on Machine Learning (ICML 2024)
♻ ★ Cell Maps Representation For Lung Adenocarcinoma Growth Patterns Classification In Whole Slide Images
Lung adenocarcinoma is a morphologically heterogeneous disease, characterized
by five primary histologic growth patterns. The quantity of these patterns can
be related to tumor behavior and has a significant impact on patient prognosis.
In this work, we propose a novel machine learning pipeline capable of
classifying tissue tiles into one of the five patterns or as non-tumor, with an
Area Under the Receiver Operating Characteristic Curve (AUCROC) score of 0.97.
Our model's strength lies in its comprehensive consideration of cellular
spatial patterns, where it first generates cell maps from Hematoxylin and Eosin
(H&E) whole slide images (WSIs), which are then fed into a convolutional neural
network classification model. Exploiting these cell maps provides the model
with robust generalizability to new data, achieving approximately 30% higher
accuracy on unseen test-sets compared to current state of the art approaches.
The insights derived from our model can be used to predict prognosis, enhancing
patient outcomes.
♻ ★ CNN-based Game State Detection for a Foosball Table
The automation of games using Deep Reinforcement Learning Strategies (DRL) is
a well-known challenge in AI research. While for feature extraction in a video
game typically the whole image is used, this is hardly practical for many real
world games. Instead, using a smaller game state reducing the dimension of the
parameter space to include essential parameters only seems to be a promising
approach. In the game of Foosball, a compact and comprehensive game state
description consists of the positional shifts and rotations of the figures and
the position of the ball over time. In particular, velocities and accelerations
can be derived from consecutive time samples of the game state. In this paper,
a figure detection system to determine the game state in Foosball is presented.
We capture a dataset containing the rotations of the rods which were measured
using accelerometers and the positional shifts were derived using traditional
Computer Vision techniques (in a laboratory setting). This dataset is utilized
to train Convolutional Neural Network (CNN) based end-to-end regression models
to predict the rotations and shifts of each rod. We present an evaluation of
our system using different state-of-the-art CNNs as base architectures for the
regression model. We show that our system is able to predict the game state
with high accuracy. By providing data for both black and white teams, the
presented system is intended to provide the required data for future
developments of Imitation Learning techniques w.r.t. to observing human
players.
♻ ★ Deep Regression Representation Learning with Topology ICML 2024
Most works studying representation learning focus only on classification and
neglect regression. Yet, the learning objectives and, therefore, the
representation topologies of the two tasks are fundamentally different:
classification targets class separation, leading to disconnected
representations, whereas regression requires ordinality with respect to the
target, leading to continuous representations. We thus wonder how the
effectiveness of a regression representation is influenced by its topology,
with evaluation based on the Information Bottleneck (IB) principle. The IB
principle is an important framework that provides principles for learning
effective representations. We establish two connections between it and the
topology of regression representations. The first connection reveals that a
lower intrinsic dimension of the feature space implies a reduced complexity of
the representation Z. This complexity can be quantified as the conditional
entropy of Z on the target Y, and serves as an upper bound on the
generalization error. The second connection suggests a feature space that is
topologically similar to the target space will better align with the IB
principle. Based on these two connections, we introduce PH-Reg, a regularizer
specific to regression that matches the intrinsic dimension and topology of the
feature space with the target space. Experiments on synthetic and real-world
regression tasks demonstrate the benefits of PH-Reg. Code:
https://github.com/needylove/PH-Reg.
comment: ICML 2024
♻ ★ Testing the Segment Anything Model on radiology data
Deep learning models trained with large amounts of data have become a recent
and effective approach to predictive problem solving -- these have become known
as "foundation models" as they can be used as fundamental tools for other
applications. While the paramount examples of image classification (earlier)
and large language models (more recently) led the way, the Segment Anything
Model (SAM) was recently proposed and stands as the first foundation model for
image segmentation, trained on over 10 million images and with recourse to over
1 billion masks. However, the question remains -- what are the limits of this
foundation? Given that magnetic resonance imaging (MRI) stands as an important
method of diagnosis, we sought to understand whether SAM could be used for a
few tasks of zero-shot segmentation using MRI data. Particularly, we wanted to
know if selecting masks from the pool of SAM predictions could lead to good
segmentations.
Here, we provide a critical assessment of the performance of SAM on magnetic
resonance imaging data. We show that, while acceptable in a very limited set of
cases, the overall trend implies that these models are insufficient for MRI
segmentation across the whole volume, but can provide good segmentations in a
few, specific slices. More importantly, we note that while foundation models
trained on natural images are set to become key aspects of predictive
modelling, they may prove ineffective when used on other imaging modalities.
♻ ★ FSL-Rectifier: Rectify Outliers in Few-Shot Learning via Test-Time Augmentation
Few-shot-learning (FSL) commonly requires a model to identify images
(queries) that belong to classes unseen during training, based on a few
labelled samples of the new classes (support set) as reference. As the test
classes are novel, FSL is challenging with high generalization error with
respect to the novel classes, where outliers query or support image during
inference exacerbate the error further. So far, plenty of algorithms involve
training data augmentation to improve the generalization capability of FSL
models. In contrast, inspired by the fact that test samples are more relevant
to the target domain, we believe that test-time augmentation may be more useful
than training augmentation for FSL. In this work, to reduce the bias caused by
unconventional test samples, we generate new test samples through combining
them with similar train-class samples. Averaged representations of the
test-time augmentation are then considered for few-shot classification.
According to our experiments, by augmenting the support set and query with a
few additional generated sample, we can achieve improvement for trained FSL
models. Importantly, our method is universally compatible with different
off-the-shelf FSL models, whose performance can be improved without extra
dataset nor further training of the models themselves. Codes are available at
https://github.com/WendyBaiYunwei/FSL-Rectifier.
♻ ★ Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) is a challenging task where an agent is
required to navigate to a natural language described location via vision
observations. The navigation abilities of the agent can be enhanced by the
relations between objects, which are usually learned using internal objects or
external datasets. The relationships between internal objects are modeled
employing graph convolutional network (GCN) in traditional studies. However,
GCN tends to be shallow, limiting its modeling ability. To address this issue,
we utilize a cross attention mechanism to learn the connections between objects
over a trajectory, which takes temporal continuity into account, termed as
Temporal Object Relations (TOR). The external datasets have a gap with the
navigation environment, leading to inaccurate modeling of relations. To avoid
this problem, we construct object connections based on observations from all
viewpoints in the navigational environment, which ensures complete spatial
coverage and eliminates the gap, called Spatial Object Relations (SOR).
Additionally, we observe that agents may repeatedly visit the same location
during navigation, significantly hindering their performance. For resolving
this matter, we introduce the Turning Back Penalty (TBP) loss function, which
penalizes the agent's repetitive visiting behavior, substantially reducing the
navigational distance. Experimental results on the REVERIE, SOON, and R2R
datasets demonstrate the effectiveness of the proposed method.
♻ ★ PCLMix: Weakly Supervised Medical Image Segmentation via Pixel-Level Contrastive Learning and Dynamic Mix Augmentation
In weakly supervised medical image segmentation, the absence of structural
priors and the discreteness of class feature distribution present a challenge,
i.e., how to accurately propagate supervision signals from local to global
regions without excessively spreading them to other irrelevant regions? To
address this, we propose a novel weakly supervised medical image segmentation
framework named PCLMix, comprising dynamic mix augmentation, pixel-level
contrastive learning, and consistency regularization strategies. Specifically,
PCLMix is built upon a heterogeneous dual-decoder backbone, addressing the
absence of structural priors through a strategy of dynamic mix augmentation
during training. To handle the discrete distribution of class features, PCLMix
incorporates pixel-level contrastive learning based on prediction uncertainty,
effectively enhancing the model's ability to differentiate inter-class pixel
differences and intra-class consistency. Furthermore, to reinforce segmentation
consistency and robustness, PCLMix employs an auxiliary decoder for dual
consistency regularization. In the inference phase, the auxiliary decoder will
be dropped and no computation complexity is increased. Extensive experiments on
the ACDC dataset demonstrate that PCLMix appropriately propagates local
supervision signals to the global scale, further narrowing the gap between
weakly supervised and fully supervised segmentation methods. Our code is
available at https://github.com/Torpedo2648/PCLMix.
♻ ★ Training-Free Consistent Text-to-Image Generation SIGGRAPH 2024
Text-to-image models offer a new level of creative flexibility by allowing
users to guide the image generation process through natural language. However,
using these models to consistently portray the same subject across diverse
prompts remains challenging. Existing approaches fine-tune the model to teach
it new words that describe specific user-provided subjects or add image
conditioning to the model. These methods require lengthy per-subject
optimization or large-scale pre-training. Moreover, they struggle to align
generated images with text prompts and face difficulties in portraying multiple
subjects. Here, we present ConsiStory, a training-free approach that enables
consistent subject generation by sharing the internal activations of the
pretrained model. We introduce a subject-driven shared attention block and
correspondence-based feature injection to promote subject consistency between
images. Additionally, we develop strategies to encourage layout diversity while
maintaining subject consistency. We compare ConsiStory to a range of baselines,
and demonstrate state-of-the-art performance on subject consistency and text
alignment, without requiring a single optimization step. Finally, ConsiStory
can naturally extend to multi-subject scenarios, and even enable training-free
personalization for common objects.
comment: Accepted to journal track of SIGGRAPH 2024 (TOG). Project page is at
https://consistory-paper.github.io
♻ ★ An Adaptive Cost-Sensitive Learning and Recursive Denoising Framework for Imbalanced SVM Classification
Category imbalance is one of the most popular and important issues in the
domain of classification. Emotion classification model trained on imbalanced
datasets easily leads to unreliable prediction. The traditional machine
learning method tends to favor the majority class, which leads to the lack of
minority class information in the model. Moreover, most existing models will
produce abnormal sensitivity issues or performance degradation. We propose a
robust learning algorithm based on adaptive cost-sensitiveity and recursive
denoising, which is a generalized framework and can be incorporated into most
stochastic optimization algorithms. The proposed method uses the dynamic kernel
distance optimization model between the sample and the decision boundary, which
makes full use of the sample's prior information. In addition, we also put
forward an effective method to filter noise, the main idea of which is to judge
the noise by finding the nearest neighbors of the minority class. In order to
evaluate the strength of the proposed method, we not only carry out experiments
on standard datasets but also apply it to emotional classification problems
with different imbalance rates (IR). Experimental results show that the
proposed general framework is superior to traditional methods in accuracy,
recall and G-means.
comment: 22 pages, 30 figures
♻ ★ Rectified Gaussian kernel multi-view k-means clustering
In this paper, we show two new variants of multi-view k-means (MVKM)
algorithms to address multi-view data. The general idea is to outline the
distance between $h$-th view data points $x_i^h$ and $h$-th view cluster
centers $a_k^h$ in a different manner of centroid-based approach. Unlike other
methods, our proposed methods learn the multi-view data by calculating the
similarity using Euclidean norm in the space of Gaussian-kernel, namely as
multi-view k-means with exponent distance (MVKM-ED). By simultaneously aligning
the stabilizer parameter $p$ and kernel coefficients $\beta^h$, the compression
of Gaussian-kernel based weighted distance in Euclidean norm reduce the
sensitivity of MVKM-ED. To this end, this paper designated as Gaussian-kernel
multi-view k-means (GKMVKM) clustering algorithm. Numerical evaluation of five
real-world multi-view data demonstrates the robustness and efficiency of our
proposed MVKM-ED and GKMVKM approaches.
comment: 13 pages, 1 figure, 7 Tables
♻ ★ BrepGen: A B-rep Generative Diffusion Model with Structured Latent Geometry SIGGRAPH 2024
Xiang Xu, Joseph G. Lambourne, Pradeep Kumar Jayaraman, Zhengqing Wang, Karl D. D. Willis, Yasutaka Furukawa
This paper presents BrepGen, a diffusion-based generative approach that
directly outputs a Boundary representation (B-rep) Computer-Aided Design (CAD)
model. BrepGen represents a B-rep model as a novel structured latent geometry
in a hierarchical tree. With the root node representing a whole CAD solid, each
element of a B-rep model (i.e., a face, an edge, or a vertex) progressively
turns into a child-node from top to bottom. B-rep geometry information goes
into the nodes as the global bounding box of each primitive along with a latent
code describing the local geometric shape. The B-rep topology information is
implicitly represented by node duplication. When two faces share an edge, the
edge curve will appear twice in the tree, and a T-junction vertex with three
incident edges appears six times in the tree with identical node features.
Starting from the root and progressing to the leaf, BrepGen employs
Transformer-based diffusion models to sequentially denoise node features while
duplicated nodes are detected and merged, recovering the B-Rep topology
information. Extensive experiments show that BrepGen advances the task of CAD
B-rep generation, surpassing existing methods on various benchmarks. Results on
our newly collected furniture dataset further showcase its exceptional
capability in generating complicated geometry. While previous methods were
limited to generating simple prismatic shapes, BrepGen incorporates free-form
and doubly-curved surfaces for the first time. Additional applications of
BrepGen include CAD autocomplete and design interpolation. The code, pretrained
models, and dataset are available at https://github.com/samxuxiang/BrepGen.
comment: Accepted to ACM SIGGRAPH 2024. Code at
https://github.com/samxuxiang/BrepGen
♻ ★ RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection ICRA 2024
While LiDAR sensors have been successfully applied to 3D object detection,
the affordability of radar and camera sensors has led to a growing interest in
fusing radars and cameras for 3D object detection. However, previous
radar-camera fusion models were unable to fully utilize the potential of radar
information. In this paper, we propose Radar-Camera Multi-level fusion
(RCM-Fusion), which attempts to fuse both modalities at both feature and
instance levels. For feature-level fusion, we propose a Radar Guided BEV
Encoder which transforms camera features into precise BEV representations using
the guidance of radar Bird's-Eye-View (BEV) features and combines the radar and
camera BEV features. For instance-level fusion, we propose a Radar Grid Point
Refinement module that reduces localization error by accounting for the
characteristics of the radar point clouds. The experiments conducted on the
public nuScenes dataset demonstrate that our proposed RCM-Fusion achieves
state-of-the-art performances among single frame-based radar-camera fusion
methods in the nuScenes 3D object detection benchmark. Code will be made
publicly available.
comment: Accepted by IEEE International Conference on Robotics and Automation
(ICRA 2024, Oral presentation), 7 pages, 5 figures
♻ ★ MIMIC: Masked Image Modeling with Image Correspondences
Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro, Ranjay Krishna
Dense pixel-specific representation learning at scale has been bottlenecked
due to the unavailability of large-scale multi-view datasets. Current methods
for building effective pretraining datasets heavily rely on annotated 3D
meshes, point clouds, and camera parameters from simulated environments,
preventing them from building datasets from real-world data sources where such
metadata is lacking. We propose a pretraining dataset-curation approach that
does not require any additional annotations. Our method allows us to generate
multi-view datasets from both real-world videos and simulated environments at
scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and
MIMIC-3M with 3.1M multi-view image pairs. We train multiple models with
different masked image modeling objectives to showcase the following findings:
Representations trained on our automatically generated MIMIC-3M outperform
those learned from expensive crowdsourced datasets (ImageNet-1K) and those
learned from synthetic environments (MULTIVIEW-HABITAT) on two dense geometric
tasks: depth estimation on NYUv2 (1.7%), and surface normals estimation on
Taskonomy (2.05%). For dense tasks which also require object understanding, we
outperform MULTIVIEW-HABITAT, on semantic segmentation on ADE20K (3.89%), pose
estimation on MSCOCO (9.4%), and reduce the gap with models pre-trained on the
object-centric expensive ImageNet-1K. We outperform even when the
representations are frozen, and when downstream training data is limited to
few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which
is promising since our curation method can arbitrarily scale to produce even
larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at
https://github.com/RAIVNLab/MIMIC.
♻ ★ Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization
Effectively and efficiently retrieving images from remote sensing databases
is a critical challenge in the realm of remote sensing big data. Utilizing
hand-drawn sketches as retrieval inputs offers intuitive and user-friendly
advantages, yet the potential of multi-level feature integration from sketches
remains underexplored, leading to suboptimal retrieval performance. To address
this gap, our study introduces a novel zero-shot, sketch-based retrieval method
for remote sensing images, leveraging multi-level feature extraction,
self-attention-guided tokenization and filtering, and cross-modality attention
update. This approach employs only vision information and does not require
semantic knowledge concerning the sketch and image. It starts by employing
multi-level self-attention guided feature extraction to tokenize the query
sketches, as well as self-attention feature extraction to tokenize the
candidate images. It then employs cross-attention mechanisms to establish token
correspondence between these two modalities, facilitating the computation of
sketch-to-image similarity. Our method significantly outperforms existing
sketch-based remote sensing image retrieval techniques, as evidenced by tests
on multiple datasets. Notably, it also exhibits robust zero-shot learning
capabilities and strong generalizability in handling unseen categories and
novel remote sensing data. The method's scalability can be further enhanced by
the pre-calculation of retrieval tokens for all candidate images in a database.
This research underscores the significant potential of multi-level,
attention-guided tokenization in cross-modal remote sensing image retrieval.
For broader accessibility and research facilitation, we have made the code and
dataset used in this study publicly available online. Code and dataset are
available at https://github.com/Snowstormfly/Cross-modal-retrieval-MLAGT.
comment: 44 pages, 6 figures
♻ ★ V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection
AI-generated video has revolutionized short video production, filmmaking, and
personalized media, making video local editing an essential tool. However, this
progress also blurs the line between reality and fiction, posing challenges in
multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to
address the limitations of current video tampering forensics, such as poor
generalizability, singular function, and single modality focus. Combining the
fragility of video-into-video steganography with deep robust watermarking, our
method can embed invisible visual-audio localization watermarks and copyright
watermarks into the original video frames and audio, enabling precise
manipulation localization and copyright protection. We also design a temporal
alignment and fusion module and degradation prompt learning to enhance the
localization accuracy and decoding robustness. Meanwhile, we introduce a
sample-level audio localization method and a cross-modal copyright extraction
mechanism to couple the information of audio and video frames. The
effectiveness of V2A-Mark has been verified on a visual-audio tampering
dataset, emphasizing its superiority in localization precision and copyright
accuracy, crucial for the sustainable development of video editing in the AIGC
video era.
♻ ★ Retrieval-Augmented Egocentric Video Captioning CVPR 2024
Understanding human actions from videos of first-person view poses
significant challenges. Most prior approaches explore representation learning
on egocentric videos only, while overlooking the potential benefit of
exploiting existing large-scale third-person videos. In this paper, (1) we
develop EgoInstructor, a retrieval-augmented multimodal captioning model that
automatically retrieves semantically relevant third-person instructional videos
to enhance the video captioning of egocentric videos. (2) For training the
cross-view retrieval module, we devise an automatic pipeline to discover
ego-exo video pairs from distinct large-scale egocentric and exocentric
datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE
loss that pulls egocentric and exocentric video features closer by aligning
them to shared text features that describe similar actions. (4) Through
extensive experiments, our cross-view retrieval module demonstrates superior
performance across seven benchmarks. Regarding egocentric video captioning,
EgoInstructor exhibits significant improvements by leveraging third-person
videos as references.
comment: CVPR 2024. Project page: https://jazzcharles.github.io/Egoinstructor/
♻ ★ Remembering Transformer for Continual Learning
Neural networks encounter the challenge of Catastrophic Forgetting (CF) in
continual learning, where new task learning interferes with previously learned
knowledge. Existing data fine-tuning and regularization methods necessitate
task identity information during inference and cannot eliminate interference
among different tasks, while soft parameter sharing approaches encounter the
problem of an increasing model parameter size. To tackle these challenges, we
propose the Remembering Transformer, inspired by the brain's Complementary
Learning Systems (CLS). Remembering Transformer employs a mixture-of-adapters
architecture and a generative model-based novelty detection mechanism in a
pretrained Transformer to alleviate CF. Remembering Transformer dynamically
routes task data to the most relevant adapter with enhanced parameter
efficiency based on knowledge distillation. We conducted extensive experiments,
including ablation studies on the novelty detection mechanism and model
capacity of the mixture-of-adapters, in a broad range of class-incremental
split tasks and permutation tasks. Our approach demonstrated SOTA performance
surpassing the second-best method by 15.90% in the split tasks, reducing the
memory footprint from 11.18M to 0.22M in the five splits CIFAR10 task.