Computer Vision and Pattern Recognition
★ PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation ECCV 2024
We present PhysGen, a novel image-to-video generation method that converts a
single image and an input condition (e.g., force and torque applied to an
object in the image) to produce a realistic, physically plausible, and
temporally consistent video. Our key insight is to integrate model-based
physical simulation with a data-driven video generation process, enabling
plausible image-space dynamics. At the heart of our system are three core
components: (i) an image understanding module that effectively captures the
geometry, materials, and physical parameters of the image; (ii) an image-space
dynamics simulation model that utilizes rigid-body physics and inferred
parameters to simulate realistic behaviors; and (iii) an image-based rendering
and refinement module that leverages generative video diffusion to produce
realistic video footage featuring the simulated motion. The resulting videos
are realistic in both physics and appearance and are even precisely
controllable, showcasing superior results over existing data-driven
image-to-video generation works through quantitative comparison and
comprehensive user study. PhysGen's resulting videos can be used for various
downstream applications, such as turning an image into a realistic animation or
allowing users to interact with the image and create various dynamics. Project
page: https://stevenlsw.github.io/physgen/
comment: Accepted to ECCV 2024. Project page:
https://stevenlsw.github.io/physgen/
★ Exploring Token Pruning in Vision State Space Models NeurIPS'24
Zheng Zhan, Zhenglun Kong, Yifan Gong, Yushu Wu, Zichong Meng, Hangyu Zheng, Xuan Shen, Stratis Ioannidis, Wei Niu, Pu Zhao, Yanzhi Wang
State Space Models (SSMs) have the advantage of keeping linear computational
complexity compared to attention modules in transformers, and have been applied
to vision tasks as a new type of powerful vision foundation model. Inspired by
the observations that the final prediction in vision transformers (ViTs) is
only based on a subset of most informative tokens, we take the novel step of
enhancing the efficiency of SSM-based vision models through token-based
pruning. However, direct applications of existing token pruning techniques
designed for ViTs fail to deliver good performance, even with extensive
fine-tuning. To address this issue, we revisit the unique computational
characteristics of SSMs and discover that naive application disrupts the
sequential token positions. This insight motivates us to design a novel and
general token pruning method specifically for SSM-based vision models. We first
introduce a pruning-aware hidden state alignment method to stabilize the
neighborhood of remaining tokens for performance enhancement. Besides, based on
our detailed analysis, we propose a token importance evaluation method adapted
for SSM models, to guide the token pruning. With efficient implementation and
practical acceleration methods, our method brings actual speedup. Extensive
experiments demonstrate that our approach can achieve significant computation
reduction with minimal impact on performance across different tasks. Notably,
we achieve 81.7\% accuracy on ImageNet with a 41.6\% reduction in the FLOPs for
pruned PlainMamba-L3. Furthermore, our work provides deeper insights into
understanding the behavior of SSM-based vision models for future research.
comment: NeurIPS'24
★ ProMerge: Prompt and Merge for Unsupervised Instance Segmentation ECCV2024
Unsupervised instance segmentation aims to segment distinct object instances
in an image without relying on human-labeled data. This field has recently seen
significant advancements, partly due to the strong local correspondences
afforded by rich visual feature representations from self-supervised models
(e.g., DINO). Recent state-of-the-art approaches use self-supervised features
to represent images as graphs and solve a generalized eigenvalue system (i.e.,
normalized-cut) to generate foreground masks. While effective, this strategy is
limited by its attendant computational demands, leading to slow inference
speeds. In this paper, we propose Prompt and Merge (ProMerge), which leverages
self-supervised visual features to obtain initial groupings of patches and
applies a strategic merging to these segments, aided by a sophisticated
background-based mask pruning technique. ProMerge not only yields competitive
results but also offers a significant reduction in inference time compared to
state-of-the-art normalized-cut-based approaches. Furthermore, when training an
object detector using our mask predictions as pseudo-labels, the resulting
detector surpasses the current leading unsupervised model on various
challenging instance segmentation benchmarks.
comment: ECCV2024 camera-ready
★ UniCal: Unified Neural Sensor Calibration ECCV 2024
Ze Yang, George Chen, Haowei Zhang, Kevin Ta, Ioan Andrei Bârsan, Daniel Murphy, Sivabalan Manivasagam, Raquel Urtasun
Self-driving vehicles (SDVs) require accurate calibration of LiDARs and
cameras to fuse sensor data accurately for autonomy. Traditional calibration
methods typically leverage fiducials captured in a controlled and structured
scene and compute correspondences to optimize over. These approaches are costly
and require substantial infrastructure and operations, making it challenging to
scale for vehicle fleets. In this work, we propose UniCal, a unified framework
for effortlessly calibrating SDVs equipped with multiple LiDARs and cameras.
Our approach is built upon a differentiable scene representation capable of
rendering multi-view geometrically and photometrically consistent sensor
observations. We jointly learn the sensor calibration and the underlying scene
representation through differentiable volume rendering, utilizing outdoor
sensor data without the need for specific calibration fiducials. This
"drive-and-calibrate" approach significantly reduces costs and operational
overhead compared to existing calibration systems, enabling efficient
calibration for large SDV fleets at scale. To ensure geometric consistency
across observations from different sensors, we introduce a novel surface
alignment loss that combines feature-based registration with neural rendering.
Comprehensive evaluations on multiple datasets demonstrate that UniCal
outperforms or matches the accuracy of existing calibration approaches while
being more efficient, demonstrating the value of UniCal for scalable
calibration.
comment: ECCV 2024. Project page: https://waabi.ai/unical/
★ Spectral Wavelet Dropout: Regularization in the Wavelet Domain ICML
Regularization techniques help prevent overfitting and therefore improve the
ability of convolutional neural networks (CNNs) to generalize. One reason for
overfitting is the complex co-adaptations among different parts of the network,
which make the CNN dependent on their joint response rather than encouraging
each part to learn a useful feature representation independently. Frequency
domain manipulation is a powerful strategy for modifying data that has temporal
and spatial coherence by utilizing frequency decomposition. This work
introduces Spectral Wavelet Dropout (SWD), a novel regularization method that
includes two variants: 1D-SWD and 2D-SWD. These variants improve CNN
generalization by randomly dropping detailed frequency bands in the discrete
wavelet decomposition of feature maps. Our approach distinguishes itself from
the pre-existing Spectral "Fourier" Dropout (2D-SFD), which eliminates
coefficients in the Fourier domain. Notably, SWD requires only a single
hyperparameter, unlike the two required by SFD. We also extend the literature
by implementing a one-dimensional version of Spectral "Fourier" Dropout
(1D-SFD), setting the stage for a comprehensive comparison. Our evaluation
shows that both 1D and 2D SWD variants have competitive performance on
CIFAR-10/100 benchmarks relative to both 1D-SFD and 2D-SFD. Specifically,
1D-SWD has a significantly lower computational complexity compared to
1D/2D-SFD. In the Pascal VOC Object Detection benchmark, SWD variants surpass
1D-SFD and 2D-SFD in performance and demonstrate lower computational complexity
during training.
comment: Accepted by The International Conference on Machine Learning and
Applications (ICMLA) 2024
★ From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou, Tianze Luo, Guiyang Xie, Victor, Zhang, Fengmao Lv, Guangcong Wang, Juanyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang
The integration of Large Language Models (LLMs) with visual encoders has
recently shown promising performance in visual understanding tasks, leveraging
their inherent capability to comprehend and generate human-like text for visual
reasoning. Given the diverse nature of visual data, MultiModal Large Language
Models (MM-LLMs) exhibit variations in model designing and training for
understanding images, short videos, and long videos. Our paper focuses on the
substantial differences and unique challenges posed by long video understanding
compared to static image and short video understanding. Unlike static images,
short videos encompass sequential frames with both spatial and within-event
temporal information, while long videos consist of multiple events with
between-event and long-term temporal information. In this survey, we aim to
trace and summarize the advancements of MM-LLMs from image understanding to
long video understanding. We review the differences among various visual
understanding tasks and highlight the challenges in long video understanding,
including more fine-grained spatiotemporal details, dynamic events, and
long-term dependencies. We then provide a detailed summary of the advancements
in MM-LLMs in terms of model design and training methodologies for
understanding long videos. Finally, we compare the performance of existing
MM-LLMs on video understanding benchmarks of various lengths and discuss
potential future directions for MM-LLMs in long video understanding.
comment: 11 pages
★ ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions
Images captured in challenging environments--such as nighttime, foggy, rainy
weather, and underwater--often suffer from significant degradation, resulting
in a substantial loss of visual quality. Effective restoration of these
degraded images is critical for the subsequent vision tasks. While many
existing approaches have successfully incorporated specific priors for
individual tasks, these tailored solutions limit their applicability to other
degradations. In this work, we propose a universal network architecture, dubbed
"ReviveDiff", which can address a wide range of degradations and bring images
back to life by enhancing and restoring their quality. Our approach is inspired
by the observation that, unlike degradation caused by movement or electronic
issues, quality degradation under adverse conditions primarily stems from
natural media (such as fog, water, and low luminance), which generally
preserves the original structures of objects. To restore the quality of such
images, we leveraged the latest advancements in diffusion models and developed
ReviveDiff to restore image quality from both macro and micro levels across
some key factors determining image quality, such as sharpness, distortion,
noise level, dynamic range, and color accuracy. We rigorously evaluated
ReviveDiff on seven benchmark datasets covering five types of degrading
conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our
experimental results demonstrate that ReviveDiff outperforms the
state-of-the-art methods both quantitatively and visually.
★ SurfaceAI: Automated creation of cohesive road surface quality datasets based on open street-level imagery SP
This paper introduces SurfaceAI, a pipeline designed to generate
comprehensive georeferenced datasets on road surface type and quality from
openly available street-level imagery. The motivation stems from the
significant impact of road unevenness on the safety and comfort of traffic
participants, especially vulnerable road users, emphasizing the need for
detailed road surface data in infrastructure modeling and analysis. SurfaceAI
addresses this gap by leveraging crowdsourced Mapillary data to train models
that predict the type and quality of road surfaces visible in street-level
images, which are then aggregated to provide cohesive information on entire
road segment conditions.
comment: 4 pages, 2 figures; accepted at 2nd ACM SIGSPATIAL International
Workshop on Advances in Urban-AI
★ Improving Visual Object Tracking through Visual Prompting
Learning a discriminative model to distinguish a target from its surrounding
distractors is essential to generic visual object tracking. Dynamic target
representation adaptation against distractors is challenging due to the limited
discriminative capabilities of prevailing trackers. We present a new visual
Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this
issue. PiVOT proposes a prompt generation network with the pre-trained
foundation model CLIP to automatically generate and refine visual prompts,
enabling the transfer of foundation model knowledge for tracking. While CLIP
offers broad category-level knowledge, the tracker, trained on
instance-specific data, excels at recognizing unique object instances. Thus,
PiVOT first compiles a visual prompt highlighting potential target locations.
To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to
refine the visual prompt based on the similarities between candidate objects
and the reference templates across potential targets. Once the visual prompt is
refined, it can better highlight potential target locations, thereby reducing
irrelevant prompt information. With the proposed prompting mechanism, the
tracker can generate improved instance-aware feature maps through the guidance
of the visual prompt, thus effectively reducing distractors. The proposed
method does not involve CLIP during training, thereby keeping the same training
complexity and preserving the generalization capability of the pretrained
foundation model. Extensive experiments across multiple benchmarks indicate
that PiVOT, using the proposed prompting method can suppress distracting
objects and enhance the tracker.
comment: Accepted and to appear in IEEE Transactions on Multimedia
★ Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors
Yunlong Lin, Zhenqi Fu, Kairun Wen, Tian Ye, Sixiang Chen, Ge Meng, Yingying Wang, Yue Huang, Xiaotong Tu, Xinghao Ding
Low-light image enhancement (LIE) aims at precisely and efficiently
recovering an image degraded in poor illumination environments. Recent advanced
LIE techniques are using deep neural networks, which require lots of low-normal
light image pairs, network parameters, and computational resources. As a
result, their practicality is limited. In this work, we devise a novel
unsupervised LIE framework based on diffusion priors and lookup tables (DPLUT)
to achieve efficient low-light image recovery. The proposed approach comprises
two critical components: a light adjustment lookup table (LLUT) and a noise
suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised
losses. It aims at predicting pixel-wise curve parameters for the dynamic range
adjustment of a specific image. NLUT is designed to remove the amplified noise
after the light brightens. As diffusion models are sensitive to noise,
diffusion priors are introduced to achieve high-performance noise suppression.
Extensive experiments demonstrate that our approach outperforms
state-of-the-art methods in terms of visual quality and efficiency.
comment: 13 pages, 10 figures
★ Detecting Dataset Abuse in Fine-Tuning Stable Diffusion Models for Text-to-Image Synthesis
Text-to-image synthesis has become highly popular for generating realistic
and stylized images, often requiring fine-tuning generative models with
domain-specific datasets for specialized tasks. However, these valuable
datasets face risks of unauthorized usage and unapproved sharing, compromising
the rights of the owners. In this paper, we address the issue of dataset abuse
during the fine-tuning of Stable Diffusion models for text-to-image synthesis.
We present a dataset watermarking framework designed to detect unauthorized
usage and trace data leaks. The framework employs two key strategies across
multiple watermarking schemes and is effective for large-scale dataset
authorization. Extensive experiments demonstrate the framework's effectiveness,
minimal impact on the dataset (only 2% of the data required to be modified for
high detection accuracy), and ability to trace data leaks. Our results also
highlight the robustness and transferability of the framework, proving its
practical applicability in detecting dataset abuse.
★ S2O: Static to Openable Enhancement for Articulated 3D Objects
Despite much progress in large 3D datasets there are currently few
interactive 3D object datasets, and their scale is limited due to the manual
effort required in their construction. We introduce the static to openable
(S2O) task which creates interactive articulated 3D objects from static
counterparts through openable part detection, motion prediction, and interior
geometry completion. We formulate a unified framework to tackle this task, and
curate a challenging dataset of openable 3D objects that serves as a test bed
for systematic evaluation. Our experiments benchmark methods from prior work
and simple yet effective heuristics for the S2O task. We find that turning
static 3D objects into interactively openable counterparts is possible but that
all methods struggle to generalize to realistic settings of the task, and we
highlight promising future work directions.
★ Explainable Artifacts for Synthetic Western Blot Source Attribution
João Phillipe Cardenuto, Sara Mandelli, Daniel Moreira, Paolo Bestagini, Edward Delp, Anderson Rocha
Recent advancements in artificial intelligence have enabled generative models
to produce synthetic scientific images that are indistinguishable from pristine
ones, posing a challenge even for expert scientists habituated to working with
such content. When exploited by organizations known as paper mills, which
systematically generate fraudulent articles, these technologies can
significantly contribute to the spread of misinformation about ungrounded
science, potentially undermining trust in scientific research. While previous
studies have explored black-box solutions, such as Convolutional Neural
Networks, for identifying synthetic content, only some have addressed the
challenge of generalizing across different models and providing insight into
the artifacts in synthetic images that inform the detection process. This study
aims to identify explainable artifacts generated by state-of-the-art generative
models (e.g., Generative Adversarial Networks and Diffusion Models) and
leverage them for open-set identification and source attribution (i.e.,
pointing to the model that created the image).
comment: Accepted in IEEE International Workshop on Information Forensics and
Security - WIFS 2024, Rome, Italy
★ UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception
Visual emotion analysis holds significant research value in both computer
vision and psychology. However, existing methods for visual emotion analysis
suffer from limited generalizability due to the ambiguity of emotion perception
and the diversity of data scenarios. To tackle this issue, we introduce
UniEmoX, a cross-modal semantic-guided large-scale pretraining framework.
Inspired by psychological research emphasizing the inseparability of the
emotional exploration process from the interaction between individuals and
their environment, UniEmoX integrates scene-centric and person-centric
low-level image spatial structural information, aiming to derive more nuanced
and discriminative emotional representations. By exploiting the similarity
between paired and unpaired image-text samples, UniEmoX distills rich semantic
knowledge from the CLIP model to enhance emotional embedding representations
more effectively. To the best of our knowledge, this is the first large-scale
pretraining framework that integrates psychological theories with contemporary
contrastive learning and masked image modeling techniques for emotion analysis
across diverse scenarios. Additionally, we develop a visual emotional dataset
titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural,
realistic, science fiction and advertising cover styles, covering nearly all
common emotional scenes. Comprehensive experiments conducted on six benchmark
datasets across two downstream tasks validate the effectiveness of UniEmoX. The
source code is available at https://github.com/chincharles/u-emo.
comment: Submitted to TIP
★ CemiFace: Center-based Semi-hard Synthetic Face Generation for Face Recognition NeurIPS 2024
Privacy issue is a main concern in developing face recognition techniques.
Although synthetic face images can partially mitigate potential legal risks
while maintaining effective face recognition (FR) performance, FR models
trained by face images synthesized by existing generative approaches frequently
suffer from performance degradation problems due to the insufficient
discriminative quality of these synthesized samples. In this paper, we
systematically investigate what contributes to solid face recognition model
training, and reveal that face images with certain degree of similarities to
their identity centers show great effectiveness in the performance of trained
FR models. Inspired by this, we propose a novel diffusion-based approach
(namely Center-based Semi-hard Synthetic Face Generation (CemiFace)) which
produces facial samples with various levels of similarity to the subject
center, thus allowing to generate face datasets containing effective
discriminative samples for training face recognition. Experimental results show
that with a modest degree of similarity, training on the generated dataset can
produce competitive performance compared to previous generation methods.
comment: accepted to NeurIPS 2024. We are preparing the camera-ready version
according to the reviews
★ Simulating Dynamic Tumor Contrast Enhancement in Breast MRI using Conditional Generative Adversarial Networks
Richard Osuala, Smriti Joshi, Apostolia Tsirikoglou, Lidia Garrucho, Walter H. L. Pinaya, Daniel M. Lang, Julia A. Schnabel, Oliver Diaz, Karim Lekadir
This paper presents a method for virtual contrast enhancement in breast MRI,
offering a promising non-invasive alternative to traditional contrast
agent-based DCE-MRI acquisition. Using a conditional generative adversarial
network, we predict DCE-MRI images, including jointly-generated sequences of
multiple corresponding DCE-MRI timepoints, from non-contrast-enhanced MRIs,
enabling tumor localization and characterization without the associated health
risks. Furthermore, we qualitatively and quantitatively evaluate the synthetic
DCE-MRI images, proposing a multi-metric Scaled Aggregate Measure (SAMe),
assessing their utility in a tumor segmentation downstream task, and conclude
with an analysis of the temporal patterns in multi-sequence DCE-MRI generation.
Our approach demonstrates promising results in generating realistic and useful
DCE-MRI sequences, highlighting the potential of virtual contrast enhancement
for improving breast cancer diagnosis and treatment, particularly for patients
where contrast agent administration is contraindicated.
★ Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang
While next-token prediction is considered a promising path towards artificial
general intelligence, it has struggled to excel in multimodal tasks, which are
still dominated by diffusion models (e.g., Stable Diffusion) and compositional
approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a
new suite of state-of-the-art multimodal models trained solely with next-token
prediction. By tokenizing images, text, and videos into a discrete space, we
train a single transformer from scratch on a mixture of multimodal sequences.
Emu3 outperforms several well-established task-specific models in both
generation and perception tasks, surpassing flagship models such as SDXL and
LLaVA-1.6, while eliminating the need for diffusion or compositional
architectures. Emu3 is also capable of generating high-fidelity video via
predicting the next token in a video sequence. We simplify complex multimodal
model designs by converging on a singular focus: tokens, unlocking great
potential for scaling both during training and inference. Our results
demonstrate that next-token prediction is a promising path towards building
general multimodal intelligence beyond language. We open-source key techniques
and models to support further research in this direction.
comment: Project Page: https://emu.baai.ac.cn
★ MCUBench: A Benchmark of Tiny Object Detectors on MCUs
Sudhakar Sah, Darshan C. Ganji, Matteo Grimaldi, Ravish Kumar, Alexander Hoffman, Honnesh Rohmetra, Ehsan Saboori
We introduce MCUBench, a benchmark featuring over 100 YOLO-based object
detection models evaluated on the VOC dataset across seven different MCUs. This
benchmark provides detailed data on average precision, latency, RAM, and Flash
usage for various input resolutions and YOLO-based one-stage detectors. By
conducting a controlled comparison with a fixed training pipeline, we collect
comprehensive performance metrics. Our Pareto-optimal analysis shows that
integrating modern detection heads and training techniques allows various YOLO
architectures, including legacy models like YOLOv3, to achieve a highly
efficient tradeoff between mean Average Precision (mAP) and latency. MCUBench
serves as a valuable tool for benchmarking the MCU performance of contemporary
object detectors and aids in model selection based on specific constraints.
comment: Code and data are available at
https://github.com/Deeplite/deeplite-torch-zoo
★ Positional Encoder Graph Quantile Neural Networks for Geographic Data
Positional Encoder Graph Neural Networks (PE-GNNs) are a leading approach for
modeling continuous spatial data. However, they often fail to produce
calibrated predictive distributions, limiting their effectiveness for
uncertainty quantification. We introduce the Positional Encoder Graph Quantile
Neural Network (PE-GQNN), a novel method that integrates PE-GNNs, Quantile
Neural Networks, and recalibration techniques in a fully nonparametric
framework, requiring minimal assumptions about the predictive distributions. We
propose a new network architecture that, when combined with a quantile-based
loss function, yields accurate and reliable probabilistic models without
increasing computational complexity. Our approach provides a flexible, robust
framework for conditional density estimation, applicable beyond spatial data
contexts. We further introduce a structured method for incorporating a KNN
predictor into the model while avoiding data leakage through the GNN layer
operation. Experiments on benchmark datasets demonstrate that PE-GQNN
significantly outperforms existing state-of-the-art methods in both predictive
accuracy and uncertainty quantification.
comment: 17 main text pages, 4 figures
★ LW2G: Learning Whether to Grow for Prompt-based Continual Learning
Continual Learning (CL) aims to learn in non-stationary scenarios,
progressively acquiring and maintaining knowledge from sequential tasks. Recent
Prompt-based Continual Learning (PCL) has achieved remarkable performance with
Pre-Trained Models (PTMs). These approaches grow a prompt sets pool by adding a
new set of prompts when learning each new task (\emph{prompt learning}) and
adopt a matching mechanism to select the correct set for each testing sample
(\emph{prompt retrieval}). Previous studies focus on the latter stage by
improving the matching mechanism to enhance Prompt Retrieval Accuracy (PRA). To
promote cross-task knowledge facilitation and form an effective and efficient
prompt sets pool, we propose a plug-in module in the former stage to
\textbf{Learn Whether to Grow (LW2G)} based on the disparities between tasks.
Specifically, a shared set of prompts is utilized when several tasks share
certain commonalities, and a new set is added when there are significant
differences between the new task and previous tasks. Inspired by Gradient
Projection Continual Learning, our LW2G develops a metric called Hinder Forward
Capability (HFC) to measure the hindrance imposed on learning new tasks by
surgically modifying the original gradient onto the orthogonal complement of
the old feature space. With HFC, an automated scheme Dynamic Growing Approach
adaptively learns whether to grow with a dynamic threshold. Furthermore, we
design a gradient-based constraint to ensure the consistency between the
updating prompts and pre-trained knowledge, and a prompts weights reusing
strategy to enhance forward transfer. Extensive experiments show the
effectiveness of our method. The source codes are available at
\url{https://github.com/RAIAN08/LW2G}.
comment: submit to neurips2024
★ Space-time 2D Gaussian Splatting for Accurate Surface Reconstruction under Complex Dynamic Scenes
Previous surface reconstruction methods either suffer from low geometric
accuracy or lengthy training times when dealing with real-world complex dynamic
scenes involving multi-person activities, and human-object interactions. To
tackle the dynamic contents and the occlusions in complex scenes, we present a
space-time 2D Gaussian Splatting approach. Specifically, to improve geometric
quality in dynamic scenes, we learn canonical 2D Gaussian splats and deform
these 2D Gaussian splats while enforcing the disks of the Gaussian located on
the surface of the objects by introducing depth and normal regularizers.
Further, to tackle the occlusion issues in complex scenes, we introduce a
compositional opacity deformation strategy, which further reduces the surface
recovery of those occluded areas. Experiments on real-world sparse-view video
datasets and monocular dynamic datasets demonstrate that our reconstructions
outperform state-of-the-art methods, especially for the surface of the details.
The project page and more visualizations can be found at:
https://tb2-sy.github.io/st-2dgs/.
comment: Project page: https://tb2-sy.github.io/st-2dgs/
★ MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, Conghui He
Document content analysis has been a crucial research area in computer
vision. Despite significant advancements in methods such as OCR, layout
detection, and formula recognition, existing open-source solutions struggle to
consistently deliver high-quality content extraction due to the diversity in
document types and content. To address these challenges, we present MinerU, an
open-source solution for high-precision document content extraction. MinerU
leverages the sophisticated PDF-Extract-Kit models to extract content from
diverse documents effectively and employs finely-tuned preprocessing and
postprocessing rules to ensure the accuracy of the final results. Experimental
results demonstrate that MinerU consistently achieves high performance across
various document types, significantly enhancing the quality and consistency of
content extraction. The MinerU open-source project is available at
https://github.com/opendatalab/MinerU.
comment: MinerU Technical Report
★ Classification and regression of trajectories rendered as images via 2D Convolutional Neural Networks
Trajectories can be regarded as time-series of coordinates, typically arising
from motile objects. Methods for trajectory classification are particularly
important to detect different movement patterns, while methods for regression
to compute motility metrics and forecasting. Recent advances in computer vision
have facilitated the processing of trajectories rendered as images via
artificial neural networks with 2d convolutional layers (CNNs). This approach
leverages the capability of CNNs to learn spatial hierarchies of features from
images, necessary to recognize complex shapes. Moreover, it overcomes the
limitation of other machine learning methods that require input trajectories
with a fixed number of points. However, rendering trajectories as images can
introduce poorly investigated artifacts such as information loss due to the
plotting of coordinates on a discrete grid, and spectral changes due to line
thickness and aliasing. In this study, we investigate the effectiveness of CNNs
for solving classification and regression problems from synthetic trajectories
that have been rendered as images using different modalities. The parameters
considered in this study include line thickness, image resolution, usage of
motion history (color-coding of the temporal component) and anti-aliasing.
Results highlight the importance of choosing an appropriate image resolution
according to model depth and motion history in applications where movement
direction is critical.
comment: 13 pages, 5 figures
★ YOLOv8-ResCBAM: YOLOv8 Based on An Effective Attention Module for Pediatric Wrist Fracture Detection ICONIP 2024
Wrist trauma and even fractures occur frequently in daily life, particularly
among children who account for a significant proportion of fracture cases.
Before performing surgery, surgeons often request patients to undergo X-ray
imaging first, and prepare for the surgery based on the analysis of the X-ray
images. With the development of neural networks, You Only Look Once (YOLO)
series models have been widely used in fracture detection for Computer-Assisted
Diagnosis, where the YOLOv8 model has obtained the satisfactory results.
Applying the attention modules to neural networks is one of the effective
methods to improve the model performance. This paper proposes YOLOv8-ResCBAM,
which incorporates Convolutional Block Attention Module integrated with
resblock (ResCBAM) into the original YOLOv8 network architecture. The
experimental results on the GRAZPEDWRI-DX dataset demonstrate that the mean
Average Precision calculated at Intersection over Union threshold of 0.5 (mAP
50) of the proposed model increased from 63.6% of the original YOLOv8 model to
65.8%, which achieves the state-of-the-art performance. The implementation code
is available at
https://github.com/RuiyangJu/Fracture_Detection_Improved_YOLOv8.
comment: Accepted by ICONIP 2024. arXiv admin note: substantial text overlap
with arXiv:2402.09329
★ Early diagnosis of Alzheimer's disease from MRI images with deep learning model SP
It is acknowledged that the most common cause of dementia worldwide is
Alzheimer's disease (AD). This condition progresses in severity from mild to
severe and interferes with people's everyday routines. Early diagnosis plays a
critical role in patient care and clinical trials. Convolutional neural
networks (CNN) are used to create a framework for identifying specific disease
features from MRI scans Classification of dementia involves approaches such as
medical history review, neuropsychological tests, and magnetic resonance
imaging (MRI). However, the image dataset obtained from Kaggle faces a
significant issue of class imbalance, which requires equal distribution of
samples from each class to address. In this article, to address this imbalance,
the Synthetic Minority Oversampling Technique (SMOTE) is utilized. Furthermore,
a pre-trained convolutional neural network has been applied to the DEMNET
dementia network to extract key features from AD images. The proposed model
achieved an impressive accuracy of 98.67%.
comment: 7 pages, 3 figures, Presented at the 20-th CSI International
Symposium on Artificial Intelligence and Signal Processing (AISP) 21-22
February, 2024, Mazandaran University of Science and Technology, Babol, Iran
★ EyeTrAES: Fine-grained, Low-Latency Eye Tracking via Adaptive Event Slicing
Eye-tracking technology has gained significant attention in recent years due
to its wide range of applications in human-computer interaction, virtual and
augmented reality, and wearable health. Traditional RGB camera-based
eye-tracking systems often struggle with poor temporal resolution and
computational constraints, limiting their effectiveness in capturing rapid eye
movements. To address these limitations, we propose EyeTrAES, a novel approach
using neuromorphic event cameras for high-fidelity tracking of natural
pupillary movement that shows significant kinematic variance. One of EyeTrAES's
highlights is the use of a novel adaptive windowing/slicing algorithm that
ensures just the right amount of descriptive asynchronous event data
accumulation within an event frame, across a wide range of eye movement
patterns. EyeTrAES then applies lightweight image processing functions over
accumulated event frames from just a single eye to perform pupil segmentation
and tracking. We show that these methods boost pupil tracking fidelity by 6+%,
achieving IoU~=92%, while incurring at least 3x lower latency than competing
pure event-based eye tracking alternatives [38]. We additionally demonstrate
that the microscopic pupillary motion captured by EyeTrAES exhibits distinctive
variations across individuals and can thus serve as a biometric fingerprint.
For robust user authentication, we train a lightweight per-user Random Forest
classifier using a novel feature vector of short-term pupillary kinematics,
comprising a sliding window of pupil (location, velocity, acceleration)
triples. Experimental studies with two different datasets demonstrate that the
EyeTrAES-based authentication technique can simultaneously achieve high
authentication accuracy (~=0.82) and low processing latency (~=12ms), and
significantly outperform multiple state-of-the-art competitive baselines.
comment: 32 pages,15 figures,
★ MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation
In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced
rapidly, yet the increasing size of models conflicts with the limited
computational capabilities of Embodied AI platforms. To address this challenge,
we aim to achieve both high model performance and practical deployability.
Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in
Embodied AI. This paper introduces a two-stage knowledge distillation
framework, producing a student model, MiniVLN, and showcasing the significant
potential of distillation techniques in developing lightweight models. The
proposed method aims to capture fine-grained knowledge during the pretraining
phase and navigation-specific knowledge during the fine-tuning phase. Our
findings indicate that the two-stage distillation approach is more effective in
narrowing the performance gap between the teacher model and the student model
compared to single-stage distillation. On the public R2R and REVERIE
benchmarks, MiniVLN achieves performance on par with the teacher model while
having only about 12% of the teacher model's parameter count.
★ Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs
Vision-and-Language Navigation (VLN) tasks require an agent to follow textual
instructions to navigate through 3D environments. Traditional approaches use
supervised learning methods, relying heavily on domain-specific datasets to
train VLN models. Recent methods try to utilize closed-source large language
models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face
challenges related to expensive token costs and potential data breaches in
real-world applications. In this work, we introduce Open-Nav, a novel study
that explores open-source LLMs for zero-shot VLN in the continuous environment.
Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach
to break down tasks into instruction comprehension, progress estimation, and
decision-making. It enhances scene perceptions with fine-grained object and
spatial knowledge to improve LLM's reasoning in navigation. Our extensive
experiments in both simulated and real-world environments demonstrate that
Open-Nav achieves competitive performance compared to using closed-source LLMs.
★ Excavating in the Wild: The GOOSE-Ex Dataset for Semantic Segmentation
The successful deployment of deep learning-based techniques for autonomous
systems is highly dependent on the data availability for the respective system
in its deployment environment. Especially for unstructured outdoor
environments, very few datasets exist for even fewer robotic platforms and
scenarios. In an earlier work, we presented the German Outdoor and Offroad
Dataset (GOOSE) framework along with 10000 multimodal frames from an offroad
vehicle to enhance the perception capabilities in unstructured environments. In
this work, we address the generalizability of the GOOSE framework. To
accomplish this, we open-source the GOOSE-Ex dataset, which contains additional
5000 labeled multimodal frames from various completely different environments,
recorded on a robotic excavator and a quadruped platform. We perform a
comprehensive analysis of the semantic segmentation performance on different
platforms and sensor modalities in unseen environments. In addition, we
demonstrate how the combined datasets can be utilized for different downstream
applications or competitions such as offroad navigation, object manipulation or
scene completion. The dataset, its platform documentation and pre-trained
state-of-the-art models for offroad perception will be made available on
https://goose-dataset.de/.
\
comment: Submitted to IEEE for review
★ Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation
Knowledge distillation has become widely recognized for its ability to
transfer knowledge from a large teacher network to a compact and more
streamlined student network. Traditional knowledge distillation methods
primarily follow a teacher-oriented paradigm that imposes the task of learning
the teacher's complex knowledge onto the student network. However, significant
disparities in model capacity and architectural design hinder the student's
comprehension of the complex knowledge imparted by the teacher, resulting in
sub-optimal performance. This paper introduces a novel perspective emphasizing
student-oriented and refining the teacher's knowledge to better align with the
student's needs, thereby improving knowledge transfer effectiveness.
Specifically, we present the Student-Oriented Knowledge Distillation (SoKD),
which incorporates a learnable feature augmentation strategy during training to
refine the teacher's knowledge of the student dynamically. Furthermore, we
deploy the Distinctive Area Detection Module (DAM) to identify areas of mutual
interest between the teacher and student, concentrating knowledge transfer
within these critical areas to avoid transferring irrelevant information. This
customized module ensures a more focused and effective knowledge distillation
process. Our approach, functioning as a plug-in, could be integrated with
various knowledge distillation methods. Extensive experimental results
demonstrate the efficacy and generalizability of our method.
★ DualDn: Dual-domain Denoising via Differentiable ISP ECCV 2024
Image denoising is a critical component in a camera's Image Signal Processing
(ISP) pipeline. There are two typical ways to inject a denoiser into the ISP
pipeline: applying a denoiser directly to captured raw frames (raw domain) or
to the ISP's output sRGB images (sRGB domain). However, both approaches have
their limitations. Residual noise from raw-domain denoising can be amplified by
the subsequent ISP processing, and the sRGB domain struggles to handle
spatially varying noise since it only sees noise distorted by the ISP.
Consequently, most raw or sRGB domain denoising works only for specific noise
distributions and ISP configurations. To address these challenges, we propose
DualDn, a novel learning-based dual-domain denoising. Unlike previous
single-domain denoising, DualDn consists of two denoising networks: one in the
raw domain and one in the sRGB domain. The raw domain denoising adapts to
sensor-specific noise as well as spatially varying noise levels, while the sRGB
domain denoising adapts to ISP variations and removes residual noise amplified
by the ISP. Both denoising networks are connected with a differentiable ISP,
which is trained end-to-end and discarded during the inference stage. With this
design, DualDn achieves greater generalizability compared to most
learning-based denoising methods, as it can adapt to different unseen noises,
ISP parameters, and even novel ISP pipelines. Experiments show that DualDn
achieves state-of-the-art performance and can adapt to different denoising
architectures. Moreover, DualDn can be used as a plug-and-play denoising module
with real cameras without retraining, and still demonstrate better performance
than commercial on-camera denoising. The project website is available at:
https://openimaginglab.github.io/DualDn/
comment: Accepted at ECCV 2024, Project page:
https://openimaginglab.github.io/DualDn/
★ Relighting from a Single Image: Datasets and Deep Intrinsic-based Architecture
Single image scene relighting aims to generate a realistic new version of an
input image so that it appears to be illuminated by a new target light
condition. Although existing works have explored this problem from various
perspectives, generating relit images under arbitrary light conditions remains
highly challenging, and related datasets are scarce. Our work addresses this
problem from both the dataset and methodological perspectives. We propose two
new datasets: a synthetic dataset with the ground truth of intrinsic components
and a real dataset collected under laboratory conditions. These datasets
alleviate the scarcity of existing datasets. To incorporate physical
consistency in the relighting pipeline, we establish a two-stage network based
on intrinsic decomposition, giving outputs at intermediate steps, thereby
introducing physical constraints. When the training set lacks ground truth for
intrinsic decomposition, we introduce an unsupervised module to ensure that the
intrinsic outputs are satisfactory. Our method outperforms the state-of-the-art
methods in performance, as tested on both existing datasets and our newly
developed datasets. Furthermore, pretraining our method or other prior methods
using our synthetic dataset can enhance their performance on other datasets.
Since our method can accommodate any light conditions, it is capable of
producing animated results. The dataset, method, and videos are publicly
available.
comment: Accepted for publication as a Regular paper in the IEEE Transactions
on Multimedia
★ State-of-the-Art Periorbital Distance Prediction and Disease Classification Using Periorbital Features
George R. Nahass, Ghasem Yazdanpanah, Madison Cheung, Alex Palacios, Jeffery Peterson, Kevin Heinze, Sasha Hubschman, Chad A. Purnell, Pete Setabutr, Ann Q. Tran, Darvin Yi
Periorbital distances and features around the eyes and lids hold valuable
information for disease quantification and monitoring of surgical and medical
intervention. These distances are commonly measured manually, a process that is
both subjective and highly time-consuming. Here, we set out to developed three
deep-learning methods for segmentation and periorbital distance prediction, and
also evaluate the utility of periorbital distances for disease classification.
The MAE of our deep learning predicted distances was less than or very close to
the error observed between trained human annotators. We compared our models to
the current state-of-the-art (SOTA) method for periorbital distance prediction
and found that our methods outperformed SOTA on all of our datasets on all but
one periorbital measurement. We also show that robust segmentation can be
achieved on diseased eyes using models trained on open-source, healthy eyes,
and that periorbital distances have can be used as high-quality features in
downstream classification models. Leveraging segmentation networks as
intermediary steps in classification has broad implications for increasing the
generalizability of classification models in ophthalmic plastic and
craniofacial surgery by avoiding the out-of-distribution problem observed in
traditional convolutional neural networks.
comment: 16 pages, 4 figures, 4 tables
★ Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations
We propose a novel framework that leverages Visual Question Answering (VQA)
models to automate the evaluation of LLM-generated data visualizations.
Traditional evaluation methods often rely on human judgment, which is costly
and unscalable, or focus solely on data accuracy, neglecting the effectiveness
of visual communication. By employing VQA models, we assess data representation
quality and the general communicative clarity of charts. Experiments were
conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with
visualizations generated by OpenAI's GPT-3.5 Turbo and Meta's Llama 3.1
70B-Instruct models. Our results indicate that LLM-generated charts do not
match the accuracy of the original non-LLM-generated charts based on VQA
performance measures. Moreover, while our results demonstrate that few-shot
prompting significantly boosts the accuracy of chart generation, considerable
progress remains to be made before LLMs can fully match the precision of
human-generated graphs. This underscores the importance of our work, which
expedites the research process by enabling rapid iteration without the need for
human annotation, thus accelerating advancements in this field.
★ Enhancing Explainability in Multimodal Large Language Models Using Ontological Context
Recently, there has been a growing interest in Multimodal Large Language
Models (MLLMs) due to their remarkable potential in various tasks integrating
different modalities, such as image and text, as well as applications such as
image captioning and visual question answering. However, such models still face
challenges in accurately captioning and interpreting specific visual concepts
and classes, particularly in domain-specific applications. We argue that
integrating domain knowledge in the form of an ontology can significantly
address these issues. In this work, as a proof of concept, we propose a new
framework that combines ontology with MLLMs to classify images of plant
diseases. Our method uses concepts about plant diseases from an existing
disease ontology to query MLLMs and extract relevant visual concepts from
images. Then, we use the reasoning capabilities of the ontology to classify the
disease according to the identified concepts. Ensuring that the model
accurately uses the concepts describing the disease is crucial in
domain-specific applications. By employing an ontology, we can assist in
verifying this alignment. Additionally, using the ontology's inference
capabilities increases transparency, explainability, and trust in the
decision-making process while serving as a judge by checking if the annotations
of the concepts by MLLMs are aligned with those in the ontology and displaying
the rationales behind their errors. Our framework offers a new direction for
synergizing ontologies and MLLMs, supported by an empirical study using
different well-known MLLMs.
★ Effectiveness of learning-based image codecs on fingerprint storage
The success of learning-based coding techniques and the development of
learning-based image coding standards, such as JPEG-AI, point towards the
adoption of such solutions in different fields, including the storage of
biometric data, like fingerprints. However, the peculiar nature of
learning-based compression artifacts poses several issues concerning their
impact and effectiveness on extracting biometric features and landmarks, e.g.,
minutiae. This problem is utterly stressed by the fact that most models are
trained on natural color images, whose characteristics are very different from
usual biometric images, e.g, fingerprint or iris pictures. As a matter of fact,
these issues are deemed to be accurately questioned and investigated, being
such analysis still largely unexplored.
This study represents the first investigation about the adaptability of
learning-based image codecs in the storage of fingerprint images by measuring
its impact on the extraction and characterization of minutiae. Experimental
results show that at a fixed rate point, learned solutions considerably
outperform previous fingerprint coding standards, like JPEG2000, both in terms
of distortion and minutiae preservation. Indeed, experimental results prove
that the peculiarities of learned compression artifacts do not prevent
automatic fingerprint identification (since minutiae types and locations are
not significantly altered), nor do compromise image quality for human visual
inspection (as they gain in terms of BD rate and PSNR of 47.8% and +3.97dB
respectively).
comment: Accepted ad Wifs 2024
★ A Generalized Tensor Formulation for Hyperspectral Image Super-Resolution Under General Spatial Blurring
Hyperspectral super-resolution is commonly accomplished by the fusing of a
hyperspectral imaging of low spatial resolution with a multispectral image of
high spatial resolution, and many tensor-based approaches to this task have
been recently proposed. Yet, it is assumed in such tensor-based methods that
the spatial-blurring operation that creates the observed hyperspectral image
from the desired super-resolved image is separable into independent horizontal
and vertical blurring. Recent work has argued that such separable spatial
degradation is ill-equipped to model the operation of real sensors which may
exhibit, for example, anisotropic blurring. To accommodate this fact, a
generalized tensor formulation based on a Kronecker decomposition is proposed
to handle any general spatial-degradation matrix, including those that are not
separable as previously assumed. Analysis of the generalized formulation
reveals conditions under which exact recovery of the desired super-resolved
image is guaranteed, and a practical algorithm for such recovery, driven by a
blockwise-group-sparsity regularization, is proposed. Extensive experimental
results demonstrate that the proposed generalized tensor approach outperforms
not only traditional matrix-based techniques but also state-of-the-art
tensor-based methods; the gains with respect to the latter are especially
significant in cases of anisotropic spatial blurring.
★ Multi-modal Medical Image Fusion For Non-Small Cell Lung Cancer Classification
The early detection and nuanced subtype classification of non-small cell lung
cancer (NSCLC), a predominant cause of cancer mortality worldwide, is a
critical and complex issue. In this paper, we introduce an innovative
integration of multi-modal data, synthesizing fused medical imaging (CT and PET
scans) with clinical health records and genomic data. This unique fusion
methodology leverages advanced machine learning models, notably MedClip and
BEiT, for sophisticated image feature extraction, setting a new standard in
computational oncology. Our research surpasses existing approaches, as
evidenced by a substantial enhancement in NSCLC detection and classification
precision. The results showcase notable improvements across key performance
metrics, including accuracy, precision, recall, and F1-score. Specifically, our
leading multi-modal classifier model records an impressive accuracy of 94.04%.
We believe that our approach has the potential to transform NSCLC diagnostics,
facilitating earlier detection and more effective treatment planning and,
ultimately, leading to superior patient outcomes in lung cancer care.
★ 3DPX: Single Panoramic X-ray Analysis Guided by 3D Oral Structure Reconstruction
Xiaoshuang Li, Zimo Huang, Mingyuan Meng, Eduardo Delamare, Dagan Feng, Lei Bi, Bin Sheng, Lingyong Jiang, Bo Li, Jinman Kim
Panoramic X-ray (PX) is a prevalent modality in dentistry practice owing to
its wide availability and low cost. However, as a 2D projection of a 3D
structure, PX suffers from anatomical information loss and PX diagnosis is
limited compared to that with 3D imaging modalities. 2D-to-3D reconstruction
methods have been explored for the ability to synthesize the absent 3D
anatomical information from 2D PX for use in PX image analysis. However, there
are challenges in leveraging such 3D synthesized reconstructions. First,
inferring 3D depth from 2D images remains a challenging task with limited
accuracy. The second challenge is the joint analysis of 2D PX with its 3D
synthesized counterpart, with the aim to maximize the 2D-3D synergy while
minimizing the errors arising from the synthesized image. In this study, we
propose a new method termed 3DPX - PX image analysis guided by 2D-to-3D
reconstruction, to overcome these challenges. 3DPX consists of (i) a novel
progressive reconstruction network to improve 2D-to-3D reconstruction and, (ii)
a contrastive-guided bidirectional multimodality alignment module for 3D-guided
2D PX classification and segmentation tasks. The reconstruction network
progressively reconstructs 3D images with knowledge imposed on the intermediate
reconstructions at multiple pyramid levels and incorporates Multilayer
Perceptrons to improve semantic understanding. The downstream networks leverage
the reconstructed images as 3D anatomical guidance to the PX analysis through
feature alignment, which increases the 2D-3D synergy with bidirectional feature
projection and decease the impact of potential errors with contrastive
guidance. Extensive experiments on two oral datasets involving 464 studies
demonstrate that 3DPX outperforms the state-of-the-art methods in various tasks
including 2D-to-3D reconstruction, PX classification and lesion segmentation.
★ Learning from Pattern Completion: Self-supervised Controllable Generation
The human brain exhibits a strong ability to spontaneously associate
different visual attributes of the same or similar visual scene, such as
associating sketches and graffiti with real-world visual objects, usually
without supervising information. In contrast, in the field of artificial
intelligence, controllable generation methods like ControlNet heavily rely on
annotated training datasets such as depth maps, semantic segmentation maps, and
poses, which limits the method's scalability. Inspired by the neural mechanisms
that may contribute to the brain's associative power, specifically the cortical
modularization and hippocampal pattern completion, here we propose a
self-supervised controllable generation (SCG) framework. Firstly, we introduce
an equivariant constraint to promote inter-module independence and intra-module
correlation in a modular autoencoder network, thereby achieving functional
specialization. Subsequently, based on these specialized modules, we employ a
self-supervised pattern completion approach for controllable generation
training. Experimental results demonstrate that the proposed modular
autoencoder effectively achieves functional specialization, including the
modular processing of color, brightness, and edge detection, and exhibits
brain-like features including orientation selectivity, color antagonism, and
center-surround receptive fields. Through self-supervised training, associative
generation capabilities spontaneously emerge in SCG, demonstrating excellent
generalization ability to various tasks such as associative generation on
painting, sketches, and ancient graffiti. Compared to the previous
representative method ControlNet, our proposed approach not only demonstrates
superior robustness in more challenging high-noise scenarios but also possesses
more promising scalability potential due to its self-supervised manner.
★ A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation NeurIPS2024
Low-shot object counters estimate the number of objects in an image using few
or no annotated exemplars. Objects are localized by matching them to
prototypes, which are constructed by unsupervised image-wide object appearance
aggregation. Due to potentially diverse object appearances, the existing
approaches often lead to overgeneralization and false positive detections.
Furthermore, the best-performing methods train object localization by a
surrogate loss, that predicts a unit Gaussian at each object center. This loss
is sensitive to annotation error, hyperparameters and does not directly
optimize the detection task, leading to suboptimal counts. We introduce GeCo, a
novel low-shot counter that achieves accurate object detection, segmentation,
and count estimation in a unified architecture. GeCo robustly generalizes the
prototypes across objects appearances through a novel dense object query
formulation. In addition, a novel counting loss is proposed, that directly
optimizes the detection task and avoids the issues of the standard surrogate
loss. GeCo surpasses the leading few-shot detection-based counters by
$\sim$25\% in the total count MAE, achieves superior detection accuracy and
sets a new solid state-of-the-art result across all low-shot counting setups.
comment: Accepted to NeurIPS2024
★ Image-guided topic modeling for interpretable privacy classification ECCV 2024
Predicting and explaining the private information contained in an image in
human-understandable terms is a complex and contextual task. This task is
challenging even for large language models. To facilitate the understanding of
privacy decisions, we propose to predict image privacy based on a set of
natural language content descriptors. These content descriptors are associated
with privacy scores that reflect how people perceive image content. We generate
descriptors with our novel Image-guided Topic Modeling (ITM) approach. ITM
leverages, via multimodality alignment, both vision information and image
textual descriptions from a vision language model. We use the ITM-generated
descriptors to learn a privacy predictor, Priv$\times$ITM, whose decisions are
interpretable by design. Our Priv$\times$ITM classifier outperforms the
reference interpretable method by 5 percentage points in accuracy and performs
comparably to the current non-interpretable state-of-the-art model.
comment: Paper accepted at the eXCV Workshop at ECCV 2024. Supplementary
material included. Code available at https://github.com/idiap/itm
★ Exploiting Motion Prior for Accurate Pose Estimation of Dashboard Cameras
Dashboard cameras (dashcams) record millions of driving videos daily,
offering a valuable potential data source for various applications, including
driving map production and updates. A necessary step for utilizing these
dashcam data involves the estimation of camera poses. However, the low-quality
images captured by dashcams, characterized by motion blurs and dynamic objects,
pose challenges for existing image-matching methods in accurately estimating
camera poses. In this study, we propose a precise pose estimation method for
dashcam images, leveraging the inherent camera motion prior. Typically, image
sequences captured by dash cameras exhibit pronounced motion prior, such as
forward movement or lateral turns, which serve as essential cues for
correspondence estimation. Building upon this observation, we devise a pose
regression module aimed at learning camera motion prior, subsequently
integrating these prior into both correspondences and pose estimation
processes. The experiment shows that, in real dashcams dataset, our method is
22% better than the baseline for pose estimation in AUC5\textdegree, and it can
estimate poses for 19% more images with less reprojection error in Structure
from Motion (SfM).
★ When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation
This study investigates the application and performance of the Segment
Anything Model 2 (SAM2) in the challenging task of video camouflaged object
segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in
the surroundings for videos, due to similar colors and textures, poor light
conditions, etc. Compared to the objects in normal scenes, camouflaged objects
are much more difficult to detect. SAM2, a video foundation model, has shown
potential in various tasks. But its effectiveness in dynamic camouflaged
scenarios remains under-explored. This study presents a comprehensive study on
SAM2's ability in VCOS. First, we assess SAM2's performance on camouflaged
video datasets using different models and prompts (click, box, and mask).
Second, we explore the integration of SAM2 with existing multimodal large
language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by
fine-tuning it on the video camouflaged dataset. Our comprehensive experiments
demonstrate that SAM2 has excellent zero-shot ability of detecting camouflaged
objects in videos. We also show that this ability could be further improved by
specifically adjusting SAM2's parameters for VCOS. The code will be available
at https://github.com/zhoustan/SAM2-VCOS
comment: Technical report
★ Enhanced Convolution Neural Network with Optimized Pooling and Hyperparameter Tuning for Network Intrusion Detection
Network Intrusion Detection Systems (NIDS) are essential for protecting
computer networks from malicious activities, including Denial of Service (DoS),
Probing, User-to-Root (U2R), and Remote-to-Local (R2L) attacks. Without
effective NIDS, networks are vulnerable to significant security breaches and
data loss. Machine learning techniques provide a promising approach to enhance
NIDS by automating threat detection and improving accuracy. In this research,
we propose an Enhanced Convolutional Neural Network (EnCNN) for NIDS and
evaluate its performance using the KDDCUP'99 dataset. Our methodology includes
comprehensive data preprocessing, exploratory data analysis (EDA), and feature
engineering. We compare EnCNN with various machine learning algorithms,
including Logistic Regression, Decision Trees, Support Vector Machines (SVM),
and ensemble methods like Random Forest, AdaBoost, and Voting Ensemble. The
results show that EnCNN significantly improves detection accuracy, with a
notable 10% increase over state-of-art approaches. This demonstrates the
effectiveness of EnCNN in real-time network intrusion detection, offering a
robust solution for identifying and mitigating security threats, and enhancing
overall network resilience.
comment: 7 Pages , 2 figures , 4 Tables , Conference paper
★ Unsupervised Fingerphoto Presentation Attack Detection With Diffusion Models
Smartphone-based contactless fingerphoto authentication has become a reliable
alternative to traditional contact-based fingerprint biometric systems owing to
rapid advances in smartphone camera technology. Despite its convenience,
fingerprint authentication through fingerphotos is more vulnerable to
presentation attacks, which has motivated recent research efforts towards
developing fingerphoto Presentation Attack Detection (PAD) techniques. However,
prior PAD approaches utilized supervised learning methods that require labeled
training data for both bona fide and attack samples. This can suffer from two
key issues, namely (i) generalization:the detection of novel presentation
attack instruments (PAIs) unseen in the training data, and (ii) scalability:the
collection of a large dataset of attack samples using different PAIs. To
address these challenges, we propose a novel unsupervised approach based on a
state-of-the-art deep-learning-based diffusion model, the Denoising Diffusion
Probabilistic Model (DDPM), which is trained solely on bona fide samples. The
proposed approach detects Presentation Attacks (PA) by calculating the
reconstruction similarity between the input and output pairs of the DDPM. We
present extensive experiments across three PAI datasets to test the accuracy
and generalization capability of our approach. The results show that the
proposed DDPM-based PAD method achieves significantly better detection error
rates on several PAI classes compared to other baseline unsupervised
approaches.
comment: Accepted by IJCB 2024
★ Towards Integrating Epistemic Uncertainty Estimation into the Radiotherapy Workflow
The precision of contouring target structures and organs-at-risk (OAR) in
radiotherapy planning is crucial for ensuring treatment efficacy and patient
safety. Recent advancements in deep learning (DL) have significantly improved
OAR contouring performance, yet the reliability of these models, especially in
the presence of out-of-distribution (OOD) scenarios, remains a concern in
clinical settings. This application study explores the integration of epistemic
uncertainty estimation within the OAR contouring workflow to enable OOD
detection in clinically relevant scenarios, using specifically compiled data.
Furthermore, we introduce an advanced statistical method for OOD detection to
enhance the methodological framework of uncertainty estimation. Our empirical
evaluation demonstrates that epistemic uncertainty estimation is effective in
identifying instances where model predictions are unreliable and may require an
expert review. Notably, our approach achieves an AUC-ROC of 0.95 for OOD
detection, with a specificity of 0.95 and a sensitivity of 0.92 for implant
cases, underscoring its efficacy. This study addresses significant gaps in the
current research landscape, such as the lack of ground truth for uncertainty
estimation and limited empirical evaluations. Additionally, it provides a
clinically relevant application of epistemic uncertainty estimation in an
FDA-approved and widely used clinical solution for OAR segmentation from
Varian, a Siemens Healthineers company, highlighting its practical benefits.
comment: Keywords: Epistemic Uncertainty - Out-of-Distribution Detection - CT
Segmentation - OAR contouring - Radiotherapy
★ Metasurface-generated large and arbitrary analog convolution kernels for accelerated machine vision
In the rapidly evolving field of artificial intelligence, convolutional
neural networks are essential for tackling complex challenges such as machine
vision and medical diagnosis. Recently, to address the challenges in processing
speed and power consumption of conventional digital convolution operations,
many optical components have been suggested to replace the digital convolution
layer in the neural network, accelerating various machine vision tasks.
Nonetheless, the analog nature of the optical convolution kernel has not been
fully explored. Here, we develop a spatial frequency domain training method to
create arbitrarily shaped analog convolution kernels using an optical
metasurface as the convolution layer, with its receptive field largely
surpassing digital convolution kernels. By employing spatial multiplexing, the
multiple parallel convolution kernels with both positive and negative weights
are generated under the incoherent illumination condition. We experimentally
demonstrate a 98.59% classification accuracy on the MNIST dataset, with
simulations showing 92.63% and 68.67% accuracy on the Fashion-MNIST and
CIFAR-10 datasets with additional digital layers. This work underscores the
unique advantage of analog optical convolution, offering a promising avenue to
accelerate machine vision tasks, especially in edge devices.
★ From One to the Power of Many: Augmentations for Invariance to Multi-LiDAR Perception from Single-Sensor Datasets
Recently, LiDAR perception methods for autonomous vehicles, powered by deep
neural networks have experienced steep growth in performance on classic
benchmarks, such as nuScenes and SemanticKITTI. However, there are still large
gaps in performance when deploying models trained on such single-sensor setups
to modern multi-sensor vehicles. In this work, we investigate if a lack of
invariance may be responsible for these performance gaps, and propose some
initial solutions in the form of application-specific data augmentations, which
can facilitate better transfer to multi-sensor LiDAR setups. We provide
experimental evidence that our proposed augmentations improve generalization
across LiDAR sensor setups, and investigate how these augmentations affect the
models' invariance properties on simulations of different LiDAR sensor setups.
★ Off to new Shores: A Dataset & Benchmark for (near-)coastal Flood Inundation Forecasting NeurIPS 2024
Floods are among the most common and devastating natural hazards, imposing
immense costs on our society and economy due to their disastrous consequences.
Recent progress in weather prediction and spaceborne flood mapping demonstrated
the feasibility of anticipating extreme events and reliably detecting their
catastrophic effects afterwards. However, these efforts are rarely linked to
one another and there is a critical lack of datasets and benchmarks to enable
the direct forecasting of flood extent. To resolve this issue, we curate a
novel dataset enabling a timely prediction of flood extent. Furthermore, we
provide a representative evaluation of state-of-the-art methods, structured
into two benchmark tracks for forecasting flood inundation maps i) in general
and ii) focused on coastal regions. Altogether, our dataset and benchmark
provide a comprehensive platform for evaluating flood forecasts, enabling
future solutions for this critical challenge. Data, code & models are shared at
https://github.com/Multihuntr/GFF under a CC0 license.
comment: Accepted at NeurIPS 2024 Datasets & Benchmarks
★ Cross-video Identity Correlating for Person Re-identification Pre-training NeurIPS 2024
Recent researches have proven that pre-training on large-scale person images
extracted from internet videos is an effective way in learning better
representations for person re-identification. However, these researches are
mostly confined to pre-training at the instance-level or single-video
tracklet-level. They ignore the identity-invariance in images of the same
person across different videos, which is a key focus in person
re-identification. To address this issue, we propose a Cross-video
Identity-cOrrelating pre-traiNing (CION) framework. Defining a noise concept
that comprehensively considers both intra-identity consistency and
inter-identity discrimination, CION seeks the identity correlation from
cross-video images by modeling it as a progressive multi-level denoising
problem. Furthermore, an identity-guided self-distillation loss is proposed to
implement better large-scale pre-training by mining the identity-invariance
within person images. We conduct extensive experiments to verify the
superiority of our CION in terms of efficiency and performance. CION achieves
significantly leading performance with even fewer training samples. For
example, compared with the previous state-of-the-art~\cite{ISR}, CION with the
same ResNet50-IBN achieves higher mAP of 93.3\% and 74.3\% on Market1501 and
MSMT17, while only utilizing 8\% training samples. Finally, with CION
demonstrating superior model-agnostic ability, we contribute a model zoo named
ReIDZoo to meet diverse research and application needs in this field. It
contains a series of CION pre-trained models with spanning structures and
parameters, totaling 32 models with 10 different structures, including
GhostNet, ConvNext, RepViT, FastViT and so on. The code and models will be made
publicly available at https://github.com/Zplusdragon/CION_ReIDZoo.
comment: NeurIPS 2024 Accepted Paper
★ Harmonizing knowledge Transfer in Neural Network with Unified Distillation
Knowledge distillation (KD), known for its ability to transfer knowledge from
a cumbersome network (teacher) to a lightweight one (student) without altering
the architecture, has been garnering increasing attention. Two primary
categories emerge within KD methods: feature-based, focusing on intermediate
layers' features, and logits-based, targeting the final layer's logits. This
paper introduces a novel perspective by leveraging diverse knowledge sources
within a unified KD framework. Specifically, we aggregate features from
intermediate layers into a comprehensive representation, effectively gathering
semantic information from different stages and scales. Subsequently, we predict
the distribution parameters from this representation. These steps transform
knowledge from the intermediate layers into corresponding distributive forms,
thereby allowing for knowledge distillation through a unified distribution
constraint at different stages of the network, ensuring the comprehensiveness
and coherence of knowledge transfer. Numerous experiments were conducted to
validate the effectiveness of the proposed method.
★ AL-GTD: Deep Active Learning for Gaze Target Detection
Gaze target detection aims at determining the image location where a person
is looking. While existing studies have made significant progress in this area
by regressing accurate gaze heatmaps, these achievements have largely relied on
access to extensive labeled datasets, which demands substantial human labor. In
this paper, our goal is to reduce the reliance on the size of labeled training
data for gaze target detection. To achieve this, we propose AL-GTD, an
innovative approach that integrates supervised and self-supervised losses
within a novel sample acquisition function to perform active learning (AL).
Additionally, it utilizes pseudo-labeling to mitigate distribution shifts
during the training phase. AL-GTD achieves the best of all AUC results by
utilizing only 40-50% of the training data, in contrast to state-of-the-art
(SOTA) gaze target detectors requiring the entire training dataset to achieve
the same performance. Importantly, AL-GTD quickly reaches satisfactory
performance with 10-20% of the training data, showing the effectiveness of our
acquisition function, which is able to acquire the most informative samples. We
provide a comprehensive experimental analysis by adapting several AL methods
for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting
superior performance compared to SOTA gaze target detectors when all are
trained within a low-data regime. Code is available at
https://github.com/francescotonini/al-gtd.
comment: Accepted to ACM Multimedia 2024
★ CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials
Programming tutorials in the form of coding screencasts play a crucial role
in programming education, serving both novices and experienced developers.
However, the video format of these tutorials presents a challenge due to the
difficulty of searching for and within videos. Addressing the absence of
large-scale and diverse datasets for screencast analysis, we introduce the
CodeSCAN dataset. It comprises 12,000 screenshots captured from the Visual
Studio Code environment during development, featuring 24 programming languages,
25 fonts, and over 90 distinct themes, in addition to diverse layout changes
and realistic user interactions. Moreover, we conduct detailed quantitative and
qualitative evaluations to benchmark the performance of Integrated Development
Environment (IDE) element detection, color-to-black-and-white conversion, and
Optical Character Recognition (OCR). We hope that our contributions facilitate
more research in coding screencast analysis, and we make the source code for
creating the dataset and the benchmark publicly available on this website.
★ Efficient Noise Mitigation for Enhancing Inference Accuracy in DNNs on Mixed-Signal Accelerators
In this paper, we propose a framework to enhance the robustness of the neural
models by mitigating the effects of process-induced and aging-related
variations of analog computing components on the accuracy of the analog neural
networks. We model these variations as the noise affecting the precision of the
activations and introduce a denoising block inserted between selected layers of
a pre-trained model. We demonstrate that training the denoising block
significantly increases the model's robustness against various noise levels. To
minimize the overhead associated with adding these blocks, we present an
exploration algorithm to identify optimal insertion points for the denoising
blocks. Additionally, we propose a specialized architecture to efficiently
execute the denoising blocks, which can be integrated into mixed-signal
accelerators. We evaluate the effectiveness of our approach using Deep Neural
Network (DNN) models trained on the ImageNet and CIFAR-10 datasets. The results
show that on average, by accepting 2.03% parameter count overhead, the accuracy
drop due to the variations reduces from 31.7% to 1.15%.
★ Reducing Semantic Ambiguity In Domain Adaptive Semantic Segmentation Via Probabilistic Prototypical Pixel Contrast
Domain adaptation aims to reduce the model degradation on the target domain
caused by the domain shift between the source and target domains. Although
encouraging performance has been achieved by combining cognitive learning with
the self-training paradigm, they suffer from ambiguous scenarios caused by
scale, illumination, or overlapping when deploying deterministic embedding. To
address these issues, we propose probabilistic proto-typical pixel contrast
(PPPC), a universal adaptation framework that models each pixel embedding as a
probability via multivariate Gaussian distribution to fully exploit the
uncertainty within them, eventually improving the representation quality of the
model. In addition, we derive prototypes from probability estimation posterior
probability estimation which helps to push the decision boundary away from the
ambiguity points. Moreover, we employ an efficient method to compute similarity
between distributions, eliminating the need for sampling and
reparameterization, thereby significantly reducing computational overhead.
Further, we dynamically select the ambiguous crops at the image level to
enlarge the number of boundary points involved in contrastive learning, which
benefits the establishment of precise distributions for each category.
Extensive experimentation demonstrates that PPPC not only helps to address
ambiguity at the pixel level, yielding discriminative representations but also
achieves significant improvements in both synthetic-to-real and day-to-night
adaptation tasks. It surpasses the previous state-of-the-art (SOTA) by +5.2%
mIoU in the most challenging daytime-to-nighttime adaptation scenario,
exhibiting stronger generalization on other unseen datasets. The code and
models are available at
https://github.com/DarlingInTheSV/Probabilistic-Prototypical-Pixel-Contrast.
comment: revise
★ How Effective is Pre-training of Large Masked Autoencoders for Downstream Earth Observation Tasks?
Jose Sosa, Mohamed Aloulou, Danila Rukhovich, Rim Sleimi, Boonyarit Changaival, Anis Kacem, Djamila Aouada
Self-supervised pre-training has proven highly effective for many computer
vision tasks, particularly when labelled data are scarce. In the context of
Earth Observation (EO), foundation models and various other Vision Transformer
(ViT)-based approaches have been successfully applied for transfer learning to
downstream tasks. However, it remains unclear under which conditions
pre-trained models offer significant advantages over training from scratch. In
this study, we investigate the effectiveness of pre-training ViT-based Masked
Autoencoders (MAE) for downstream EO tasks, focusing on reconstruction,
segmentation, and classification. We consider two large ViT-based MAE
pre-trained models: a foundation model (Prithvi) and SatMAE. We evaluate
Prithvi on reconstruction and segmentation-based downstream tasks, and for
SatMAE we assess its performance on a classification downstream task. Our
findings suggest that pre-training is particularly beneficial when the
fine-tuning task closely resembles the pre-training task, e.g. reconstruction.
In contrast, for tasks such as segmentation or classification, training from
scratch with specific hyperparameter adjustments proved to be equally or more
effective.
★ Prompt-Driven Temporal Domain Adaptation for Nighttime UAV Tracking IROS2024
Nighttime UAV tracking under low-illuminated scenarios has achieved great
progress by domain adaptation (DA). However, previous DA training-based works
are deficient in narrowing the discrepancy of temporal contexts for UAV
trackers. To address the issue, this work proposes a prompt-driven temporal
domain adaptation training framework to fully utilize temporal contexts for
challenging nighttime UAV tracking, i.e., TDA. Specifically, the proposed
framework aligns the distribution of temporal contexts from daytime and
nighttime domains by training the temporal feature generator against the
discriminator. The temporal-consistent discriminator progressively extracts
shared domain-specific features to generate coherent domain discrimination
results in the time series. Additionally, to obtain high-quality training
samples, a prompt-driven object miner is employed to precisely locate objects
in unannotated nighttime videos. Moreover, a new benchmark for long-term
nighttime UAV tracking is constructed. Exhaustive evaluations on both public
and self-constructed nighttime benchmarks demonstrate the remarkable
performance of the tracker trained in TDA framework, i.e., TDA-Track.
Real-world tests at nighttime also show its practicality. The code and demo
videos are available at https://github.com/vision4robotics/TDA-Track.
comment: Accepted by IROS2024
★ Token Caching for Diffusion Transformer Acceleration
Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, Chenguang Ma
Diffusion transformers have gained substantial interest in diffusion
generative modeling due to their outstanding performance. However, their high
computational cost, arising from the quadratic computational complexity of
attention mechanisms and multi-step inference, presents a significant
bottleneck. To address this challenge, we propose TokenCache, a novel
post-training acceleration method that leverages the token-based multi-block
architecture of transformers to reduce redundant computations among tokens
across inference steps. TokenCache specifically addresses three critical
questions in the context of diffusion transformers: (1) which tokens should be
pruned to eliminate redundancy, (2) which blocks should be targeted for
efficient pruning, and (3) at which time steps caching should be applied to
balance speed and quality. In response to these challenges, TokenCache
introduces a Cache Predictor that assigns importance scores to tokens, enabling
selective pruning without compromising model performance. Furthermore, we
propose an adaptive block selection strategy to focus on blocks with minimal
impact on the network's output, along with a Two-Phase Round-Robin (TPRR)
scheduling policy to optimize caching intervals throughout the denoising
process. Experimental results across various models demonstrate that TokenCache
achieves an effective trade-off between generation quality and inference speed
for diffusion transformers. Our code will be publicly available.
★ Med-IC: Fusing a Single Layer Involution with Convolutions for Enhanced Medical Image Classification and Segmentation
Md. Farhadul Islam, Sarah Zabeen, Meem Arafat Manab, Mohammad Rakibul Hasan Mahin, Joyanta Jyoti Mondal, Md. Tanzim Reza, Md Zahidul Hasan, Munima Haque, Farig Sadeque, Jannatun Noor
The majority of medical images, especially those that resemble cells, have
similar characteristics. These images, which occur in a variety of shapes,
often show abnormalities in the organ or cell region. The convolution operation
possesses a restricted capability to extract visual patterns across several
spatial regions of an image. The involution process, which is the inverse
operation of convolution, complements this inherent lack of spatial information
extraction present in convolutions. In this study, we investigate how applying
a single layer of involution prior to a convolutional neural network (CNN)
architecture can significantly improve classification and segmentation
performance, with a comparatively negligible amount of weight parameters. The
study additionally shows how excessive use of involution layers might result in
inaccurate predictions in a particular type of medical image. According to our
findings from experiments, the strategy of adding only a single involution
layer before a CNN-based model outperforms most of the previous works.
comment: 13 pages, 5 figures, 4 tables, preprint submitted to an Elsevier
journal
★ Neural Video Representation for Redundancy Reduction and Consistency Preservation
Implicit neural representations (INRs) embed various signals into networks.
They have gained attention in recent years because of their versatility in
handling diverse signal types. For videos, INRs achieve video compression by
embedding video signals into networks and compressing them. Conventional
methods use an index that expresses the time of the frame or the features
extracted from the frame as inputs to the network. The latter method provides
greater expressive capability as the input is specific to each video. However,
the features extracted from frames often contain redundancy, which contradicts
the purpose of video compression. Moreover, since frame time information is not
explicitly provided to the network, learning the relationships between frames
is challenging. To address these issues, we aim to reduce feature redundancy by
extracting features based on the high-frequency components of the frames. In
addition, we use feature differences between adjacent frames in order for the
network to learn frame relationships smoothly. We propose a video
representation method that uses the high-frequency components of frames and the
differences in features between adjacent frames. The experimental results show
that our method outperforms the existing HNeRV method in 90 percent of the
videos.
★ Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
With the development of video understanding, there is a proliferation of
tasks for clip-level temporal video analysis, including temporal action
detection (TAD), temporal action segmentation (TAS), and generic event boundary
detection (GEBD). While task-specific video understanding models have exhibited
outstanding performance in each task, there remains a dearth of a unified
framework capable of simultaneously addressing multiple tasks, which is a
promising direction for the next generation of AI. To this end, in this paper,
we propose a single unified framework, coined as Temporal2Seq, to formulate the
output of these temporal video understanding tasks as a sequence of discrete
tokens. With this unified token representation, Temporal2Seq can train a
generalist model within a single architecture on different video understanding
tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a
comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and
GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding
test sets of three tasks, demonstrating that Temporal2Seq can produce
reasonable results on various tasks and achieve advantages compared with
single-task training on this framework. We also investigate the generalization
performance of our generalist model on new datasets from different tasks, which
yields superior performance to the specific model.
★ Underwater Image Enhancement with Physical-based Denoising Diffusion Implicit Models
Underwater vision is crucial for autonomous underwater vehicles (AUVs), and
enhancing degraded underwater images in real-time on a resource-constrained AUV
is a key challenge due to factors like light absorption and scattering, or the
sufficient model computational complexity to resolve such factors. Traditional
image enhancement techniques lack adaptability to varying underwater
conditions, while learning-based methods, particularly those using
convolutional neural networks (CNNs) and generative adversarial networks
(GANs), offer more robust solutions but face limitations such as inadequate
enhancement, unstable training, or mode collapse. Denoising diffusion
probabilistic models (DDPMs) have emerged as a state-of-the-art approach in
image-to-image tasks but require intensive computational complexity to achieve
the desired underwater image enhancement (UIE) using the recent UW-DDPM
solution. To address these challenges, this paper introduces UW-DiffPhys, a
novel physical-based and diffusion-based UIE approach. UW-DiffPhys combines
light-computation physical-based UIE network components with a denoising U-Net
to replace the computationally intensive distribution transformation U-Net in
the existing UW-DDPM framework, reducing complexity while maintaining
performance. Additionally, the Denoising Diffusion Implicit Model (DDIM) is
employed to accelerate the inference process through non-Markovian sampling.
Experimental results demonstrate that UW-DiffPhys achieved a substantial
reduction in computational complexity and inference time compared to UW-DDPM,
with competitive performance in key metrics such as PSNR, SSIM, UCIQE, and an
improvement in the overall underwater image quality UIQM metric. The
implementation code can be found at the following repository:
https://github.com/bachzz/UW-DiffPhys
★ Towards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration NeurIPS 2024
Federated Learning has emerged as a promising paradigm for collaborative
machine learning, while preserving user data privacy. Despite its potential,
standard FL lacks support for diverse heterogeneous device prototypes, which
vary significantly in model and dataset sizes -- from small IoT devices to
large workstations. This limitation is only partially addressed by existing
knowledge distillation techniques, which often fail to transfer knowledge
effectively across a broad spectrum of device prototypes with varied
capabilities. This failure primarily stems from two issues: the dilution of
informative logits from more capable devices by those from less capable ones,
and the use of a single integrated logits as the distillation target across all
devices, which neglects their individual learning capacities and and the unique
contributions of each. To address these challenges, we introduce TAKFL, a novel
KD-based framework that treats the knowledge transfer from each device
prototype's ensemble as a separate task, independently distilling each to
preserve its unique contributions and avoid dilution. TAKFL also incorporates a
KD-based self-regularization technique to mitigate the issues related to the
noisy and unsupervised ensemble distillation process. To integrate the
separately distilled knowledge, we introduce an adaptive task arithmetic
knowledge integration process, allowing each student model to customize the
knowledge integration for optimal performance. Additionally, we present
theoretical results demonstrating the effectiveness of task arithmetic in
transferring knowledge across heterogeneous devices with varying capacities.
Comprehensive evaluations of our method across both CV and NLP tasks
demonstrate that TAKFL achieves SOTA results in a variety of datasets and
settings, significantly outperforming existing KD-based methods. Code is
released at https://github.com/MMorafah/TAKFL
comment: NeurIPS 2024
★ FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
Research on food image understanding using recipe data has been a
long-standing focus due to the diversity and complexity of the data. Moreover,
food is inextricably linked to people's lives, making it a vital research area
for practical applications such as dietary management. Recent advancements in
Multimodal Large Language Models (MLLMs) have demonstrated remarkable
capabilities, not only in their vast knowledge but also in their ability to
handle languages naturally. While English is predominantly used, they can also
support multiple languages including Japanese. This suggests that MLLMs are
expected to significantly improve performance in food image understanding
tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe
dataset and benchmarked their performance against the closed model GPT-4o. We
then evaluated the content of generated recipes, including ingredients and
cooking procedures, using 5,000 evaluation samples that comprehensively cover
Japanese food culture. Our evaluation demonstrates that the open models trained
on recipe data outperform GPT-4o, the current state-of-the-art model, in
ingredient generation. Our model achieved F1 score of 0.531, surpassing
GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore,
our model exhibited comparable performance to GPT-4o in generating cooking
procedure text.
comment: 14 pages, 5 figures
★ Enhancing Crime Scene Investigations through Virtual Reality and Deep Learning Techniques
The analysis of a crime scene is a pivotal activity in forensic
investigations. Crime Scene Investigators and forensic science practitioners
rely on best practices, standard operating procedures, and critical thinking,
to produce rigorous scientific reports to document the scenes of interest and
meet the quality standards expected in the courts. However, crime scene
examination is a complex and multifaceted task often performed in environments
susceptible to deterioration, contamination, and alteration, despite the use of
contact-free and non-destructive methods of analysis. In this context, the
documentation of the sites, and the identification and isolation of traces of
evidential value remain challenging endeavours. In this paper, we propose a
photogrammetric reconstruction of the crime scene for inspection in virtual
reality (VR) and focus on fully automatic object recognition with deep learning
(DL) algorithms through a client-server architecture. A pre-trained Faster-RCNN
model was chosen as the best method that can best categorize relevant objects
at the scene, selected by experts in the VR environment. These operations can
considerably improve and accelerate crime scene analysis and help the forensic
expert in extracting measurements and analysing in detail the objects under
analysis. Experimental results on a simulated crime scene have shown that the
proposed method can be effective in finding and recognizing objects with
potential evidentiary value, enabling timely analyses of crime scenes,
particularly those with health and safety risks (e.g. fires, explosions,
chemicals, etc.), while minimizing subjective bias and contamination of the
scene.
★ DynaWeightPnP: Toward global real-time 3D-2D solver in PnP without correspondences
This paper addresses a special Perspective-n-Point (PnP) problem: estimating
the optimal pose to align 3D and 2D shapes in real-time without
correspondences, termed as correspondence-free PnP. While several studies have
focused on 3D and 2D shape registration, achieving both real-time and accurate
performance remains challenging. This study specifically targets the 3D-2D
geometric shape registration tasks, applying the recently developed Reproducing
Kernel Hilbert Space (RKHS) to address the "big-to-small" issue. An iterative
reweighted least squares method is employed to solve the RKHS-based formulation
efficiently. Moreover, our work identifies a unique and interesting
observability issue in correspondence-free PnP: the numerical ambiguity between
rotation and translation. To address this, we proposed DynaWeightPnP,
introducing a dynamic weighting sub-problem and an alternative searching
algorithm designed to enhance pose estimation and alignment accuracy.
Experiments were conducted on a typical case, that is, a 3D-2D vascular
centerline registration task within Endovascular Image-Guided Interventions
(EIGIs). Results demonstrated that the proposed algorithm achieves registration
processing rates of 60 Hz (without post-refinement) and 31 Hz (with
post-refinement) on modern single-core CPUs, with competitive accuracy
comparable to existing methods. These results underscore the suitability of
DynaWeightPnP for future robot navigation tasks like EIGIs.
★ Gradient-free Decoder Inversion in Latent Diffusion Models NeurIPS 2024
In latent diffusion models (LDMs), denoising diffusion process efficiently
takes place on latent space whose dimension is lower than that of pixel space.
Decoder is typically used to transform the representation in latent space to
that in pixel space. While a decoder is assumed to have an encoder as an
accurate inverse, exact encoder-decoder pair rarely exists in practice even
though applications often require precise inversion of decoder. Prior works for
decoder inversion in LDMs employed gradient descent inspired by inversions of
generative adversarial networks. However, gradient-based methods require larger
GPU memory and longer computation time for larger latent space. For example,
recent video LDMs can generate more than 16 frames, but GPUs with 24 GB memory
can only perform gradient-based decoder inversion for 4 frames. Here, we
propose an efficient gradient-free decoder inversion for LDMs, which can be
applied to diverse latent models. Theoretical convergence property of our
proposed inversion has been investigated not only for the forward step method,
but also for the inertial Krasnoselskii-Mann (KM) iterations under mild
assumption on cocoercivity that is satisfied by recent LDMs. Our proposed
gradient-free method with Adam optimizer and learning rate scheduling
significantly reduced computation time and memory usage over prior
gradient-based methods and enabled efficient computation in applications such
as noise-space watermarking while achieving comparable error levels.
comment: 19 pages, Accepted to NeurIPS 2024
★ Search3D: Hierarchical Open-Vocabulary 3D Segmentation
Ayca Takmaz, Alexandros Delitzas, Robert W. Sumner, Francis Engelmann, Johanna Wald, Federico Tombari
Open-vocabulary 3D segmentation enables the exploration of 3D spaces using
free-form text descriptions. Existing methods for open-vocabulary 3D instance
segmentation primarily focus on identifying object-level instances in a scene.
However, they face challenges when it comes to understanding more fine-grained
scene entities such as object parts, or regions described by generic
attributes. In this work, we introduce Search3D, an approach that builds a
hierarchical open-vocabulary 3D scene representation, enabling the search for
entities at varying levels of granularity: fine-grained object parts, entire
objects, or regions described by attributes like materials. Our method aims to
expand the capabilities of open vocabulary instance-level 3D segmentation by
shifting towards a more flexible open-vocabulary 3D search setting less
anchored to explicit object-centric queries, compared to prior work. To ensure
a systematic evaluation, we also contribute a scene-scale open-vocabulary 3D
part segmentation benchmark based on MultiScan, along with a set of
open-vocabulary fine-grained part annotations on ScanNet++. We verify the
effectiveness of Search3D across several tasks, demonstrating that our approach
outperforms baselines in scene-scale open-vocabulary 3D part segmentation,
while maintaining strong performance in segmenting 3D objects and materials.
comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
★ Robust Network Learning via Inverse Scale Variational Sparsification
While neural networks have made significant strides in many AI tasks, they
remain vulnerable to a range of noise types, including natural corruptions,
adversarial noise, and low-resolution artifacts. Many existing approaches focus
on enhancing robustness against specific noise types, limiting their
adaptability to others. Previous studies have addressed general robustness by
adopting a spectral perspective, which tends to blur crucial features like
texture and object contours. Our proposed solution, however, introduces an
inverse scale variational sparsification framework within a time-continuous
inverse scale space formulation. This framework progressively learns
finer-scale features by discerning variational differences between pixels,
ultimately preserving only large-scale features in the smoothed image. Unlike
frequency-based methods, our approach not only removes noise by smoothing
small-scale features where corruptions often occur but also retains
high-contrast details such as textures and object contours. Moreover, our
framework offers simplicity and efficiency in implementation. By integrating
this algorithm into neural network training, we guide the model to prioritize
learning large-scale features. We show the efficacy of our approach through
enhanced robustness against various noise types.
comment: 21 pages, 7 figures
★ A3: Active Adversarial Alignment for Source-Free Domain Adaptation ICML
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a
labeled source domain to an unlabeled target domain. Recent works have focused
on source-free UDA, where only target data is available. This is challenging as
models rely on noisy pseudo-labels and struggle with distribution shifts. We
propose Active Adversarial Alignment (A3), a novel framework combining
self-supervised learning, adversarial training, and active learning for robust
source-free UDA. A3 actively samples informative and diverse data using an
acquisition function for training. It adapts models via adversarial losses and
consistency regularization, aligning distributions without source data access.
A3 advances source-free UDA through its synergistic integration of active and
adversarial learning for effective domain alignment and noise reduction.
comment: Accepted at ICMLA 2024
★ Query matching for spatio-temporal action detection with query-based object detector
In this paper, we propose a method that extends the query-based object
detection model, DETR, to spatio-temporal action detection, which requires
maintaining temporal consistency in videos. Our proposed method applies DETR to
each frame and uses feature shift to incorporate temporal information. However,
DETR's object queries in each frame may correspond to different objects, making
a simple feature shift ineffective. To overcome this issue, we propose query
matching across different frames, ensuring that queries for the same object are
matched and used for the feature shift. Experimental results show that
performance on the JHMDB21 dataset improves significantly when query features
are shifted using the proposed query matching.
★ GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation
Large-scale text-guided image diffusion models have shown astonishing results
in text-to-image (T2I) generation. However, applying these models to synthesize
textures for 3D geometries remains challenging due to the domain gap between 2D
images and textures on a 3D surface. Early works that used a
projecting-and-inpainting approach managed to preserve generation diversity but
often resulted in noticeable artifacts and style inconsistencies. While recent
methods have attempted to address these inconsistencies, they often introduce
other issues, such as blurring, over-saturation, or over-smoothing. To overcome
these challenges, we propose a novel text-to-texture synthesis framework that
leverages pretrained diffusion models. We first introduce a local attention
reweighing mechanism in the self-attention layers to guide the model in
concentrating on spatial-correlated patches across different views, thereby
enhancing local details while preserving cross-view consistency. Additionally,
we propose a novel latent space merge pipeline, which further ensures
consistency across different viewpoints without sacrificing too much diversity.
Our method significantly outperforms existing state-of-the-art techniques
regarding texture consistency and visual quality, while delivering results much
faster than distillation-based methods. Importantly, our framework does not
require additional training or fine-tuning, making it highly adaptable to a
wide range of models available on public platforms.
★ You Only Speak Once to See ICASSP 2025
Grounding objects in images using visual cues is a well-established approach
in computer vision, yet the potential of audio as a modality for object
recognition and grounding remains underexplored. We introduce YOSS, "You Only
Speak Once to See," to leverage audio for grounding objects in visual scenes,
termed Audio Grounding. By integrating pre-trained audio models with visual
models using contrastive learning and multi-modal alignment, our approach
captures speech commands or descriptions and maps them directly to
corresponding objects within images. Experimental results indicate that audio
guidance can be effectively applied to object grounding, suggesting that
incorporating audio guidance may enhance the precision and robustness of
current object grounding methods and improve the performance of robotic systems
and computer vision applications. This finding opens new possibilities for
advanced object recognition, scene understanding, and the development of more
intuitive and capable robotic systems.
comment: 7 pages, 4 figures, submitted to ICASSP 2025
★ Multi-hypotheses Conditioned Point Cloud Diffusion for 3D Human Reconstruction from Occluded Images NeurIPS 2024
3D human shape reconstruction under severe occlusion due to human-object or
human-human interaction is a challenging problem. Parametric models i.e.,
SMPL(-X), which are based on the statistics across human shapes, can represent
whole human body shapes but are limited to minimally-clothed human shapes.
Implicit-function-based methods extract features from the parametric models to
employ prior knowledge of human bodies and can capture geometric details such
as clothing and hair. However, they often struggle to handle misaligned
parametric models and inpaint occluded regions given a single RGB image. In
this work, we propose a novel pipeline, MHCDIFF, Multi-hypotheses Conditioned
Point Cloud Diffusion, composed of point cloud diffusion conditioned on
probabilistic distributions for pixel-aligned detailed 3D human reconstruction
under occlusion. Compared to previous implicit-function-based methods, the
point cloud diffusion model can capture the global consistent features to
generate the occluded regions, and the denoising process corrects the
misaligned SMPL meshes. The core of MHCDIFF is extracting local features from
multiple hypothesized SMPL(-X) meshes and aggregating the set of features to
condition the diffusion model. In the experiments on CAPE and MultiHuman
datasets, the proposed method outperforms various SOTA methods based on SMPL,
implicit functions, point cloud diffusion, and their combined, under synthetic
and real occlusions.
comment: 17 pages, 7 figures, accepted NeurIPS 2024
★ SinoSynth: A Physics-based Domain Randomization Approach for Generalizable CBCT Image Enhancement MICCAI 2024
Cone Beam Computed Tomography (CBCT) finds diverse applications in medicine.
Ensuring high image quality in CBCT scans is essential for accurate diagnosis
and treatment delivery. Yet, the susceptibility of CBCT images to noise and
artifacts undermines both their usefulness and reliability. Existing methods
typically address CBCT artifacts through image-to-image translation approaches.
These methods, however, are limited by the artifact types present in the
training data, which may not cover the complete spectrum of CBCT degradations
stemming from variations in imaging protocols. Gathering additional data to
encompass all possible scenarios can often pose a challenge. To address this,
we present SinoSynth, a physics-based degradation model that simulates various
CBCT-specific artifacts to generate a diverse set of synthetic CBCT images from
high-quality CT images without requiring pre-aligned data. Through extensive
experiments, we demonstrate that several different generative networks trained
on our synthesized data achieve remarkable results on heterogeneous
multi-institutional datasets, outperforming even the same networks trained on
actual data. We further show that our degradation model conveniently provides
an avenue to enforce anatomical constraints in conditional generative models,
yielding high-quality and structure-preserving synthetic CT images.
comment: MICCAI 2024
♻ ★ InterNet: Unsupervised Cross-modal Homography Estimation Based on Interleaved Modality Transfer and Self-supervised Homography Prediction
Junchen Yu, Si-Yuan Cao, Runmin Zhang, Chenghao Zhang, Jianxin Hu, Zhu Yu, Beinan Yu, Hui-liang Shen
We propose a novel unsupervised cross-modal homography estimation framework,
based on interleaved modality transfer and self-supervised homography
prediction, named InterNet. InterNet integrates modality transfer and
self-supervised homography estimation, introducing an innovative interleaved
optimization framework to alternately promote both components. The modality
transfer gradually narrows the modality gaps, facilitating the self-supervised
homography estimation to fully leverage the synthetic intra-modal data. The
self-supervised homography estimation progressively achieves reliable
predictions, thereby providing robust cross-modal supervision for the modality
transfer. To further boost the estimation accuracy, we also formulate a
fine-grained homography feature loss to improve the connection between two
components. Furthermore, we employ a simple yet effective distillation training
technique to reduce model parameters and improve cross-domain generalization
ability while maintaining comparable performance. Experiments reveal that
InterNet achieves the state-of-the-art (SOTA) performance among unsupervised
methods, and even outperforms many supervised methods such as MHN and
LocalTrans.
♻ ★ Confidence intervals uncovered: Are we ready for real-world medical imaging AI? MICCAI 2024
Evangelia Christodoulou, Annika Reinke, Rola Houhou, Piotr Kalinowski, Selen Erkan, Carole H. Sudre, Ninon Burgos, Sofiène Boutaj, Sophie Loizillon, Maëlys Solal, Nicola Rieke, Veronika Cheplygina, Michela Antonelli, Leon D. Mayer, Minu D. Tizabi, M. Jorge Cardoso, Amber Simpson, Paul F. Jäger, Annette Kopp-Schneider, Gaël Varoquaux, Olivier Colliot, Lena Maier-Hein
Medical imaging is spearheading the AI transformation of healthcare.
Performance reporting is key to determine which methods should be translated
into clinical practice. Frequently, broad conclusions are simply derived from
mean performance values. In this paper, we argue that this common practice is
often a misleading simplification as it ignores performance variability. Our
contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n =
221) published in 2023, we first observe that more than 50% of papers do not
assess performance variability at all. Moreover, only one (0.5%) paper reported
confidence intervals (CIs) for model performance. (2) To address the reporting
bottleneck, we show that the unreported standard deviation (SD) in segmentation
papers can be approximated by a second-order polynomial function of the mean
Dice similarity coefficient (DSC). Based on external validation data from 56
previous MICCAI challenges, we demonstrate that this approximation can
accurately reconstruct the CI of a method using information provided in
publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of
MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three
times larger than the median performance gap between the first and second
ranked method. For more than 60% of papers, the mean performance of the
second-ranked method was within the CI of the first-ranked method. We conclude
that current publications typically do not provide sufficient evidence to
support which models could potentially be translated into clinical practice.
comment: Paper accepted at MICCAI 2024 conference
♻ ★ Leveraging Anthropometric Measurements to Improve Human Mesh Estimation and Ensure Consistent Body Shapes
The basic body shape of a person does not change within a single video.
However, most SOTA human mesh estimation (HME) models output a slightly
different body shape for each video frame, which results in inconsistent body
shapes for the same person. In contrast, we leverage anthropometric
measurements like tailors are already obtaining from humans for centuries. We
create a model called A2B that converts such anthropometric measurements to
body shape parameters of human mesh models. Moreover, we find that finetuned
SOTA 3D human pose estimation (HPE) models outperform HME models regarding the
precision of the estimated keypoints. We show that applying inverse kinematics
(IK) to the results of such a 3D HPE model and combining the resulting body
pose with the A2B body shape leads to superior and consistent human meshes for
challenging datasets like ASPset or fit3D, where we can lower the MPJPE by over
30 mm compared to SOTA HME models. Further, replacing HME models estimates of
the body shape parameters with A2B model results not only increases the
performance of these HME models, but also leads to consistent body shapes.
♻ ★ VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection ACCV 2024
Video anomaly detection (VAD) is a crucial task in video analysis and
surveillance within computer vision. Currently, VAD is gaining attention with
memory techniques that store the features of normal frames. The stored features
are utilized for frame reconstruction, identifying an abnormality when a
significant difference exists between the reconstructed and input frames.
However, this approach faces several challenges due to the simultaneous
optimization required for both the memory and encoder-decoder model. These
challenges include increased optimization difficulty, complexity of
implementation, and performance variability depending on the memory size. To
address these challenges,we propose an effective memory method for VAD, called
VideoPatchCore. Inspired by PatchCore, our approach introduces a structure that
prioritizes memory optimization and configures three types of memory tailored
to the characteristics of video data. This method effectively addresses the
limitations of existing memory-based methods, achieving good performance
comparable to state-of-the-art methods. Furthermore, our method requires no
training and is straightforward to implement, making VAD tasks more accessible.
Our code is available online at github.com/SkiddieAhn/Paper-VideoPatchCore.
comment: Accepted to ACCV 2024
♻ ★ SpaRED benchmark: Enhancing Gene Expression Prediction from Histology Images with Spatial Transcriptomics Completion
Spatial Transcriptomics is a novel technology that aligns histology images
with spatially resolved gene expression profiles. Although groundbreaking, it
struggles with gene capture yielding high corruption in acquired data. Given
potential applications, recent efforts have focused on predicting
transcriptomic profiles solely from histology images. However, differences in
databases, preprocessing techniques, and training hyperparameters hinder a fair
comparison between methods. To address these challenges, we present a
systematically curated and processed database collected from 26 public sources,
representing an 8.6-fold increase compared to previous works. Additionally, we
propose a state-of-the-art transformer based completion technique for inferring
missing gene expression, which significantly boosts the performance of
transcriptomic profile predictions across all datasets. Altogether, our
contributions constitute the most comprehensive benchmark of gene expression
prediction from histology images to date and a stepping stone for future
research on spatial transcriptomics.
♻ ★ ChaosBench: A Multi-Channel, Physics-Based Benchmark for Subseasonal-to-Seasonal Climate Prediction NeurIPS'24
Accurate prediction of climate in the subseasonal-to-seasonal scale is
crucial for disaster preparedness and robust decision making amidst climate
change. Yet, forecasting beyond the weather timescale is challenging because it
deals with problems other than initial condition, including boundary
interaction, butterfly effect, and our inherent lack of physical understanding.
At present, existing benchmarks tend to have shorter forecasting range of up-to
15 days, do not include a wide range of operational baselines, and lack
physics-based constraints for explainability. Thus, we propose ChaosBench, a
challenging benchmark to extend the predictability range of data-driven weather
emulators to S2S timescale. First, ChaosBench is comprised of variables beyond
the typical surface-atmospheric ERA5 to also include ocean, ice, and land
reanalysis products that span over 45 years to allow for full Earth system
emulation that respects boundary conditions. We also propose physics-based, in
addition to deterministic and probabilistic metrics, to ensure a
physically-consistent ensemble that accounts for butterfly effect. Furthermore,
we evaluate on a diverse set of physics-based forecasts from four national
weather agencies as baselines to our data-driven counterpart such as
ViT/ClimaX, PanguWeather, GraphCast, and FourCastNetV2. Overall, we find
methods originally developed for weather-scale applications fail on S2S task:
their performance simply collapse to an unskilled climatology. Nonetheless, we
outline and demonstrate several strategies that can extend the predictability
range of existing weather emulators, including the use of ensembles, robust
control of error propagation, and the use of physics-informed models. Our
benchmark, datasets, and instructions are available at
https://leap-stc.github.io/ChaosBench.
comment: Accepted as Oral in NeurIPS'24 D&B Track
♻ ★ A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts ECCV 2024
Aurel Pjetri, Stefano Caprasecca, Leonardo Taccari, Matteo Simoncini, Henrique Piñeiro Monteagudo, Walter Wallace, Douglas Coimbra de Andrade, Francesco Sambo, Andrew David Bagdanov
Monocular depth estimation is a critical task for autonomous driving and many
other computer vision applications. While significant progress has been made in
this field, the effects of viewpoint shifts on depth estimation models remain
largely underexplored. This paper introduces a novel dataset and evaluation
methodology to quantify the impact of different camera positions and
orientations on monocular depth estimation performance. We propose a ground
truth strategy based on homography estimation and object detection, eliminating
the need for expensive lidar sensors. We collect a diverse dataset of road
scenes from multiple viewpoints and use it to assess the robustness of a modern
depth estimation model to geometric shifts. After assessing the validity of our
strategy on a public dataset, we provide valuable insights into the limitations
of current models and highlight the importance of considering viewpoint
variations in real-world applications.
comment: 17 pages, 5 figures. Accepted at ECCV 2024 2nd Workshop on
Vision-Centric Autonomous Driving (VCAD)
♻ ★ A preliminary study on continual learning in computer vision using Kolmogorov-Arnold Networks
Alessandro Cacciatore, Valerio Morelli, Federica Paganica, Emanuele Frontoni, Lucia Migliorelli, Daniele Berardini
Deep learning has long been dominated by multi-layer perceptrons (MLPs),
which have demonstrated superiority over other optimizable models in various
domains. Recently, a new alternative to MLPs has emerged - Kolmogorov-Arnold
Networks (KAN)- which are based on a fundamentally different mathematical
framework. According to their authors, KANs address several major issues in
MLPs, such as catastrophic forgetting in continual learning scenarios. However,
this claim has only been supported by results from a regression task on a toy
1D dataset. In this paper, we extend the investigation by evaluating the
performance of KANs in continual learning tasks within computer vision,
specifically using the MNIST datasets. To this end, we conduct a structured
analysis of the behavior of MLPs and two KAN-based models in a
class-incremental learning scenario, ensuring that the architectures involved
have the same number of trainable parameters. Our results demonstrate that an
efficient version of KAN outperforms both traditional MLPs and the original KAN
implementation. We further analyze the influence of hyperparameters in MLPs and
KANs, as well as the impact of certain trainable parameters in KANs, such as
bias and scale weights. Additionally, we provide a preliminary investigation of
recent KAN-based convolutional networks and compare their performance with that
of traditional convolutional neural networks. Our codes can be found at
https://github.com/MrPio/KAN-Continual_Learning_tests.
♻ ★ A Novel Framework for the Automated Characterization of Gram-Stained Blood Culture Slides Using a Large-Scale Vision Transformer
Jack McMahon, Naofumi Tomita, Elizabeth S. Tatishev, Adrienne A. Workman, Cristina R Costales, Niaz Banaei, Isabella W. Martin, Saeed Hassanpour
This study introduces a new framework for the artificial
intelligence-assisted characterization of Gram-stained whole-slide images
(WSIs). As a test for the diagnosis of bloodstream infections, Gram stains
provide critical early data to inform patient treatment. Rapid and reliable
analysis of Gram stains has been shown to be positively associated with better
clinical outcomes, underscoring the need for improved tools to automate Gram
stain analysis. In this work, we developed a novel transformer-based model for
Gram-stained WSI classification, which is more scalable to large datasets than
previous convolutional neural network (CNN) -based methods as it does not
require patch-level manual annotations. We also introduce a large Gram stain
dataset from Dartmouth-Hitchcock Medical Center (Lebanon, New Hampshire, USA)
to evaluate our model, exploring the classification of five major categories of
Gram-stained WSIs: Gram-positive cocci in clusters, Gram-positive cocci in
pairs/chains, Gram-positive rods, Gram-negative rods, and slides with no
bacteria. Our model achieves a classification accuracy of 0.858 (95% CI: 0.805,
0.905) and an AUC of 0.952 (95% CI: 0.922, 0.976) using five-fold nested
cross-validation on our 475-slide dataset, demonstrating the potential of
large-scale transformer models for Gram stain classification. We further
demonstrate the generalizability of our trained model, which achieves strong
performance on external datasets without additional fine-tuning.
♻ ★ The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers ECCV 2024
Knowledge distillation is an effective method for training lightweight vision
models. However, acquiring teacher supervision for training samples is often
costly, especially from large-scale models like vision transformers (ViTs). In
this paper, we develop a simple framework to reduce the supervision cost of ViT
distillation: masking out a fraction of input tokens given to the teacher. By
masking input tokens, one can skip the computations associated with the masked
tokens without requiring any change to teacher parameters or architecture. We
find that masking patches with the lowest student attention scores is highly
effective, saving up to 50% of teacher FLOPs without any drop in student
accuracy, while other masking criterion leads to suboptimal efficiency gains.
Through in-depth analyses, we reveal that the student-guided masking provides a
good curriculum to the student, making teacher supervision easier to follow
during the early stage and challenging in the later stage.
comment: ECCV 2024
♻ ★ Deep Bayesian Future Fusion for Self-Supervised, High-Resolution, Off-Road Mapping
Shubhra Aich, Wenshan Wang, Parv Maheshwari, Matthew Sivaprakasam, Samuel Triest, Cherie Ho, Jason M. Gregory, John G. Rogers III, Sebastian Scherer
High-speed off-road navigation requires long-range, high-resolution maps to
enable robots to safely navigate over different surfaces while avoiding
dangerous obstacles. However, due to limited computational power and sensing
noise, most approaches to off-road mapping focus on producing coarse (20-40cm)
maps of the environment. In this paper, we propose Future Fusion, a framework
capable of generating dense, high-resolution maps from sparse sensing data (30m
forward at 2cm). This is accomplished by - (1) the efficient realization of the
well-known Bayes filtering within the standard deep learning models that
explicitly accounts for the sparsity pattern in stereo and LiDAR depth data,
and (2) leveraging perceptual losses common in generative image completion. The
proposed methodology outperforms the conventional baselines. Moreover, the
learned features and the completed dense maps lead to improvements in the
downstream navigation task.
♻ ★ Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
Text-to-Image (T2I) models excel at synthesizing concepts such as nouns,
appearances, and styles. To enable customized content creation based on a few
example images of a concept, methods such as Textual Inversion and DreamBooth
invert the desired concept and enable synthesizing it in new scenes. However,
inverting personalized concepts that go beyond object appearance and style
(adjectives and verbs) through natural language remains a challenge. Two key
characteristics of these concepts contribute to the limitations of current
inversion methods. 1) Adjectives and verbs are entangled with nouns (subject)
and can hinder appearance-based inversion methods, where the subject appearance
leaks into the concept embedding, and 2) describing such concepts often extends
beyond single word embeddings.
In this study, we introduce Lego, a textual inversion method designed to
invert subject-entangled concepts from a few example images. Lego disentangles
concepts from their associated subjects using a simple yet effective Subject
Separation step and employs a Context Loss that guides the inversion of
single/multi-embedding concepts. In a thorough user study, Lego-generated
concepts were preferred over 70% of the time when compared to the baseline in
terms of authentically generating concepts according to a reference.
Additionally, visual question answering using an LLM suggested Lego-generated
concepts are better aligned with the text description of the concept.
♻ ★ DeRainGS: Gaussian Splatting for Enhanced Scene Reconstruction in Rainy Environments
Reconstruction under adverse rainy conditions poses significant challenges
due to reduced visibility and the distortion of visual perception. These
conditions can severely impair the quality of geometric maps, which is
essential for applications ranging from autonomous planning to environmental
monitoring. In response to these challenges, this study introduces the novel
task of 3D Reconstruction in Rainy Environments (3DRRE), specifically designed
to address the complexities of reconstructing 3D scenes under rainy conditions.
To benchmark this task, we construct the HydroViews dataset that comprises a
diverse collection of both synthesized and real-world scene images
characterized by various intensities of rain streaks and raindrops.
Furthermore, we propose DeRainGS, the first 3DGS method tailored for
reconstruction in adverse rainy environments. Extensive experiments across a
wide range of rain scenarios demonstrate that our method delivers
state-of-the-art performance, remarkably outperforming existing occlusion-free
methods.
♻ ★ High-Frequency Anti-DreamBooth: Robust Defense against Personalized Image Synthesis ECCV 2024
Recently, text-to-image generative models have been misused to create
unauthorized malicious images of individuals, posing a growing social problem.
Previous solutions, such as Anti-DreamBooth, add adversarial noise to images to
protect them from being used as training data for malicious generation.
However, we found that the adversarial noise can be removed by adversarial
purification methods such as DiffPure. Therefore, we propose a new adversarial
attack method that adds strong perturbation on the high-frequency areas of
images to make it more robust to adversarial purification. Our experiment
showed that the adversarial images retained noise even after adversarial
purification, hindering malicious image generation.
comment: ECCV 2024 Workshop The Dark Side of Generative AIs and Beyond
♻ ★ Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer
Vision-based perception and reasoning is essential for scene understanding in
any autonomous system. RGB and depth images are commonly used to capture both
the semantic and geometric features of the environment. Developing methods to
reliably interpret this data is critical for real-world applications, where
noisy measurements are often unavoidable. In this work, we introduce a
diffusion-based framework to address the RGB-D semantic segmentation problem.
Additionally, we demonstrate that utilizing a Deformable Attention Transformer
as the encoder to extract features from depth images effectively captures the
characteristics of invalid regions in depth measurements. Our generative
framework shows a greater capacity to model the underlying distribution of
RGB-D images, achieving robust performance in challenging scenarios with
significantly less training time compared to discriminative methods.
Experimental results indicate that our approach achieves State-of-the-Art
performance on both the NYUv2 and SUN-RGBD datasets in general and especially
in the most challenging of their image data. Our project page will be available
at https://diffusionmms.github.io/
♻ ★ I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing NeurIPS2024
Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, Rongrong Ji
Significant progress has been made in the field of Instruction-based Image
Editing (IIE). However, evaluating these models poses a significant challenge.
A crucial requirement in this field is the establishment of a comprehensive
evaluation benchmark for accurately assessing editing results and providing
valuable insights for its further development. In response to this need, we
propose I2EBench, a comprehensive benchmark designed to automatically evaluate
the quality of edited images produced by IIE models from multiple dimensions.
I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding
original and diverse instructions. It offers three distinctive characteristics:
1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation
dimensions that cover both high-level and low-level aspects, providing a
comprehensive assessment of each IIE model. 2) Human Perception Alignment: To
ensure the alignment of our benchmark with human perception, we conducted an
extensive user study for each evaluation dimension. 3) Valuable Research
Insights: By analyzing the advantages and disadvantages of existing IIE models
across the 16 dimensions, we offer valuable research insights to guide future
development in the field. We will open-source I2EBench, including all
instructions, input images, human annotations, edited images from all evaluated
methods, and a simple script for evaluating the results from new IIE models.
The code, dataset and generated images from all IIE models are provided in
github: https://github.com/cocoshe/I2EBench.
comment: NeurIPS2024, 15 pages, 7 figures
♻ ★ Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition
Suvajit Patra, Arkadip Maitra, Megha Tiwari, K. Kumaran, Swathy Prabhu, Swami Punyeshwarananda, Soumitra Samanta
Automatic Sign Language (SL) recognition is an important task in the computer
vision community. To build a robust SL recognition system, we need a
considerable amount of data which is lacking particularly in Indian sign
language (ISL). In this paper, we introduce a large-scale isolated ISL dataset
and a novel SL recognition model based on skeleton graph structure. The dataset
covers 2002 daily used common words in the deaf community recorded by 20 (10
male and 10 female) deaf adult signers (contains 40033 videos). We propose a SL
recognition model namely Hierarchical Windowed Graph Attention Network (HWGAT)
by utilizing the human upper body skeleton graph. The HWGAT tries to capture
distinctive motions by giving attention to different body parts induced by the
human skeleton graph. The utility of the proposed dataset and the usefulness of
our model are evaluated through extensive experiments. We pre-trained the
proposed model on the presented dataset and fine-tuned it across different sign
language datasets further boosting the performance of 1.10, 0.46, 0.78, and
6.84 percentage points on INCLUDE, LSA64, AUTSL and WLASL respectively compared
to the existing state-of-the-art keypoints-based models.
♻ ★ TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian Tang
Vision-Language-Action (VLA) models have shown remarkable potential in
visuomotor control and instruction comprehension through end-to-end learning
processes. However, current VLA models face significant challenges: they are
slow during inference and require extensive pre-training on large amounts of
robotic data, making real-world deployment difficult. In this paper, we
introduce a new family of compact vision-language-action models, called
TinyVLA, which offers two key advantages over existing VLA models: (1) faster
inference speeds, and (2) improved data efficiency, eliminating the need for
pre-training stage. Our framework incorporates two essential components to
build TinyVLA: (1) initializing the policy backbone with robust, high-speed
multimodal models, and (2) integrating a diffusion policy decoder during
fine-tuning to enable precise robot actions. We conducted extensive evaluations
of TinyVLA in both simulation and on real robots, demonstrating that our
approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in
terms of speed and data efficiency, while delivering comparable or superior
performance. Additionally, TinyVLA exhibits strong generalization capabilities
across various dimensions, including language instructions, novel objects,
unseen positions, changes in object appearance, background variations, and
environmental shifts, often matching or exceeding the performance of OpenVLA.
We believe that \methodname offers an interesting perspective on utilizing
pre-trained multimodal models for policy learning. Our project is at
https://tiny-vla.github.io.
comment: add more citations
♻ ★ Implicit Image-to-Image Schrodinger Bridge for Image Restoration
Yuang Wang, Siyeop Yoon, Pengfei Jin, Matthew Tivnan, Sifan Song, Zhennong Chen, Rui Hu, Li Zhang, Quanzheng Li, Zhiqiang Chen, Dufan Wu
Diffusion-based models are widely recognized for their effectiveness in image
restoration tasks; however, their iterative denoising process, which begins
from Gaussian noise, often results in slow inference speeds. The Image-to-Image
Schr\"odinger Bridge (I$^2$SB) presents a promising alternative by starting the
generative process from corrupted images and leveraging training techniques
from score-based diffusion models. In this paper, we introduce the Implicit
Image-to-Image Schr\"odinger Bridge (I$^3$SB) to further accelerate the
generative process of I$^2$SB. I$^3$SB reconfigures the generative process into
a non-Markovian framework by incorporating the initial corrupted image into
each step, while ensuring that the marginal distribution aligns with that of
I$^2$SB. This allows for the direct use of the pretrained network from I$^2$SB.
Extensive experiments on natural images, human face images, and medical images
validate the acceleration benefits of I$^3$SB. Compared to I$^2$SB, I$^3$SB
achieves the same perceptual quality with fewer generative steps, while
maintaining equal or improved fidelity to the ground truth.
comment: 23 pages, 8 figures, submitted to Pattern Recognition
♻ ★ CCFExp: Facial Image Synthesis with Cycle Cross-Fusion Diffusion Model for Facial Paralysis Individuals
Facial paralysis is a debilitating condition that affects the movement of
facial muscles, leading to a significant loss of facial expressions. Currently,
the diagnosis of facial paralysis remains a challenging task, often relying
heavily on the subjective judgment and experience of clinicians, which can
introduce variability and uncertainty in the assessment process. One promising
application in real-life situations is the automatic estimation of facial
paralysis. However, the scarcity of facial paralysis datasets limits the
development of robust machine learning models for automated diagnosis and
therapeutic interventions. To this end, this study aims to synthesize a
high-quality facial paralysis dataset to address this gap, enabling more
accurate and efficient algorithm training. Specifically, a novel Cycle
Cross-Fusion Expression Generative Model (CCFExp) based on the diffusion model
is proposed to combine different features of facial information and enhance the
visual details of facial appearance and texture in facial regions, thus
creating synthetic facial images that accurately represent various degrees and
types of facial paralysis. We have qualitatively and quantitatively evaluated
the proposed method on the commonly used public clinical datasets of facial
paralysis to demonstrate its effectiveness. Experimental results indicate that
the proposed method surpasses state-of-the-art methods, generating more
realistic facial images and maintaining identity consistency.
♻ ★ Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation
Single camera 3D pose estimation is an ill-defined problem due to inherent
ambiguities from depth, occlusion or keypoint noise. Multi-hypothesis pose
estimation accounts for this uncertainty by providing multiple 3D poses
consistent with the 2D measurements. Current research has predominantly
concentrated on generating multiple hypotheses for single frame static pose
estimation or single hypothesis motion estimation. In this study we focus on
the new task of multi-hypothesis motion estimation. Multi-hypothesis motion
estimation is not simply multi-hypothesis pose estimation applied to multiple
frames, which would ignore temporal correlation across frames. Instead, it
requires distributions which are capable of generating temporally consistent
samples, which is significantly more challenging than multi-hypothesis pose
estimation or single-hypothesis motion estimation. To this end, we introduce
Platypose, a framework that uses a diffusion model pretrained on 3D human
motion sequences for zero-shot 3D pose sequence estimation. Platypose
outperforms baseline methods on multiple hypotheses for motion estimation.
Additionally, Platypose also achieves state-of-the-art calibration and
competitive joint error when tested on static poses from Human3.6M,
MPI-INF-3DHP and 3DPW. Finally, because it is zero-shot, our method generalizes
flexibly to different settings such as multi-camera inference.
♻ ★ EMR-Merging: Tuning-Free High-Performance Model Merging NeurIPS 2024
The success of pretrain-finetune paradigm brings about the release of
numerous model weights. In this case, merging models finetuned on different
tasks to enable a single model with multi-task capabilities is gaining
increasing attention for its practicability. Existing model merging methods
usually suffer from (1) significant performance degradation or (2) requiring
tuning by additional data or training. In this paper, we rethink and analyze
the existing model merging paradigm. We discover that using a single model's
weights can hardly simulate all the models' performance. To tackle this issue,
we propose Elect, Mask & Rescale-Merging (EMR-Merging). We first (a) elect a
unified model from all the model weights and then (b) generate extremely
lightweight task-specific modulators, including masks and rescalers, to align
the direction and magnitude between the unified model and each specific model,
respectively. EMR-Merging is tuning-free, thus requiring no data availability
or any additional training while showing impressive performance. We find that
EMR-Merging shows outstanding performance compared to existing merging methods
under different classical and newly-established settings, including merging
different numbers of vision models (up to 30), NLP models, PEFT models, and
multi-modal models.
comment: NeurIPS 2024
♻ ★ FracGM: A Fast Fractional Programming Technique for Geman-McClure Robust Estimator
Robust estimation is essential in computer vision, robotics, and navigation,
aiming to minimize the impact of outlier measurements for improved accuracy. We
present a fast algorithm for Geman-McClure robust estimation, FracGM,
leveraging fractional programming techniques. This solver reformulates the
original non-convex fractional problem to a convex dual problem and a linear
equation system, iteratively solving them in an alternating optimization
pattern. Compared to graduated non-convexity approaches, this strategy exhibits
a faster convergence rate and better outlier rejection capability. In addition,
the global optimality of the proposed solver can be guaranteed under given
conditions. We demonstrate the proposed FracGM solver with Wahba's rotation
problem and 3-D point-cloud registration along with relaxation pre-processing
and projection post-processing. Compared to state-of-the-art algorithms, when
the outlier rates increase from 20% to 80%, FracGM shows 53% and 88% lower
rotation and translation increases. In real-world scenarios, FracGM achieves
better results in 13 out of 18 outcomes, while having a 19.43% improvement in
the computation time.
comment: 8 pages, 6 figures
♻ ★ 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?
Co-speech gestures are fundamental for communication. The advent of recent
deep learning techniques has facilitated the creation of lifelike, synchronous
co-speech gestures for Embodied Conversational Agents. "In-the-wild" datasets,
aggregating video content from platforms like YouTube via human pose detection
technologies, provide a feasible solution by offering 2D skeletal sequences
aligned with speech. Concurrent developments in lifting models enable the
conversion of these 2D sequences into 3D gesture databases. However, it is
important to note that the 3D poses estimated from the 2D extracted poses are,
in essence, approximations of the ground-truth, which remains in the 2D domain.
This distinction raises questions about the impact of gesture representation
dimensionality on the quality of generated motions - a topic that, to our
knowledge, remains largely unexplored. Our study examines the effect of using
either 2D or 3D joint coordinates as training data on the performance of
speech-to-gesture deep generative models. We employ a lifting model for
converting generated 2D pose sequences into 3D and assess how gestures created
directly in 3D stack up against those initially generated in 2D and then
converted to 3D. We perform an objective evaluation using widely used metrics
in the gesture generation field as well as a user study to qualitatively
evaluate the different approaches.
comment: arXiv admin note: substantial text overlap with arXiv:2406.15111
♻ ★ JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach
to generating high-quality and temporally coherent videos. We achieve this by
integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained
on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our
method combines these models in the reverse diffusion process, where the LIDM
enhances image quality and the LVDM ensures temporal consistency. This unique
combination allows us to effectively handle the complex spatio-temporal
dynamics in video generation. Our results demonstrate quantitative and
qualitative improvements in producing realistic and coherent videos.
♻ ★ Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer
Motivated by the huge success of Transformers in the field of natural
language processing (NLP), Vision Transformers (ViTs) have been rapidly
developed and achieved remarkable performance in various computer vision tasks.
However, their huge model sizes and intensive computations hinder ViTs'
deployment on embedded devices, calling for effective model compression
methods, such as quantization. Unfortunately, due to the existence of
hardware-unfriendly and quantization-sensitive non-linear operations,
particularly {Softmax}, it is non-trivial to completely quantize all operations
in ViTs, yielding either significant accuracy drops or non-negligible hardware
costs. In response to challenges associated with \textit{standard ViTs}, we
focus our attention towards the quantization and acceleration for
\textit{efficient ViTs}, which not only eliminate the troublesome Softmax but
also integrate linear attention with low computational complexity, and propose
Trio-ViT accordingly. Specifically, at the algorithm level, we develop a
{tailored post-training quantization engine} taking the unique activation
distributions of Softmax-free efficient ViTs into full consideration, aiming to
boost quantization accuracy. Furthermore, at the hardware level, we build an
accelerator dedicated to the specific Convolution-Transformer hybrid
architecture of efficient ViTs, thereby enhancing hardware efficiency.
Extensive experimental results consistently prove the effectiveness of our
Trio-ViT framework. {Particularly, we can gain up to
$\uparrow$$\mathbf{3.6}\times$, $\uparrow$$\mathbf{5.0}\times$, and
$\uparrow$$\mathbf{7.3}\times$ FPS under comparable accuracy over
state-of-the-art ViT accelerators, as well as $\uparrow$$\mathbf{6.0}\times$,
$\uparrow$$\mathbf{1.5}\times$, and $\uparrow$$\mathbf{2.1}\times$ DSP
efficiency.} Codes are available at
\url{https://github.com/shihuihong214/Trio-ViT}.
♻ ★ Personalized Video Relighting With an At-Home Light Stage
In this paper, we develop a personalized video relighting algorithm that
produces high-quality and temporally consistent relit videos under any pose,
expression, and lighting condition in real-time. Existing relighting algorithms
typically rely either on publicly available synthetic data, which yields poor
relighting results, or on actual light stage data which is difficult to
acquire. We show that by just capturing recordings of a user watching YouTube
videos on a monitor we can train a personalized algorithm capable of performing
high-quality relighting under any condition. Our key contribution is a novel
image-based neural relighting architecture that effectively separates the
intrinsic appearance features - the geometry and reflectance of the face - from
the source lighting and then combines them with the target lighting to generate
a relit image. This neural architecture enables smoothing of intrinsic
appearance features leading to temporally stable video relighting. Both
qualitative and quantitative evaluations show that our architecture improves
portrait image relighting quality and temporal consistency over
state-of-the-art approaches on both casually captured `Light Stage at Your
Desk' (LSYD) and light-stage-captured `One Light At a Time' (OLAT) datasets.
♻ ★ SharkTrack: an accurate, generalisable software for streamlining shark and ray underwater video analysis
Filippo Varini, Joel H. Gayford, Jeremy Jenrette, Matthew J. Witt, Francesco Garzon, Francesco Ferretti, Sophie Wilday, Mark E. Bond, Michael R. Heithaus, Danielle Robinson, Devon Carter, Najee Gumbs, Vincent Webster, Ben Glocker
Elasmobranchs (shark sand rays) represent a critical component of marine
ecosystems. Yet, they are experiencing global population declines and effective
monitoring of populations is essential to their protection. Underwater
stationary videos, such as those from Baited Remote Underwater Video Stations
(BRUVS), are critical for understanding elasmobranch spatial ecology and
abundance. However, processing these videos requires time-consuming manual
analysis that can delay conservation. To address this challenge, we developed
SharkTrack, a semi-automatic underwater video analysis software. SharkTrack
uses Convolutional Neural Networks (CNN) and Multi-Object Tracking to
automatically detect and track elasmobranchs and provides an annotation
pipeline to manually classify elasmobranch species and compute species-specific
MaxN (ssMaxN), the standard metric of relative abundance. When tested on BRUVS
footage from locations unseen by the CNN model during training, SharkTrack
computed ssMaxN with 89% accuracy over 207 hours of footage. The semi-automatic
SharkTrack pipeline required two minutes of manual classification per hour of
video, an estimated 95% reduction of manual analysis time compared to
traditional methods. Furthermore, we demonstrate SharkTrack accuracy across
diverse marine ecosystems and elasmobranch species, an advancement compared to
previous models, which were limited to specific species or locations.
SharkTrack applications extend beyond BRUVS, facilitating the analysis of any
underwater stationary video. By making video analysis faster and more
accessible, SharkTrack enables research and conservation organisations to
monitor elasmobranch populations more efficiently, thereby improving
conservation efforts. To further support these goals, we provide public access
to the SharkTrack software.
♻ ★ Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models
Image classifiers should be used with caution in the real world. Performance
evaluated on a validation set may not reflect performance in the real world. In
particular, classifiers may perform well for conditions that are frequently
encountered during training, but poorly for other infrequent conditions. In
this study, we hypothesize that recent advances in text-to-image generative
models make them valuable for benchmarking computer vision models such as image
classifiers: they can generate images conditioned by textual prompts that cause
classifier failures, allowing failure conditions to be described with textual
attributes. However, their generation cost becomes an issue when a large number
of synthetic images need to be generated, which is the case when many different
attribute combinations need to be tested. We propose an image classifier
benchmarking method as an iterative process that alternates image generation,
classifier evaluation, and attribute selection. This method efficiently
explores the attributes that ultimately lead to poor behavior detection.
♻ ★ Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector ECCV2024
Yuqian Fu, Yu Wang, Yixuan Pan, Lian Huai, Xingyu Qiu, Zeyu Shangguan, Tong Liu, Yanwei Fu, Luc Van Gool, Xingqun Jiang
This paper studies the challenging cross-domain few-shot object detection
(CD-FSOD), aiming to develop an accurate object detector for novel domains with
minimal labeled examples. While transformer-based open-set detectors, such as
DE-ViT, show promise in traditional few-shot object detection, their
generalization to CD-FSOD remains unclear: 1) can such open-set detection
methods easily generalize to CD-FSOD? 2) If not, how can models be enhanced
when facing huge domain gaps? To answer the first question, we employ measures
including style, inter-class variance (ICV), and indefinable boundaries (IB) to
understand the domain gap. Based on these measures, we establish a new
benchmark named CD-FSOD to evaluate object detection methods, revealing that
most of the current approaches fail to generalize across domains. Technically,
we observe that the performance decline is associated with our proposed
measures: style, ICV, and IB. Consequently, we propose several novel modules to
address these issues. First, the learnable instance features align initial
fixed instances with target categories, enhancing feature distinctiveness.
Second, the instance reweighting module assigns higher importance to
high-quality instances with slight IB. Third, the domain prompter encourages
features resilient to different styles by synthesizing imaginary domains
without altering semantic contents. These techniques collectively contribute to
the development of the Cross-Domain Vision Transformer for CD-FSOD (CD-ViTO),
significantly improving upon the base DE-ViT. Experimental results validate the
efficacy of our model.
comment: Accepted by ECCV2024 (project website:
http://yuqianfu.com/CDFSOD-benchmark)
♻ ★ CauSkelNet: Causal Representation Learning for Human Behaviour Analysis
Xingrui Gu, Chuyi Jiang, Erte Wang, Zekun Wu, Qiang Cui, Leimin Tian, Lianlong Wu, Siyang Song, Chuang Yu
Constrained by the lack of model interpretability and a deep understanding of
human movement in traditional movement recognition machine learning methods,
this study introduces a novel representation learning method based on causal
inference to better understand human joint dynamics and complex behaviors. We
propose a two-stage framework that combines the Peter-Clark (PC) algorithm and
Kullback-Leibler (KL) divergence to identify and quantify causal relationships
between joints. Our method effectively captures interactions and produces
interpretable, robust representations. Experiments on the EmoPain dataset show
that our causal GCN outperforms traditional GCNs in accuracy, F1 score, and
recall, especially in detecting protective behaviors. The model is also highly
invariant to data scale changes, enhancing its reliability in practical
applications. Our approach advances human motion analysis and paves the way for
more adaptive intelligent healthcare solutions.
♻ ★ Ultra-High-Definition Image Restoration: New Benchmarks and A Dual Interaction Prior-Driven Solution
Ultra-High-Definition (UHD) image restoration has acquired remarkable
attention due to its practical demand. In this paper, we construct UHD snow and
rain benchmarks, named UHD-Snow and UHD-Rain, to remedy the deficiency in this
field. The UHD-Snow/UHD-Rain is established by simulating the physics process
of rain/snow into consideration and each benchmark contains 3200 degraded/clear
image pairs of 4K resolution. Furthermore, we propose an effective UHD image
restoration solution by considering gradient and normal priors in model design
thanks to these priors' spatial and detail contributions. Specifically, our
method contains two branches: (a) feature fusion and reconstruction branch in
high-resolution space and (b) prior feature interaction branch in
low-resolution space. The former learns high-resolution features and fuses
prior-guided low-resolution features to reconstruct clear images, while the
latter utilizes normal and gradient priors to mine useful spatial features and
detail features to guide high-resolution recovery better. To better utilize
these priors, we introduce single prior feature interaction and dual prior
feature interaction, where the former respectively fuses normal and gradient
priors with high-resolution features to enhance prior ones, while the latter
calculates the similarity between enhanced prior ones and further exploits dual
guided filtering to boost the feature interaction of dual priors. We conduct
experiments on both new and existing public datasets and demonstrate the
state-of-the-art performance of our method on UHD image low-light enhancement,
dehazing, deblurring, desonwing, and deraining. The source codes and benchmarks
are available at \url{https://github.com/wlydlut/UHDDIP}.
♻ ★ TOP-Nav: Legged Navigation Integrating Terrain, Obstacle and Proprioception Estimation
Legged navigation is typically examined within open-world, off-road, and
challenging environments. In these scenarios, estimating external disturbances
requires a complex synthesis of multi-modal information. This underlines a
major limitation in existing works that primarily focus on avoiding obstacles.
In this work, we propose TOP-Nav, a novel legged navigation framework that
integrates a comprehensive path planner with Terrain awareness, Obstacle
avoidance and close-loop Proprioception. TOP-Nav underscores the synergies
between vision and proprioception in both path and motion planning. Within the
path planner, we present and integrate a terrain estimator that enables the
robot to select waypoints on terrains with higher traversability while
effectively avoiding obstacles. In the motion planning level, we not only
implement a locomotion controller to track the navigation commands, but also
construct a proprioception advisor to provide motion evaluations for the path
planner. Based on the close-loop motion feedback, we make online corrections
for the vision-based terrain and obstacle estimations. Consequently, TOP-Nav
achieves open-world navigation that the robot can handle terrains or
disturbances beyond the distribution of prior knowledge and overcomes
constraints imposed by visual conditions. Building upon extensive experiments
conducted in both simulation and real-world environments, TOP-Nav demonstrates
superior performance in open-world navigation compared to existing methods.
comment: Published on CoRL 2024
♻ ★ Transformer with Leveraged Masked Autoencoder for video-based Pain Assessment
Accurate pain assessment is crucial in healthcare for effective diagnosis and
treatment; however, traditional methods relying on self-reporting are
inadequate for populations unable to communicate their pain. Cutting-edge AI is
promising for supporting clinicians in pain recognition using facial video
data. In this paper, we enhance pain recognition by employing facial video
analysis within a Transformer-based deep learning model. By combining a
powerful Masked Autoencoder with a Transformers-based classifier, our model
effectively captures pain level indicators through both expressions and
micro-expressions. We conducted our experiment on the AI4Pain dataset, which
produced promising results that pave the way for innovative healthcare
solutions that are both comprehensive and objective.
♻ ★ Lemon and Orange Disease Classification using CNN-Extracted Features and Machine Learning Classifier
Lemons and oranges, both are the most economically significant citrus fruits
globally. The production of lemons and oranges is severely affected due to
diseases in its growth stages. Fruit quality has degraded due to the presence
of flaws. Thus, it is necessary to diagnose the disease accurately so that we
can avoid major loss of lemons and oranges. To improve citrus farming, we
proposed a disease classification approach for lemons and oranges. This
approach would enable early disease detection and intervention, reduce yield
losses, and optimize resource allocation. For the initial modeling of disease
classification, the research uses innovative deep learning architectures such
as VGG16, VGG19 and ResNet50. In addition, for achieving better accuracy, the
basic machine learning algorithms used for classification problems include
Random Forest, Naive Bayes, K-Nearest Neighbors (KNN) and Logistic Regression.
The lemon and orange fruits diseases are classified more accurately (95.0% for
lemon and 99.69% for orange) by the model. The model's base features were
extracted from the ResNet50 pre-trained model and the diseases are classified
by the Logistic Regression which beats the performance given by VGG16 and VGG19
for other classifiers. Experimental outcomes show that the proposed model also
outperforms existing models in which most of them classified the diseases using
the Softmax classifier without using any individual classifiers.
♻ ★ FedRepOpt: Gradient Re-parametrized Optimizers in Federated Learning
Federated Learning (FL) has emerged as a privacy-preserving method for
training machine learning models in a distributed manner on edge devices.
However, on-device models face inherent computational power and memory
limitations, potentially resulting in constrained gradient updates. As the
model's size increases, the frequency of gradient updates on edge devices
decreases, ultimately leading to suboptimal training outcomes during any
particular FL round. This limits the feasibility of deploying advanced and
large-scale models on edge devices, hindering the potential for performance
enhancements. To address this issue, we propose FedRepOpt, a gradient
re-parameterized optimizer for FL. The gradient re-parameterized method allows
training a simple local model with a similar performance as a complex model by
modifying the optimizer's gradients according to a set of model-specific
hyperparameters obtained from the complex models. In this work, we focus on
VGG-style and Ghost-style models in the FL environment. Extensive experiments
demonstrate that models using FedRepOpt obtain a significant boost in
performance of 16.7% and 11.4% compared to the RepGhost-style and RepVGG-style
networks, while also demonstrating a faster convergence time of 11.7% and 57.4%
compared to their complex structure.
♻ ★ Compact 3D Gaussian Splatting For Dense Visual SLAM
Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, Weidong Chen
Recent work has shown that 3D Gaussian-based SLAM enables high-quality
reconstruction, accurate pose estimation, and real-time rendering of scenes.
However, these approaches are built on a tremendous number of redundant 3D
Gaussian ellipsoids, leading to high memory and storage costs, and slow
training speed. To address the limitation, we propose a compact 3D Gaussian
Splatting SLAM system that reduces the number and the parameter size of
Gaussian ellipsoids. A sliding window-based masking strategy is first proposed
to reduce the redundant ellipsoids. Then we observe that the covariance matrix
(geometry) of most 3D Gaussian ellipsoids are extremely similar, which
motivates a novel geometry codebook to compress 3D Gaussian geometric
attributes, i.e., the parameters. Robust and accurate pose estimation is
achieved by a global bundle adjustment method with reprojection loss. Extensive
experiments demonstrate that our method achieves faster training and rendering
speed while maintaining the state-of-the-art (SOTA) quality of the scene
representation.
♻ ★ GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning
The rapid advancement of photorealistic generators has reached a critical
juncture where the discrepancy between authentic and manipulated images is
increasingly indistinguishable. Thus, benchmarking and advancing techniques
detecting digital manipulation become an urgent issue. Although there have been
a number of publicly available face forgery datasets, the forgery faces are
mostly generated using GAN-based synthesis technology, which does not involve
the most recent technologies like diffusion. The diversity and quality of
images generated by diffusion models have been significantly improved and thus
a much more challenging face forgery dataset shall be used to evaluate SOTA
forgery detection literature. In this paper, we propose a large-scale, diverse,
and fine-grained high-fidelity dataset, namely GenFace, to facilitate the
advancement of deepfake detection, which contains a large number of forgery
faces generated by advanced generators such as the diffusion-based model and
more detailed labels about the manipulation approaches and adopted generators.
In addition to evaluating SOTA approaches on our benchmark, we design an
innovative cross appearance-edge learning (CAEL) detector to capture
multi-grained appearance and edge global representations, and detect
discriminative and general forgery traces. Moreover, we devise an
appearance-edge cross-attention (AECA) module to explore the various
integrations across two domains. Extensive experiment results and
visualizations show that our detection model outperforms the state of the arts
on different settings like cross-generator, cross-forgery, and cross-dataset
evaluations. Code and datasets will be available at
\url{https://github.com/Jenine-321/GenFace
comment: Accepted by IEEE Transactions on Information Forensics and Security
♻ ★ Perception-Guided Quality Metric of 3D Point Clouds Using Hybrid Strategy
Full-reference point cloud quality assessment (FR-PCQA) aims to infer the
quality of distorted point clouds with available references. Most of the
existing FR-PCQA metrics ignore the fact that the human visual system (HVS)
dynamically tackles visual information according to different distortion levels
(i.e., distortion detection for high-quality samples and appearance perception
for low-quality samples) and measure point cloud quality using unified
features. To bridge the gap, in this paper, we propose a perception-guided
hybrid metric (PHM) that adaptively leverages two visual strategies with
respect to distortion degree to predict point cloud quality: to measure visible
difference in high-quality samples, PHM takes into account the masking effect
and employs texture complexity as an effective compensatory factor for absolute
difference; on the other hand, PHM leverages spectral graph theory to evaluate
appearance degradation in low-quality samples. Variations in geometric signals
on graphs and changes in the spectral graph wavelet coefficients are utilized
to characterize geometry and texture appearance degradation, respectively.
Finally, the results obtained from the two components are combined in a
non-linear method to produce an overall quality score of the tested point
cloud. The results of the experiment on five independent databases show that
PHM achieves state-of-the-art (SOTA) performance and offers significant
performance improvement in multiple distortion environments. The code is
publicly available at https://github.com/zhangyujie-1998/PHM.
♻ ★ High-Fidelity GAN Inversion for Image Attribute Editing CVPR 2022
We present a novel high-fidelity generative adversarial network (GAN)
inversion framework that enables attribute editing with image-specific details
well-preserved (e.g., background, appearance, and illumination). We first
analyze the challenges of high-fidelity GAN inversion from the perspective of
lossy data compression. With a low bit-rate latent code, previous works have
difficulties in preserving high-fidelity details in reconstructed and edited
images. Increasing the size of a latent code can improve the accuracy of GAN
inversion but at the cost of inferior editability. To improve image fidelity
without compromising editability, we propose a distortion consultation approach
that employs a distortion map as a reference for high-fidelity reconstruction.
In the distortion consultation inversion (DCI), the distortion map is first
projected to a high-rate latent map, which then complements the basic low-rate
latent code with more details via consultation fusion. To achieve high-fidelity
editing, we propose an adaptive distortion alignment (ADA) module with a
self-supervised training scheme, which bridges the gap between the edited and
inversion images. Extensive experiments in the face and car domains show a
clear improvement in both inversion and editing quality.
comment: CVPR 2022; Project Page is at https://tengfei-wang.github.io/HFGI/
♻ ★ DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction ACM MM 2024
With the recent burst of 2D and 3D data, cross-modal retrieval has attracted
increasing attention recently. However, manual labeling by non-experts will
inevitably introduce corrupted annotations given ambiguous 2D/3D content.
Though previous works have addressed this issue by designing a naive division
strategy with hand-crafted thresholds, their performance generally exhibits
great sensitivity to the threshold value. Besides, they fail to fully utilize
the valuable supervisory signals within each divided subset. To tackle this
problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and
Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD)
and Adaptive Alignment and Correction (AAC). Specifically, the former performs
accurate sample division by adaptive credibility modeling for each sample based
on the compensation information within multimodal loss distribution. Then in
AAC, samples in distinct subsets are exploited with different alignment
strategies to fully enhance the semantic compactness and meanwhile alleviate
over-fitting to noisy labels, where a self-correction strategy is introduced to
improve the quality of representation. Moreover. To evaluate the effectiveness
in real-world scenarios, we introduce a challenging noisy benchmark, namely
Objaverse-N200, which comprises 200k-level samples annotated with 1156
realistic noisy labels. Extensive experiments on both traditional and the newly
proposed benchmarks demonstrate the generality and superiority of our DAC,
where DAC outperforms state-of-the-art models by a large margin. (i.e., with
+5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).
comment: accepted by ACM MM 2024
♻ ★ Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models NIPS 2024
Diffusion models have revolutionized customized text-to-image generation,
allowing for efficient synthesis of photos from personal data with textual
descriptions. However, these advancements bring forth risks including privacy
breaches and unauthorized replication of artworks. Previous researches
primarily center around using prompt-specific methods to generate adversarial
examples to protect personal images, yet the effectiveness of existing methods
is hindered by constrained adaptability to different prompts. In this paper, we
introduce a Prompt-Agnostic Adversarial Perturbation (PAP) method for
customized diffusion models. PAP first models the prompt distribution using a
Laplace Approximation, and then produces prompt-agnostic perturbations by
maximizing a disturbance expectation based on the modeled distribution. This
approach effectively tackles the prompt-agnostic attacks, leading to improved
defense stability. Extensive experiments in face privacy and artistic style
protection, demonstrate the superior generalization of PAP in comparison to
existing techniques. Our project page is available at
https://github.com/vancyland/Prompt-Agnostic-Adversarial-Perturbation-for-Customized-Diffusion-Models.github.io.
comment: Accepted by NIPS 2024
♻ ★ SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery NeurIPS 2024
Global semantic 3D understanding from single-view high-resolution remote
sensing (RS) imagery is crucial for Earth Observation (EO). However, this task
faces significant challenges due to the high costs of annotations and data
collection, as well as geographically restricted data availability. To address
these challenges, synthetic data offer a promising solution by being easily
accessible and thus enabling the provision of large and diverse datasets. We
develop a specialized synthetic data generation pipeline for EO and introduce
SynRS3D, the largest synthetic RS 3D dataset. SynRS3D comprises 69,667
high-resolution optical images that cover six different city styles worldwide
and feature eight land cover types, precise height information, and building
change masks. To further enhance its utility, we develop a novel multi-task
unsupervised domain adaptation (UDA) method, RS3DAda, coupled with our
synthetic dataset, which facilitates the RS-specific transition from synthetic
to real scenarios for land cover mapping and height estimation tasks,
ultimately enabling global monocular 3D semantic understanding based on
synthetic data. Extensive experiments on various real-world datasets
demonstrate the adaptability and effectiveness of our synthetic dataset and
proposed RS3DAda method. SynRS3D and related codes will be available.
comment: Accepted at NeurIPS 2024 as a Spotlight
♻ ★ $\texttt{NePhi}$: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration ECCV 2024
This work proposes NePhi, a generalizable neural deformation model which
results in approximately diffeomorphic transformations. In contrast to the
predominant voxel-based transformation fields used in learning-based
registration approaches, NePhi represents deformations functionally, leading to
great flexibility within the design space of memory consumption during training
and inference, inference time, registration accuracy, as well as transformation
regularity. Specifically, NePhi 1) requires less memory compared to voxel-based
learning approaches, 2) improves inference speed by predicting latent codes,
compared to current existing neural deformation based registration approaches
that \emph{only} rely on optimization, 3) improves accuracy via instance
optimization, and 4) shows excellent deformation regularity which is highly
desirable for medical image registration. We demonstrate the performance of
NePhi on a 2D synthetic dataset as well as for real 3D medical image datasets
(e.g., lungs and brains). Our results show that NePhi can match the accuracy of
voxel-based representations in a single-resolution registration setting. For
multi-resolution registration, our method matches the accuracy of current SOTA
learning-based registration approaches with instance optimization while
reducing memory requirements by a factor of five. Our code is available at
https://github.com/uncbiag/NePhi.
comment: ECCV 2024
♻ ★ SpikeGS: Learning 3D Gaussian Fields from Continuous Spike Stream ACCV 2024
A spike camera is a specialized high-speed visual sensor that offers
advantages such as high temporal resolution and high dynamic range compared to
conventional frame cameras. These features provide the camera with significant
advantages in many computer vision tasks. However, the tasks of 3D
reconstruction and novel view synthesis based on spike cameras remain
underdeveloped. Although there are existing methods for learning neural
radiance fields from spike stream, they either lack robustness in extremely
noisy, low-quality lighting conditions or suffer from high computational
complexity due to the deep fully connected neural networks and ray marching
rendering strategies used in neural radiance fields, making it difficult to
recover fine texture details. In contrast, the latest advancements in 3DGS have
achieved high-quality real-time rendering by optimizing the point cloud
representation into Gaussian ellipsoids. Building on this, we introduce
SpikeGS, the first method to learn 3D Gaussian fields solely from spike stream.
We designed a differentiable spike stream rendering framework based on 3DGS,
incorporating noise embedding and spiking neurons. By leveraging the multi-view
consistency of 3DGS and the tile-based multi-threaded parallel rendering
mechanism, we achieved high-quality real-time rendering results. Additionally,
we introduced a spike rendering loss function that generalizes under varying
illumination conditions. Our method can reconstruct view synthesis results with
fine texture details from a continuous spike stream captured by a moving spike
camera, while demonstrating high robustness in extremely noisy low-light
scenarios. Experimental results on both real and synthetic datasets demonstrate
that our method surpasses existing approaches in terms of rendering quality and
speed. Our code will be available at https://github.com/520jz/SpikeGS.
comment: Accepted by ACCV 2024. Project page: https://github.com/520jz/SpikeGS
♻ ★ Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen
The emerging video LMMs (Large Multimodal Models) have achieved significant
improvements on generic video understanding in the form of VQA (Visual Question
Answering), where the raw videos are captured by cameras. However, a large
portion of videos in real-world applications are edited videos, \textit{e.g.},
users usually cut and add effects/modifications to the raw video before
publishing it on social media platforms. The edited videos usually have high
view counts but they are not covered in existing benchmarks of video LMMs,
\textit{i.e.}, ActivityNet-QA, or VideoChatGPT benchmark. In this paper, we
leverage the edited videos on a popular short video platform, \textit{i.e.},
TikTok, and build a video VQA benchmark (named EditVid-QA) covering four
typical editing categories, i.e., effect, funny, meme, and game. Funny and meme
videos benchmark nuanced understanding and high-level reasoning, while effect
and game evaluate the understanding capability of artificial design. Most of
the open-source video LMMs perform poorly on the EditVid-QA benchmark,
indicating a huge domain gap between edited short videos on social media and
regular raw videos. To improve the generalization ability of LMMs, we collect a
training set for the proposed benchmark based on both Panda-70M/WebVid raw
videos and small-scale TikTok/CapCut edited videos, which boosts the
performance on the proposed EditVid-QA benchmark, indicating the effectiveness
of high-quality training data. We also identified a serious issue in the
existing evaluation protocol using the GPT-3.5 judge, namely a "sorry" attack,
where a sorry-style naive answer can achieve an extremely high rating from the
GPT judge, e.g., over 4.3 for correctness score on VideoChatGPT evaluation
protocol. To avoid the "sorry" attacks, we evaluate results with GPT-4 judge
and keyword filtering. The dataset is released at
https://github.com/XenonLamb/EditVid-QA.
♻ ★ 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Imitation learning provides an efficient way to teach robots dexterous
skills; however, learning complex skills robustly and generalizablely usually
consumes large amounts of human demonstrations. To tackle this challenging
problem, we present 3D Diffusion Policy (DP3), a novel visual imitation
learning approach that incorporates the power of 3D visual representations into
diffusion policies, a class of conditional action generative models. The core
design of DP3 is the utilization of a compact 3D visual representation,
extracted from sparse point clouds with an efficient point encoder. In our
experiments involving 72 simulation tasks, DP3 successfully handles most tasks
with just 10 demonstrations and surpasses baselines with a 24.2% relative
improvement. In 4 real robot tasks, DP3 demonstrates precise control with a
high success rate of 85%, given only 40 demonstrations of each task, and shows
excellent generalization abilities in diverse aspects, including space,
viewpoint, appearance, and instance. Interestingly, in real robot experiments,
DP3 rarely violates safety requirements, in contrast to baseline methods which
frequently do, necessitating human intervention. Our extensive evaluation
highlights the critical importance of 3D representations in real-world robot
learning. Videos, code, and data are available on
https://3d-diffusion-policy.github.io .
comment: Published at Robotics: Science and Systems (RSS) 2024. Videos, code,
and data: https://3d-diffusion-policy.github.io
♻ ★ Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model
Current state-of-the-art diffusion models employ U-Net architectures
containing convolutional and (qkv) self-attention layers. The U-Net processes
images while being conditioned on the time embedding input for each sampling
step and the class or caption embedding input corresponding to the desired
conditional generation. Such conditioning involves scale-and-shift operations
to the convolutional layers but does not directly affect the attention layers.
While these standard architectural choices are certainly effective, not
conditioning the attention layers feels arbitrary and potentially suboptimal.
In this work, we show that simply adding LoRA conditioning to the attention
layers without changing or tuning the other parts of the U-Net architecture
improves the image generation quality. For example, a drop-in addition of LoRA
conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for
unconditional and class-conditional CIFAR-10 generation, improving upon the
baseline of 1.97/1.79.
♻ ★ RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models ECCV
With the extensive use of vision-language models in various downstream tasks,
evaluating their robustness is crucial. In this paper, we propose a benchmark
for assessing the robustness of vision-language models. We believe that a
robust model should properly understand both linguistic and visual semantics
and be resilient to explicit variations. In pursuit of this goal, we create new
variants of texts and images in the MS-COCO test set and re-evaluate the
state-of-the-art (SOTA) models with the new data. Specifically, we alter the
meaning of text by replacing a word, and generate visually altered images that
maintain some visual context while introducing noticeable pixel changes through
image mixing techniques.Our evaluations on the proposed benchmark reveal
substantial performance degradation in many SOTA models (e.g., Image-to-Text
Recall@1: 81.9\% $\rightarrow$ 48.4\% in BLIP, 66.1\% $\rightarrow$ 37.6\% in
VSE$\infty$), with the models often favoring the altered texts/images over the
original ones. This indicates the current vision-language models struggle with
subtle changes and often fail to understand the overall context of texts and
images. Based on these findings, we propose semantic contrastive loss and
visual contrastive loss to learn more robust embedding. Datasets and code are
available at {\url{https://github.com/pseulki/rococo}}.
comment: Accepted to ECCV Synthetic Data for Computer Vision Workshop (Oral)
♻ ★ Segment Any Change NeurIPS 2024
Visual foundation models have achieved remarkable results in zero-shot image
classification and segmentation, but zero-shot change detection remains an open
problem. In this paper, we propose the segment any change models (AnyChange), a
new type of change detection model that supports zero-shot prediction and
generalization on unseen change types and data distributions. AnyChange is
built on the segment anything model (SAM) via our training-free adaptation
method, bitemporal latent matching. By revealing and exploiting intra-image and
inter-image semantic similarities in SAM's latent space, bitemporal latent
matching endows SAM with zero-shot change detection capabilities in a
training-free way. We also propose a point query mechanism to enable
AnyChange's zero-shot object-centric change detection capability. We perform
extensive experiments to confirm the effectiveness of AnyChange for zero-shot
change detection. AnyChange sets a new record on the SECOND benchmark for
unsupervised change detection, exceeding the previous SOTA by up to 4.4% F$_1$
score, and achieving comparable accuracy with negligible manual annotations (1
pixel per image) for supervised change detection.
comment: Accepted by NeurIPS 2024